DEVELOPING AND VALIDATING NGSS-ALIGNED 3D LEARNING PROGRESSION FOR 
ELECTRICAL INTERACTIONS IN THE CONTEXT OF 9TH GRADE PHYSICAL SCIENCE 

CURRICULUM 

 

By 

 

Leonora Kaldaras 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 

A DISSERTATION 

 

Submitted to 

Michigan State University 

 in partial fulfillment of the requirements  

for the degree of  

 

Curriculum, Instruction and Teacher Education—Doctor of Philosophy  

Measurement and Quantitative Methods—Dual Major  

 

2020

 

 

ABSTRACT 

DEVELOPING AND VALIDATING NGSS-ALIGNED 3D LEARNING PROGRESSION FOR 
ELECTRICAL INTERACTIONS IN THE CONTEXT OF 9TH GRADE PHYSICAL SCIENCE 

CURRICULUM 

 

By  

Leonora Kaldaras 

The Framework for K-12 science education (The Framework) and Next Generation 

Science Standards (NGSS) emphasize the usefulness of learning progressions (LPs) in aligning 

curriculum, instruction and assessment. The three dimensions of science form the basis of 

theoretical LPs described in the document and used to develop NGSS. The three dimensions are 

disciplinary core ideas (DCIs), scientific and engineering practices (SEPs) and crosscutting 

concepts (CCCs). The Framework defines three- dimensional learning (3D learning) as the 

ability to integrate DCIs, SEPs and CCCc ta make sense of phenomena and solver problems. 

Engaging in 3D learning leads to developing deep, useable understanding of science. While the 

Framework LPs for the three dimensions, we currently have limited empirical evidence to show 

that LPs for 3D learning (3D LPs) can be developed and validated in practice. This dissertation 

shows the feasibility of developing and validating a large grain 3D LP and a finer-grain 3D 

construct map in the context of NGSS-aligned curriculum. The 3D LP focuses on the construct 

of electrical interactions, and the 3D construct map focuses of the construct of chemical bonding. 

Conceptually, the 3D construct map for chemical bonding is an integral part of 3D LP of 

electrical interactions, but more narrowly scoped. The feasibility of using the assessment tools 

designed to probe levels of the 3D LP and 3D construct map for assigning levels to individual 

answers and for characterizing student learning are demonstrated. These properties of a validated 

LP are essential for successful implementation of NGSS. 

 

 

This thesis is dedicated to Mr. Allen Baldwin, or simply Big Al. 

Thank you for helping me change my life for the better. 

iii 

 

ACKNOWLEDGEMENTS 

 

I would like to thank the following people, without whom I would not have made it through my 

PhD degree! My supervisor, Dr. Joseph (Joe) Krajcik, for his support both in work and life 

situations, constant encouragement, and valuable feedback. My friend and colleague, Dr. Hope 

Akaeze, for invaluable help in completing this dissertation. My academic advisor, Dr. Gail 

Richmond, for giving me an opportunity to pursue a career in education, and for supporting me 

in all my educational endeavors. My amazing support team, including Dr. Bob Geier, Cathrene 

(Sue) Carpenter and Dr. Joe Krajcik, who have helped me and my brother Kosta navigate 

through the most challenges times of our lives, and remain on path to pursuing our educational 

goals. My super-star dissertation committee, Dr. William Schmidt, Dr. Melanie Cooper, Dr. Gail 

Richmond and Dr. Mark Reckase, whose feedback helped me learn and improve my 

understanding of measurement and education. My amazing colleagues at CREATE for STEM 

Institute for constant support and amazing work spirit. I am blessed to be working and learning 

alongside each and every one of you! My brother, Kosta, for being together throughout PhD 

years, and still willing to keep up with me. I love and appreciate you always! My fiancé, Alonso, 

for not giving up on me, and for constantly pushing me to “get it done”. Your patience, love, and 

support made it happen. My little daughter Marina-Luisa, who is my sunshine. My parents, 

Marina and Nikolay: I cannot thank you enough, I have been truly blessed with the most 

amazing parents in the world. My host family, Neocles and Vassiliki Leontis, who have always 

believed in me and supported unconditionally. My dear friend, Al Baldwin (Big Al), for being 

the most amazing human being I have ever met. May your soul rest in peace! 

 

 

iv 

 

 

TABLE OF CONTENTS 

LIST OF TABLES ..................................................................................................................... vi 
 
LIST OF FIGURES ................................................................................................................... viii 
 
INTRODUCTION ..................................................................................................................... 1 
 
CHAPTER 1 A Methodology for Determining and Validating Latent Factor Dimensionality of 
Complex Multi-Factor Science Constructs Measuring Knowledge-In-Use .............................. 7 
     Introduction ........................................................................................................................... 7 
     Methodology ......................................................................................................................... 14 
     Results ................................................................................................................................... 28 
     Discussion ............................................................................................................................. 37 
APPENDIX ................................................................................................................................ 46 
BIBLIOGRAPHY ...................................................................................................................... 72 
 
CHAPTER 2 Developing and Validating an NGSS-Aligned Learning Progression to Track 
Three-Dimensional Learning of Electrical Interactions in High School Physical Science ....... 76 
     Introduction ........................................................................................................................... 76 
     Theoretical Framework ......................................................................................................... 80 
     Methodology ......................................................................................................................... 86 
     Data Analysis ........................................................................................................................ 91 
     Results ................................................................................................................................... 100 
     Discussion ............................................................................................................................. 117 
APPENDIX ................................................................................................................................ 124 
BIBLIOGRAPHY ...................................................................................................................... 136 
 
CHAPTER 3 Exploring Student Reasoning about Chemical Bonds from Perspective of Energy 
and Force in the context of NGSS Classroom ........................................................................... 142 
     Introduction ........................................................................................................................... 142 
     Theoretical Framework ......................................................................................................... 148 
     Methodology ......................................................................................................................... 149 
     Results ................................................................................................................................... 169 
     Discussion ............................................................................................................................. 186 
APPENDIX ................................................................................................................................ 193 
BIBLIOGRAPHY ...................................................................................................................... 206 
 
CONCLUDING REMARKS ..................................................................................................... 211 
 
 
 

 

 

v 

 

LIST OF TABLES 

 

Table 1.1 Example of modified ECD process .............................................................................. 16 

Table 1.2 Scoring rubric example ................................................................................................. 17 

Table 1.3 3D structure of unit 1 assessment ................................................................................. 19 

Table 1.4 3D structure of unit 2 assessment ................................................................................. 21 

Table 1.5 Summary of two time points EFA model fit for unit 1 and unit 2 assessment ............. 23 

Table 1.6 Summary of CFA model fit for unit 1 pre/post and unit 2 pre/post.............................. 24 

Table 1.7 Reliability...................................................................................................................... 28 

Table 1.8 Two time points EFA model fit for unit 1 assessment ................................................. 29 

Table 1.9 Two time points EFA factor loadings for unit 1 assessment ........................................ 29 

Table 1.10 Two time points EFA model fit for unit 2 assessment ............................................... 30 

Table 1.11 Two time points EFA factor loadings for unit 2 assessment ...................................... 30 

Table 1.12 Measurement invariance analysis for unit 1 pre/post assessment ............................... 31 

Table 1.13 Measurement invariance analysis for unit 2 pre/post assessment ............................... 33 

Table 2.1 Hypothetical 3D LP for electrical interactions ............................................................. 87 

Table 2.2 Example of mECD process ........................................................................................... 90 

Table 2.3 Sample responses for every 3D LP level for paper and rod ......................................... 94 

Table 2.4 Sample responses for every 3D LP level for the foil experiment ................................. 97 

Table 2.5 Sample responses that fall between levels of the 3D LP for paper and rod…………104 

Table 2.6 Sample responses that fall between levels of the 3D LP for the foil experiment ........105 

Table 2.7 Student score/3D LP level for each interview phenomenon ........................................107 

Table 2.8 Model comparison for GPCM and GRM ....................................................................125 

 

vi 

 

Table 2.9 S-X2 item fit statistics ..................................................................................................129 

Table 3.1 Hypothetical 3D construct map for chemical bonding ................................................153 

Table 3.2 Example of mECD process ..........................................................................................156 

Table 3.3 Sample responses for every 3D construct map level for the match on the hot plate ...158 

Table 3.4 Sample responses for every 3D construct map level for atoms forming a bond .........162 

Table 3.5 Sample responses that fall between levels of the 3D construct map ...........................175 

Table 3.6 Student score and 3D construct map level for each interview phenomenon ...............178 

Table 3.7 Model comparison for GPCM and GRM ....................................................................195 

Table 3.8 S-X2 item fit statistics ..................................................................................................198 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

vii 

 

LIST OF FIGURES 

 

Figure 1.1 Summary of modified evidence centered design process ............................................ 14 

Figure 1.2 Theoretical latent structure of unit 1 assessment instrument ....................................... 18 

Figure 1.3 Theoretical latent structure of unit 2 assessment instrument  ...................................... 20 

Figure 1.4 Measurement invariance model for unit 1 assessment instrument .............................. 25 

Figure 1.5 Measurement invariance model for unit 2 assessment instrument  ............................. 25 
 
Figure 1.6 95% Confidence interval for factor scores at different level of observed (raw) score 
for unit 1 assessment instrument ................................................................................................... 35 
 
Figure 1.7 Confidence interval (95%) for factor scores at different level of observed (raw) score 
for Unit 2 assessment instrument .................................................................................................. 35 
 
Figure 2.1 Summary of modified evidence centered design process ............................................ 88 
 
Figure 2.2 Wright map showing learning progression levels for unit 1 assessment items ..........109 

Figure 2.3 Wright map showing distribution of respondents who provided answers on pre and 
post unit 1 test ..............................................................................................................................111 
 
Figure 2.4 Wright map showing learning progression levels for unit 1 pretest assessment items 
and distribution of respondents for the relevant cut points for students who provided answers on 
both pre and posttest ....................................................................................................................114 
 
Figure 2.5 Wright map showing learning progression levels for unit 1 posttest assessment items 
and distribution of respondents for the relevant cut points for students who provided answers on 
both pre and posttest ....................................................................................................................114 
 
Figure 2.6 Modified wright map for pre unit 1 test showing student proficiency estimates and 
standard error bands from lowest to highest ................................................................................116 
 
Figure 2.7 Modified wright map for post unit 1 test showing student proficiency estimates and 
standard error bands from lowest to highest ................................................................................116 
 
Figure 2.8 Q3 matrix ....................................................................................................................128 
 
Figure 2.9 Person fit Zh statistics ................................................................................................129 

Figure 3.1 Summary of modified evidence centered design process ...........................................155 

 

viii 

 

Figure 3.2 Wright map showing 3D construct map levels for unit 2 assessment items ..............181 

Figure 3.3 Wright map with respondents who provided answers on pre/ post unit 2 test ...........182 

Figure 3.4 Modified wright map for pre unit 2 test showing student proficiency estimates and 
standard error bands from lowest to highest ................................................................................185 
 
Figure 3.5 Modified wright map for post unit 2 test showing student proficiency estimates and 
standard error bands from lowest to highest ................................................................................185 
 
Figure 3.6 Q3 matrix ....................................................................................................................197 

Figure 3.7 Person fit Zh statistics ................................................................................................198 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ix 

 

INTRODUCTION 

 

The Framework for K-12 Science Education (The Framework) and Next Generation 

Science Standards (NGSS) emphasizes developmental nature of science understanding, and 

stress the importance of supporting students in developing useable understanding of big ideas in 

science coherently over time through three-dimensional (3D) learning strategies (National 

Research Council [NRC], 2012, Standards, 2013). The Framework defines 3D learning as ability 

to integrate the three dimensions of science that include scientific and engineering practices 

(SEPs) and crosscutting concepts (CCCs) to make sense of the disciplinary core ideas (DCIs). 

The developmental nature of student understanding is reflected in the idea of a learning 

progression that describes development of science understanding as a series of increasingly more 

sophisticated steps towards understanding of big ideas in science (NRC, 2012). While the 

Framework and NGSS promote using learning progressions as a tool to help organize 

curriculum, instruction and assessment, validated NGSS-aligned learning progression are not 

currently available in practice. While both the Framework and NGSS provide outlines of 

theoretical learning progressions for DCIs, SEPs and CCCs, detailed validated learning 

progressions that integrate the three dimensions of science and show what student understanding 

looks like at each level of sophistication in terms of ability to integrate the three dimensions are 

yet to be developed. Without such three-dimensional learning progressions (3D LPs) developed 

and validated in practice successful implementation of ideas of NGSS the Framework will 

become more complicated. 

This work presents first example of NGSS-aligned learning progression that integrates 

the three dimensions of science and demonstrates immediate pedagogical use in terms of 

 

1 

 

providing information about the location of each individual student on the 3D LP level with 68% 

confidence. The 3D LP presented here is focused on electrical interactions and is validated in the 

context of previously designed NGSS-aligned curriculum for 9th grade Physical Science called 

“Interactions”1. The curriculum helps students build understanding of electrical interactions 

starting from macro to atomic-molecular level. The curriculum consists of four units. Unit 1 

focuses of building student understanding of electrical interactions grounded in ideas of electrical 

charges, fields and forces at the macroscopic level, and introduces atomic nature of matter. Unit 

2 adds ideas of energy at the macro and atomic-molecular level and helps students build an 

integrated understanding of electrical forces and energy to explain phenomena related to 

intermolecular interactions and chemical bonding. Units 3 and 4 help students build their 

understanding of electrical interactions at the atomic-molecular level by focusing on 

hydrophobic/hydrophilic interactions and protein folding. The 3D LP presented in this study is 

aligned to the same NGSS performance expectations as the “Interactions “curriculum. The 3D 

LP uses the same DCIs as the “Interactions” curriculum to describe progression of understanding 

of electrical interactions as a continuum of ideas focused on principles of electrostatic attraction 

and energy that can be used to explain interactions between charged macroscopic objects, 

formation of chemical bonds and intermolecular interactions. 

The process of developing a learning progression involves specifying in detail what 

student understanding looks like at each level of sophistication (Duschl, Schweingruber, Shouse, 

2007). In this study this process was grounded in relevant research literature including the 

                                                 
1 This work is supported in part by the NSF grant “Developing and Testing a Model to Support 
Student Understanding of the Sub-Microscopic Interactions that Govern Biological and 
Chemical Processes”, National Science Foundation, DRL-1232388. All opinions in the 
dissertation are those of the author and not NSF. 
 

 

2 

 

Framework and NGSS, and feedback from disciplinary and pedagogical experts. The validation 

of a learning progression is carried out by developing assessment instruments capable of probing 

student understanding at each level of the learning progression, and then collecting and analyzing 

assessment data from these instruments to see if student response data supports theoretically 

suggested progression of understanding described by the LP (Wilson, 2009). In the context of 

this study, assessment instruments were designed to probe student understanding of electrical 

interactions before and after each of the four curriculum units was completed. Due to limited 

time and resources, a few items were selected from Unit 1 and Unit 2 assessment instruments 

only and analyzed using Item Response Theory (IRT) approaches to obtain validity evidence for 

the theoretically suggested progression of student understanding of electrical interactions. 

Additionally, 2 items from Unit 1 and Unit 2 assessment instruments were used to conduct 

analysis with selected students to gain qualitative validity evidence and help describe student 3D 

understanding of electrical interactions at each level of sophistication with greater detail. 

The resulting study presented in this dissertation consists of three interconnected parts. 

The first part (chapter 1) focuses on demonstrating internal latent structure validity evidence and 

reliability of the Unit 1 and Unit 2 assessment instruments using confirmatory and exploratory 

factor analysis approaches. This study provides important evidence for latent dimensionality of 

the two assessment instruments that is important to examine prior to conducting IRT analysis to 

be able to choose an IRT model that most accurately represents a given data sample. The study 

presented in chapter 1 focuses on the construct of electrical interactions, and even though it is 

represented by multiple DCIs that are introduced in a specific sequence throughout the 

“Interactions” curriculum, the only latent construct of interest that all the DCIs are contributing 

towards is electrical interactions. However, the construct of electrical interactions is not merely 

 

3 

 

focused on student recollection of the DCIs, but the ability to apply understanding of electrical 

interactions to explain various phenomena by integrating relevant DCIs with SEPs and CCCs. 

Therefore, these NGSS dimensions might affect latent dimensionality of the construct such that 

even if hypothesized latent structure is described as one-dimensional (e.a. student understanding 

of electrical interactions), in practice the DCIs, SEPs and CCCs is inherently multidimensional 

(Gorin, Mislevy, 2013) and therefore might manifest as separate latent dimensions. To author’s 

knowledge, there have not been any studies conducted that focus on examining relationship 

between dimensions of NGSS (DCIs, SEPs, CCCs) and the latent dimensionality of assessment 

instruments built following principles of integrating the three dimensions. In the first part of this 

study the data from Unit 1 and Unit 2 assessment instruments is used to demonstrate that student 

ability to integrate the three dimensions of NGSS is in fact manifested as a single latent 

construct. Specifically, study presented in chapter 1 shows that student ability to integrate 

relevant DCIs (including ideas related to understanding of Coulomb’s law in Unit 1, energy and 

chemical reactions in Unit 2) with SEPs and CCCs to explain phenomena related to electrical 

interactions manifest as single latent constructs. The results of this study have important 

implications for developing and validating assessments that measure student understanding of 

complex constructs that integrate the three dimensions of NGSS in practice. In the context of this 

work, study presented in chapter 1 provides important information about latent dimensionality of 

constructs measured by Unit 1 and Unit 2 assessment instruments related to student 

understanding of electrical interactions, which allows to conduct a more accurate IRT analysis to 

obtain quantitative validity evidence for studies described in chapters 2 and 3. 

The second part of the study (chapter 2) introduces 3D LP for electrical interactions 

aligned to NGSS performance expectations and validated using assessment data from Unit 1 

 

4 

 

assessment instrument only. The 3D LP described in chapter 2 represents a large-grain 3DLP for 

electrical interactions scoping 1 academic year.  Chapter 2 provides both qualitative (IRT 

analysis) and qualitative (student oral interview analysis) of the Unit 1 assessment data to 

demonstrate validity of the 3D LP for electrical interactions. The 3D LP presented in chapter 2 

described aspects of relevant DCIs, SEPs and CCCs at each level of sophistication. The 

qualitative analysis of student interviews allowed to construct detailed descriptions of what 

student understanding of ideas related to electrical forces, fields and charges looks like at each 

level of the 3D LP. IRT analysis further demonstrated that the progression of understanding 

described by the 3D LP is supported by large scale student response data. Finally, study in 

chapter 2 demonstrates that the resulting 3D LP can be used to place individual students on a 

level of the 3D LP with 68% confidence, which suggests immediate pedagogical applicability of 

the designed LP. 

The study presented in chapter 3 introduces a finer grain size 3D construct map for 

chemical bonding validated using Unit 2 assessment data. As mentioned above, in “Interactions” 

curriculum student develop understanding of chemical bonding as an extension of the same ideas 

related to electric forces and energy that govern interactions between charged macroscopic 

objects and molecules. Therefore, in essence, the 3D construct map for chemical bonding 

presented in chapter 3 is an integral part of the 3D LP for electrical interactions discussed in 

chapter 2, but more narrowly focused on exploring student reasoning about chemical bonding, 

and validated using assessment items specifically focused on electrical interactions in the context 

of chemical bonding. The major contribution of work presented in chapter 3 is that it presents 3D 

construct map for chemical bonding grounded in principles outlined in the Framework and 

NGSS focusing on building student understanding of chemical bonding not as heuristics based 

 

5 

 

on octet rule and memorization of valency states, but as a state of a system of interacting atoms 

defined by balance of attractive and repulsive interactions which leads to energy minimization. 

The validity evidence from both student interviews and IRT analysis demonstrates consistency 

between hypothesized progression of student understanding outlined in 3D construct map and 

empirical student response data. The 3D construct map for chemical bonding presented in 

chapter 3 also can be used to place individual students on a level with 68% confidence, which 

also demonstrates immediate pedagogical applicability of the 3D construct map. 

To summarize, this dissertation presents studies that demonstrate how 3D learning can be 

successfully described and measured in practice, starting from developing NGSS-aligned 3D LPs 

that integrate DCIs, SEPs and CCCs, developing valid and reliable assessment instruments 

capable of measuring student progress along the levels of 3D LPs, and obtaining quantitative and 

qualitative validity evidence for the 3D LPs using designed assessment instruments. The studies 

presented here provide valuable insights into how student progress in NGSS classroom can be 

measured effectively, therefore helping enact the vision of the Framework and NGSS in practice.  

 

 

 

 

 

 

 

 

 

 

6 

 

CHAPTER 1 

A Methodology for Determining and Validating Latent Factor Dimensionality of Complex 

Multi-Factor Science Constructs Measuring Knowledge-In-Use 

Introduction 

Historically, curriculum, instruction and assessment based on state and local standards 

have been focused on memorization of large number of scientific facts without understanding 

fundamental scientific principles. In the age of technology, when information is readily available 

through a variety of sources, memorization-based system of education is becoming obsolete. 

Instead, there is increasing demand for integrated, deep understanding of the key ideas in science 

that translates into ability to apply scientific concepts to explain phenomena and solve real life 

problems (National Research Council [NRC], 2007; National Research Council [NRC], 2013a; 

Erickan & Oliveri, 2016).  

Because of the growing mismatch between the educational demands of 21 century society 

and the actual products of present-day educational system, there has been significant effort from 

scientists and educational researchers to shift the focus of classroom instruction from fact-based 

memorization of ideas to supporting complex cognitive processes aimed at developing 

knowledge application skills. These efforts resulted in publication of several reports by National 

Research Council, and release of the Next Generation Science Standards reflecting this vision 

(NRC, 2013a; National Research Council [NRC], 2012; Standards [NGSS], (2013)).  

To meet the requirements consistent with the new vision, assessment also needs to 

change. Specifically, it needs to shift from measuring simple constructs reflecting fact-based 

memorization learning trajectory, to measuring complex constructs comprised of elements 

related to content and skills focusing on application of knowledge (Ercikan & Oliveri, 2016; 

 

7 

 

NRC, 2013a; Pellegrino & Hilton, 2012). The challenges associated with developing these types 

of assessment are multiple. First, we need to understand the fundamental difference between 

simple and complex constructs. In the context of science disciplines, simple constructs focus 

primarily on content. For example, traditional tests assess student ability to recite formulas, 

reproduce definitions, calculate outcome based on memorized equation etc. Additionally, test 

items are usually devoid of context, and represent content-based assessment unrelated to real-life 

situations. This approach is not suitable for measuring student ability to apply knowledge, which 

demands that assessment be situated in a real-life problem that requires a solution, or a 

phenomenon that needs to be explained using appropriate science ideas (Pellegrino, Wilson, 

Koenig, & Beatty, 2014; DeBarger, Penuel, Harris, & Kennedy, 2016). The focus of this 

assessment is not pure content, but also skills and competencies associated with ability to apply 

this content (DeBarger et al., 2016). These skills/competencies, along with content, represent 

different components of learning that combine to form a complex construct (Ercikan & Oliveri, 

2016). The integration of these components makes complex constructs multidimensional in a 

sense that these constructs no longer reflect content knowledge alone, but also relevant 

skills/competencies required to apply it.  

This fundamental difference between simple and complex constructs has important 

implications for developing valid assessments. In order to ensure valid interpretation of 

assessment results, assessment development process should focus on complex constructs 

modeled using relevant learning theories, and supported by observed student response data 

(Pellegrino, Wilson, Koenig, & Beatty, 2014). The two major uses of assessment results are 

growth in performance and assessment of subdomains respectively (Reckase, 2017). Policy 

makers use growth in performance on high stakes tests for the purpose of holding schools and 

 

8 

 

teachers accountable, while individual teachers and schools use assessment of subdomains to 

obtain information regarding student performance on assessment related to specific instructional 

content (Reckase, 2017).  

Using assessment results for evaluating growth in performance is only valid if the 

assumption of common unidimensional continuum (single latent construct reflecting science 

proficiency) across grades is valid. Using assessment results for evaluating student performance 

on various subdomains is only valid if the latent dimensions presumably measured by the 

assessment instrument indeed reflect student performance on the subdomains of interest. 

Therefore, for both high stakes and diagnostic assessment, the understanding of dimensionality 

of a complex construct in psychometric context is fundamental to designing valid assessments 

(Gorin & Mislevy, 2013). Specifically, we need to understand if the different components of 

complex constructs (content and skills/competencies) manifest as separate latent dimensions 

psychometrically. This will help in understanding how components of complex constructs relate 

to dimension of variation in student response data, leading to more meaningful interpretation of 

student performance on both types of assessments. 

Current work demonstrates development and validation process for assessment of 

complex constructs focusing on latent structure validation. It starts by demonstrating the process 

of developing assessments instruments for complex constructs grounded in learning theories 

used to develop the most current science standards. It further describes steps for creating a 

theoretical validity argument for measuring complex constructs, designing operational 

assessment instrument based on this argument, and obtaining empirical evidence for the 

theoretical validity of the argument and assessment instrument. The presented validity evidence 

includes response process-based validity and internal latent structure-based validity.  

 

9 

 

This work is situated in the context of the Framework for K-12 Science Education (the 

Framework) that defines deep science understanding as student’s ability to integrate Scientific 

and Engineering Practices (SEPs) and Crosscutting Concepts (CCCs) to make sense of 

Disciplinary Core Ideas (DCIs) in the context of real-life phenomena. DCIs, SEPs and CCCs are 

referred to as the three dimensions of science. Disciplinary core ideas are different from content 

traditionally defined in previous standards in that they represent few fundamental ideas in 

science that are essential for building deep understanding and explaining phenomena. Scientific 

and engineering practices are authentic practices that scientists engage in when making sense of 

phenomena or solving problems. Crosscutting concepts are lenses you can use to look at natural 

phenomena when making sense of the world. A student’s ability to integrate the three 

dimensions of science is called “three-dimensional learning”. Three-Dimensional Learning (3D 

learning) and the vision of science education expressed in the Framework became the basis of 

Next Generation Science Standards (NGSS) that are expressed as performance expectations that 

combine a DCI, an SEP and a CCCs, and focus on explaining phenomena or solving problems 

using the three dimensions thereby promoting development of knowledge application ability in 

students.  

In the context of assessment, the three dimensions of science become components of the 

complex constructs that need to be assessed. The theoretical premise of 3D understanding, 

according to the Framework, is that these three dimensions are inseparable, and should be 

integrated together in curriculum, instruction and assessment. This argument is grounded in 

situated learning, stating that students cannot learn the content separate from the context (NRC, 

2007), and developmental approach stating that deep understanding takes time and appropriate 

scaffolding to develop (Smith, Wiser, Anderson, & Krajcik, 2006). In other words, integration of 

 

10 

 

the three dimensions discussed in the Framework suggests that these three dimensions are 

components of one complex construct that reflects student understanding of specific aspect of 

science, and should manifest as a single latent construct in psychometric analysis. However, it is 

also possible that the three dimensions manifest as separate latent constructs (Gorin & Mislevy, 

2013). These different outcomes for resulting dimensionality will have implications for 

developing 3D assessments and reporting student progress in the context of NGSS on both high-

stakes or diagnostic assessment.  

To author’s knowledge, no formal studies on dimensionality of NGSS-aligned 

assessments have been conducted. Previously conducted studies provide compelling validity 

evidence based on response process showing that the designed tasks indeed elicit responses 

expected from the evidence centered design (ECD) argument (DeBarger et al., 2016; Gane, 

McElhaney, Zaidi, & Pellegrino, 2018; Gane, McElhaney, Zaidi, & Pellegrino, 2019). However, 

no research is currently available on studying dimensionality of NGSS-aligned tasks, which is a 

prerequisite for choosing appropriate psychometric models that will allow one to quantitatively 

evaluate performance of items and students on the assessment and provide information about 

student growth.  

The current study builds on previously conducted research on argument-based validation 

(DeBarger et al., 2016; Gane et al., 2018; Gane et al., 2019), and demonstrates detailed 

investigation of dimensionality of NGSS-aligned assessment instrument that incorporates 

theoretical assumptions of 3D learning including integration of the three dimensions, situated 

learning, and developmental nature of student understanding. The mthodology presented here 

expands validity evidence for NGSS-aligned assessment of complex constructs beyond 

qualitative response process based on ECD approach to include internal latent structure-based 

 

11 

 

validity. This is an essential piece of validity required for conducting meaningful latent trait 

analysis.  

To develop 3D assessment instruments aligned to NGSS, the study employs modified 

evidence-centered design (mECD) approach (Harris, Krajcik, Pellegrino, & DeBarger, 2019). 

The theoretical argument resulting from the mECD process specifies what aspects of the three 

dimensions related to the complex construct of interest are being measured, and what evidence 

from student answers is needed to draw conclusion about proficiency levels that are also clearly 

specified as part of the argument. The degree to which observed item difficulty and overall 

student performance on the test is consistent with that suggested by the mECD argument 

provides evidence towards validity of inferences based on the student performance on the 

assessment instrument, or response-process based validity (Geisinger, Bracken, Carlson, Hansen, 

Kuncel, Reise, & Rodriguez, 2013).  

These assessment instruments are further used to gather validity evidence based on 

internal latent structure analysis to show that the integration of the three dimensions suggested by 

the theory is indeed supported in practice. Specifically, since the theoretical premise of 3D 

learning is that deep scientific understanding is manifested in student ability to use the three 

dimensions of science simultaneously when making sense of phenomena, it suggests that 3D 

understanding of a complex construct should be manifested as a single latent construct when 

evaluating internal latent structure of an assessment instrument. For example, one of the 

assessment instruments in this work focuses on measuring student 3D understanding of 

Coulomb’s Law, which is a complex construct. Each assessment item measuring the construct 

contains an aspect of a disciplinary core idea related to Coulomb’s Law, a scientific practice, and 

a crosscutting concept. If 3D learning theory is to be supported in practice, a single latent 

 

12 

 

construct should be observed in the internal latent structure analysis instead of three different 

ones pertaining to DCI, CC and SEP. To evaluate the feasibility of this theoretical statement, 

without imposing any supposition on the measures, two-time point exploratory factor analysis 

with invariant loading structure is used to explore the theoretical dimensionality of an assessment 

instrument suggested by mECD across time.  

 It is important to point out that while the Framework suggests interpreting the three 

dimensions of science as integral part of a single complex construct, it is not clear whether the 

three dimensions manifest as separate latent dimensions psychometrically. Therefore there is not 

enough theoretical grounds to initially use confirmatory factor analysis approach requiring rigid 

specification of latent structure. Exploratory factor analysis (EFA), on the other hand, allows one 

to explore and generate hypothesis for the most plausible factor solution by taking into 

consideration the possibility that the three dimensions may in fact manifest as separate latent 

constructs. 

Finally, the most plausible model based on EFA results is further used to conduct 

confirmatory factor analysis (CFA)-based measurement invariance examination to verify that 

theoretical dimensionality of assessment instruments is supported by student response pattern 

across time. The CFA-based invariance analysis provides additional source of validity evidence 

based in internal latent structure (Geisinger et al., 2013; Dimitrov, 2010). The following sections 

demonstrate in detail the process of obtaining all three types of validity evidence including 

response process, EFA, and CFA-based measurement invariance.  

 

 

 

 

13 

 

Methodology 

Assessment Context: “Interactions” curriculum 

The assessment instrument developed here aligns with a NGSS-aligned curriculum 

materials for 9th grade Physical Science called “Interactions”. The curriculum focuses on helping 

students develop understanding of 3-dimensional NGSS performance expectations (PE) related 

to electrical interactions at the macroscopic and microscopic level. The materials consist of four 

units: Unit 1 focuses on electric charge and forces at the macroscopic and atomic scales; Unit 2 

focuses on energy and its relation to electric forces; and Units 3 and 4 apply ideas from prior 

units to build understanding of intermolecular interactions. Each unit has an associated 

assessment instrument aligned to relevant NGSS PEs administered before and after each unit is 

studied in the classroom. In this paper, only assessment instruments from unit 1 and unit 2 are 

examined because other units were yet to be implemented when data was collected. 

Development of assessment instrument for measuring complex NGSS constructs  

Modified evidence-centered design (mECD) process (Harris et al., 2019) is used  

to develop assessments that show evidence of 3D learning. This approach ensures that scores on 

every item of the test can be meaningfully interpreted in terms of what level of understanding of 

complex science constructs students have developed as defined by NGSS and the Framework, 

and what pieces and skills students are missing to develop higher levels of understanding. The 

mECD process is shown in Figure 1 representing modified schematic from Harris et al., 2019. 

Figure 1.1 Summary of modified evidence centered design process 

 

 

14 

 

 

The first step of the mECD approach involves unpacking NGSS performance 

expectations (NGSS PE) in order to develop a claim that describes what students should be able 

to do to demonstrate their 3D understanding of complex science constructs. Each claim 

incorporates an aspect of a DCI, a SEP and a CC. The next step involves specifying the evidence 

that shows students have met the requirements of the claim. Both claim and evidence are closely 

related to the learning goals of the “Interactions” curriculum and aligned to NGSS PEs. Finally, 

assessment tasks are developed that provide the necessary evidence to measure the claim. An 

example of how this process was used to develop assessments is discussed below. For example, 

one of the NGSS PE addresses in Unit 1 is: 

NGSS PE: HS-PS2-4. Use mathematical representations of Newton’s Law of Gravitation 

and Coulomb’s Law to describe and predict the gravitational and electrostatic forces 

between objects.  

Once each of the dimensions was thoroughly unpacked, a three-dimensional claim and evidence 

was developed to specify which part of the NGSS PE above will be assessed as part of unit 1 

test. Since the curriculum only focused on qualitative representation of Coulomb’s Law 

relationship, the parts on NGSS PE that were assessed are underlined. Table 1 below illustrates 

the process of developing assessment items to track 3D learning in more detail, beginning with 

showing the claim and evidence that stems from unpacking process. 

 
 
 
 
 
 
 
 
 
 

 

15 

 

Table 1.1 

Example of modified ECD process 

Claim: students should construct a causal model that shows how objects become charged using electron transfer 
to explain attractive and repulsive interactions between objects. 
Evidence 
Students will include these ideas in their models to explain phenomena: 

1.  Objects are initially neutral (# e = #p) 
2.  Transfer of electrons between atoms of one object and the atoms of another object causes both objects 

to become charged.  

3.  Electron transfer is caused by contact (touching or rubbing).  
4.  Net charge on an object is caused by gaining electrons (“-” charge) or losing electrons (“+” charge). 
5.  #e lost = # e gained, charge is conserved. 
6.  Models show causal relationship between components of atoms (electrons), distance and generated 

electric forces and fields. 

Task: Students are shown a video where fur and rod do not attract paper before they are rubbed together. Upon 
being rubbed together, both fur and rod start attracting paper.  

 

Draw a model that shows what happens to the rod and fur when they are 
rubbed together to cause the paper to move towards the rod. Make sure to 
label everything in your model. Describe what happens to the rod and fur 
during the process of rubbing them together. 

 

The item presented in Table 1 requires students to engage in the SEP of developing 

models using aspects of the DCI of types of interactions, specifically related to Coulomb’s law. 

In addition, it links to the CC of cause and effect as students must provide causal relationship 

between distance, charge and associated attractive force and electric field. Notice that the 

original NGSS PE contains the CC of Patterns and the SEP of using mathematical and 

computational thinking. However, it is acceptable to use CC and SEP different from those in the 

PE for the purposes of assessment as long as each item contains all the three dimensions: a DCI, 

an SEP and a CC. Table 2 below shows the scoring rubric for the item in Table 1. The scoring 

rubric is built on the idea that all the three dimensions (DCI, SEP, CC) are inseparable from each 

other, and represent elements of one complex construct. Therefore, the score for each item is 

assigned based on student ability to integrate all the three dimensions of a given complex 

construct when modeling and/or explaining a particular phenomenon. 

 

16 

 

Table 1.2 

Scoring rubric example 

Pts 

0 
1 

Item: Draw a model that shows what happens to the rod and fur when they are rubbed together to cause 
the paper to move towards the rod. Label everything in your model. Describe what happens to the rod and 
fur during the process of rubbing them together. 
no answer/justification is not clear 
DCI: Types of Interactions:  
•  Neutral objects: charge represented as static/fuzz/magnet.  
•  Charged objects: rubbing causes static/magnetizes rod, causing attraction 
•  No relationship between distance and magnitude of force (Coulomb’s Law) to explain attractive force 

2 

3 

between charged rod and paper 
Paper/Rod have no micro level components 

• 
SEP and CCs: Developing and Using Models and Cause and Effect 
models/explanations are causal at macro level, show that rubbing causes “static” effect (or “magnetic” 
effect) that causes paper to stick to the rod. 
DCI: Types of Interactions:  
•  Neutral objects: Before rubbing rod is neutral (equal # of +/- charges, charge represented as point 

charge, not atoms shown) 

Paper/Rod contain point charges (+/-) 

•  Charged objects: rubbing causes charge transfer, charges the rod (unequal # of +/-) 
•  The closer charged rod is to the paper, the greater the attraction (Coulomb’s Law) 
• 
SEP and CCs: Developing and Using Models and Cause and Effect 
models/explanations are causal at the macro/micro level, rubbing causes transfer of charge (fur to rod). 
Paper sticks to the charged rod b/c neutral & charged objects attract. 
DCI: Types of Interactions:  
•  Neutral objects: Before rubbing rod is neutral (charges shown as components of atoms making up 

objects, equal # of protons (+) and electrons (-) in atoms of the rod) 

•  Charged objects: rubbing causes e transfer, atoms of the rod gain electrons 
•  Charged rod attracts neutral paper when it is brought close to it because electric field is generated 

around charged rod, causing attractive force between electrons in the charged rod and protons in the 
atoms of neutral paper (Coulomb’s Law). 

•  Note: mention of electric field is optional 

Paper/Rod are made up of atoms (Bohr or probabilistic model of electrons) 

• 
SEP and CCs: Developing and Using Models and Cause and Effect 
Models/explanations are causal at the microscopic level, show that as the charged rod moves close to the 
neutral paper, repulsive force is generated between electrons in the atoms of the paper and electrons of the 
rod. This force causes paper electrons to move away, resulting in temporary charge separation within 
paper, which causes attractions between electrons in the rod and temporary partial positive charge on 
paper. 
 
The scoring rubric emphasizes causal microscopic level mechanistic thinking, and 

reflects developmental nature of student understanding. In this case, the higher score (3 pts) 

reflects student ability to integrate all three dimensions at the microscopic level and provide 

detailed, microscopic level causal mechanism to explain the phenomenon in questions. Lower 

scores, on the other hand, reflect macroscopic level thinking with causal mechanism at the 

 

17 

 

macro/micro level (2 points), macroscopic level thinking with elements of causal mechanism at 

the macro level (1 pt), or no answer (0 pts). The scoring rubric is closely aligned to mECD 

argument, and shows teachers what aspects of the 3D thinking are missing from student answers 

if they do not get full credit. This rubric therefore guides teachers as to what aspects of the NGSS 

performance expectation related to Coulomb’s Law need to be further emphasized during 

instruction in order to help students further develop understanding of this complex construct. 

Theoretical Latent Structure of Assessment Instruments 

F1 

3D understanding 

of Coulomb’s 

Law 

Q1 

Q2 

Q3 

Q4 

Q5 

 Q8 

 
Figure 1.2 Theoretical latent structure for unit 1 assessment instrument 

Q6
Q 

Q7
QQ 

The theoretical latent structure for complex construct measured on Unit 1 assessment is 

shown in Figure 2. Unit 1 measured the complex construct focused on students’ 3D 

understanding of phenomena involving Coulomb’s Law focusing on electrical forces, fields and 

charges at macroscopic and atomic level. There were eight 3D items.  For both units, all items 

were open-ended and scored by proficient team of graders using the rubric similar to the one 

described in Table 2. 

All eight questions assessed an aspect of the DCI of PS2: Motion and Stability: Forces 

and Interactions (specifically, PS2.B: Types of Interactions) and the CC of Cause and Effect. 

Five questions focused on SEP of Developing and Using Models, and three focused on the SEP 

of Constructing Explanations. Therefore, there were total four different aspects of NGSS 

 

18 

 

dimensions assessed on Unit 1 test. The items were designed in a form of testlets. The number of 

items in each testlet was not the same, and depended on what aspects of a phenomenon in 

question students were expected to focus on in their answers to fully evaluate a claim produced 

from mECD process. Table 3 summarizes the three dimensions assessed by each item. 

Table 1.3 

3D structure of unit 1 assessment 

Testlet 
 
1 

 
 
2 

 

Item 
1 
2 
3 
4 
5 
6 
7 
8 

DCI 
Types of Interactions 
Types of Interactions 
Types of Interactions 
Types of Interactions 
Types of Interactions 
Types of Interactions 
Types of Interactions 
Types of Interactions 

SEP 
Developing and Using Models 
Developing and Using Models 
Developing and Using Models 
Constructing Explanations 
Developing and Using Models 
Developing and Using Models 
Constructing Explanations 
Constructing Explanations 

CC 
Cause and Effect 
Cause and Effect 
Cause and Effect 
Cause and Effect 
Cause and Effect 
Cause and Effect 
Cause and Effect 
Cause and Effect 

In unit 2 students continued to investigate phenomena related to electrical interactions. 

They added complex construct related to energy to their model of how charged objects interact. 

They also started talking about complex construct of chemical reactions focusing on bond-

formation and bond breaking processes, and energy changes during bond formation and bond 

breaking processes. The mECD argument discussed below describes student 3D understanding 

of energy and chemical reactions as distinct complex constructs because each requires mastery of 

a different set of DCIs and SEPs as specified in Table 4 below. In order to demonstrate the 

mastery of both complex constructs, student should demonstrate the ability to integrate the 

appropriate DCIs with SEPs and CCs. Simple recollection of facts related to the DCIs for the two 

constructs is not sufficient to demonstrate mastery of these complex 3D constructs. There were 

total of eight items in Unit 2 assessment instrument. The theoretical latent structure for Unit 2 

assessment instrument based on mECD argument is shown in Figure 3 below.   

 

 

19 

 

 
 
 

 

 

 

 

F1 

3D understanding 

of Energy 

F2 

 

3D understanding of 
chemical reactions 

Q1

Q2  Q3 

Q4 

Q5 

Q6 

Q7 

Q8 

Figure 1.3 Theoretical latent structure of unit 2 assessment instrument 

The two complex constructs representing two latent dimensions: student 3D 

understanding of Energy and student 3D understanding of Chemical Reactions are correlated, 

since both relate to explaining how energy of the system changes when electrical interactions 

form (between atoms that form a chemical bond, or between two charged objects), which is 

indicated by the arrow connecting the two latent constructs. All eight questions designed for Unit 

2 assessment instrument contained aspects of the three dimensions of NGSS. The first five 

questions assessed aspects of the DCI of PS3: Energy (specifically, PS3.C: relationship between 

energy and forces), CC of Cause and Effect, and the SEP of Developing and Using Models. 

Questions 6-8 assess aspects of the DCI of PS1: Matter and its Interactions (specifically, PS1.B: 

Chemical Reactions) and the CC of Cause and Effect. Question 6-8 focuses on SEP of 

Constructing Explanations. Therefore, there were total five different aspects of NGSS 

dimensions assessed on Unit 2 test. The items were designed in a form of testlets. The number of 

items in each testlet was not the same, and depended on what aspects of a phenomenon in 

question students needed to focus on in their answers to fully evaluate a claim from the mECD 

process. Additionally, F1 latent dimension had more items than F2 latent dimension because in 

the Unit 2 curriculum more time and instruction was focused on building student 3D 

 

20 

 

understanding of Energy than student 3D understanding of chemical reactions. Table 4 below 

summarizes the aspects of the three dimensions assessed by each item on the test.  

Table 1.4 

3D structure of unit 2 assessment  

Testlet 
 
 
1 

Item  DCI 
1 
2 
3 
4 
5 

Rel.btw. Energy and Forces 
Rel.btw. Energy and Forces 
Rel.btw. Energy and Forces 
Rel.btw. Energy and Forces 
Rel.btw. Energy and Forces 

2 
 
3 

6 
7 
8 

Chemical Reactions 
Chemical Reactions 
Chemical Reactions 

SEP 
Dev. and Using Models 
Dev. and Using Models 
Dev. and Using Models 
Dev. and Using Models 
Dev. and Using Models 

Construct. Explanations 
Construct. Explanations 
Construct. Explanations 

CC 
Cause and effect 
Cause and effect 
Cause and effect 
Cause and effect 
Cause and effect 

Cause and effect 
Cause and effect 
Cause and effect 

 
Sample.  

The assessment instruments for units 1 and 2 were administered in six schools in the 

Mid-West and five schools in West United States. Schools in the Mid-West were rural type with 

28% free and reduced lunch. Schools in the Western part of US were urban type with 72.4% free 

and reduced lunch. The assessment was administered in classrooms where the “Interactions” 

curriculum was piloted during Fall 2016 and Spring 2017. The total sample size is 899 students. 

Teachers in the Mid-West schools have taught the “Interactions” curriculum prior to data 

collection year, and teachers in Western part of the US were first time users of the curriculum. 

Students on average had very little prior knowledge of the constructs measured by the two 

assessment instruments as based on pre-unit interview data (see Chapter1 and Chapter 3 for 

interview data analysis). 

Data Analysis. 

Each assessment was administered before and after the corresponding unit was 

completed. Each assessment item was scored on 0-3-point scale following the rubric similar to 

 

21 

 

the one shown in Table 2. Inter-rater reliability was established qualitatively with multiple 

scorers.  

Both EFA and CFA-based measurement invariance were conducted using proper 

techniques for dealing with categorical data. Details of model estimation are provided in the 

appendix. 

Reliability 

Reliability analysis was conducted separately on the pre- and post-tests for each unit 

using summed item scores following methods described in Green & Yang, 2009. Reliability 

analysis was conducted in SAS software. The code is provided in the Appendix. 

Exploratory Factor Analysis to Study Latent Structure of the Complex Constructs  

This work uses two time points exploratory factor analysis (EFA) with factor loading 

invariance and correlated residuals across time to explore the number of latent dimensions for 

Unit 1 and Unit 2 assessment instruments.  This approach uses a flexible EFA framework to 

explore possible latent structures without the need for rigid model specification of factor 

loadings, which is a requirement for confirmatory factor analysis (CFA) based approaches 

(Asparouhov & Muthén, 2009). This flexible approach can help provide evidence towards 

accurately determining dimensionality of an assessment instrument by approximating multiple 

possible latent structures for a given assessment instrument. Additionally, it accounts for factor 

loading equality across time, which will allow to more accurately approximate latent structure by 

considering the pre- and post-unit assessment data. Two time point EFA used 40% of the sample 

(322 students). EFA analysis was conducted using Mplus software, and the code is provided in 

the Appendix. For Unit 1, there were four different aspects of NGSS dimensions assessed on the 

test (see Table 3). Therefore, EFA models with one, two, three and four possible factors were 

 

22 

 

estimated, and only 1-factor model converged to admissible solution2. For unit 2, while there 

were five different aspects of NGSS dimensions assessed on the test, EFA models with one 

through four possible factors were estimated.  EFA model with 5 possible factors was not 

estimated due to lack of degrees of freedom. Only a 2-factor model converged to admissible 

solution3. Table 5 below shows results of model fit for the 2 EFA models that converged. 

Table 1.5 

Summary of two time points EFA model fit for unit 1 and unit 2 assessment  

Parameter 
χ2 
χ2 p value 
CFI/TLI 
RMSEA 

 

Unit 1 
174.2 
>0.001 
0.995/0.994 
0.047 

Unit 2 
98.4 
0.2540 
0.999/0.998 
0.017 

Chi-square model fit test for Unit 1 1-factor EFA solution rejects the plausibility of 

proposed hypothesis. However, chi-square model fit test is sample sensitive, it rejects reasonable 

models if the sample size is too large (Kline, 2010; McDonald & Ho, 2002). Therefore, other 

indexes were used to evaluate mode fit including CFI/TLI, and RMSEA.  As suggested in 

literature CFI/TLI> 0.900 and RMSEA <0.08 were used to evaluate model fit (McDonald & Ho, 

2002; Kline, 2010, Van de Schoot et. al., 2012). Based on these parameters, the model fit for 

Unit 1 EFA 1-factor model is acceptable. Similarly, Unit 2 model fit for 2-factor model is 

acceptable using all indexes (chi square, CFI/TLI, RMSEA). 

 

 

                                                 
2 Two, three and four factor estimated models for Unit 1 gave inadmissible solution due to negative variances and 
unacceptable model fit 
 
3 One, three and four factor estimated models for Unit 2 gave inadmissible solution due to negative variances and 
rejected model fit 
 

 

23 

 

Confirmatory Factor Analysis for Unit 1 and Unit 2 at Each Time Point 

Unlike in EFA, where all measured variables (items) are related to all factors at a given 

time point, CFA requires rigid specification of the latent structure of the data by indicating which 

items load on which factors and all the other loadings are set to zero. This analysis provides 

additional evidence for validity of internal latent structure of assessment instruments (Geisinger 

et al., 2013; Dimitrov, 2010). Based on results of EFA analysis, 1-factor CFA for unit 1 and 2-

factor CFA for unit 2 are used as hypothesized structures for CFA. This analysis was conducted 

in MPlus software (code provided in Appendix) using the reserved 60% of the sample (577 

students) for pre and post Unit 1 and Unit 2 assessment separately to ensure plausibility of 

hypothesis suggested by EFA. The results of CFA model fit for both units on pre and post 

assessment are shown in table 6 below. 

Table 1.6 

Summary of CFA model fit for unit 1 pre/post and unit 2 pre/post 

Parameter 
χ2 
χ2 p value 
CFI/TLI 
RMSEA 

 

Unit 1 pre 
49.5 
>0.001 
0.997/0.996 
0.041 

Unit 1 post 
61 
>0.001 
0.998/0.998 
0.048 

Unit 2 pre 
82 
>0.001 
0.984/0.978 
0.040 

Unit 2 post 
119 
>0.001 
0.996/0.995 
0.053 

Following similar guidelines for model fit evaluation as for EFA, model fit for is 

acceptable for all 4 models based on CFI/TLI> 0.900 and RMSEA <0.08 guidelines (McDonald 

& Ho, 2002; Kline, 2010, Van de Schoot et. al., 2012). CFA-based measurement invariance 

analysis is conducted further next to ascertain that same constructs are measured over time for 

units 1 and 2. 

 

 

 

 

24 

 

Measurement Invariance Analysis  

The latent factor structure for configural CFA-based measurement invariance analysis for 

unit 1 assessment is shown in Figure 4. One-way arrows from latent factor (e.g., F1T1) to items 

(e.g., Q1T1, Q2T1) represent factor loadings for each time point respectively (T1 for pre, T2 for 

post). 

 
 
Figure 1.4 Measurement invariance model for unit 1 assessment instrument 

 
Figure 1.5 Measurement invariance model for unit 2 assessment instrument 

 

25 

 

The latent construct F1T1 represents 3D Understanding of Coulomb’s Law. Figure 5 

shows configural measurement invariance model for unit 2. The two latent constructs for unit 2 

are: F1T1: 3D Understanding of Energy and F2T1: 3D Understanding of Chemical Reactions. 

To assess measurement invariance across time, a series of four nested hierarchical models 

is tested: configural/form invariance, weak/loading invariance, strong/threshold invariance, and 

strict invariance (Van de Schoot, Lugtig, & Hox, 2012; Liu, Millsap, West, Tein, Tanaka, & 

Grimm, 2017; Dimitrov, 2010). Configural invariance model represents the basic type of 

invariance, and tests the hypothesis that items load on the same constructs across time. In the 

configural invariance model factor loadings, intercepts, and unique factor variance matrix are 

freely estimated across time (Liu et al, 2017). Once configural invariance is established, the 

subsequent models are estimated by sequentially adding constraints to those three sets of 

parameters. If configural invariance cannot be established, it suggests the latent construct of 

interest is not represented by the same number of factors, and same pattern of loadings, 

indicating that the construct changes over time, and higher order invariance cannot be tested 

(Van de Schoot et al., 2012). 

Weak (or loading) invariance tests the hypothesis that factor loadings are equal across 

time (Van de Schoot et al., 2012; Liu et al, 2017; Dimitrov, 2010). Factor loadings for the same 

items therefore have to be constrained to be equal on the pre- and post-test. Factor loading 

reflects the degree to which the difference in students’ responses to the items reflect the 

differences in their levels on the construct being measured. To assess weak invariance, 

plausibility of the equal loading constraint is tested using the DIFF test function in MPLUS to 

compare weak invariance model fit to the configural invariance model fit (Asparouhov, Muthén, 

& Muthén, 2006). This function allows to accurately compare model fit difference between 

 

26 

 

nested ordered categorical CFA models (Liu et. al., 2017). If there is no significant difference in 

model fit, it suggests that factor loadings are invariant across administrations.  

For strong invariance, both factor loadings, and thresholds (indicator intercepts, or 

means) have to be equal across administrations. Strong invariance allows to compare factor 

means across administrations. Existence of strong invariance indicates that differences in 

observed mean scores on the items on the pre- and post-test can be attributed to differences in 

latent common factor means on the pre- and post-test (Liu et. al, 2017). Similarly, to establish 

strong invariance, additional constraint of equal thresholds is imposed and tested by comparing 

the model fits (strong to weak) using DIFF test. If the difference in fit is not significant, strong 

invariance is supported. 

Finally, strict invariance sets corresponding factor loadings, intercepts, and unique factor 

variances equal over time. Strict invariance model tests whether residual error variance is 

equivalent across administrations (Van de Schoot et. al., 2012). Strict invariance is supported in 

the same fashion by conducting DIFF test, comparing the fit of strict invariance model to strong 

invariance model. If strict invariance is supported, it indicates that construct is measured with the 

same precision across administrations. Strict invariance indicates that difference in pre and post 

scores on every item is only due to the difference in level of the factor. The details of 

measurement invariance model identification and Mplus code are provided in Appendix. 

Further, if the DIFF test for a given measurement invariance model is significant, 

modification indexes are examined for the corresponding parameters to decide which of the 

constrained parameters can be freed to achieve better model fit (Dimitrov, 2010). Modification 

index (MI) for a given parameter indicates expected drop in model’s chi square value if that 

parameter is freely estimated. Generally, MI > 3.84 indicate statistical significance 

 

27 

 

(Dimitrov,2010). If MIs above 3.84 were identified for a given measurement invariance model 

with significant DIFF test and acceptable RMSEA and CFI parameters, the corresponding 

parameters were freed starting from the largest one until nonsignificant chi square value for 

DIFF test was obtained. Higher level invariance was then tested, and DIFF test conducted by 

comparing the model fit of higher order model to modified lower level invariance model 

(Dimitrov, 2010).  

Results 

Reliability 

Table 7 shows reliability coefficients. In general, tests are highly reliable, with the post-test 

being more reliable for the given sample of examinees than the pre-test, for both unit 1 and 2.  

Table 1.7 

 Reliability 

Unit 

Unit 1 pre test 

Unit 1 post test 
Unit 2 pre test 
Unit 2 post test 

 
EFA 

Reliability Coefficient 

0.872 

0.934 
0.823 
0.932 

Unit 1 EFA model fit analysis is shown in Table 8. If contrary to the theory proposed in 

the Framework, the three dimensions of science (DCI, SEP, CC) are distinct constructs, rather 

than integral parts of student understanding of the same complex construct, a better model fit 

should be observed for models with larger number of factors with potential cross-loadings, rather 

than 1 factor model suggested based on theory. Plausibility of 1 -4 factor structures were tested, 

and only 1-factor model converged to admissible solution. Model fit for 1-factor EFA model for 

Unit 1 is shown in table 8. Chi-square model fit test for 1 factor rejects the plausibility of 

 

28 

 

proposed hypothesis. However, chi-square model fit test is sample sensitive, it rejects reasonable 

models if the sample size is too large (Kline, 2010; McDonald & Ho, 2002). 

Table 1.8 

Two time points EFA model fit for unit 1 assessment  

Parameter 
χ2 
χ2  p-value 
CFI/TLI 
RMSEA 

 

1 Factor 
174.2 
<0.001 
0.995/0.994 
0.047 

Therefore, other indexes were used to evaluate mode fit including CFI/TLI, and RMSEA.  

As suggested in literature CFI/TLI> 0.900 and RMSEA <0.08 were used to evaluate model fit 

(McDonald & Ho, 2002; Kline, 2010, Van de Schoot et. al., 2012). Based on these parameters, 

the model fit for 1-factor model is acceptable. Table 9 shows that all factor loadings for both unit 

1 pre- and post- assessments are above 0.5 (Hair, Black, Babin, Anderson, & Tatham, 2009) 

suggesting that each item measures dimension of interest reasonably well.  

Table 1.9 

Two time points EFA factor loadings for unit 1 assessment  

Pre and Post Unit 1 Test 
Q1 
Q2 
Q3 
Q4 
Q5 
Q6 
Q7 
Q8 

 

Factor Loading Estimate 
0.724 
0.760 
0.907 
0.897 
0.908 
0.841 
0.876 
0.834 

Standard Error 
0.032 
0.026 
0.013 
0.015 
0.015 
0.018 
0.018 
0.019 

Unit 2 two-timepoint EFA analysis was conducted in a similar way. Latent structures of 

1-4 factors were explored. Only 2-factor solution converged to admissible solution. Following 

similar guidelines as before, theoretically proposed 2-factor solution has a good model fit as 

shown in Table 10. Table 11 indicates the theoretically proposed factor loading structure is also 

 

29 

 

observed for EFA analysis. Specifically, questions 1-5 load on factor 1 with loadings all above 

0.5, and questions 6-8 load on factor 2 with loadings above 0.3. 

Table 1.10 

Two time points EFA model fit for unit 2 assessment  

Parameter 
χ2 
χ2  p-value 
CFI/TLI 
RMSEA 

 

2 Factors 
98.4 
0.2540 
0.999/0.998 
0.017 

There are no cross loadings above the value of 0.5, supporting plausibility of theoretically 

proposed latent structure. To summarize, two-time points EFA analysis showed that theoretically 

proposed 1-factor model for Unit 1and 2-factor model for Unit 2 are plausible.  

Table 1.11 

Two time points EFA factor loadings for unit 2 assessment  

Item 
Q1 
Q2 
Q3 
Q4 
Q5 
Q6 
Q7 
Q8 

Factor Loading F1 

SE(Loading) F1 

0.692 
0.651 
0.755 
0.693 
0.541 
0.277 
-0.052 
0.183 

0.071 
0.075 
0.072 
0.065 
0.099 
0.082 
0.049 
0.089 

 
Unit 1 Measurement Invariance Analysis Results 

Factor Loading F2 
0.233 
0.276 
0.100 
0.239 
0.294 
0.606 
0.909 
0.696 

SE(Loading) F2 
0.072 
0.082 
0.079 
0.065 
0.100 
0.077 
0.066 
0.092 

Table 12 shows results of measurement invariance analysis for unit 1. Configural 

invariance has good model fit (CFI/TLI>0.995, RMSEA<0.05). Weak invariance is supported 

based on DIFF test.  For all models RMSEA is below 0.05 (Rutkowski & Svetina, 2017), and 

CFI/TLI are above 0.995 (Asparouhov et al., 2006), indicating good model fits. 

 

 

 

30 

 

Table 1.12 

Measurement invariance analysis for unit 1 pre/post assessment  

Parameter  Configural 
χ2 
df 
CFI 
TLI 
RMSEA 
DIFF test 
Diff test df 
pdiff test 

148.1 
93 
0.998 
0.997 
0.032 
N/A 
N/A 
N/A 

Weak 
160 
100 
0.998 
0.997 
0.032 
13.1 
7 
0.0529 

Strong1 

171.993 
105 
0.997 
0.997 
0.033 
13.7 
5 
0.0177 

StrongM2 
164.3 
104 
0.998 
0.997 
0.032 
6.6 
4 
0.1595 

Strict3 
194.9 
111 
0.997 
0.996 
0.036 
29.5 
7 
>0.001 

StrictM4 
178.1 
110 
0.997 
0.997 
0.033 
17.032 
6 
0.0092 

StrictM25 

166.5 
109 
0.998 
0.997 
0.030 
7.763 
5 
0.1698 

1 All thresholds fixed; 2 Threshold for item 3 freed; 3. Residual variance for item 3 freed only; 4. 
Residual variances for items 1 and 3 freed; 5. Residual variances for items 1, 2, 3 freed                                              
 

Even though strong invariance did not yield satisfactory chi-square difference test (pdiff 

test=0.0177), RMSEA value was below 0.05, ∆RMSEA<0.05, and ∆CFI<-0.004 (Rutkowski & 

Svetina, 2017), all of which are indicators of good model fit. Examination of modification 

indexes for strong invariance showed high modification index (8.395) for item 3. Thresholds for 

item 3 were freed, which resulted in nonsignificant DIFF test p value of 0.1595.  

Strict invariance was evaluated next. Strict invariance model is nested in the StrongM 

model, and therefore residual variance on item 3 was freed. The resulting pdiff test>0.001, and is 

significant. Modification indexes for residual variances were examined and indicated that items 1 

and 2 showed large MIs for residual variance parameters. Item 1 residual variance on post-test 

was freed first as it had the largest modification index (20.2), which resulted in larger, but still 

significant pdiff test=0.0092. Residual variance of item 2 on the post-test was freed (mod 

index=13.5), which resulted in non-significant pdiff test=0.1698 shown for model StrictM2 in 

Table 12. 

Overall, based on DIFF test, RMSEA and CFI indexes, partial measurement invariance 

with invariance of all factor loadings, all but 1 threshold for item 3, and all but 3 residual 

variances for items 1, 2 and 3 is supported for unit 1. In total, invariance of 35 parameters were 

 

31 

 

evaluated (14 loadings, 12 thresholds, 9 unique variances), and 5 were freed during measurement 

invariance evaluation. Therefore, the proportion of invariant parameters is 30/35=0.85 or 85%, 

and 15 % of parameters were freed. According to literature, if 20% or less of parameters were 

freed during the process of establishing partial measurement invariance, the results of the 

analysis can be used for practical applications (Dimitrov, 2010). Therefore, establishing partial 

measurement invariance for Unit 1 provides additional evidence for validity of internal latent 

structure of the instrument, and indicates that difference in student performance on the pre- and 

post-unit 1 test can be attributed to difference in level of the latent factor measured.  

Unit 2 Measurement Invariance Analysis Results 

As shown in Table 13, configural invariance has good model fit: RMSEA below 0.055, 

CFI/TLI above 0.995 (Rutkowski & Svetina, 2017, Asparouhov et al., 2006). Next, a DIFF test 

for weak invariance with all loadings constrained equal across time yielded significant p<0.001. 

However, weak invariance has good model fit based on RMSEA value (below 0.05, Rutkowski 

& Svetina, 2017), and acceptable model fit based on CFI/TLI>0.9 (Asparouhov et al., 2006). 

Further ∆RMSEA<0.05, and ∆CFI<-0.004 (Rutkowski & Svetina, 2017), indicating that there is 

evidence to support weak invariance. Modification indexes for factor loadings were examined 

further, and showed high value for item 5 (29.1). Freeing loading for item 5 resulted in 

nonsignificant pdiff test=0.1649 for model WeakM. 

Strong invariance was examined next. The strong invariance model is nested in WeakM 

invariance model, and therefore threshold for item 5 was freed. Strong invariance model has 

significant pdiff test=0.0022, but good model fit based on RMSEA, CFI- ∆RMSEA<0.05, and 

∆CFI<-0.004. Examination of modification indexes showed high MI for item 1 (16.3). The 

threshold parameter for item 1 was freed, resulting in nonsignificant pdiff test=0.6433 for StrongM. 

 

32 

 

Table 1.13 

 Measurement invariance analysis for unit 2 pre/post assessment  

Parameter 
χ2 
df 
CFI 
TLI 
RMSEA 
DIFF test 
Diff test df 
DIFF test p value 

Configural 
138.419 
90 
0.997 
0.996 
0.031 
N/A 
N/A 
N/A 

Weak1  WeakM2 
172.4 
96 
0.995 
0.994 
0.037 
31.4 
6 
<0.001 

144.2 
95 
0.997 
0.996 
0.030 
7.847 
5 
0.1649 

Strong3 
161.2 
100 
0.996 
0.995 
0.033 
18.6 
5 
0.0022 

StrongM4 
145.6 
99 
0.997 
0.996 
0.029 
2.508 
4 
0.6433 

Strict 
159.5 
105 
0.997 
0.996 
0.030 
15.9 
6 
0.0141 

1. All factor loadings fixed; 2. Factor loading for item 5 on pre and pos- test freed; 3. All Intercepts except for item 
5 fixed; 4. Intercept freed for item 5 and item 1 
 

Finally, strict invariance was examined. The DIFF test p value was significant, indicating 

model fit was not satisfactory. However, RMSEA<0.05, CFI/TLI>0.9, ∆RMSEA<0.05, and  

∆CFI<-0.004 indicate good model fit for strict invariance model. There were no modification 

indexes that could have improved model fit at the strict invariance level. Therefore, there is 

evidence for strict invariance based on CFI/TLI and RMSEA indexes, but not based on DIFF 

test. Following similar procedure as described for unit 1, invariance of 33 parameters was 

evaluated (12 loadings, 12 thresholds, 9 unique variances), 6 parameters were freed during 

estimation process. A total of (33-6)/33=0.82 or 82% of parameters were invariant, and 18% of 

parameters were freed during estimation process. Since less than 20% of parameters were freed 

for achieving the partial strong measurement invariance for Unit 2, these results are acceptable 

for practical applications (Dimitrov, 2010). 

This analysis shows evidence for partial measurement invariance based on supported 

invariance of latent factor structure on the pre- and post-test (configural invariance), invariance 

of all factor loadings except for item 5 (weak invariance), invariance of all thresholds except for 

items 1 and 5 (strong invariance). There is also evidence towards equality of residual errors 

across administrations (strict invariance) based on the value and magnitude of change of 

 

33 

 

RMSEA and CFI/TLI indexes (Rutkowski & Svetina, 2017, Asparouhov et al., 2006). Therefore, 

establishing partial measurement invariance provides additional evidence for validity of internal 

latent structure of the instrument, and indicates that difference in student performance on the pre- 

and post-unit 2 test can be attributed to difference in level of the latent factor measured by the 

instrument.  

Consistency of mECD argument with student performance on the test 

         EFA and CFA-based measurement invariance analysis conducted above provide evidence 

for the validity of theoretically suggested internal latent structure of both unit 1 and unit 2 

assessment instruments. Factor scores that result from CFA-based measurement invariance 

analysis reflect the level of student understanding on the latent construct being measured. It is 

hypothesized by mECD argument that the scoring rubric for each item should be consistent with 

levels of the latent variable being measured, as reflected by factor scores. In other words, higher 

levels of the latent factor scores should correspond to higher observed (raw) scores on each item. 

If this trend is observed in practice, it provides additional source of evidence for the validity 

based on response process.  

 

It is important to note that, for either Unit 1 or Unit 2 pre- post-tests, highest level scores 

of 3 were not observed on any item. Possible explanation might be that, as consistent with 

developmental approach, student 3D understanding takes time to develop. Hence, by the end of 

units 1 and 2, which are approximately 3 month long, students have not yet developed the 

microscopic level causal thinking that is required to achieve a score of 3 on each item. As they 

progress further in the curriculum, they will likely develop higher level of 3D understanding.  

          The next step compares average factor scores on each item, which reflect student level of 

understanding of the latent construct being measured, with the observed scores on each item 

 

34 

 

across time. Figure 6 shows 95% confidence interval for factor scores at different levels of raw 

scores for each item on the pre- and post-unit 1 assessment. As can be seen from the graph, the 

hypothesized trend for higher average raw scores on each item corresponding to higher factor  

scores holds for the pre and post-test at each level of raw score. Specifically, for raw scores of 0, 

1 or 2, the corresponding factor score for each of the eight items is consistently higher, and the 

separation between each raw score is evident in the graph. 

Figure 1.6 95% Confidence interval for factor scores at different level of observed (raw) score 
for unit 1 assessment instrument  

Figure 1.7 Confidence interval (95%) for factor scores at different level of observed (raw) score 
for unit 2 assessment instrument 

 

35 

 

However, the degree of separation is different for pre and post-test. Specifically, on pre-

test this trend holds for raw scores of 0 and 1, but raw score of 2 corresponds to only slightly 

higher factor score than for raw score 1 for all 8 items, and therefore the separation of levels 

suggested by mECD is not very pronounced. Possible explanation might be that on the pre-test 

student overall level of 3D understanding was much lower than the level measured by the 

assessment instrument, which is supported by the fact that most students scored very low on the 

pre-test. This is not surprising since no prior knowledge on the construct of interest was assumed 

on the pre-test. Therefore, it was harder to differentiate between students of lower ability levels 

on pre-test due to potential lack of easy enough items that measure lower factor levels among 

students.  

This explanation is also supported by the fact that the pre-test had slightly lower 

reliability coefficient as compared with the post-test for both units, suggesting lower sensitivity 

of the pre-test possibly due to lack of easy items to measure lower level understanding. In the 

future, it might be beneficial to include items measuring lower level 3D understanding to ensure 

better level separation on the pre-test. Similar trend holds for Unit 2 assessment instrument items 

shown in figure 7. Specifically, the pre-test also does not provide good separation between levels 

of the factor score as related to raw score, especially for raw score categories of 1 and 2 for both 

factors.  

 

However, the post-test differentiates much better between different factor score levels as 

related to raw scores, and provides clear levels separation for each of the observed raw score 

categories. The constructs of 3D understanding of Energy (Factor 1) and Chemical Reactions 

(Factor 2) were new to the students at the beginning of Unit 2, and the pre-unit 2 test 

performance was generally very poor. It is possible that, similar to Unit 1, Unit 2 test lacked easy 

 

36 

 

enough items to measure student 3D understanding at lower levels of the latent construct. It 

would be useful to develop and include those items in the future test revisions.  

             In general, both Unit 1 and 2 post test results are consistent with mECD argument 

regarding hypothesized distribution of difficulty levels for each item, which provides validity 

evidence based on response process for both assessment instruments. Notably, factor scores from 

the analysis software were not standardized to have a mean of 0.0 and a standard deviation of 

1.0.  Instead, other constraints were applied to set the scale for the solution. As a result, the mean 

scores for the pre- and post-test factor scores are not directly comparable. However, since partial 

strong invariance is supported, this comparison is possible to do if proper standardization of 

factor scores is carried out. This extension of the analysis is beyond the scope of the current 

research study and will be investigated in the studies that follow. 

Discussion 

Developing valid assessments for evaluating student understanding of complex constructs 

is essential for successful implementation of NGSS and helping students develop 21st century 

skills and competencies. In order to draw accurate conclusions about student understanding, it is 

necessary to provide sound validity arguments that are grounded in both theoretical and 

empirical evidence (Kane, 2016; Messick, 1995). The Framework for K-12 Science Education 

describes complex construct in science as ability to blend Scientific and Engineering Practices 

(SEPs) and Crosscutting Concepts (CCCs) to make sense of the Disciplinary Core Ideas (DCIs) 

in the context of real-life phenomena. This ability is achieved through the process of 3D 

learning, grounded in developmental approach and situated cognition theories. In order to make 

inferences about student progress related to 3D learning, validity evidence needs to be obtained 

that shows how the theory behind complex 3D constructs relates to observed student response 

 

37 

 

patterns. Argument-based approach with elements of evidence-centered design (ECD) has been 

suggested by multiple researchers for developing assessments of complex constructs (Pellegrino 

et al., 2014; Harris et al., 2019; Huff, Steinberg, & Matts, 2010). 

The mECD approach allows to clearly specify in a theoretical argument the aspects of 

content, skills, and different types of responses reflecting level of understanding of a given 

complex construct (Mislevy, 2009; Mislevy & Haertel, 2006). The previous studies have used 

ECD approach to develop 3D assessments and showed, using student interviews and detailed 

analysis of student response patterns, that theoretical objectives outlined in ECD argument are 

supported by student response data (DeBarger, 2016; Gane, 2018; Gane, 2019). These studies 

provide rich evidence for usefulness of ECD-based methods in designing valid 3D assessments, 

as well as give educators valuable information about how student 3D understanding develops in 

the classroom. However, in order to draw inferences about student progress on complex 3D 

constructs that are more quantitative, it is necessary to conduct mathematical modeling to obtain 

standardized item parameters including item difficulty using IRT approaches for example. This 

analysis can further be used to develop standardized measures of student progress for both 

diagnostic and high stakes assessment.  

To select a mathematical model that will provide accurate item parameters, it is essential 

to investigate dimensionality of a given assessment instrument. Complex constructs such as 

those that arise as a result 3D learning process are inherently multidimensional because they 

combine an element of a DCI, SEP and CCCs (Gorin & Mislevy, 2013). However, the 

Framework also suggests that the blending of the three dimensions reflects deep understanding 

of science, therefore implying that complex NGSS constructs may manifest as single latent 

construct.  

 

38 

 

The studies on 3D assessment development and validation conducted so far assume 

unidimensional 3D constructs and use unidimensional IRT models to obtain item parameters 

(DeBarger, 2016; Gane, 2018; Gane, 2019). However, if assumption of unidimensional does not 

hold, then obtained item parameters cannot be used to draw valid conclusions about student 

performance. To author’s knowledge, this study represents the first example of investigating 

dimensionality of 3D assessment instrument. Current work builds on the previously published 

work using mECD approach (DeBarger, 2016; Gane, 2018; Gane, 2019; Harris et al., 2019) to 

develop assessment instrument grounded in 3D learning theories and aligned to NGSS. Current 

work further presents a methodology for evaluating internal latent structure of 3D assessment 

instruments developed using mECD approach to show that the theoretical dimensionality 

specified by mECD argument is confirmed in practice. Specifically, it shows that the blending of 

the three dimensions of NGSS (DCI, SEP, CC) suggested by the Framework is confirmed by 

empirical evidence. This work provides multiple sources of evidence for the validity of this 

assumption. First, two-time point EFA is used to show that the most plausible latent structure for 

both assessment instruments is the one hypothesized by mECD argument. Since the analysis is 

exploratory in nature, it allows to investigate possibilities of DCI, SEP and CC manifesting as 

septate latent constructs by estimating 1, 2, and 3-factor EFA models. 

The results of analysis suggest that for Unit 1, the most plausible latent structure of the 

assessment instrument is unidimensional, with all loadings above 0.5 and with small standard 

arrow (Table 6, Table 7). Similarly, for Unit 2, the most plausible latent structure is two-

dimensional, as suggested by mECD argument, with loading pattern supporting hypothesized 

loading structure (Table 9). Further, CFA-based measurement invariance was used to gain 

additional evidence for the plausibility of latent structure suggested by EFA. Since CFA-based 

 

39 

 

measurement invariance requires rigid specification of latent structure, it provides additional 

confirmatory evidence towards validity of internal latent structure hypothesized by mECD 

argument. Also, since partial measurement invariance is supported for both units, it allows to 

compare student performance on pre and post-test to draw more accurate conclusions about how 

student 3D understanding developed during the course of each unit, which will be the focus of 

future work. Establishing partial measurement invariance also serves as a source of validity 

evidence based on internal latent structure (Geisinger et al., 2013; Dimitrov, 2010). Once the 

evidence for latent dimensionality is evaluated, CFA-based difficulty parameters (factor scores) 

are further compared with hypothesized item difficulty suggested by mECD argument reflected 

by observed scores. As can be seen in Figures 8 and 9, for both units increasing factor scores for 

all items relate to increasing observed score, suggesting that assumptions of mECD argument are 

indeed supported by student response data. This provides additional validity evidence based on 

response process. 

The dimensionality analysis presented here demonstrates the first study on investigating 

relationship between theoretical dimensionality suggested by 3D learning theories and 

empirically tested latent dimensionality of student response data. The results of this study 

suggest that 3D tasks are better described by single-factor models, which has several 

implications for assessment of 3D constructs. 

The first implication has to do with how the NGSS dimensions that comprise 3D tasks 

should be analyzed. One of the major assessment challenges for 3D tasks, as mentioned above, 

has to do with evaluating dimensionality of the tasks, which is a prerequisite for accurate scaling 

and reporting of any assessment. Since these tasks are inherently multidimensional, and various 

dimensions of NGSS are likely to contribute to individual items and overall assessment to a 

 

40 

 

different extend, it has been suggested that NGSS assessments will most likely present a 

complex multidimensional structure which is difficult if not impossible to handle with present-

day measurement techniques (Gorin, Mislevy, 2013).  

Current work suggests that the three dimensions in fact manifest as a single latent 

construct, and therefore 3D tasks need to be analyzed as a whole and not as separate dimensions, 

which makes unidimensional models a potentially appropriate modeling tool for 3D tasks. This is 

an important implication because it could potentially make unidimensionality, which is always 

the goal in the context of educational testing, considerably more achievable, or significantly 

reduce dimensionality of multidimensional tests as demonstrated for Unit 2 assessment 

instrument in this study. This would make psychometric analysis of NGSS assessments more 

feasible and easily achievable in practice by bringing down time and resources required for such 

analysis. 

The second implication relates to cognitive and instructional inferences that can be made 

based on the current work. The finding that 3D tasks are better described by single factor models 

indicates the complexity of 3D constructs. In other words, a 3D construct is a complex construct 

that combines the three dimensions of NGSS (SEPs, CCCs, and DCIs) as opposed to three 

separate constructs. Therefore, 3D construct is a complex conceptual dimension that manifests as 

a single psychometric dimension. This finding is consistent with the situated cognition premise 

stated in the Framework and NGSS which states that learning content (DCIs) is inseparable from 

engaging in practices (SEPs) and crosscutting concepts (CCCs) along with the content. This has 

important instructional implication including the fact that students cannot gain deep 

understanding of the content (DCIs) without the context (CCCs, and SEPs). For instance, if a 

student can construct a model in a Physical Science, it does not mean that the student can 

 

41 

 

construct a model in the context of Biology discipline, since the content in the two cases is very 

different.  

A s a result, for designing assessments, this finding suggests that content should not be 

measured separately from the context, and that the three dimensions of NGSS should be 

integrated together in both instruction and assessment. 

The third implication relates to the need to follow systematic process in assessment 

design to ensure alignment between the NGSS standards and the assessment items. In the context 

of this work, it is important to emphasize that while the assessment items were administered in 

the context of “Interactions” curriculum, the items were designed to align to NGSS PEs, and not 

the curriculum learning goals. As shown in the methods section, NGSS PEs were carefully 

unpacked using mECD process to specify the aspects of the three dimensions that are being 

targeted in the assessment. I believe that the results obtained in this study are in part due to the 

good alignment between NGSS PEs and the assessments instruments that resulted following 

mECD process. While at the initial stage of the development the mECD process is somewhat 

time-consuming because it requires careful detailing of what will be assessed in the test and how 

it relates to NGSS PEs, the result is well-aligned assessments that provide accurate information 

about the degree of student understanding of NGSS PEs and, as this work suggests, demonstrate 

good psychometric properties that are essential for making valid conclusions based on the results 

of such assessments. Similarly to the assessments, the “Interactions” curriculum was also built 

following similar principles of aligning NGSS PEs and curriculum learning goals.  The 

assessment was unpacked separately from the curriculum items, with new set of mECD 

arguments designed for assessment, therefore ensuring that assessment was entirely independent 

of the curriculum. The unpacking following mECD process, however, ensured that the entire 

 

42 

 

system, including the curriculum and the assessment are tightly aligned to NGSS, which resulted 

in informative outcomes demonstrated here specifically concerning dimensionality of assessment 

instruments.  If assessment developers don’t follow systematic procedures like mECD (Harris et 

al., 2019), the alignment of resulting assessments with the standards will be largely unspecified 

and implied, and assessments essentially become a “black box”. As a result, the inferences that 

can be made based on those assessment related to student progress towards mastering NGSS PEs 

will have little validity, which is a fundamental property of any well-designed assessment 

(Messick, 1995). As suggested by multiple documents, alignment between standards, assessment 

and curriculum, as well as instruction and PD are a critical feature of any well-functioning 

system of assessment (National Academies of Sciences, Engineering, and Medicine, 2019, NRC, 

2013a, NRC, 2007). To achieve good alignment, it is critical to follow systematic process such 

as mECD (Harris et. al., 2019). Lack of alignment results in a broken system where curriculum, 

instruction and assessment are disconnected and do not share the same learning goals.  

The need for alignment is not unique to the context described in this work, but to other 

fields to. One of the limitations of this work is that is focuses on a very small number of NGSS 

PEs, including a small number of DCIs, SEPs and CCs. In order to ensure reproducibility of 

results presented here, studies need to be conducted both in another domain, and redoing this 

study with a different set of NGSS PEs. Additionally, current work did not look at the 

instructional aspect of alignment specifically, and therefore no conclusions can be drawn about 

fidelity of implementation of the “Interactions” curriculum. However, as part of implementing 

“Interactions” curriculum, summer professional development (PD) session with teachers were 

conducted, including 3-day PD that focused on demonstrating fundamental design principles of 

the curriculum to teachers and engaging teachers in student activities.  

 

43 

 

For example, teachers were involved in experiencing various electrostatic phenomena 

and go through the process of modeling and revising their models as part of the learning process 

to make sure teachers understand what is required of their students. Additionally, “Interactions” 

had extensive teacher materials available to teachers at any time. In the future, however, fidelity 

of implementation is important to look at to ensure that results of the study are reproducible 

irrespective of the implementation context. 

Another important point to emphasize is that dimensionality is a function of instruction, 

and item sensitivity to instruction (Lord, 1976). Since assessment instruments presented here 

were validated in the context of “Interactions” curriculum, it is possible that, with a different 

population of students who did not participate in the “Interactions” curriculum, the 

dimensionality of both assessment instruments might be different. However, this will not 

necessarily disqualify assumptions about dimensionality reflected in mECD argument. Different 

dimensionality will imply that student ability to blend the three dimensions of NGSS being 

assessed on the test is not as uniform for the population that did not go through “Interactions” 

curriculum. At the same time, the assessment items used in this study were aligned to NGSS 

PEs.  

Thus, it is reasonable to assume that, for population of students who had instruction on 

the same DCIs, SEPs as those targeted by items on assessment instruments presented here, the 

response pattern should be similar to the students who went through “Interactions” curriculum, 

and therefore the dimensionality of the both instruments should not change. Nevertheless, more 

research is needed on how the dimensions of complex NGSS constructs manifest in response 

pattern as a function of different instructional content (Gorin & Mislevy, 2013). Related to the 

 

44 

 

previous point, if a different set of SEP and CCCs was chosen to evaluate student ability to 

integrate the three dimensions with the same. 

 DCIs as presented here, it is possible that the resulting dimensionality would be different 

if students didn’t have the same opportunities to practice integrating the new SEPs and CCCs 

with DCIs.  Using the language of situated cognition theory, if students didn’t have opportunities 

to practice integration of DCIs with a different set of SEPs and CCCs than those presented here, 

these new SEPs and CCCs might contribute to the latent trait being measured to a different 

degree depending of the level of student familiarity with them, which might result in different 

dimensionality of the assessment instrument.  This is another important issue to investigate in the 

future studies. However, going back to the Framework, which states that 3D learning is defined 

as ability to integrate SEPs and CCCs to make sense of DCIs, it becomes clear that possible 

multidimensionality resulting from different degree of contribution of NGSS dimensions to the 

overarching complex construct should be viewed as something to be avoided rather than aimed 

for. This is because the Framework defines science proficiency as the ability to integrate the 

three dimensions, so attempting to measure the dimensions separately in the assessment is 

uninformative in terms of evaluating student knowledge-in-use and goes against the very 

principles outlined in the Framework and NGSS. Instead, what NGSS-aligned assessments 

should demonstrate is tasks that are tightly integrated across relevant DCIs, SEPs and CCCs that 

measure a well-defined complex construct of interest resulting from careful unpacking of NGSS 

PEs following a systematic process such as mECD (Harris et al, 2019). Only tasks designed 

following these principles will be informative in terms of drawing conclusions about student 

ability to integrate SEPs and CCCs to makes sense of DCIs, or their degree of 3D understanding, 

which is the ultimate goal of the Framework and NGSS. 

 

45 

 

 

APPENDIX 

46 

 

Reliability 

APPENDIX 

 

Polychoric correlation matrix was computed in RStudio using the code below 

#####PreUnit 1###### 

lambda<-matrix (c(0.780, 0.752 ,0.939, 0.832, 0.887, 0.830, 0.808, 0.765), nrow=8)  
fcor<-matrix(c(1), nrow=1)  #for Unit 2 matrix (c(1, 0.62, 0.62, 1), nrow=2)  
POLYR   <- lambda%*%fcor %*%t(lambda)  
# lambda: factor loading matrix (items x factors) 
 #fcor: latent factor correlation matrix (factors x factors)  
diag(POLYR) <- 1  
 
####PostUnit 1##### 

lambda<-matrix (c(0.799, 0.795, 0.936, 0.922, 0.951, 0.921, 0.909, 0.884), nrow=8) 
fcor<-matrix(c(1), nrow=1)  #for Unit 2 matrix (c(1, 0.62, 0.62, 1), nrow=2) 
POLYR   <- lambda%*%fcor %*%t(lambda)  
# lambda: factor loading matrix (items x factors)  
#fcor: latent factor correlation matrix (factors x factors)  
diag(POLYR) <- 1 
 

#####PreUnit 2##### 

lambda<-matrix (c(0.780, 0.860, 0.716, 0.799,0.876,0, 0, 0, 0, 0, 0, 0, 0, .846, .902, .810), 
nrow=8) 
fcor<-matrix(c(1, 0.784, 0.784, 1), nrow=2)   
POLYR   <- lambda%*%fcor %*%t(lambda)  
# lambda: factor loading matrix (items x factors)  
#fcor: latent factor correlation matrix (factors x factors)  
diag(POLYR) <- 1  
 

#####PostUnit 2##### 

lambda<-matrix (c(0.862, 0.933, 0.883, 0.953, 0.834, 0, 0, 0, 0, 0, 0, 0, 0, .898, .917, .904), 
nrow=8) 
fcor<-matrix(c(1, 0.928, 0.928, 1), nrow=2)   
POLYR   <- lambda%*%fcor %*%t(lambda)  
# lambda: factor loading matrix (items x factors) 
 #fcor: latent factor correlation matrix (factors x factors)  

 

47 

 

diag(POLYR) <- 1  
 
Polychoric correlation matrix computed above was further used to calculate reliability is SAS 
using the code provided below. 
 

Unit 1 pre test SAS reliability code 

 
proc iml;  
RESET fuzz;  
THRESH={1.381 2.257,1.621 19.1,0.772 2.045,1.043 1.865,0.879 2.708,1.061 
1.815, 1.010 2.198, 1.125 2.404}; 
LOAD={0.780,0.752,0.939,0.832, 0.887, 0.830, 0.808, 0.765}; 
FACCOR={1}; 
POLY={1 0.586560 0.732420 0.648960 0.691860 0.64740 0.630240 0.596700, 
0.58656 1 0.706128 0.625664 0.667024 0.62416 0.607616 0.575280, 
0.73242 0.706128 1 0.781248 0.832893 0.77937 0.758712 0.718335, 
0.64896 0.625664 0.781248 1 0.737984 0.69056 0.672256 0.636480, 
0.69186 0.667024 0.832893 0.737984 1 0.73621 0.716696 0.678555, 
0.64740 0.624160 0.779370 0.690560 0.736210 1 0.670640 0.634950, 
0.63024 0.607616 0.758712 0.672256 0.716696 0.67064 1 0.618120, 
0.59670 0.575280 0.718335 0.636480 0.678555 0.63495 0.618120 1}; 
NTHRESH=Ncol(thresh); 
NCAT=NTHRESH+1; 
NITEM=Nrow(LOAD); 
NFACT=Ncol(LOAD); 
POLYR=LOAD*FACCOR*T(LOAD); 
do j=1 to NITEM; 
POLYR[j,j]=1; 
end; 
DIFFPOLY=POLY-POLYR; 
Print NTHRESH[label="Number of Thresholds"], NITEM[label="Number of 
items"], 
NCAT[label="Number of response categories"], NFACT[label="Number of 
factors"], 
THRESH[label="Response Thresholds"],LOAD[label="Factor Loadings"], 
FACCOR[label="Factor Correlation Matrix"], 
POLY[label="Polychoric Correlation Matrix among Continuous Items"] ; 
print "The matrix below is the difference between polychoric 
correlation matrix generated by factors and inputted polychoric 
correlation matrix. Nonzero values should represent the estimated 
correlated errors, as specified by the user, or an error in inputted 
data."; 
print DIFFPOLY[label=" "]; 
sumnum=0; 
addden=0; 
do j=1 to NITEM; 
do jp=1 to NITEM; 
sumprobn2=0; 
addprobn2=0; 
do c=1 to NTHRESH; 
do cp=1 to NTHRESH; 
sumrvstar=0; 
do k=1 to NFACT; 
do kp=1 to NFACT; 
sumrvstar=sumrvstar+LOAD[j,k]*LOAD[jp,kp]*FACCOR[k,kp]; 

 

48 

 

end; 
end; 
sumprobn2=sumprobn2+probbnrm(THRESH[j,c],THRESH[jp,cp],sumrvstar); 
addprobn2=addprobn2+probbnrm(THRESH[j,c],THRESH[jp,cp],POLY[j,jp]); 
end; 
end; 
sumprobn1=0; 
sumprobn1p=0; 
do cc=1 to NTHRESH; 
sumprobn1=sumprobn1+CDF('NORMAL',THRESH[j,cc]); 
sumprobn1p=sumprobn1p+CDF('NORMAL',THRESH[jp,cc]); 
end; 
sumnum=sumnum+(sumprobn2-sumprobn1*sumprobn1p); 
addden=addden+(addprobn2-sumprobn1*sumprobn1p); 
end; 
end; 
reliab=sumnum/addden; 
print sumnum[label="Numerator of Eq. (21)"], 
addden[label="Denominator of Eq. (21)"], 
reliab[label="Nonlinear SEM Reliability Coefficient"]; 
quit; 
 

Unit 1 post test SAS reliability code 

 
proc iml;  
RESET fuzz;  
THRESH={0.837 1.675, 0.714 1.507, 0.970 1.525, 0.815 1.464, 1.025 2.036, 
0.732 1.923, 0.637 1.577, 0.743 1.503}; 
LOAD={0.862 0, 0.933 0, 0.883 0, 0.953 0, 0.834 0, 0 .898, 0 .917, 0 .904}; 
FACCOR={1 0.928, 0.928 1}; 
POLY={1.0 0.8042460 0.7611460 0.8214860 0.7189080 0.7183425 0.7335413 
0.7231421, 
0.8042460 1.0 0.8238390 0.8891490 0.7781220 0.7775100 0.7939606 0.7827049, 
0.7611460 0.8238390 1.0 0.8414990 0.7364220 0.7358428 0.7514118 0.7407593, 
0.8214860 0.8891490 0.8414990 1.0 0.7948020 0.7941768 0.8109801 0.7994831, 
0.7189080 0.7781220 0.7364220 0.7948020 1.0 0.6950089 0.7097140 0.6996526, 
0.7183425 0.7775100 0.7358428 0.7941768 0.6950089 1.0 0.8234660 0.8117920, 
0.7335413 0.7939606 0.7514118 0.8109801 0.7097140 0.8234660 1.0 0.8289680, 
0.7231421 0.7827049 0.7407593 0.7994831 0.6996526 0.8117920 0.8289680 1.0}; 
NTHRESH=Ncol(thresh); 
NCAT=NTHRESH+1; 
NITEM=Nrow(LOAD); 
NFACT=Ncol(LOAD); 
POLYR=LOAD*FACCOR*T(LOAD); 
do j=1 to NITEM; 
POLYR[j,j]=1; 
end; 
DIFFPOLY=POLY-POLYR; 
Print NTHRESH[label="Number of Thresholds"], NITEM[label="Number of 
items"], 
NCAT[label="Number of response categories"], NFACT[label="Number of 
factors"], 
THRESH[label="Response Thresholds"],LOAD[label="Factor Loadings"], 
FACCOR[label="Factor Correlation Matrix"], 
POLY[label="Polychoric Correlation Matrix among Continuous Items"] ; 
print "The matrix below is the difference between polychoric 

 

49 

 

correlation matrix generated by factors and inputted polychoric 
correlation matrix. Nonzero values should represent the estimated 
correlated errors, as specified by the user, or an error in inputted 
data."; 
print DIFFPOLY[label=" "]; 
sumnum=0; 
addden=0; 
do j=1 to NITEM; 
do jp=1 to NITEM; 
sumprobn2=0; 
addprobn2=0; 
do c=1 to NTHRESH; 
do cp=1 to NTHRESH; 
sumrvstar=0; 
do k=1 to NFACT; 
do kp=1 to NFACT; 
sumrvstar=sumrvstar+LOAD[j,k]*LOAD[jp,kp]*FACCOR[k,kp]; 
end; 
end; 
sumprobn2=sumprobn2+probbnrm(THRESH[j,c],THRESH[jp,cp],sumrvstar); 
addprobn2=addprobn2+probbnrm(THRESH[j,c],THRESH[jp,cp],POLY[j,jp]); 
end; 
end; 
sumprobn1=0; 
sumprobn1p=0; 
do cc=1 to NTHRESH; 
sumprobn1=sumprobn1+CDF('NORMAL',THRESH[j,cc]); 
sumprobn1p=sumprobn1p+CDF('NORMAL',THRESH[jp,cc]); 
end; 
sumnum=sumnum+(sumprobn2-sumprobn1*sumprobn1p); 
addden=addden+(addprobn2-sumprobn1*sumprobn1p); 
end; 
end; 
reliab=sumnum/addden; 
print sumnum[label="Numerator of Eq. (21)"], 
addden[label="Denominator of Eq. (21)"], 
reliab[label="Nonlinear SEM Reliability Coefficient"]; 
quit; 
 
 

Unit 2 pre test SAS reliability code 

 

proc iml;  
RESET fuzz;  
THRESH={1.748 2.311, 1.519 3.205, 1.816 2.857, 1.576 2.830, 1.442 2.560, 
1.096 2.877, 0.995 2.464, 1.058 2.141}; 
LOAD={0.780 0, 0.860 0, 0.716 0, 0.799 0, 0.876 0, 0 .846, 0 .902, 0 .810}; 
FACCOR={1 0.784, 0.784 1}; 
POLY= {1.0 0.6708000 0.5584800 0.6232200 0.6832800 0.5173459 0.5515910 
0.4953312, 
0.6708000 1.0 0.6157600 0.6871400 0.7533600 0.5704070 0.6081645 0.5461344, 
0.5584800 0.6157600 1.0 0.5720840 0.6272160 0.4748970 0.5063323 0.4546886, 
0.6232200 0.6871400 0.5720840 1.0 0.6999240 0.5299479 0.5650272 0.5073970, 
0.6832800 0.7533600 0.6272160 0.6999240 1.0 0.5810193 0.6194792 0.5562950, 
0.5173459 0.5704070 0.4748970 0.5299479 0.5810193 1.0 0.7630920 0.6852600, 
0.5515910 0.6081645 0.5063323 0.5650272 0.6194792 0.7630920 1.0 0.7306200, 

 

50 

 

0.4953312 0.5461344 0.4546886 0.5073970 0.5562950 0.6852600 0.7306200 1.0}; 
NTHRESH=Ncol(thresh); 
NCAT=NTHRESH+1; 
NITEM=Nrow(LOAD); 
NFACT=Ncol(LOAD); 
POLYR=LOAD*FACCOR*T(LOAD); 
do j=1 to NITEM; 
POLYR[j,j]=1; 
end; 
DIFFPOLY=POLY-POLYR; 
Print NTHRESH[label="Number of Thresholds"], NITEM[label="Number of 
items"], 
NCAT[label="Number of response categories"], NFACT[label="Number of 
factors"], 
THRESH[label="Response Thresholds"],LOAD[label="Factor Loadings"], 
FACCOR[label="Factor Correlation Matrix"], 
POLY[label="Polychoric Correlation Matrix among Continuous Items"] ; 
print "The matrix below is the difference between polychoric 
correlation matrix generated by factors and inputted polychoric 
correlation matrix. Nonzero values should represent the estimated 
correlated errors, as specified by the user, or an error in inputted 
data."; 
print DIFFPOLY[label=" "]; 
sumnum=0; 
addden=0; 
do j=1 to NITEM; 
do jp=1 to NITEM; 
sumprobn2=0; 
addprobn2=0; 
do c=1 to NTHRESH; 
do cp=1 to NTHRESH; 
sumrvstar=0; 
do k=1 to NFACT; 
do kp=1 to NFACT; 
sumrvstar=sumrvstar+LOAD[j,k]*LOAD[jp,kp]*FACCOR[k,kp]; 
end; 
end; 
sumprobn2=sumprobn2+probbnrm(THRESH[j,c],THRESH[jp,cp],sumrvstar); 
addprobn2=addprobn2+probbnrm(THRESH[j,c],THRESH[jp,cp],POLY[j,jp]); 
end; 
end; 
sumprobn1=0; 
sumprobn1p=0; 
do cc=1 to NTHRESH; 
sumprobn1=sumprobn1+CDF('NORMAL',THRESH[j,cc]); 
sumprobn1p=sumprobn1p+CDF('NORMAL',THRESH[jp,cc]); 
end; 
sumnum=sumnum+(sumprobn2-sumprobn1*sumprobn1p); 
addden=addden+(addprobn2-sumprobn1*sumprobn1p); 
end; 
end; 
reliab=sumnum/addden; 
print sumnum[label="Numerator of Eq. (21)"], 
addden[label="Denominator of Eq. (21)"], 
reliab[label="Nonlinear SEM Reliability Coefficient"]; 
quit; 

 

51 

 

Unit 2 post test SAS reliability code 

 

proc iml;  
RESET fuzz;  
THRESH={0.837 1.675, 0.714 1.507, 0.970 1.525, 0.815 1.464, 1.025 2.036, 
0.732 1.923, 0.637 1.577, 0.743 1.503}; 
LOAD={0.862 0, 0.933 0, 0.883 0, 0.953 0, 0.834 0, 0 .898, 0 .917, 0 .904}; 
FACCOR={1 0.928, 0.928 1}; 
POLY={1.0 0.8042460 0.7611460 0.8214860 0.7189080 0.7183425 0.7335413 
0.7231421, 
0.8042460 1.0 0.8238390 0.8891490 0.7781220 0.7775100 0.7939606 0.7827049, 
0.7611460 0.8238390 1.0 0.8414990 0.7364220 0.7358428 0.7514118 0.7407593, 
0.8214860 0.8891490 0.8414990 1.0 0.7948020 0.7941768 0.8109801 0.7994831, 
0.7189080 0.7781220 0.7364220 0.7948020 1.0 0.6950089 0.7097140 0.6996526, 
0.7183425 0.7775100 0.7358428 0.7941768 0.6950089 1.0 0.8234660 0.8117920, 
0.7335413 0.7939606 0.7514118 0.8109801 0.7097140 0.8234660 1.0 0.8289680, 
0.7231421 0.7827049 0.7407593 0.7994831 0.6996526 0.8117920 0.8289680 1.0}; 
NTHRESH=Ncol(thresh); 
NCAT=NTHRESH+1; 
NITEM=Nrow(LOAD); 
NFACT=Ncol(LOAD); 
POLYR=LOAD*FACCOR*T(LOAD); 
do j=1 to NITEM; 
POLYR[j,j]=1; 
end; 
DIFFPOLY=POLY-POLYR; 
Print NTHRESH[label="Number of Thresholds"], NITEM[label="Number of 
items"], 
NCAT[label="Number of response categories"], NFACT[label="Number of 
factors"], 
THRESH[label="Response Thresholds"],LOAD[label="Factor Loadings"], 
FACCOR[label="Factor Correlation Matrix"], 
POLY[label="Polychoric Correlation Matrix among Continuous Items"] ; 
print "The matrix below is the difference between polychoric 
correlation matrix generated by factors and inputted polychoric 
correlation matrix. Nonzero values should represent the estimated 
correlated errors, as specified by the user, or an error in inputted 
data."; 
print DIFFPOLY[label=" "]; 
sumnum=0; 
addden=0; 
do j=1 to NITEM; 
do jp=1 to NITEM; 
sumprobn2=0; 
addprobn2=0; 
do c=1 to NTHRESH; 
do cp=1 to NTHRESH; 
sumrvstar=0; 
do k=1 to NFACT; 
do kp=1 to NFACT; 
sumrvstar=sumrvstar+LOAD[j,k]*LOAD[jp,kp]*FACCOR[k,kp]; 
end; 
end; 
sumprobn2=sumprobn2+probbnrm(THRESH[j,c],THRESH[jp,cp],sumrvstar); 
addprobn2=addprobn2+probbnrm(THRESH[j,c],THRESH[jp,cp],POLY[j,jp]); 
end; 

 

52 

 

end; 
sumprobn1=0; 
sumprobn1p=0; 
do cc=1 to NTHRESH; 
sumprobn1=sumprobn1+CDF('NORMAL',THRESH[j,cc]); 
sumprobn1p=sumprobn1p+CDF('NORMAL',THRESH[jp,cc]); 
end; 
sumnum=sumnum+(sumprobn2-sumprobn1*sumprobn1p); 
addden=addden+(addprobn2-sumprobn1*sumprobn1p); 
end; 
end; 
reliab=sumnum/addden; 
print sumnum[label="Numerator of Eq. (21)"], 
addden[label="Denominator of Eq. (21)"], 
reliab[label="Nonlinear SEM Reliability Coefficient"]; 
quit; 
 

 

 
 
 
 
 
 
 
 
 

 

 

 

 

 

 

 

 

 

 

 

 

53 

 

Two Time Point EFA Code MPlus 

EFA Code Unit 1 
 
TITLE: Unit 1 EFA at two timepoints with factor loading invariance and correlated residuals 
across time 
DATA: FILE IS EFAforty.dat; 
VARIABLE: NAMES ARE STUID TCID Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 
Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2Q7T2 Q8T2; 
CATEGORICAL ARE Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 
Q4T2 Q5T2 Q6T2Q7T2 Q8T2; 
usevar = Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 
Q6T2Q7T2 Q8T2; 
 
MODEL: f1 BY Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 (*t1 1); 
                f2 BY Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2Q7T2 Q8T2 (*t2 1); 
                f1 WITH f2; 
Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 WITH Q1T2 Q2T2 Q3T2 Q4T2  Q5T2 
Q6T2Q7T2 Q8T2; 
 
ANALYSIS: ROTATION=CF-VARIMAX; 
OUTPUT: TECH1 STANDARDIZED; 
 
 
EFA Code Unit 2 
 
TITLE: Unit 2 EFA at two timepoints with factor loading invariance and correlated residuals 
across time  
DATA: FILE IS EFAforty.dat; 
VARIABLE: NAMES ARE STUID TCID Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 
Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2Q7T2 Q8T2; 
usevar = Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 
Q6T2Q7T2 Q8T2; 
CATEGORICAL ARE Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 
Q4T2 Q5T2 Q6T2Q7T2 Q8T2; 
 
MODEL: f1-f2 BY Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 (*t1 1); 
                f3-f4 BY Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2Q7T2 Q8T2 (*t2 1); 
                f1-f2 WITH f3-f4; 
Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1WITH Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 
Q6T2Q7T2 Q8T2; 
 
ANALYSIS: ROTATION=CF-VARIMAX; 
OUTPUT: TECH1 STANDARDIZED; 
 
 

 

54 

 

Measurement Invariance Model Identification 

The items on both unit 1 and unit 2 assessment represent ordered-categorical variables, and 
measurement invariance was evaluated following procedures described in Liu et al., 2017, with 
some modifications. Specifically, the following constrains were imposed for model identification 
purposes: 

1.  At the reference measurement occasion (pre-test was chosen to be the reference 

measurement occasion), common factor mean for Unit 1 was constrained to zero, and 
unique factor variances were constrained to one. On the post test, unique factor variances 
were freely estimated. For Unit 2, unique factor variances were constrained to one at the 
reference measurement occasion, but common factor mean were not estimated due to 
limited degrees of freedom. 

2.  On pre and post the same item is chosen as a marker variable and the factor loading of the 
marker variable is constrained to 1. Constraining the loading of the marker variable gives 
the latent common factor a scale that is in the same unit as one of the items chosen to be 
the marker variable (Liu et. al. 2017). Marker variable for longitudinal measurement 
invariance should have an invariant factor loading across all measurement occasions, and 
have at least two invariant thresholds (Liu et. al. 2017).   

Choosing marker variable to identify variance structure of the latent common factor in 
measurement invariance model identification 

Procedure described in Liu et. al. (2017) was used to identify variance structure of the 

latent common factor using marker variable approach. Specifically, confirmatory factor analysis 
(CFA) was conducted and factor loadings were examined on pre and post as well as thresholds 
for all the items to choose specific marker variables. Item 7 on unit 1 pre/post had the smallest 
difference in factor loading and the most invariant thresholds on pre and post test. Therefore, 
item 7 was chosen as marker variable (factor loading fixed at 1 on pre and post test, and 
thresholds 1 and 2 on pre and post set equal).  

The same procedure was used to set marker variable for unit 2 pre/post test. However, 

since unit 2 has 2 latent common factors, marker variable was chosen separately for each factor. 
Items 1-5 load on factor 1 in unit 2. Therefore, factor loadings and thresholds were examined for 
these items first. Following similar guidelines, Item 3 was chosen as marker variable for factor 1. 
Items 6-8 load on factor 2 in unit 2. Therefore, factor loadings and thresholds were further 
examined for these items, and item 7 was chosen as marker variable for factor 2. 
 
Excluding Threshold from invariance analysis due to sample limitations 
The sample of total 899 students was split 40% and 60%. The 40% sample was used to conduct 
EFA, while the 60% sample was used to conduct CFA-based measurement invariance analysis. It 
was observed that item 2 in Unit 1 pre/post assessment in the 60% random split did not contain 
observed second response category. Therefore, that threshold was excluded from measurement 
invariance analysis. See the code for measurement invariance below for details. 
 
 
 

 

 

55 

 

Unit 1 measurement invariance code for MPlus 

 

! Factor loadings; 

Configural Longitudinal Invariance Model for unit 1 pre/post assessment 
FILE is CFAsixty_U1.dat; 

 
TITLE: 
DATA: 
VARIABLE:  NAMES ARE STUID TCID Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 
Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2Q7T2 Q8T2; 
usevar = Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 
Q6T2Q7T2 Q8T2; 
CATEGORICAL ARE Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 
Q4T2 Q5T2 Q6T2Q7T2 Q8T2; 
 ANALYSIS:  PARAMETERIZATION= THETA; 
 ITERATIONS=3000; 
 ESTIMATOR=WLSMV; 
 MODEL: 
   
            Time1F1 BY      Q1T1* Q2T1* Q3T1* Q4T1* Q5T1* Q6T1* Q7T1@1 Q8T1*; 
   
Time2F1 BY       Q1T2* Q2T2* Q3T2* Q4T2* Q5T2* Q6T2* Q7T2@1 Q8T2*;  
 
          !Thresholds; 
          [Q1T1$1 Q1T2$1](1);       !unit 1 item 1 pre (Q1T1) and post (Q1T2) threshold 1 
          [Q1T1$2 Q1T2$2];       !unit 1 item 1 pre (Q1T1) and post (Q1T2) threshold 2 
 
          [Q2T1$1 Q2T2$1](2);     !unit 1 item 2 pre (Q2T1) and post (Q2T2) threshold 1 
          [Q2T2$2];                         !unit 1 item 2 pre (Q2T1) threshold 2 excluded (not  
                                                      observed in the sample), only Item 2 post (Q2T2)  
                                                       threshold 2 used 
 
          [Q3T1$1 Q3T2$1](3);      !unit 1 item 3 pre (Q3T1) and post (Q3T2) threshold 1 
          [Q3T1$2 Q3T2$2];           !unit 1 item 3 pre (Q3)T1 and post (Q3T2) threshold 2 
 
          [Q4T1$1 Q4T2$1](4);      !unit 1 item 4 pre (Q4T1) and post (Q4T2) threshold 1 
          [Q4T1$2 Q4T2$2];           !unit 1 item 4 pre (Q4T1) and post (Q4T2) threshold 2 
 
          [Q5T1$1 Q5T2$1] (5);     !unit 1 item 5 pre (Q5T1) and post (Q5T2) threshold 1 
          [Q5T1$2 Q5T2$2];           !unit 1 item 5 pre (Q5T1) and post (Q5T2) threshold 2 
 
          [Q6T1$1 Q6T2$1] (6);       !unit 1 item 6 pre (Q6T1) and post (Q6T2) threshold 1 
          [Q6T1$2 Q6T2$2];             !unit 1 item 6 pre (Q6T1) and post (Q6T2) threshold 2 
 
          [Q7T1$1 Q7T2$1] (7);       !unit 1 item 7 pre (Q7T1) and post (Q7T2) threshold 1 
          [Q7T1$2 Q7T2$2](9);             !unit 1 item 7 pre (Q7T1) and post (Q7T2) threshold 2 
 
          [Q8T1$1 Q8T2$1](8);         !unit 1 item 8 pre (Q8T1) and post (Q8T2) threshold 1 
          [Q8T1$2 Q8T2$2];             !unit 1 item 8 pre (Q8T1) and post (Q8T2) threshold 2 
 
 

 

56 

 

!Common factor covariance matrix; 
Time1F1 Time2F1  WITH  Time1F1 Time2F1; 
 
!Common factor means; 
[Time1F1@0 Time2F1*]; 
 
!Unique variances; 
Q1T1@1 Q2T1@1 Q3T1@1 Q4T1@1 Q5T1@1 Q6T1@1 Q7T1@1 Q8T1@1; !unit 1 pre items 
Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; !unit 1 post items 
 
          !Lagged unique factor covariances; 
          Q1T1 WITH Q1T2*; !unit 1 item 1 pre (Q1T1) and post (Q1T2) 
          Q2T1 WITH Q2T2*; !unit 1 item 2 pre (Q2T1) and post (Q2T2) 
          Q3T1 WITH Q3T2*; !unit 1 item 3 pre (Q3T1) and post (Q3T2) 
          Q4T1 WITH Q4T2*; !unit 1 item 4 pre (Q4T1) and post (Q4T2) 
          Q5T1 WITH Q5T2*; !unit 1 item 5 pre (Q5T1) and post (Q5T2) 
          Q6T1 WITH Q6T2*; !unit 1 item 6 pre (Q6T1) and post (Q6T2) 
          Q7T1 WITH Q7T2*; !unit 1 item 7 pre (Q7T1) and post (Q7T2) 
          Q8T1 WITH Q8T2*; !unit 1 item 8 pre (Q8T1) and post (Q8T2) 
 
 
        Q4T1      WITH Q5T1;      #item correlation on pre test included following mECD argument 
                                            (items measure similar aspect of phenomenon in question)             
        Q4T2      WITH Q4T2;   #item correlation on post test included following mECD argument 
                                            (items measure similar aspect of phenomenon in question)             
 
           
OUTPUT:      sampstat STDYX mod(all 8); 
SAVEDATA: DIFFTEST IS unit1_configural.dat; 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

57 

 

 

Weak Longitudinal Invariance Model for unit 1 pre/post assessment 
FILE is CFAsixty_U1.dat; 

TITLE: 
DATA: 
VARIABLE:  NAMES ARE STUID TCID Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 
Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; 
usevar = Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 
Q6T2Q7T2 Q8T2; 
CATEGORICAL ARE Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 
Q4T2 Q5T2 Q6T2Q7T2 Q8T2; 
 ANALYSIS:  PARAMETERIZATION= THETA; 
 ITERATIONS=3000; 
 ESTIMATOR=WLSMV; 
 MODEL: 
! Factor loadings; 
   
   
Time2F1 BY       Q1T1*  (10)  
                                         Q2T1* (11) 
                                         Q3T1* (12) 
                                         Q4T1* (13)  
                                         Q5T1* (14)  
                                         Q6T1* (15)  
                                         Q7T1@1 
                                         Q8T1* (16);  !unit 1 pre test items 
            Time1F1 BY       Q1T2*(10)  
                                         Q2T2* (11) 
                                         Q3T2* (12) 
                                         Q4T2* (13) 
                                         Q5T2* (14) 
                                         Q6T2* (15) 
                                         Q7T2@1  
                                         Q8T2* (16); ! Unit 1 post test 
          !Thresholds; 
          [Q1T1$1 Q1T2$1](1);       !unit 1 item 1 pre (Q1T1) and post (Q1T2) threshold 1 
          [Q1T1$2 Q1T2$2];            !unit 1 item 1 pre (Q1T1) and post (Q1T2) threshold 2 
 
          [Q2T1$1 Q2T2$1](2);       !unit 1 item 2 pre (Q2T1) and post (Q2T2) threshold 1 
          [Q2T2$2];                           !unit 1 item 2 pre (Q2T1) threshold 2 excluded (not  
                                                      observed in the sample), only Item 2 post (Q2T2)  
                                                       threshold 2 used 
 
          [Q3T1$1 Q3T2$1](3);      !unit 1 item 3 pre (Q3T1) and post (Q3T2) threshold 1 
          [Q3T1$2 Q3T2$2];           !unit 1 item 3 pre (Q3)T1 and post (Q3T2) threshold 2 
 
          [Q4T1$1 Q4T2$1](4);      !unit 1 item 4 pre (Q4T1) and post (Q4T2) threshold 1 
          [Q4T1$2 Q4T2$2];           !unit 1 item 4 pre (Q4T1) and post (Q4T2) threshold 2 
 
          [Q5T1$1 Q5T2$1] (5);     !unit 1 item 5 pre (Q5T1) and post (Q5T2) threshold 1 
          [Q5T1$2 Q5T2$2];           !unit 1 item 5 pre (Q5T1) and post (Q5T2) threshold 2 

 

58 

 

 
          [Q6T1$1 Q6T2$1] (6);       !unit 1 item 6 pre (Q6T1) and post (Q6T2) threshold 1 
          [Q6T1$2 Q6T2$2];             !unit 1 item 6 pre (Q6T1) and post (Q6T2) threshold 2 
 
          [Q7T1$1 Q7T2$1] (7);       !unit 1 item 7 pre (Q7T1) and post (Q7T2) threshold 1 
          [Q7T1$2 Q7T2$2] (9);             !unit 1 item 7 pre (Q7T1) and post (Q7T2) threshold 2 
 
          [Q8T1$1 Q8T2$1](8);         !unit 1 item 8 pre (Q8T1) and post (Q8T2) threshold 1 
          [Q8T1$2 Q8T2$2];             !unit 1 item 8 pre (Q8T1) and post (Q8T2) threshold 2 
 
!Common factor covariance matrix; 
Time1F1 Time2F1  WITH  Time1F1 Time2F1; 
 
!Common factor means; 
[Time1F1@0 Time2F1*]; 
 
!Unique variances; 
Q1T1@1 Q2T1@1 Q3T1@1 Q4T1@1 Q5T1@1 Q6T1@1 Q7T1@1 Q8T1@1; !unit 1 pre items 
Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; !unit 1 post items 
 
          !Lagged unique factor covariances; 
          Q1T1 WITH Q1T2*; !unit 1 item 1 pre (Q1T1) and post (Q1T2) 
          Q2T1 WITH Q2T2*; !unit 1 item 2 pre (Q2T1) and post (Q2T2) 
          Q3T1 WITH Q3T2*; !unit 1 item 3 pre (Q3T1) and post (Q3T2) 
          Q4T1 WITH Q4T2*; !unit 1 item 4 pre (Q4T1) and post (Q4T2) 
          Q5T1 WITH Q5T2*; !unit 1 item 5 pre (Q5T1) and post (Q5T2) 
          Q6T1 WITH Q6T2*; !unit 1 item 6 pre (Q6T1) and post (Q6T2) 
          Q7T1 WITH Q7T2*; !unit 1 item 7 pre (Q7T1) and post (Q7T2) 
          Q8T1 WITH Q8T2*; !unit 1 item 8 pre (Q8T1) and post (Q8T2) 
 
 
        Q4T1      WITH Q5T1;      #item correlation on pre test included following mECD argument 
                                            (items measure similar aspect of phenomenon in question)             
        Q4T2      WITH Q4T2;   #item correlation on post test included following mECD argument 
                                            (items measure similar aspect of phenomenon in question)             
 
           
OUTPUT:      sampstat STDYX mod(all 8); 
SAVEDATA: DIFFTEST IS unit1_configural.dat; 
 
 
 
 
 
 
 

 

59 

 

 

FILE is CFAsixty_U1.dat; 

TITLE:            Strong Longitudinal Invariance Model for unit 1 pre/post assessment 
DATA: 
VARIABLE:  NAMES ARE STUID TCID Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 
Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; 
usevar = Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 
Q6T2Q7T2 Q8T2; 
CATEGORICAL ARE Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 
Q4T2 Q5T2 Q6T2Q7T2 Q8T2; 
 ANALYSIS:  PARAMETERIZATION= THETA; 
 ITERATIONS=3000; 
 ESTIMATOR=WLSMV; 
 MODEL: 
! Factor loadings; 
   
   
Time2F1 BY       Q1T1*  (10)  
                                         Q2T1* (11) 
                                         Q3T1* (12) 
                                         Q4T1* (13)  
                                         Q5T1* (14)  
                                         Q6T1* (15)  
                                         Q7T1@1 
                                         Q8T1* (16);  !unit 1 pre test items 
            Time1F1 BY       Q1T2*(10)  
                                         Q2T2* (11) 
                                         Q3T2* (12) 
                                         Q4T2* (13) 
                                         Q5T2* (14) 
                                         Q6T2* (15) 
                                         Q7T2@1  
                                         Q8T2* (16); ! Unit 1 post test 
          
          !Thresholds; 
          [Q1T1$1 Q1T2$1](1);         !unit 1 item 1 pre (Q1T1) and post (Q1T2) threshold 1 
          [Q1T1$2 Q1T2$2](17);       !unit 1 item 1 pre (Q1T1) and post (Q1T2) threshold 2 
 
          [Q2T1$1 Q2T2$1](2);             !unit 1 item 2 pre (Q2T1) and post (Q2T2) threshold 1 
          [Q2T2$2](18);                         !unit 1 item 2 pre (Q2T1) threshold 2 excluded (not  
                                                      observed in the sample), only Item 2 post (Q2T2)  
                                                       threshold 2 used 
 
          [Q3T1$1 Q3T2$1](3);      !unit 1 item 3 pre (Q3T1) and post (Q3T2) threshold 1 
          [Q3T1$2 Q3T2$2];           !unit 1 item 3 pre (Q3)T1 and post (Q3T2) threshold 2 
                                                     !Threshold for Item 3 free to achieve strong invariance 
 
          [Q4T1$1 Q4T2$1](4);      !unit 1 item 4 pre (Q4T1) and post (Q4T2) threshold 1 
          [Q4T1$2 Q4T2$2](20);           !unit 1 item 4 pre (Q4T1) and post (Q4T2) threshold 2 
 

 

60 

 

          [Q5T1$1 Q5T2$1] (5);     !unit 1 item 5 pre (Q5T1) and post (Q5T2) threshold 1 
          [Q5T1$2 Q5T2$2](21);           !unit 1 item 5 pre (Q5T1) and post (Q5T2) threshold 2 
 
          [Q6T1$1 Q6T2$1] (6);       !unit 1 item 6 pre (Q6T1) and post (Q6T2) threshold 1 
          [Q6T1$2 Q6T2$2](22);             !unit 1 item 6 pre (Q6T1) and post (Q6T2) threshold 2 
 
          [Q7T1$1 Q7T2$1] (7);       !unit 1 item 7 pre (Q7T1) and post (Q7T2) threshold 1 
          [Q7T1$2 Q7T2$2](9);             !unit 1 item 7 pre (Q7T1) and post (Q7T2) threshold 2 
 
          [Q8T1$1 Q8T2$1](8);         !unit 1 item 8 pre (Q8T1) and post (Q8T2) threshold 1 
          [Q8T1$2 Q8T2$2](23);             !unit 1 item 8 pre (Q8T1) and post (Q8T2) threshold 2 
 
!Common factor covariance matrix; 
Time1F1 Time2F1  WITH  Time1F1 Time2F1; 
 
!Common factor means; 
[Time1F1@0 Time2F1*]; 
 
!Unique variances; 
Q1T1@1 Q2T1@1 Q3T1@1 Q4T1@1 Q5T1@1 Q6T1@1 Q7T1@1 Q8T1@1; !unit 1 pre items 
Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; !unit 1 post items 
 
          !Lagged unique factor covariances; 
          Q1T1 WITH Q1T2*; !unit 1 item 1 pre (Q1T1) and post (Q1T2) 
          Q2T1 WITH Q2T2*; !unit 1 item 2 pre (Q2T1) and post (Q2T2) 
          Q3T1 WITH Q3T2*; !unit 1 item 3 pre (Q3T1) and post (Q3T2) 
          Q4T1 WITH Q4T2*; !unit 1 item 4 pre (Q4T1) and post (Q4T2) 
          Q5T1 WITH Q5T2*; !unit 1 item 5 pre (Q5T1) and post (Q5T2) 
          Q6T1 WITH Q6T2*; !unit 1 item 6 pre (Q6T1) and post (Q6T2) 
          Q7T1 WITH Q7T2*; !unit 1 item 7 pre (Q7T1) and post (Q7T2) 
          Q8T1 WITH Q8T2*; !unit 1 item 8 pre (Q8T1) and post (Q8T2) 
 
 
        Q4T1      WITH Q5T1;      #item correlation on pre test included following mECD argument 
                                            (items measure similar aspect of phenomenon in question)             
        Q4T2      WITH Q4T2;   #item correlation on post test included following mECD argument 
                                            (items measure similar aspect of phenomenon in question)             
 
           
OUTPUT:      sampstat STDYX mod(all 8); 
SAVEDATA: DIFFTEST IS unit1_configural.dat; 
 
 
 
 
 

 

61 

 

 

FILE is CFAsixty_U1.dat; 

TITLE:             Strict Longitudinal Invariance Model for unit 1 pre/post assessment 
DATA: 
VARIABLE:  NAMES ARE STUID TCID Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 
Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; 
usevar = Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 
Q6T2Q7T2 Q8T2; 
CATEGORICAL ARE Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 
Q4T2 Q5T2 Q6T2Q7T2 Q8T2; 
 ANALYSIS:  PARAMETERIZATION= THETA; 
 ITERATIONS=3000; 
 ESTIMATOR=WLSMV; 
 MODEL: 
! Factor loadings; 
   
   
Time2F1 BY       Q1T1*  (10)  
                                         Q2T1* (11) 
                                         Q3T1* (12) 
                                         Q4T1* (13)  
                                         Q5T1* (14)  
                                         Q6T1* (15)  
                                         Q7T1@1 
                                         Q8T1* (16);  !unit 1 pre test items 
            Time1F1 BY       Q1T2*(10)  
                                         Q2T2* (11) 
                                         Q3T2* (12) 
                                         Q4T2* (13) 
                                         Q5T2* (14) 
                                         Q6T2* (15) 
                                         Q7T2@1  
                                         Q8T2* (16); ! Unit 1 post test 
         
  !Thresholds; 
          [Q1T1$1 Q1T2$1](1);       !unit 1 item 1 pre (Q1T1) and post (Q1T2) threshold 1 
          [Q1T1$2 Q1T2$2](17);       !unit 1 item 1 pre (Q1T1) and post (Q1T2) threshold 2 
 
          [Q2T1$1 Q2T2$1](2);     !unit 1 item 2 pre (Q2T1) and post (Q2T2) threshold 1 
          [Q2T2$2](18);                  !unit 1 item 2 pre (Q2T1) threshold 2 excluded (not  
                                                      observed in the sample), only Item 2 post (Q2T2)  
                                                       threshold 2 used 
 
          [Q3T1$1 Q3T2$1](3);      !unit 1 item 3 pre (Q3T1) and post (Q3T2) threshold 1 
          [Q3T1$2 Q3T2$2];           !unit 1 item 3 pre (Q3)T1 and post (Q3T2) threshold 2 freed  
                                                      to achieve strong invariance 
 
          [Q4T1$1 Q4T2$1](4);      !unit 1 item 4 pre (Q4T1) and post (Q4T2) threshold 1 
          [Q4T1$2 Q4T2$2](20);     !unit 1 item 4 pre (Q4T1) and post (Q4T2) threshold 2 
 

 

62 

 

          [Q5T1$1 Q5T2$1] (5);     !unit 1 item 5 pre (Q5T1) and post (Q5T2) threshold 1 
          [Q5T1$2 Q5T2$2](21);           !unit 1 item 5 pre (Q5T1) and post (Q5T2) threshold 2 
 
          [Q6T1$1 Q6T2$1] (6);       !unit 1 item 6 pre (Q6T1) and post (Q6T2) threshold 1 
          [Q6T1$2 Q6T2$2](22);             !unit 1 item 6 pre (Q6T1) and post (Q6T2) threshold 2 
 
          [Q7T1$1 Q7T2$1] (7);       !unit 1 item 7 pre (Q7T1) and post (Q7T2) threshold 1 
          [Q7T1$2 Q7T2$2](9);             !unit 1 item 7 pre (Q7T1) and post (Q7T2) threshold 2 
 
          [Q8T1$1 Q8T2$1](8);         !unit 1 item 8 pre (Q8T1) and post (Q8T2) threshold 1 
          [Q8T1$2 Q8T2$2](23);             !unit 1 item 8 pre (Q8T1) and post (Q8T2) threshold 2 
 
!Common factor covariance matrix; 
Time1F1 Time2F1  WITH  Time1F1 Time2F1; 
 
!Common factor means; 
[Time1F1@0 Time2F1*]; 
 
!Unique variances; 
Q1T1@1 Q2T1@1 Q3T1@1 Q4T1@1 Q5T1@1 Q6T1@1 Q7T1@1 Q8T1@1; !unit 1 pre items 
Q1T2 Q2T2 Q3T2 Q4T2@1 Q5T2@1 Q6T2@1 Q7T2@1 Q8T2@1; !unit 1 post items 
 
 
!Q3T2 unique variance freed because strict invariance model is nested in strong invariance 
model, and Q3 threshold was freed during strong invariance model estimation above. 
!Q1T2 and Q2T2 unique variances freed to achieve strict invariance  
 
          !Lagged unique factor covariances; 
          Q1T1 WITH Q1T2*; !unit 1 item 1 pre (Q1T1) and post (Q1T2) 
          Q2T1 WITH Q2T2*; !unit 1 item 2 pre (Q2T1) and post (Q2T2) 
          Q3T1 WITH Q3T2*; !unit 1 item 3 pre (Q3T1) and post (Q3T2) 
          Q4T1 WITH Q4T2*; !unit 1 item 4 pre (Q4T1) and post (Q4T2) 
          Q5T1 WITH Q5T2*; !unit 1 item 5 pre (Q5T1) and post (Q5T2) 
          Q6T1 WITH Q6T2*; !unit 1 item 6 pre (Q6T1) and post (Q6T2) 
          Q7T1 WITH Q7T2*; !unit 1 item 7 pre (Q7T1) and post (Q7T2) 
          Q8T1 WITH Q8T2*; !unit 1 item 8 pre (Q8T1) and post (Q8T2) 
 
 
        Q4T1      WITH Q5T1;      #item correlation on pre test included following mECD argument 
                                            (items measure similar aspect of phenomenon in question)             
        Q4T2      WITH Q4T2;   #item correlation on post test included following mECD argument 
                                            (items measure similar aspect of phenomenon in question)             
 
           
OUTPUT:      sampstat STDYX mod(all 8); 
SAVEDATA: DIFFTEST IS unit1_configural.dat; 

 

63 

 

Unit 2  measurement invariance code for MPlus 

ITERATIONS=3000; 
ESTIMATOR=WLSMV; 

 
! Factor loadings; 

Configural Longitudinal Invariance Model for unit 2 pre/post assessment 
FILE is CFAsixty_U2.dat; 

 
TITLE: 
DATA: 
VARIABLE:  NAMES ARE STUID TCID U1T1-U8T2; 
 usevar = STUID Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 
Q5T2 Q6T2 Q7T2 Q8T2; 
 CATEGORICAL ARE Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 
Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; 
IDVAR=STUID; 
ANALYSIS:  PARAMETERIZATION= THETA; 
       
       
OUTPUT: sampstat residual mod(all 8); 
STDYX RESIDUAL;  
MODEL: 
 
 
        Time1F1 BY   Q1T1* Q2T1* Q3T1@1 Q4T1* Q5T1*;   
        Time1F2 BY   Q6T1* Q7T1@1 Q8T1*;      
        
        Time2F1 BY   Q1T2* Q2T2* Q3T2@1 Q4T2* Q5T2*;   
        Time2F2 BY   Q6T2* Q7T2@1 Q8T2*;              
         
        !Thresholds; 
        [Q1T1$1 Q1T2$1](1); !unit 2 item 1 pre (Q1T1) and post (Q1T2) threshold 1 
        [Q1T1$2 Q1T2$2];      !unit 2 item 1 pre (Q1T1) and post (Q1T2) threshold 2 
         
        [Q2T1$1 Q2T2$1](2); !unit 2 item 2 pre (Q2T1) and post (Q2T2) threshold 1 
        [Q2T1$2 Q2T2$2];     !unit 2 item 2 pre (Q2T1) and post (Q2T2) threshold 2 
         
        [Q3T1$1 Q3T2$1](3);      !unit 2 item 3 pre (Q3T1) and post (Q3T2) threshold 1 
        [Q3T1$2 Q3T2$2](11);    !unit 2 item 3 pre (Q3T1) and post (Q3T2) threshold 2 
         
        [Q4T1$1 Q4T2$1](4); !unit 2 item 4 pre (Q4T1) and post (Q4T2) threshold 1 
        [Q4T1$2 Q4T2$2];     !unit 2 item 4 pre (Q4T1) and post (Q4T2) threshold 2 
         
        [Q5T1$1 Q5T2$1] (6);  !unit 2 item 5 pre (Q5T1) and post (Q5T2) threshold 1 
        [Q5T1$2 Q5T2$2];        !unit 2 item 5 pre (Q5T1) and post (Q5T2) threshold 2 
          
        [Q6T1$1 Q6T2$1] (7);  !unit 2 item 6 pre (Q6T1) and post (Q6T2) threshold 1 
        [Q6T1$2 Q6T2$2];        !unit 2 item 6 pre (Q6T1) and post (Q6T2) threshold 2 
          
        [Q7T1$1 Q7T2$1](8);         !unit 2 item 7 pre (Q7T1) and post (Q7T2) threshold 1 
        [Q7T1$2 Q7T2$2] (12);      !unit 2 item 7 pre (Q7T1) and post (Q7T2) threshold 2 
         

 

64 

 

        [uQ8T1$1 Q8T2$1] (9);  !unit 2 item 8 pre (Q8T1) and post (Q8T2) threshold 1 
        [Q8T1$2   Q8T2$2];        !unit 2 item 8 pre (Q8T1) and post (Q8T2) threshold 2 
          
        ! Common factor covariance matrix; 
        Time1F1 Time2F1 WITH  Time1F1 Time2F1; 
        Time1F2 Time2F2 WITH  Time1F2 Time2F2; 
          
!Unique variances; 
Q1T1@1 Q2T1@1 Q3T1@1 Q4T1@1 Q5T1@1 Q6T1@1 Q7T1@1 Q8T1@1; !unit 2 pre items 
Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; !unit 2 post items 
         
        !Lagged unique factor covariances; 
        Q1T1 WITH Q1T2*; !unit 2 item 1 pre (Q1T1) and post (Q1T2) 
        Q2T1 WITH Q2T2*; !unit 2 item 2 pre (Q2T1) and post (Q2T2) 
        Q3T1 WITH Q3T2*; !unit 2 item 3 pre (Q3T1) and post (Q3T2) 
        Q4T1 WITH Q4T2*; !unit 2 item 4 pre (Q4T1) and post (Q4T2) 
        Q5T1 WITH Q5T2*; !unit 2 item 5 pre (Q5T1) and post (Q5T2) 
        Q6T1 WITH Q6T2*; !unit 2 item 6 pre (Q6T1) and post (Q6T2) 
        Q7T1 WITH Q7T2*; !unit 2 item 7 pre (Q7T1) and post (Q7T2) 
        Q8T1 WITH Q8T2*; !unit 2 item 8 pre (Q8T1) and post (Q8T2) 
         
        Q1T1      WITH Q2T1; 
        Q1T2      WITH Q2T2; 
        
SAVEDATA: DIFFTEST IS unit2_configural.dat; 
 
 
 
 
 
 
 
 
   
 
   
 
          
 
 
 
 
 
 
 
 

 

 

 

65 

 

ITERATIONS=3000; 
ESTIMATOR=WLSMV; 

 
! Factor loadings; 

Weak Longitudinal Invariance Model for unit 2 pre/post assessment 
FILE is CFAsixty_U2.dat; 

TITLE: 
DATA: 
VARIABLE:  NAMES ARE STUID TCID U1T1-U8T2; 
 usevar = STUID Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 
Q5T2 Q6T2 Q7T2 Q8T2; 
 CATEGORICAL ARE Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 
Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; 
IDVAR=STUID; 
ANALYSIS:  PARAMETERIZATION= THETA; 
       
       
OUTPUT: sampstat residual mod(all 8); 
DIFFTEST IS unit2_configural.dat; 
STDYX RESIDUAL;  
MODEL: 
 
 
        Time1F1 BY   Q1T1* (13) 
                                Q2T1* (14) 
                                Q3T1@1  
                                Q4T1* (15) 
                                Q5T1* (16);   
        Time1F2 BY   Q6T1*(17) 
                                Q7T1@1  
                                Q8T1* (18);      
        
        Time2F1 BY   Q1T2*(13) 
                                Q2T2*(14) 
                                Q3T2@1  
                                Q4T2* (15) 
                                Q5T2*;  !loading freed to achieve weak invariance 
        Time2F2 BY   Q6T2* (17) 
                                 Q7T2@1 
                                 Q8T2* (18);              
         
        !Thresholds; 
        [Q1T1$1 Q1T2$1](1); !unit 2 item 1 pre (Q1T1) and post (Q1T2) threshold 1 
        [Q1T1$2 Q1T2$2];      !unit 2 item 1 pre (Q1T1) and post (Q1T2) threshold 2 
         
        [Q2T1$1 Q2T2$1](2); !unit 2 item 2 pre (Q2T1) and post (Q2T2) threshold 1 
        [Q2T1$2 Q2T2$2];     !unit 2 item 2 pre (Q2T1) and post (Q2T2) threshold 2 
         
        [Q3T1$1 Q3T2$1](3);      !unit 2 item 3 pre (Q3T1) and post (Q3T2) threshold 1 
        [Q3T1$2 Q3T2$2](11);    !unit 2 item 3 pre (Q3T1) and post (Q3T2) threshold 2 
         
        [Q4T1$1 Q4T2$1](4); !unit 2 item 4 pre (Q4T1) and post (Q4T2) threshold 1 

 

66 

 

        [Q4T1$2 Q4T2$2];     !unit 2 item 4 pre (Q4T1) and post (Q4T2) threshold 2 
         
        [Q5T1$1 Q5T2$1] (6);  !unit 2 item 5 pre (Q5T1) and post (Q5T2) threshold 1 
        [Q5T1$2 Q5T2$2];        !unit 2 item 5 pre (Q5T1) and post (Q5T2) threshold 2 
          
        [Q6T1$1 Q6T2$1] (7);  !unit 2 item 6 pre (Q6T1) and post (Q6T2) threshold 1 
        [Q6T1$2 Q6T2$2];        !unit 2 item 6 pre (Q6T1) and post (Q6T2) threshold 2 
          
        [Q7T1$1 Q7T2$1](8);         !unit 2 item 7 pre (Q7T1) and post (Q7T2) threshold 1 
        [Q7T1$2 Q7T2$2] (12);      !unit 2 item 7 pre (Q7T1) and post (Q7T2) threshold 2 
         
        [uQ8T1$1 Q8T2$1] (9);  !unit 2 item 8 pre (Q8T1) and post (Q8T2) threshold 1 
        [Q8T1$2   Q8T2$2];        !unit 2 item 8 pre (Q8T1) and post (Q8T2) threshold 2 
          
        ! Common factor covariance matrix; 
        Time1F1 Time2F1 WITH  Time1F1 Time2F1; 
        Time1F2 Time2F2 WITH  Time1F2 Time2F2; 
          
!Unique variances; 
Q1T1@1 Q2T1@1 Q3T1@1 Q4T1@1 Q5T1@1 Q6T1@1 Q7T1@1 Q8T1@1; !unit 2 pre items 
Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; !unit 2 post items 
         
        !Lagged unique factor covariances; 
        Q1T1 WITH Q1T2*; !unit 2 item 1 pre (Q1T1) and post (Q1T2) 
        Q2T1 WITH Q2T2*; !unit 2 item 2 pre (Q2T1) and post (Q2T2) 
        Q3T1 WITH Q3T2*; !unit 2 item 3 pre (Q3T1) and post (Q3T2) 
        Q4T1 WITH Q4T2*; !unit 2 item 4 pre (Q4T1) and post (Q4T2) 
        Q5T1 WITH Q5T2*; !unit 2 item 5 pre (Q5T1) and post (Q5T2) 
        Q6T1 WITH Q6T2*; !unit 2 item 6 pre (Q6T1) and post (Q6T2) 
        Q7T1 WITH Q7T2*; !unit 2 item 7 pre (Q7T1) and post (Q7T2) 
        Q8T1 WITH Q8T2*; !unit 2 item 8 pre (Q8T1) and post (Q8T2) 
         
        Q1T1      WITH Q2T1; 
        Q1T2      WITH Q2T2; 
        
SAVEDATA: DIFFTEST IS unit2_weak.dat; 
 
 
 
 
 
 
 
 
 
 

 

67 

 

ITERATIONS=3000; 
ESTIMATOR=WLSMV; 

 
! Factor loadings; 

Strong Longitudinal Invariance Model for unit 2 pre/post assessment 
FILE is CFAsixty_U2.dat; 

TITLE: 
DATA: 
VARIABLE:  NAMES ARE STUID TCID U1T1-U8T2; 
 usevar = STUID Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 
Q5T2 Q6T2 Q7T2 Q8T2; 
 CATEGORICAL ARE Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 
Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; 
IDVAR=STUID; 
ANALYSIS:  PARAMETERIZATION= THETA; 
       
       
OUTPUT: sampstat residual mod(all 8); 
DIFFTEST IS unit2_weak.dat; 
STDYX RESIDUAL;  
MODEL: 
 
 
        Time1F1 BY   Q1T1* (13) 
                                Q2T1* (14) 
                                Q3T1@1  
                                Q4T1* (15) 
                                Q5T1* (16);   
        Time1F2 BY   Q6T1*(17) 
                                Q7T1@1  
                                Q8T1* (18);      
        
        Time2F1 BY   Q1T2*(13) 
                                Q2T2*(14) 
                                Q3T2@1  
                                Q4T2* (15) 
                                Q5T2*;  !loading freed to achieve weak invariance 
        Time2F2 BY   Q6T2* (17) 
                                 Q7T2@1 
                                 Q8T2* (18);              
         
        !Thresholds; 
        [Q1T1$1 Q1T2$1](1); !unit 2 item 1 pre (Q1T1) and post (Q1T2) threshold 1 
        [Q1T1$2 Q1T2$2];      !unit 2 item 1 pre (Q1T1) and post (Q1T2) threshold 2 
                                              !Threshold freed to achieve strong invariance 
         
        [Q2T1$1 Q2T2$1](2); !unit 2 item 2 pre (Q2T1) and post (Q2T2) threshold 1 
        [Q2T1$2 Q2T2$2](20);     !unit 2 item 2 pre (Q2T1) and post (Q2T2) threshold 2 
         
        [Q3T1$1 Q3T2$1](3);      !unit 2 item 3 pre (Q3T1) and post (Q3T2) threshold 1 
        [Q3T1$2 Q3T2$2](11);    !unit 2 item 3 pre (Q3T1) and post (Q3T2) threshold 2 
         

 

68 

 

        [Q4T1$1 Q4T2$1](4); !unit 2 item 4 pre (Q4T1) and post (Q4T2) threshold 1 
        [Q4T1$2 Q4T2$2](21);     !unit 2 item 4 pre (Q4T1) and post (Q4T2) threshold 2 
         
        [Q5T1$1 Q5T2$1] (6);  !unit 2 item 5 pre (Q5T1) and post (Q5T2) threshold 1 
        [Q5T1$2 Q5T2$2];        !unit 2 item 5 pre (Q5T1) and post (Q5T2) threshold 2 
                                                !Threshold freed because strong invariance is nested in  
                                                   weak invariance model 
          
        [Q6T1$1 Q6T2$1] (7);  !unit 2 item 6 pre (Q6T1) and post (Q6T2) threshold 1 
        [Q6T1$2 Q6T2$2](23);        !unit 2 item 6 pre (Q6T1) and post (Q6T2) threshold 2 
          
        [Q7T1$1 Q7T2$1](8);         !unit 2 item 7 pre (Q7T1) and post (Q7T2) threshold 1 
        [Q7T1$2 Q7T2$2] (12);      !unit 2 item 7 pre (Q7T1) and post (Q7T2) threshold 2 
         
        [uQ8T1$1 Q8T2$1] (9);  !unit 2 item 8 pre (Q8T1) and post (Q8T2) threshold 1 
        [Q8T1$2   Q8T2$2](24);        !unit 2 item 8 pre (Q8T1) and post (Q8T2) threshold 2 
          
        ! Common factor covariance matrix; 
        Time1F1 Time2F1 WITH  Time1F1 Time2F1; 
        Time1F2 Time2F2 WITH  Time1F2 Time2F2; 
          
!Unique variances; 
Q1T1@1 Q2T1@1 Q3T1@1 Q4T1@1 Q5T1@1 Q6T1@1 Q7T1@1 Q8T1@1; !unit 2 pre items 
Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; !unit 2 post items 
         
        !Lagged unique factor covariances; 
        Q1T1 WITH Q1T2*; !unit 2 item 1 pre (Q1T1) and post (Q1T2) 
        Q2T1 WITH Q2T2*; !unit 2 item 2 pre (Q2T1) and post (Q2T2) 
        Q3T1 WITH Q3T2*; !unit 2 item 3 pre (Q3T1) and post (Q3T2) 
        Q4T1 WITH Q4T2*; !unit 2 item 4 pre (Q4T1) and post (Q4T2) 
        Q5T1 WITH Q5T2*; !unit 2 item 5 pre (Q5T1) and post (Q5T2) 
        Q6T1 WITH Q6T2*; !unit 2 item 6 pre (Q6T1) and post (Q6T2) 
        Q7T1 WITH Q7T2*; !unit 2 item 7 pre (Q7T1) and post (Q7T2) 
        Q8T1 WITH Q8T2*; !unit 2 item 8 pre (Q8T1) and post (Q8T2) 
         
        Q1T1      WITH Q2T1; 
        Q1T2      WITH Q2T2; 
        
SAVEDATA: DIFFTEST IS unit2_strong.dat; 
 
 
 
 
 
 
 

 

69 

 

ITERATIONS=3000; 
ESTIMATOR=WLSMV; 

 
! Factor loadings; 

Strict Longitudinal Invariance Model for unit 2 pre/post assessment 
FILE is CFAsixty_U2.dat; 

TITLE: 
DATA: 
VARIABLE:  NAMES ARE STUID TCID U1T1-U8T2; 
 usevar = STUID Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 
Q5T2 Q6T2 Q7T2 Q8T2; 
 CATEGORICAL ARE Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 
Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; 
IDVAR=STUID; 
ANALYSIS:  PARAMETERIZATION= THETA; 
       
       
OUTPUT: sampstat residual mod(all 8); 
DIFFTEST IS unit2_strong.dat; 
STDYX RESIDUAL;  
MODEL: 
 
 
        Time1F1 BY   Q1T1* (13) 
                                Q2T1* (14) 
                                Q3T1@1  
                                Q4T1* (15) 
                                Q5T1* (16);   
        Time1F2 BY   Q6T1*(17) 
                                Q7T1@1  
                                Q8T1* (18);      
        
        Time2F1 BY   Q1T2*(13) 
                                Q2T2*(14) 
                                Q3T2@1  
                                Q4T2* (15) 
                                Q5T2*;  !loading freed to achieve weak invariance 
        Time2F2 BY   Q6T2* (17) 
                                 Q7T2@1 
                                 Q8T2* (18);              
         
        !Thresholds; 
        [Q1T1$1 Q1T2$1](1); !unit 2 item 1 pre (Q1T1) and post (Q1T2) threshold 1 
        [Q1T1$2 Q1T2$2];      !unit 2 item 1 pre (Q1T1) and post (Q1T2) threshold 2 
                                              !Threshold freed to achieve strong invariance 
         
        [Q2T1$1 Q2T2$1](2); !unit 2 item 2 pre (Q2T1) and post (Q2T2) threshold 1 
        [Q2T1$2 Q2T2$2](20);     !unit 2 item 2 pre (Q2T1) and post (Q2T2) threshold 2 
         
        [Q3T1$1 Q3T2$1](3);      !unit 2 item 3 pre (Q3T1) and post (Q3T2) threshold 1 
        [Q3T1$2 Q3T2$2](11);    !unit 2 item 3 pre (Q3T1) and post (Q3T2) threshold 2 
 

 

70 

 

         [Q4T1$1 Q4T2$1](4); !unit 2 item 4 pre (Q4T1) and post (Q4T2) threshold 1 
        [Q4T1$2 Q4T2$2](21);     !unit 2 item 4 pre (Q4T1) and post (Q4T2) threshold 2 
 
         
        [Q5T1$1 Q5T2$1] (6);  !unit 2 item 5 pre (Q5T1) and post (Q5T2) threshold 1 
        [Q5T1$2 Q5T2$2];        !unit 2 item 5 pre (Q5T1) and post (Q5T2) threshold 2 
                                                !Threshold freed because strong invariance is nested in  
                                                   weak invariance model 
          
        [Q6T1$1 Q6T2$1] (7);  !unit 2 item 6 pre (Q6T1) and post (Q6T2) threshold 1 
        [Q6T1$2 Q6T2$2](23);        !unit 2 item 6 pre (Q6T1) and post (Q6T2) threshold 2 
          
        [Q7T1$1 Q7T2$1](8);         !unit 2 item 7 pre (Q7T1) and post (Q7T2) threshold 1 
        [Q7T1$2 Q7T2$2] (12);      !unit 2 item 7 pre (Q7T1) and post (Q7T2) threshold 2 
         
        [uQ8T1$1 Q8T2$1] (9);  !unit 2 item 8 pre (Q8T1) and post (Q8T2) threshold 1 
        [Q8T1$2   Q8T2$2](24);        !unit 2 item 8 pre (Q8T1) and post (Q8T2) threshold 2 
          
        ! Common factor covariance matrix; 
        Time1F1 Time2F1 WITH  Time1F1 Time2F1; 
        Time1F2 Time2F2 WITH  Time1F2 Time2F2; 
          
!Unique variances; 
Q1T1@1 Q2T1@1 Q3T1@1 Q4T1@1 Q5T1@1 Q6T1@1 Q7T1@1 Q8T1@1; !unit 2 pre items 
Q1T2 Q2T2@1 Q3T2@1 Q4T2@1 Q5T2 Q6T2@1 Q7T2@1 Q8T2@1; !unit 2 post items 
 
!Unique variances freed for items Q1 and Q5 because strict invariance model is nested in strong 
invariance model 
         
        !Lagged unique factor covariances; 
        Q1T1 WITH Q1T2*; !unit 2 item 1 pre (Q1T1) and post (Q1T2) 
        Q2T1 WITH Q2T2*; !unit 2 item 2 pre (Q2T1) and post (Q2T2) 
        Q3T1 WITH Q3T2*; !unit 2 item 3 pre (Q3T1) and post (Q3T2) 
        Q4T1 WITH Q4T2*; !unit 2 item 4 pre (Q4T1) and post (Q4T2) 
        Q5T1 WITH Q5T2*; !unit 2 item 5 pre (Q5T1) and post (Q5T2) 
        Q6T1 WITH Q6T2*; !unit 2 item 6 pre (Q6T1) and post (Q6T2) 
        Q7T1 WITH Q7T2*; !unit 2 item 7 pre (Q7T1) and post (Q7T2) 
        Q8T1 WITH Q8T2*; !unit 2 item 8 pre (Q8T1) and post (Q8T2) 
         
        Q1T1      WITH Q2T1; 
        Q1T2      WITH Q2T2; 
        
SAVEDATA: DIFFTEST IS unit2_strict.dat; 
 
 
 

 

71 

 

 

BIBLIOGRAPHY 

72 

 

BIBLIOGRAPHY 

 

Asparouhov, T., & Muthén, B. (2009). Exploratory structural equation modeling. Structural 

equation modeling: a multidisciplinary journal, 16(3), 397-438. 

 
Asparouhov, T., Muthén, B., & Muthén, B. O. (2006). Robust chi square difference testing with 

mean and variance adjusted test statistics. matrix, 1(5), 1-6. 

 
DeBarger, A. H., Penuel, W. R., Harris, C. J., & Kennedy, C. A. (2016). Building an assessment 

argument to design and use next generation science assessments in efficacy studies of 
curriculum interventions. American Journal of Evaluation, 37(2), 174-192. 

 
Dimitrov, D. M. (2010). Testing for factorial invariance in the context of construct 

validation. Measurement and Evaluation in Counseling and Development, 43(2), 121. 

 
Ercikan, K., & Oliveri, M. E. (2016). In search of validity evidence in support of the 

interpretation and use of assessments of complex constructs: Discussion of research on 
assessing 21st century skills. Applied Measurement in Education, 29(4), 310-318. 

 
Gane, B. D., McElhaney, K.W., Zaidi, S. Z., Pellegrino, J. W. (2018, March). Analysis of student 

and item performance on three-dimensional constructed response assessment tasks. Paper 
presented at the NARST Annual International Conference, Atlanta, GA. 

  
Gane, B. D., McElhaney, K.W., Zaidi, S. Z., Pellegrino, J. W. (2019). Design and Validation of 
Instructionally-Supportive Assessment: Examining Student Performance on Knowledge-in-
use Assessment Tasks. Paper presented at the AERA Annual International Conference, 
Toronto, CA. 

  
Geisinger, K. F., Bracken, B. A., Carlson, J. F., Hansen, J. I. C., Kuncel, N. R., Reise, S. P., & 
Rodriguez, M. C. (2013). APA handbook of testing and assessment in psychology, Vol. 3: 
Testing and assessment in school psychology and education. American Psychological 
Association. 

 
Gorin, J. S., & Mislevy, R. J. (2013, September). Inherent measurement challenges in the next 

generation science standards for both formative and summative assessment. In Invitational 
research symposium on science assessment. 

 
Green, S. B., & Yang, Y. (2009). Reliability of summed item scores using structural equation 

modeling: An alternative to coefficient alpha. Psychometrika, 74(1), 155-167. 

 
Hair Junior, J. F., Black, W. C., Babin, B. J., Anderson, R. E., & Tatham, R. L. (2009). 

Multivariate analysis of data. 

 

 

73 

 

Harris, C. J., Krajcik, J. S., Pellegrino, J. W., DeBarger, A. H. (2019). Designing Knowledge‐In‐

Use Assessments to Promote Deeper Learning. Educational Measurement: Issues and 
Practice. 

 
Huff, K., Steinberg, L., & Matts, T. (2010). The promises and challenges of implementing 

evidence-centered design in large-scale assessment. Applied Measurement in 
Education, 23(4), 310-324. 

 
Kane, M. T. (2016). Explicating validity. Assessment in Education: Principles, Policy & 

Practice, 23(2), 198-211. 

 
Kline, R. B. (2015). Principles and practice of structural equation modeling. Guilford. 
 
Liu, Y., Millsap, R. E., West, S. G., Tein, J. Y., Tanaka, R., & Grimm, K. J. (2017). Testing 

measurement invariance in longitudinal data with ordered-categorical 
measures. Psychological methods, 22(3), 486. 

 
Lord, F. M. (1976). A Study of Item Bias Using Characteristic Curve Theory. 
 
McDonald, R. P., & Ho, M. H. R. (2002). Principles and practice in reporting structural equation 

analyses. Psychological methods, 7(1), 64. 

 
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons' 

responses and performances as scientific inquiry into score meaning. American 
psychologist, 50(9), 741. 

 
Mislevy, R. J. (2009). Validity from the perspective of model-based reasoning. The concept of 

validity: Revisions, new directions and applications, 83-108. 

 
Mislevy, R. J., & Haertel, G. D. (2006). Implications of evidence‐centered design for educational 

testing. Educational Measurement: Issues and Practice, 25(4), 6-20. 

 
National Academies of Sciences, Engineering, and Medicine. (2019). Science and engineering 

for grades 6-12: Investigation and design at the center. National Academies Press. 

 
National Research Council. (2007). Taking science to school: Learning and teaching science in 

grades K-8. National Academies Press. 

 
National Research Council. (2012). A framework for K-12 science education: Practices, 

crosscutting concepts, and core ideas. National Academies Press. 

 
National Research Council. (2013a). Education for life and work: Developing transferable 

knowledge and skills in the 21st century. National Academies Press. 

 

 

74 

 

Pellegrino, J. W., & Hilton, M. L. (2012). Education for Life and Work: Developing 

Transferable Knowledge and Skills in the 21st Century (p. 257). Wash. DC Retrieved 
Httpdownload Nap Educ. Cgi. 

 
Pellegrino, J. W., Wilson, M. R., Koenig, J. A., & Beatty, A. S. (2014). Developing Assessments 

for the Next Generation Science Standards. National Academies Press. 500 Fifth Street 
NW, Washington, DC 20001. 

 
Reckase, M. D. (2017). A tale of two models: Sources of confusion in achievement testing. ETS 

Research Report Series, 2017(1), 1-15. 

 
Rutkowski, L., & Svetina, D. (2017). Measurement invariance in international surveys: 

Categorical indicators and fit measure performance. Applied Measurement in 
Education, 30(1), 39-51. 

 
Smith, C. L., Wiser, M., Anderson, C. W., & Krajcik, J. (2006). FOCUS ARTICLE: implications 

of research on children's learning for standards and assessment: a proposed learning 
progression for matter and the atomic-molecular theory. Measurement: Interdisciplinary 
Research & Perspective, 4(1-2), 1-98. 

 
Standards, N. G. S. (2013). Next generation science standards: For states, by states. 
 
Van de Schoot, R., Lugtig, P., & Hox, J. (2012). A checklist for testing measurement 

invariance. European Journal of Developmental Psychology, 9(4), 486-492. 

 

 

 

 

 

 

 

 

 

 

 

 

75 

 

CHAPTER 2 

Developing and Validating an NGSS-Aligned Learning Progression to Track Three-Dimensional 

Learning of Electrical Interactions in High School Physical Science. 

Introduction 

 

Historically, the US science education system focused on broad coverage of multiple 

topics instead of developing integrated, deep understanding of the key ideas in science. Deep, 

useable understanding is essential for being able to apply scientific ideas to solve real life 

problems, typically referred to as knowledge-in-use (National Research Council [NRC], 2012, 

2013a). Because many commercial based curricula focus on memorization and surface coverage 

of material, there has been significant effort from the community of scientists and educational 

researchers to define and describe what knowledge-in-use should look like. These efforts 

resulted in the publications of the Framework for K-12 Science Education (the Framework) and 

the Next Generation Science Standards (NGSS) (NRC 2012; Standards, 2013; National 

Academies of Sciences, Engineering, and Medicine, 2019) that outline what students should 

know and be able to do in order to meet the demands of the 21st century. 

One of the major differences between the old view on science education, and the 

conceptual view expressed in the Framework and NGSS has to do with the developmental nature 

of student understanding and the importance of coherence in the learning process. The previous 

standards focused on covering a broad range of scientific topics without paying much attention to 

the need to build connections between them, and to scaffold the learning process in a way that 

would help students build understanding over time. The Framework and NGSS, on the other 

hand, emphasize the idea of a developmental approach that comes from decades of research on 

 

76 

 

how students learn (NRC 2012; Standards, 2013; National Academies of Sciences, Engineering, 

and Medicine, 2019). They stipulate that development of deep understanding takes time, careful 

instruction and appropriate scaffolding. If used consistently, a developmental approach to 

learning has been argued to lead to more meaningful and coherent organization of the learning 

process, and development of application and transferable skills in students (National Research 

Council [NRC], 2000, 2013a; Smith, Wiser, Anderson, & Krajcik, 2006; Krajcik, Sutherland, 

Drago, & Merritt, 2012, Duschl, Schweingruber, & Shouse, 2007).  

The essence of developmental approach is reflected in the idea of a learning progression 

(LP) that the Framework and NGSS are aiming to promote in order to help organize the learning 

process. The Framework emphasizes the usefulness of an LP as a valuable tool in helping 

educators support development of deeper and useable knowledge of disciplinary core ideas in 

science coherently over time. Therefore, in theory, an LP represents a “roadmap” of how 

students can potentially move toward more sophisticated levels of understanding of science over 

a broad, defined period of time (Duschl et al., 2007; Smith et al., 2006; Alonzo & Gotwals, 

2012). It is important to point out that learning progressions are not developmentally inevitable, 

and depend on specific instruction and student prior knowledge among other factors (Stevens, 

Sutherland, & Krajcik, 2009). However, obtaining validity evidence for a specific learning 

progression helps revise and better align curriculum, instruction and assessment in a way that 

better supports students in developing deeper understanding of specific constructs (Neuman, 

Viering, Boone, & Fischer, 2013). 

Learning progressions have been described in literature for various constructs, including 

atomic molecular theory (Smith et al., 2006; Talanquer, 2009; Morell, Collier, Black & Wilson, 

2017), evolution (Catley, Lehrer & Reiser, 2005), environmental literacy (Anderson, 2008, 

 

77 

 

Mohan, Chen, & Anderson, 2009), energy (Lee, Liu, 2010; Neumann, Viering, Boone, & 

Fischer, 2013), celestial motion (Plummer, Krajcik, 2009, Plummer, & Maynard, 2014), and 

force and motion (Alonzo & Steedle, 2009).  

There have also been learning progression descriptions that focus on both content and 

practice (Songer, Butler, Kelcey, Gotwals, 2009; Gotwals, Songer, 2013), as well as practice 

only (Lehrer, Kim, Ayers, & Wilson, 2014; Schwarz, Reiser, Davis, Kenyon, Achér, Fortus, D., 

... & Krajcik, 2009; Berland & McNeill, 2010; Osborne, Henderson, MacPherson, Szu, Wild & 

Yao, 2016). The Framework builds on this research and defines three dimensions of science as 

the basis of theoretical learning progressions described in the document and used to develop 

NGSS (NRC, 2012; Standards, 2013).  

The three dimensions are disciplinary core ideas (DCIs), scientific and engineering 

practices (SEPs), and crosscutting concepts (CCCs). DCIs allow for the possibility to organize 

K-12 science curriculum, instruction and assessment around the most important ideas in a 

scientific discipline. Focusing on a few core ideas that are bigger in scope allows students to 

develop deep understanding of important ideas in science coherently across school grades and 

allows for students to explain a wide range of phenomena (NRC 2012, 2013). CCCs serve as 

lenses to make sense of phenomena and to ask questions such as “What is the pattern in this 

data?”, “Is this causal or correlational”, “How does the structure influence the function?”. CCCs 

include: patterns, systems and system models, cause and effect, energy and matter, among others. 

The third dimension, SEPs, describes authentic practices that scientists and engineers use to 

generate and revise knowledge.  SEPs are different from simple skills in that doing a practice 

(like constructing a model to explain a phenomenon) requires not only skill (more of a hands-on, 

procedural aspect), but also knowledge specific for each practice (NRC, 2012).  

 

78 

 

The Framework emphasizes the idea of situated cognition, stating that students learn best 

when engaged in practices associated with applying the content under study to various real-life 

situations (Smith et. al, 2006). The Framework defines three-dimensional learning (3D learning) 

as a way to engage in scientific and engineering practices in order to deepen their understanding 

of crosscutting concepts and disciplinary core ideas (NRC, 2012). Developing the ability to 

integrate the three dimensions is termed three-dimensional learning (3D learning), and is an 

indicator of deep useable understanding of science (NRC, 2012). While the Framework describes 

theoretical basis of 3D learning, and NGSS outlines possible theoretical learning progressions for 

the three dimensions of science across grades, we currently have very limited empirical evidence 

to show that a learning progression for 3D learning can be developed and validated in practice 

(Wyner & Doherty, 2017).  

This paper demonstrates the feasibility of developing a three-dimensional learning 

progression (3D LP) supported by both qualitative and quantitative validity evidence. First, this 

paper presents a hypothetical 3D LP aligned to the previously designed NGSS-based curriculum. 

It further presents multiple sources of validity evidence for the hypothetical 3D LP, including 

interview analysis with 17 students and item response theory (IRT) analysis with 899 students to 

show validity evidence for the 3D LP on a large scale. Finally, the feasibility of using the 

assessment tool designed to probe levels of the 3D LP for assigning 3D LP levels to individual 

student answers is demonstrated, which is essential for practical applicability of any LP. This 

work provides an example of a study focused on validating 3D LP on a large scale in practice. It 

also demonstrates usefulness of validated 3D LP for organizing the learning process in NGSS 

classroom, which is essential for successful implementation of NGSS.  

 

 

79 

 

Theoretical Framework 

Validation of Theoretical Learning Progressions 

Learning progressions represent a continuum of increasingly more sophisticated ways of 

thinking about a given concept that develop across a broad, defined period of time (Corcoran, 

Mosher, Rogat, 2009, Duschl et al., 2007). Learning progressions are usually grounded in 

research on how students learn ideas associated with scientific constructs of interest, as well as 

specific logic of a given discipline. They are bound by lower and upper anchor. The lower 

anchor describes prior knowledge and relevant skills that students develop in lower grades, at 

home or other experiences. The upper anchor describes knowledge and skills students are 

expected to gain, which could relate to specific learning goals, state or local standards, or any 

other external criterion.  The intermediate level describes skills and knowledge associated with a 

specific pathway that students take towards mastering ideas described in the upper anchor. The 

levels of an LP are expressed as learning performances that summarize what students should be 

able to do with the scientific knowledge they have (Reiser, Krajcik, Moje, & Marx, 2003).  

While LPs are promising tools for organizing science instruction, curriculum and 

assessment, most of them are theoretical and have not been validated in practice, and therefore 

cannot be effectively used as “road maps” as suggested by the Framework. To be able to use LPs 

as diagnostic tools to help educators identify knowledge and skills students are missing to move 

to higher levels of a LP, we need to develop assessment tools that can probe each level of the LP 

and accurately place responses on a level (Wilson, 2009). This will help educators gain 

information to organize instruction and curriculum around specific core ideas and skills 

necessary to help students transition to higher levels of a given LP. This process of developing 

and using assessments to characterize students’ understanding and determine whether the 

 

80 

 

observed response pattern on the assessment indeed corresponds to a theoretical path described 

by the levels constitutes validating an LP in practice (Herrmann‐Abell & DeBoer, 2018). 

There are two common approaches to validating a theoretical LP (Duncan & Hmelo-

Silver, 2009). The first one is associated with a specific instructional intervention aligned to the 

theoretical LP aimed at determining what students are capable of learning given carefully 

designed instructional context (Cooper, Underwood, Hilley, & Klymkowsky, 2012; Nordine, 

Krajcik, & Fortus, 2010). The second approach is associated with the development of a 

measurement instrument aimed to evaluate student growth along the learning progression as a 

whole (Mohan, Jing, & Anderson, 2009; Herrmann‐Abell & DeBoer, 2018). The instrument may 

be used to investigate how previously developed curriculum impacts students learning. 

Consequently, this approach requires some alignment between the curriculum and the LP.  

The current study represents the second approach to LP validation. The same research 

base was used to inform the design of the curriculum, and the learning progression under study. 

Repeated measures of student understanding as they progress through the curriculum were 

conducted further. Specifically, an NGSS aligned curriculum was designed to target high school 

level NGSS performance expectations focused on electrical interactions, and the developmental 

progression of student understanding of relevant ideas in the process of the curriculum design 

was also carefully outlined. This developmental progression was further used as a basis for 

designing a theoretical learning progression (LP) that integrated the three dimensions of NGSS 

(3D LP). Finally, assessment tasks aligned to specific levels of the theoretical 3DLP were 

developed to determine how student understanding develops in the context of NGSS-based 

instruction.  

 

 

81 

 

Building 3D LP according to the principles described in the Framework  

The 3LP presented here is based on the fundamental building principles outlined in the 

Framework and NGSS. These principles include 1) integrating the three dimensions of scientific 

knowledge 2) expressing standards as performance expectations, and 3) focusing on explaining 

phenomena and solving problems using the three dimensions. The following paragraphs discuss 

each of these principles in more detail and show how each of them was used in this study. 

Integrating the three dimensions of scientific knowledge 

The three dimensions work together to allow students to make sense of and explain a 

variety of phenomena or to find solutions to challenging real-world problems. The Framework 

specifically emphasizes that it is the ability to integrate the three dimensions of scientific 

knowledge that is indicative of deep, meaningful science understanding (NRC, 2012). This 

implies that the three dimensions should also be integrated in curriculum, instruction, and 

assessment. In other words, the learning process should not be centered around instruction and 

assessment of individual DCIs, SEPs and CCCs. Rather, the focus of learning in NGSS 

classroom should be on exploring phenomena by integrating the three dimensions of NGSS. The 

3D LP presented here was developed following the principle of integrating the three dimensions 

of science. Specifically, 3D LP levels describe increase in sophistication for relevant DCIs, SEPs 

and CCCs together, instead of developing separate learning progressions for each dimension. 

The three dimensions are also integrated in the curriculum and assessment used to validate the 

3D LP. 

Expressing standards as performance expectations 

The three dimensions described above combine to form standards in the NGSS expressed 

as performance expectation (PE) that identifies what a student should know and be able to do 

 

82 

 

with the scientific knowledge they have at the end of a grade band. The NGSS provide PEs at 

each grade level for elementary and each grade band for middle and high school. Such 

representation of increasing sophistication in students’ mastery of the three dimensions reflects 

developmental approach and, if used consistently, has been argued to lead to more meaningful 

and coherent organization of the learning process (Smith et. al., 2006). The 3D LP presented here 

is aligned to NGSS PEs, and describes the appropriate and necessary degree of proficiency for 

the three dimensions that students should develop at each level. 

Focusing on explaining phenomena or solving problems using the three dimensions 

The three dimensions work together to allow students to make sense of and explain a 

variety of complex and compelling phenomena or to find solutions to challenging real-world 

problems.  Phenomena, in the context of NGSS, are events that can be directly observed in 

nature and that can be explained using scientific ideas students learn or that build on what 

students know. Phenomena serve as a gateway that allows students to ask questions and develop 

an inquiry path towards building understanding of the phenomenon with help from the teacher.  

The focus of 3D LP presented here is to describe student ability to integrate the three 

dimensions of science at different levels of proficiency, as related to their ability, to explain a 

wide range of electrostatic phenomena. The assessment instrument designed to probe the levels 

of 3D LP is focused on asking students to model and explain electrostatic phenomena. Each item 

represents a storyline containing multiple questions about phenomenon. The assessment 

instrument provides detailed information about the degree of student proficiency related to being 

able to integrate the three dimensions of NGSS to explain relevant phenomena.  

 

 

 

83 

 

3D LP context: “Interactions” curriculum  

Understanding of electrical interactions is central to developing deep understanding of 

DCIs in Physical Science, and a prerequisite for developing higher level understanding of more 

advanced topics. The Framework defines the following DCIs for the field of Physical Sciences: 

“Matter and its Interactions”, “Motion and Stability: Forces and Interactions”, “Energy” and 

“Waves and their Applications” (NRC, 2012). The emphasis the Framework puts on student 

understanding of interactions is reflected in the outlined questions including “How can one 

explain structure, properties and interactions of matter?” and “How can we explain and predict 

interactions between objects and within systems of objects” (NRC, 2012). According to the 

Framework, ability to explain how objects interact at the macroscopic and microscopic levels is 

indicative of deep understanding of the DCIs in Physical Science. This work focuses on 

electrical interactions that are central to understanding processes in multiple fields of science, 

including chemical bonding, phase changes, properties of materials, interaction of drugs in cells, 

energy contained in hurricanes, and many others. It requires understanding of atomic nature of 

matter, electric fields, columbic interaction, electric forces, and energy.  

This project uses NGSS-aligned curriculum for 9th grade Physical Science called 

“Interactions”. The curriculum is phenomena driven and focuses on helping students build 

integrated understanding of electrical interactions across time through 3D learning strategies. It 

focuses on the following aspects of the DCIs as related to explaining electrical interactions: 

atomic nature of matter (focused on the DCI of Matter and Its Interactions, sub idea of Structure 

and Properties of Matter), electric forces (focused on the DCI of Motion and Stability: Forces 

and Interactions, sub idea of Types of Interactions), and energy (focus on DCI of Energy, sub 

idea of Relationship between Energy and Forces) at micro and macro scales.  

 

84 

 

The curriculum currently consists of four units. Each unit focuses on investigating 

engaging natural phenomena using specific aspects of DCIs, several SEPs and CCs. Unit 1 starts 

with macroscopic level phenomena related to electrical interactions. The phenomena are 

presented to students in the form of driving questions that they pursue during the course of the 

entire unit, or sometimes single or few activities. For example, Unit 1 phenomenon-based driving 

questions is “Why do some clothes stick together when they come out of the dryer?”. Students 

investigate patterns in how charged objects interact and use ideas of electrical fields and forces to 

explain what causes certain types of clothes stick to together when they come out of the dryer. 

Once students have gained useable knowledge of electrical interactions at the macroscopic level, 

they proceed to explore atoms and relate charges to atomic structure. This helps students 

construct more detailed causal models to explain how objects become charged (via transfer of 

electrons), and how electron cloud shifts cause neutral objects to be attracted to charged ones, 

etc. By the end of Unit 1, students are expected to have deep understanding of ideas related to 

charges, electrical fields, and forces at the microscopic level. Unit 2 focuses on ideas of energy at 

the macroscopic and microscopic levels. Units 3 and 4 focus on applications of ideas discussed 

in units 1 and 2 to explain phenomena related to hydrogen bonding, hydrophobic and hydrophilic 

interactions and protein folding. The “Interactions” curriculum is designed with the purpose of 

helping students develop the above ideas across time over the course of one academic year. 

The curriculum has gone through external review process by Achieve. You can read more 

about the review process from the Achieve website4. Unit 1 of the “Interactions” curriculum 

received the highest rating termed “Example of high quality NGSS design”, and Unit 2 received 

the second highest rating termed “Example of high quality NGSS design if improved”. These 

                                                 
4 Achieve review process:  https://www.achieve.org/reviews 

85 

 

 

ratings indicate that the curriculum is a good example of implementing 3D learning in the 

classroom. Further, National Science Teachers Association recognizes “Interactions” as being 

aligned to NGSS and provides classroom videos demonstrating curriculum use on their official 

webpage5. These pieces of evidence support the choice of this curriculum for developing and 

validating 3D LP in this study. The curriculum consists of online materials where all the student 

activities are located6 and paper-based teacher materials that can be accessed online via Google 

docs. The curriculum is free and available for anyone to use. 

Methodology 

Developing and empirically testing NGSS-aligned 3D LP. 

  

A level on the 3D LP can be described as one in a series of comprehensive and 

developmentally appropriate steps towards more sophisticated application of DCIs, CCCs and 

SEPs. The 3D LP presented here focuses on two of the three DCIs covered in the curriculum 

including DCI of Matter and Its Interactions (sub idea of Structure and Properties of Matter) and 

DCI of Motion and Stability: Forces and Interactions (sub idea of Types of Interactions). This is 

because current work uses validity evidence collected before and after Unit 1 implementation 

only, and these DCIs were covered in Unit 1 of the curriculum. Further, 3D LP focuses on SEP 

and Developing and Using Models and CCCs of Cause and Effect because those dimensions 

were most heavily emphasized throughout the curriculum. 

First, the lower and upper anchors are defined to establish the scope of the LP. The lower 

anchor was based on students’ prior knowledge that was characterized from the written 

assessment and oral interviews with individual students before they started the curriculum.  The 

upper anchor is based on the NGSS PEs. The intermediate levels of the LP are defined based on 

                                                 
5 Classroom videos demonstrating “Interactions” use: http://ngss.nsta.org/  
6 “Interactions” online materials: http://interactions.portal.concord.org/ 

 

86 

 

a combination of the instructional sequence, feedback from disciplinary experts, and literature 

related to student learning. This process resulted in a hypothetical 3D LP that was then 

empirically tested based on interviews with students and IRT analysis of written assessment. 

Table 1 provides description of levels for the hypothetical NGSS-aligned 3D LP.  

Table 2.1 

Hypothetical 3D LP for electrical interactions 

 

Level 

Electrical Interactions 

Includes DCI sub ideas: “Types of Interactions”,   

“Structure and Properties of Matter” 

Types of Interactions:  
•  causal relationships between amounts of charge, magnitude 

of electrical field and the generated attractive/repulsive 
forces and distance between charged objects (Coulomb’s 
Law) and relate these ideas to components of atoms (p, e).  
•  use ideas of force, field and charge to explain phenomena. 

Structure and Properties of Matter: 

•  matter consists of atoms modeled as having a small, dense, 
positively charged nucleus and electrons orbiting around it. 
Electrons are modeled as point charge or cloud 

•  components of atoms (electrons, protons) are related to 

explaining interactions between objects 

Types of Interactions:  
•  Causal relationship between amount of charge, magnitude of 

electric forces and distance between charges at the macro 
level (Coulomb’s Law) 

•  charge viewed as microscopic (might mention electrons, 

protons, neutrons), but these ideas are not explicitly used to 
explain phenomena 

Structure and Properties of Matter: 
•  Matter is made of particles, but this idea is not explicitly 

used to explain phenomena 

•  Atoms modeled with plum and pudding or some other 

inaccurate version of atomic model 

Scientific Practices: 

Crosscutting 

“Developing and 
Using Models” 

Concepts: 
“Cause and 

Effect” 
•  Student models/explanations are 
causal and explicitly use ideas of 
electric forces, fields, charges and 
atomic nature of matter to explain 
phenomena by showing a micro-
level mechanism 

•  Models relate changes in the 
system to changes in forces 
between interacting atoms to 
explain phenomena 

• Student models/explanations are 
causal and explicitly use ideas of 
electric forces and electric charges 
to explain phenomena by showing a 
macro-level mechanism 

• Models relate changes in the system 

to changes in forces between 
interacting objects in a system 

3 

 
I
E
 
r
o
f
 
l
e
d
o
M

 
c
i
p
o
c
s
o
r
c
i

M

2 
 
 
I
E
 
r
o
f
 
l
e
d
o
M

 
c
i
p
o
c
s
o
r
c
a
M

1 

 
c
i
p
o
c
s
o
r
c
a
M

 
e
t
e
l
p
m
o
c
n
I

 
I
E
 
r
o
f
 
l
e
d
o
M

Types of Interactions:  
•  same charges attract and opposite repel 
•  charge is transferred via contact.  
•  No relationship between magnitude of interacting charges 

•  models/explanations are not 

causal, based on recollection of 
facts only; 

•  no mechanism explaining 

and generated electric force and distance (Coulomb’s Law) 

phenomenon 

•  charge causes attraction/repulsion 

Structure and Properties of Matter: 

•  matter is continuous, or contains particles modeled as circles 
•  charges are modeled as static or point charge 
•  don’t relate charge to structure of matter 

 

87 

 

Assessment development.  

Modified evidence-centered design (mECD) process (Harris, Krajcik, Pellegrino, & 

DeBarger, 2019) was used to develop assessments that show evidence of 3D learning in the 

context of the curriculum. The mECD approach combines elements of evidence-centered design 

(ECD) (Mislevy & Haertel, 2006) and construct-centered design (CCD) process (Shin, Stevens, 

& Krajcik, 2010) to design tasks for measuring knowledge in use.  

The first step of mECD involves identifying and unpacking an NGSS PE in order to 

develop a 3D claim that describes what students should be able to do with the corresponding 

DCI, SEPs and CCCs. The process of unpacking specifies aspects of the DCIs, SEPs and CCCs 

that students should master in order to meet a given NGSS PE.  It is important to unpack NGSS 

PEs because they represent broad statements that cover multiple content areas that are not 

necessarily the focus of the 3D LP, and therefore are not the focus of the assessment designed to 

measure the 3D LP levels. Unpacking also ensures coherency and alignment between NGSS 

PEs, assessment, and the 3D LP levels. The next step involves specifying the evidence that 

shows students have met the requirements specified in the claim. Claim and evidence combine to 

form an mECD argument.  Finally, assessment tasks for each mECD argument are developed 

that will provide the necessary evidence to measure the claim. This process is shown in Figure 1. 

Figure 2.1 Summary of modified evidence centered design process 

An example of the mECD argument for an item to help characterize the level of students’ 

understanding of electrical interactions is summarized in Table 2. The item is designed to 

provide evidence on whether students are at level 1, 2 or 3 of the 3D LP. The mECD argument 

 

88 

 

focuses on DCI of HS-PS1 - Matter and its Interactions, specifically for the element of PS1.A 

(Structure and Properties of Matter), and PS1.B (Types of Interactions). Further, the mECD 

argument focuses on SEP of Developing and Using Models, and CCCs of Cause and Effect.  

There were total of 8 items designed to measure 3D understandings of electrical 

interaction for Unit 1. Each item is open-ended (see Table 2) and contains an aspect of a DCI, a 

SEP and a CCCs. Items were administered as a pre and post Unit 1 test during 2016-2017 

academic year. Several items, including the one shown in Table 2 were used to conduct 

interviews before and after Unit 1 to obtain qualitative validity evidence for 3D LP.  

Alignment between the 3D Learning Progression and the Scoring Rubric 

Each item was open-ended and measured all 3 levels of the 3D LP. To assigning a level 

on the 3D LP to an answer, scorers used the following criteria: Are most relevant parts of DCI 

present? Does the answer reflect macro or micro level understanding? Is explanation causal? The 

rubric describes DCI, SEP and CCCs for each item. Each answer was scored directly to the 3D 

LP level. For example, score 1 on an item corresponds to level 1, etc. Table 3 shows the rubric, 

the level of the 3D LP, and sample answer from the oral interview for the item shown in Table 2. 

 

 

 

 

 

 

 

 

 

89 

 

Table 2.2 

Example of mECD process 

Claim: Students construct a causal model to explain how objects become charged using electron transfer  

Evidence 
Students use electron transfer between atoms as their model to explain the mechanism for charging objects. They 
include these ideas in the models as appropriate: 
1.  Objects are initially neutral (# of electrons is equal to the # of protons in the atoms); 
2.  Transfer of electrons between atoms of one object and the atoms of another object causes both objects to 

become charged.  

a.  Objects are made of atoms; 
b.  Atoms consist of a positively charged nucleus containing positively charged protons and neutrons 

(neutral) that is surrounded by negatively charged electrons; 

c.  Atoms with the same number of electrons and protons are neutral.  
d.  Atoms with an unequal number of electrons and protons are charged. 

i.  If they have more electrons than protons, the atoms will be negatively charged. 
ii.  If they have less electrons than protons, the atoms will be positively charged. 
iii.  When an atom becomes charged, electrons move from one atom to another.  
iv.  When electrons transfer from one atom to another, one atom becomes negatively charged and 

3.  Electron transfer is caused by contact between objects (touching or rubbing);  

the other becomes positively charged. 

a.  The effect of electron transfer on an object that gave electrons is net “+” charge because the atoms of 
this object have a larger number of protons than electrons; the effect of electron transfer on an object 
that received the electrons is net “-“ charge. 

b.  The # of electrons lost by the atoms of one object equals the # of electrons gained by the atoms of 

another. Therefore, charge is conserved. 

4.  Student models will show causal relationship between components of atoms and generated electric forces 

and fields when explaining phenomena involving electrical interactions. 

a.  Unequal number of electrons and protons within an atom causes net charge 
b.  Charged atoms generate electric field around them 
c.  When two atoms get close enough for their fields to interact, electric force is generated between the 

two atoms 

i.  Attractive electric force is generated between oppositely charged atoms 
ii.  Repulsive electric force is generated between similarly charged atoms 
iii.  The smaller the distance between the atoms, the larger the generated electric force (attractive 

or repulsive) and vice versa 

iv.  The larger the charge on each of the interacting atoms, larger the generated force (attractive or 

repulsive) and vice versa 

5.  Less sophisticated models contain fewer microscopic level components, and provide few or no causal 

relationships to account for observations of electrostatic phenomena. 

Task: Students are shown a video where fur and rod don’t attract paper before they are rubbed together. Upon 
being rubbed together, both fur and rod start attracting paper.  

Draw a model that shows what happens to the rod and fur when they are 
rubbed together to cause the paper to move towards the rod. Make sure to 
label everything in your model. Describe what happens to the rod and fur 
during the process of rubbing them together. 

 
 

 

90 

 

Data Analysis 

Constructing Hypothetical 3D LP and Evaluating Assessment Items and Rubric 

 

The hypothetical 3D LP shown in Table 1 was constructed using logical sequence of the 

discipline, relevant research literature and unpacking of NGSS PEs. The “Interactions” 

curriculum was piloted in the same schools in the Mid-West a year prior to the data collection 

described here.  Unit 1 assessment items, designed to probe the levels of the 3D LP, were 

administered during the pilot year via online “Interactions” portal. Two researchers went through 

100 student responses for each item to ensure that the items elicited the types of responses that 

the researchers anticipated based on preliminary levels of 3D LP and scoring rubric. Based on 

this analysis, 3D LP levels, assessment items, and scoring rubric were modified to ensure 

consistency and improved validity of the 3D LP and assessment instrument. There were no major 

changes made to either 3D LP or the assessment. The assessment items were rephrased, and 

scaffolds added to ensure students understand the questions better and address all parts of the 

question. 3D LP levels were not modified significantly, but a note was taken of the types of 

answers that seemed to contain ideas from multiple levels of the 3D LP, and therefore represent 

the “in between level” 3D understanding. This is discussed in the results section. 

Supporting levels of the 3D LP using qualitative analysis of student interviews 

The interview data was collected in a Mid-Western public high school where the 

“Interactions” curriculum was implemented. The school was rural type with 28% free and 

reduced lunch. Students from three different classrooms were interviews. Two classrooms had 

the same teacher, and one classroom had a different teacher. Both teachers have taught the 

“Interactions” curriculum prior to data collection year. Students from all three classrooms had 

very little prior knowledge of electrical interactions based on pre-Unit 1 interview analysis.  

 

91 

 

Several students from each of the three participating classrooms were interviewed before 

and after implementation of Unit 1, with total 17 students interviewed. The students were 

selected to represent different levels of academic achievement. Items from two different testlets 

were used in the interview: the foil experiment testlet, and the paper and rod testlet. Sample 

interview analysis for paper and rod testlet is shown in Table 3. Sample interview analysis for 

the foil experiment testlet is shown in Table 4. The mECD argument for the foil experiment 

testlet is provided in the Appendix. In the foil experiment item students develop atomic model 

consistent with Rutherford experiment results. These two testlets probe ideas related to the three 

levels of the hypothetical 3D LP shown in Table 1. Student interviews were analyzed using the 

scoring rubric and each answer was assigned a level on 3D LP. 

Inter-rater reliability was established in the following manner. One researcher scored all 

17 interviews first. Then, two other researchers used the same rubric to score the interviews of 3 

students from each classroom (total 9 students). Once 100% agreement of 3D LP level placement 

for all 9 students was reached between the 3 scorers, the scoring rubric and the 3D LP levels 

were modified accordingly, and the rest of the interviews rescored based on this discussion. 

Support for the Validity of Levels of the 3D LP using Item Response Theory (IRT) 

The pre and post Unit 1 assessment data was collected in six schools in the Mid-West and 

five schools in West United States. Schools in the Mid-West were rural type with 28% free and 

reduced lunch. Schools in the Western part of US were urban type with 72.4% free and reduced 

lunch. The assessment was administered in classrooms where the “Interactions” curriculum was 

piloted during Fall 2016 and Spring 2017. The total sample size is 899 students. Teachers in the 

Mid-West schools have taught the “Interactions” curriculum prior to data collection year, and 

teachers in Western part of the US were first time users of the curriculum. Students on average 

 

92 

 

had very little prior knowledge of the constructs measured by the two assessment instruments as 

based on pre-Unit 1 interview data. 

IRT analysis for Unit 1 pre/post assessment was carried out following Toland (2014). The 

sample of 899 students was modeled using graded response model (GRM) (Samejima,1969). 

Score of “0” was imputed for students who had missing values on any of the items. This was 

deemed appropriate because students were given unlimited amount of time to finish the 

assessment. Therefore, it was safe to assume that if they did not provide the answer for an item, 

they did not know it. Pre/post assessment data were combined in model estimation to allow for 

comparison of ability distributions on pre and posttest. The unidimensionality and longitudinal 

invariance are discussed in Chapter 1. Pre and post measures were highly reliable (pre Unit 

1=0.872, post Unit 1=0.934) and supported by validity evidence (Chapter1). This suggests 

unidimensional IRT model is appropriate for the data. Appendix provides R code for model 

selection, specification, and estimation using the mirt package (Chalmers, 2012) in RStudio 

(RStudio Team, 2015). The results section presents IRT analysis relevant to the 3D LP 

validation.    

 

 

 

 

 

 

 

93 

 

Table 2.3 

Sample responses for every 3D LP level for paper and rod  

Level/Score 

3D LP 

Scoring Rubric 

DCI 
Structure and Properties of Matter: 

•  matter is continuous, or contains particles modeled as circles 
•  charges are modeled as static or point charge 
•  don’t relate charge to structure of matter 
Types of Interactions:  
•  same charges attract and opposite repel 
•  charge is transferred via contact.  
•  No relationship between magnitude of interacting charges and 

generated electric force and distance (Coulomb’s Law) 

•  charge causes attraction/repulsion 
SEP and CC 
models/explanations are not causal, based on recollection of facts 
only; no mechanism explaining phenomenon 

                                                                                   

 

 

 

 

1 

 

 
e
s
n
o
p
s
e
R

 
t
n
e
d
u
t
S
 
e
l
p
m
a
S

 

94 

Question  
Draw a model that explains what happens to the rod and fur when they are 
rubbed together to cause paper bits to move towards the rod.  Label your 
drawing. 
DCI: Structure and Properties of Matter 
•  Charges might be shown as part of objects (rod, paper fur) but are not 

used to construct causal explanation 

•  Model does not explain what makes objects initially neutral 
•  Model does not use charge transfer to explain how rod becomes 

charged. Atoms are not shown 

DCI: Types of Interactions 
• 

Fuzz represents static electricity and charge transfer 
Fuzz/static is transferred through rubbing 

• 
•  Model doesn’t use electric force and charge relationship to explain 

phenomenon 

SEP and CC  
•  Model explains attraction between paper and rod using ideas related 

to magnets/magnetic force 

•  No causal mechanism beyond recollection of facts 
• 

Static/fuzz causes attraction 

Student: as the cloth rubs on the rod, it causes the rod to have some 
kind of “magnetic” effect. Kind of like rubbing a piece of cloth on a 
balloon causes a kind of “electric” charge. Paper is attracted to this 
“magnetized” rod. 
 
Comment: relevant components of DCI are not present, model 
 
contains only observable components and no causal mechanism to 
explain why paper is attracted to the rod 
 

 

Table 2.3 (cont’d). 

Level/Score 

2 

3D LP 

Scoring Rubric 

DCI 
Structure and Properties of Matter: 
•  Matter is made of particles, but this idea is not explicitly 

used to explain phenomena 

•  Atoms modeled with plum and pudding or some other 

inaccurate version of atomic model 

Types of Interactions:  
•  Causal relationship between amount of charge, magnitude of 

electric forces and distance between charges at the macro 
level (Coulomb’s Law) 

•  charge viewed as microscopic (might mention electrons, 

protons, neutrons), but these ideas are not explicitly used to 
explain phenomena 

SEP and CC 
•   Student models/explanations are causal and explicitly use 

ideas of electric forces and electric charges to explain 
phenomena by showing a macro-level mechanism 
•   Models relate changes in the system to changes in forces 

between interacting objects 

Question  
Draw a model that explains what happens to the rod and fur when they are 
rubbed together to cause paper bits to move towards the rod.  Label your 
drawing. 
DCI: Structure and Properties of Matter 

•  Both paper rod and fur contain charges 
• 

Paper, rod and fur are initially neutral because no interaction is 
observed. All objects contain equal number of + and – charges. 

•  Model shows charges transferred between rod and fur (just positive, just 

negative, or both) during rubbing 

•  Charges are modeled as point charges. Atoms are not shown. 

DCI: Types of Interactions 
•  When the charged rod is brought close to the paper bits, attractive 

force is generated between charged rod and charges in the paper 

SEP and CC  
•  Model or written explanation shows that attractive force between 

charged rod and charges in the paper causes paper bits to move but 
doesn’t explain how charges in the neutral paper originate 

 

 
t
 
n
e
d
u
 
t
S
 
e
l
p
m
a
S

 
e
s
n
o
p
s
e
R

 

 
 
 
 
 
 
 
 
 
 
 
 
 

 

 

Student: when the rod was not rubbed by the fur, it was neutral, 
which was why it did not stick to the paper (neutral and neutral 
objects don’t interact). When the fur was rubbed onto the rod, it 
gave negative charges over to the rod, which then made it 
negative. The paper bits were attracted to the negative rod 
because in the bits there are positives and negative charges and 
so the negative rod attracted to the positive charges inside paper 
bits. 
Comment: Relevant DCIs are present, model provides macro 
level causal mechanism (highlighted sentence), but does not 
fully explain why neutral paper is attracted to the charged rod 

 

95 

 

Table 2.3 (cont’d). 

Level/Score 

3 

DCI 

3D LP 

Scoring Rubric 

Structure and Properties of Matter: 

•  matter consists of atoms modeled as having a small, dense, 
positively charged nucleus and electrons orbiting around it. 
Electrons are modeled as point charge or cloud 

•  components of atoms (electrons, protons) are related to 

explaining interactions between objects 

Types of Interactions:  
•  causal relationships between amounts of charge, magnitude of 

electrical field and the generated attractive/repulsive forces 
and distance between charged objects (Coulomb’s Law) and 
relate these ideas to components of atoms (p, e).  

•  use ideas of force, field and charge to explain phenomena. 
SEP and CC 
•  Student models/explanations are causal and explicitly use ideas 
of electric forces, fields, charges and atomic nature of matter to 
explain phenomena by showing a micro-level mechanism 

•  Models relate changes in the system to changes in forces 

between interacting atoms to explain phenomena 

Question  
Draw a model that explains what happens to the rod and fur when they are 
rubbed together to cause paper bits to move towards the rod.  Label your 
drawing. 
DCI: Structure and Properties of Matter 

•  Both paper rod and fur contain charges modeled as parts of atoms 
• 

Paper, rod and fur are initially neutral because no interaction is 
observed. All objects contain equal number of protons (+) and 
electrons (-) within their atoms 

•  Model shows electrons transferred from fur to rod or vice versa during 

rubbing 

DCI: Types of Interactions 

•  Excess electrons in the atoms of the rod cause the rod to have “-

“charge. Alternatively, lack of electrons causes the rod to have “+” 
charge. 

•  When the charged rod is brought close to the neutral paper bits, 

repulsive force is generated between electrons in the atoms of the rod 
and electrons in the atoms of the paper. This repulsive force causes 
electrons in the atoms of the paper to move away from the rod 
exposing positively charged nucleus. Attractive force between nucleus 
of atoms in the paper and electrons in the rod causes paper to move 
towards the rod. 

SEP and CC  

Model or written explanation shows that attractive force between 
charged rod and charges in the paper causes paper bits to move ant 
explains the origin of the attractive force in spite of the fact that paper 
is neutral. 

 
e
l
p
m
a
S

 
 
t
n
e
d
u
t
S

 
e
s
n
o
p
s
e
R

Comments: no level 3 responses were observed by the end of Unit 1 for this interview item also, which is consistent with developmental 
approach.  
 
 

 

96 

 

Table 2.4 

Sample responses  for every 3D LP level for the foil experiment  

Level/Score 

3D LP 

Scoring Rubric 

DCI 
Motion and Stability: Forces and Interactions:  
•  same charges attract and opposite repel 
•  charge transferred via contact.  
•  charge as a macroscopic (point charge) 
•  No relationship between magnitude of interacting charges and generated 

electric force and distance (Coulomb’s Law) 

•  charge is static electricity that causes attraction/repulsion 

Matter and its Interactions: 

•  matter is continuous, or made of particles modeled as plain circles 
don’t relate charge to structure of matter 
SEP and CCs 

•  models/explanations are causal, but based on recollection of facts only; 
•  no mechanism explaining phenomenon 

Question 1 
Draw a model of a silver atom that is consistent with Tom’s 
results 
•  Models show a plum a pudding model, or some inaccurate 

version of the model (matter consists of different charges 
that are not pats of atoms, or charges mixed up, 
components missing etc.) 

Question 2 
Explain why your model is consistent with the observation 
that relatively few particles were deflected by the sheet of foil 
(followed Paths B, C, D or similar). Justify your answer. 
•  Explanations are at the macroscopic level, ideas of 

forces/fields are not used to explain the pattern, 
interactions at a distance between charged particles are not 
mentioned 

 

 Question 1                                                                           Question 2 
 
 

Student: some particles passed through because the foil was 
unkrinkled…In the krinkled spots on the foil the particles would bounce 
back 

 
 
 
 
 
 
 

 
1 

 

    

 
t
n
e
d
u
t
S
 
e
l
p
m
a
S

 
e
s
n
o
p
s
e
R

 

 

 

 

 

Comments: students model structure of matter as containing positive and 
negative point charges, and construct causal explanation of the observed 
pattern using only macro-level observable components (“crinkled spots” 
cause the observed pattern). All of this is consistent with level 1 of the 3D 
LP 

97 

 

Table 2.4 (cont’d). 

Level/Score 

3D LP 

Scoring Rubric 

 
 
 
 
 
 
 
2 

 

 
t
n
e
d
u
t
S
 
e
l
p
m
a
S

 
e
s
n
o
p
s
e
R

 

 

 

 

Motion and Stability: Forces and Interactions:  
•  Causal relationship between amount of charge and 

magnitude of attractive/repulsive forces and distance 
between charges at the macroscopic level 
(Coulomb’s Law) 

•  charge viewed as microscopic (might mention 

electrons, protons, neutrons), but these ideas are not 
explicitly used to explain phenomena 

Matter and its Interactions: 
•  Matter is made of particles, but this idea is not 

explicitly used to explain phenomena 

Particles making up matter modeled with plum and 
pudding or some other inaccurate version of atomic 
model 
 

Question 1 

Question 1 
Draw a model of a silver atom that is consistent with Tom’s results. 
•  Models include a concentrated positively charged nucleus that takes up small 

portion of the total volume of the atom and negatively charged electrons 
surrounding the nucleus [cloud or points] 

Question 2 
Explain why your model is consistent with the observation that relatively few 
particles were deflected by the sheet of foil (followed Paths B, C, D or similar). 
Justify your answer. 
•  Explanations use relationship between electric force and distance between 

charged particles to explain the pattern 

•  The explanation evokes “hitting mechanism” indicating that particles that went 
through the foil did not hit the sub-atomic particles directly, but passed through 
the empty space between the atoms. 

Question 2 

Student: Particles from the detector are deflected if they hit the nucleus, but 
not heads-on. They bounce back if they hit the nucleus directly, and pass 
through if they pass through the empty space between the atoms 
 

Comments: student models show accurate structure of the atom (small, dense, 
positive nucleus with point charge negative electrons around it). However, 
explanations don’t mention ideas of electric forces to construct causal account 
of the phenomenon. Instead, explanations rely on macro-level “hitting” 
mechanism to explain the pattern. This is consistent with level 2 of the 3D LP. 

98 

 

Table 2.4 (cont’d). 

Level/Score 

3D LP 

Scoring Rubric 

 
 
 
 
 
 
 
3 

 

 
t
n
e
d
u
t
S
 
e
l
p
m
a
S

 
e
s
n
o
p
s
e
R

Motion and Stability: Forces and Interactions:  
•  causal relationships between amounts of charge, 

magnitude of electrical field and the generated 
attractive/repulsive forces and distance between 
charged objects (Coulomb’s Law) and relate these 
ideas to components of atoms (protons, electrons).  
•  use ideas of force, field and charge to explain bond 

making and bond breaking processes. 

Matter and its Interactions: 

•  matter consists of atoms modeled as having a 
small, dense, positively charged nucleus and 
electrons orbiting around it. Electrons are modeled 
as point charge or cloud 

Question 1 
Draw a model of a silver atom that is consistent with Tom’s results. 
•  Models include a concentrated positively charged nucleus that takes up 

small portion of the total volume of the atom and negatively charged 
electrons surrounding the nucleus [cloud or points] 

Question 2 
Explain why your model is consistent with the observation that relatively few 
particles were deflected by the sheet of foil (followed Paths B, C, D or similar). 
Justify your answer. 
•  Explanations use relationship between electric force, electric field, and 

distance between charged particles to explain the pattern 

•  The explanation evokes microscopic level mechanism indicating that 
particles that went through the foil did not hit the sub-atomic particles 
directly, but passed through the empty space between the atoms. 

components of atoms (electrons, protons) are related 
to explaining interactions between objects 
Comments: no level 3 responses were observed by the end of Unit 1. This observation is consistent with developmental approach 
because level 3 understanding reflects deep conceptual understanding of science ideas at the microscopic level and ability to apply them 
by blending the three dimensions of NGSS effectively to various situations. This type of understanding takes a long time to develop. The 
author would expect most students to ultimately develop this level of understanding by the end of the curriculum. The elements of the 
answer to the items 1 and 2 consistent with level 3 include the following: 
 
•  Model shows foil is made of atoms with dense positively charged nucleus and electrons as cloud of negative charge 
•  Explanations indicate that alpha particles that come close enough to interact with electric field created by positively charges nucleus 

of atoms in the foil are repelled by the generated repulsive force. This repulsive force causes the particles to either bounce back if 
the interact with the nucleus heads-on (Path B), or come out at an angle if they come close to the nucleus (Path C, D). The alpha 
particles that don’t come close enough to interact with electric field generated by the nuclei come out of the foil without changing 
their original path. Since nucleus takes up only small volume of the atom, most alpha particles never interact with the nucleus (Path 
A).  

 

99 

 

Results 

 

Supporting the Validity of levels of the 3D LP using qualitative analysis of student interviews 

Identifying Key Knowledge and Practices for Each Level of the 3D LP 

Qualitative analysis of student interviews served as a rich source of information to help 

obtain validity evidence for hypothetical 3D LP levels. Analysis of student responses supported 

the hypothetically suggested progression of student understanding reflected in the 3D LP levels 

for this phenomenon. Specifically, at level 0 student answers contain no relevant information, so 

examples for that level are not shown. Level 1 responses reflect macro-level models. Their 

models contain observable components and no relevant causal mechanistic details at macro and 

micro scales. In the context of paper and rod item, models do not explain how the rod becomes 

charged (charge transfer as a result of rubbing between the rod and fur that causes the rod to 

become charged) or why neutral paper bits are attracted to the charged rod (due to attractive 

force between charged rod and charges in the paper). They use words such as “static” or 

“magnets” to explain electrostatic phenomena without specifying what they mean by these terms. 

In the context of the foil experiment items, student models only show macro-level components, 

or point charges without explaining how charges in the foil are involved in producing the pattern 

that is observed in the experiment. 

At Level 2 student models reflect macro-level causal accounts that contain relevant 

aspects of DCIs related to charges and attractive forces used to explain phenomena. In the 

context of paper and rod item, student models show that rubbing causes charge transfer between 

rod and fur, which causes paper and rod to become charged.  The models also show that paper 

bits are attracted to the charged rod as a result of attractive force generated between charges of 

the rod and the paper. Charges, however, are modeled as point charges and not parts of atoms. 

 

100 

 

 

This lack of detail in the level 2 models leads to incomplete or inaccurate explanations of 

phenomena and lack of microscopic level details.  For example, to provide full causal account 

for why neutral paper is attracted to charged rod, models need to show where the charges 

involved in the interaction between rod and paper originate (excess electrons on the rod and 

nucleus in the atoms of the paper, which becomes exposed as a result of repulsive interaction 

between electrons of the rod and the paper). Level 2 models lack that level of detail because 

students do not always relate charges to components of atoms (protons and electrons). Similarly, 

when explaining how rubbing causes fur and rod to become charged, level 2 models often 

indicate that both positive and negative charges are transferred between paper and rod. These 

inaccuracies probably also stem from the fact that students do not relate charges to the structure 

of the atom, and the understanding that positive charges are protons, which are located in the 

nucleus of the atom, and therefore cannot be transferred during rubbing. Only electrons transfer 

as a result of rubbing. In the context of the foil experiment items student models show atomic 

models which are of different degree of accuracy. However, the models and explanations use 

ideas of charges to explain observed pattern, but with some macro-level inaccuracies. For 

example, the sample student model and response shown in table 4 indicates an accurate model of 

the atom (small, positively charged nucleus, and electrons around the nucleus), but explains the 

observed pattern as resulting from a sort of “hitting mechanism” where alpha particles shot at the 

gold foil either hit the nucleus of the atom directly, or on the side. The explanations lack the 

“Interaction at a distance” aspect that would show that students view electrical forces as acting 

without contact, through the field across space, rather than contact forces that they are more 

familiar with at the macroscopic level. 

Therefore, level 2 of the 3D LP is characterized by student ability to develop 

 

101 

 

 

macroscopic level causal relationship between charges and generated electric forces to explain 

electrostatic phenomena, but lack microscopic level details to provide full causal mechanistic 

explanation. These microscopic level mechanistic details that are missing from level 2 models 

are present in level 3 responses. At this level students demonstrate mastery of force, field and 

charge relationships and atomic level understanding by showing charged particles (electrons, 

protons) as parts of atoms and explaining the origin, direction and mechanism of action of 

electric forces.  

Evidence in Support of Developmental Nature of Student 3D Understanding 

While there were no level 3 responses observed in the interviews or in the scoring of the 

entire student sample of written pre and post assessments, there were some responses that could 

be characterized as transitioning between the levels of the 3D LP. Table 5 provides examples of 

student answers that were considered to fall between the levels for the Rod and Fur item and 

explains why. For example, transitioning from level 1-2 of the 3D LP is characterized by the 

types of responses that mention microscopic level components (e.g., charged particles) either in 

the explanation or model, but do not provide complete causal account for how these components 

explain the phenomenon in question. For transition level 1-2 response in Table 5, the model 

shows only observable components (paper, rod, fur), which is consistent with level 1 of the 3D 

LP. The explanation for the model states that rubbing causes the rod to become charged, 

therefore recognizing that charge is generated via contact. However, neither model nor 

explanation show how rubbing causes rod and fur to become charged using either point charges 

or electrons. The explanation further mentions attraction between charged particles in the paper 

(protons and electrons) and the rod, but does not provide any details on what causes the 

attraction and where the charged particles are located. Therefore, while student might be 

 

102 

 

 

recalling some terms and processes that are consistent with higher levels of the 3D LP (protons, 

electrons, charging through rubbing), these ideas are not used to explain how rod becomes 

charged or why neutral paper is attracted to the rod. Hence, the model does not reflect the ability 

to integrate the three dimensions of NGSS consistent with level 2 of the 3D LP, but there are 

ideas and connections present that make this model more sophisticated than those at level 1 of 

the 3D LP. This model therefore represents an example of transitioning from level 1 to level 2. 

Similarly, transitioning from level 2-3 of the 3D LP is characterized by the types of 

models that provide incomplete or inaccurate microscopic level causal accounts of phenomena. 

For example, sample transition level 2-3 response shown in table 2 contains all but 1 microscopic 

level detail necessary to provide full causal account for the phenomenon in question. 

Specifically, the model explains, at the microscopic level, how the rod becomes charged (via 

transfer of electrons from atoms of the fur to the atoms of the rod during rubbing), but does not 

provide microscopic causal explanation for why neutral paper is attracted to the charged rod, 

using heuristics instead (because charged and neutral objects attract). 

Table 6 provides examples of student answers that were considered to fall between the 

levels for the Foil experiment item and explains why. For example, is sample level 1/2 

transitional response student is attempting to use unobservable components, such as charge and 

field, to explain the phenomenon, but the model is vague, and it is not clear from either model or 

explanation what the difference between a field and a charge is, and how both ideas are involved 

in explaining the observed pattern. Further, is sample level 2/3 answer, the student shows an 

accurate model of the atom, and uses ideas of fields in the explanation, but still reverses back to 

“hitting mechanism” when explaining the pattern, instead of using ideas related to interactions at 

a distance. 

 

103 

 

Table 2.5 

 

Sample responses that fall between levels of the 3D LP for paper and rod  

LP Level 

1/2 

2/3 

Sample Student Answer: Paper and Rod 

Student explanation: after being 
rubbed with fur the rod becomes 
charged due to friction through rubbing. 
The paper has neutral charge. But the 
charged particles (protons and 
electrons) in the paper become attracted 
to the charged rod.  

Student Explanation: rubbing 
causes electrons from the fur to go 
to the rod, making the atoms of the 
rod charged. Paper atoms are 
neutral, they have equal number of 
protons and electrons. Neutral 
paper attracts to the charged rod 
because neutral and charged 
objects attract. The closer the rod, 
the bigger the force. 

 

 

 

Therefore, transition levels can be summarized as containing more relevant content 

(aspects of DCIs), but lacking application of the content for explaining phenomena. This reflects 

the nature of 3D understanding the 3D LP aims to describe, which is characterized by achieving 

knowledge-in-use, or ability to apply content to explain real-life situations. All in-between level 

responses were assigned the lower level on the 3D LP as a final level for online responses 

because they did not contain all the aspects consistent with the higher level.   

 

 

 

 

104 

 

Table 2.6 

 

Sample responses that fall between levels of the 3D LP for the foil experiment  

LP Level 

Sample of Student Answers that fall between levels of the 3D LP for Foil Experiment item 

1/2 

 
Student explanation: few particles bounced back. This is 
because when the particles inside the foil are scattered 
around, but sometimes they form a clump of particles that 
makes it so no other particle being shot at it can go through. 
When the particles clump together they make very strong 
electric field, like a charge, which repels the particles that 
are being shot at it. 

Comment: student is attempting to use microscopic-level ideas to explain phenomenon, but 
confusing electric charge and force. The charges are not shown in the model, and it is not clear 
what the structure of the “clump of particles” is. Overall, explanation and model represent 
transition between purely observable macro-level thinking consistent with level 1, to elements of 
micro-level-based thinking consistent with level 2. 

2/3 

  

 

 

Student explanation: particles bounce back if they come 
close to positive or negative charges in the atoms of the foil 
because there is a strong electric field around these particles. 
Depending on which side they hit the electric field, they might 
come out at an angle. They go through between empty space 
on the foil where there is no electric field. 
 

 

 

Comment: the model shows electric charges as parts of atoms; the explanation uses ideas of field 
 
to explain the pattern. Both model and explanation are mostly at the microscopic level, consistent 
with level 3. However, the explanation is not accurate in that it says that alpha particles interact 
with both positive and negative charges in the foil. Also, it doesn’t use the idea of electric force, 
and instead uses “hitting” mechanism to explain interaction between electric field of atoms and 
alpha particles, which is consistent with level 2 of the LP. Finally, atoms are shown to take up 
most of the space in the foil, so the model would not explain why most particles went through 
undisturbed. 

 
 

 

 

 

 

 

105 

 

 

Consistency in Assigning Responses to 3D LP Level for Different Phenomena 

Since students were asked to explain more than one phenomenon, it was possible to study 

students’ ability to transfer their 3D understanding to different contexts. Specifically, the Foil 

Experiment items is an example of an abstract phenomenon that students can not directly 

observe, which makes it harder to model and explain. The foil experiment also contains more 

complex ideas and requires deeper understanding. On the other hand, the Paper and Rod item 

focuses on a more familiar, observable phenomenon. This difference in how familiar the 

phenomena were to students is evident in the levels of the answers provided for both scenarios in 

the interview. Table 7 shows assignment of levels for each student on each interview item. 

Specifically, on the pretest, 14 students score a level 1, and 3 scored between levels 1 and 2 of 

the 3D LP on the paper and rod item. With the foil experiment item, only 7 students scored a 

level 1 and 10 students scored a level 0 of the 3D LP. These results suggest that the abstract foil 

experiment was more difficult for students to model and explain. Overall, the majority of 

interviewed students demonstrated proficiency between level 0 and 1 of the 3D LP on the pre-

unit 1 interview. Similarly, on the post test, 13 students score in level 2, and 2 scored 

intermediate level 2/3, and only 2 students remained in level 1 on the 3D LP for the paper and 

rod item. For the foil experiment, on the other hand, only 6 students moved to level 2 (4 from 

level 1 and 2 from level 0), 6 moved to level 1, 2 moved to intermediate level 1/2, and 3 moved 

to intermediate level 2/3. These results suggest that while students develop quite sophisticated, 

macroscopic level understanding of relatively straightforward electrostatic phenomena like 

attraction of neutral paper to the charged rod, they need more time and scaffolding to transition 

to the microscopic level 3D understanding of electrical interactions required to explain abstract 

phenomena like to foil experiment, which contains more complex ideas and is more abstract.  

 

106 

 

Table 2.7 

 

Student score/3D LP level for each interview phenomenon 

Student 

Pre-Unit 1 level 

Post-Unit 1 Level 

Rod Paper and Fur 

Rod Paper and Fur 

Pre-Unit 1 level 
Foil Experiment 

Post-Unit 1 level 
Foil Experiment 

A 
B 
C 
D 
E 
F 
G 
H 
I 
J 
K 
L 
M 
N 
O 
P 
Q 

1/2 
1 
1 
1 
1 
1 
1/2 
1 
1 
1 
1 
1 
1 
1/2 
1 
1 
1 

2 
2 
2 
2 
1 
1 
2/3 
2 
2 
2 
2 
2 
2 
2/3 
2 
2 
2 

1 
0 
0 
1 
0 
0 
1 
1 
0 
1 
0 
0 
0 
1 
0 
0 
1 

2 
1 
1 
2 
1 
1 
1/2 
2/3 
1/2 
2 
2 
1 
2 
2/3 
1 
2/3 
2 

 
Supporting the Validity of levels of the 3D LP using IRT 

In this section Wright Maps resulting from fitting graded response model (GRM) are 

used to show additional validity evidence for 3D LP levels. Wright Maps show ability and item 

difficulties on the same axis (y-axis), and items of x axis (Wilson, 2004; Wilson, 2009). GRM 

model is a polytomous item model. It is used for items with more than 1 response category, like 

the ones designed for this study. Under GRM, each response category has its own difficulty 

parameter (Samejima, 1969).  The interpretation of category difficulty under GRM is the 

following: a student with ability level equal to the difficulty of a given response category has a 

fifty percent probability of scoring in that category, and fifty percent in the category below 

(Samejima, 1969). 

In order to use the Wright Map to gain validity evidence for 3D LP, it is important to 

keep in mind that item category difficulty (which is in the same scale as ability) relates to score 

in the rubric for that category, which in turn relates to 3D LP level. Therefore, when looking at 

 

107 

 

 

the Wright Map, we want to see if abilities that correspond to difficulties for various item 

response categories are consistent with those theoretically suggested by the rubric and 3D LP. 

Specifically, we expect item difficulties that correspond to lower ability levels be located in the 

lower ability region of the Wright Map for all items. This is because respondents of lower ability 

are more likely to endorse an easier item response category (lower difficulty category), which in 

turn corresponds to lower level of the 3D LP. Similarly, item difficulties corresponding to higher 

ability levels should be located at the higher ability region of the Wright Map because higher 

ability is related to higher probability of endorsing more difficult response category, which 

corresponds to higher level of the 3D LP. If this pattern is consistent for all items on the 

assessment, then we have evidence for the validity for the hypothetical 3D LP (Wilson, 2005; 

Wilson, 2009; Doherty et al., 2015).  

The Wright Map resulting from GRM analysis is shown in Figure 2. Recall that each item 

has 3 response categories aligned to the three levels of the 3D LP (see Table 3). Since no student 

received a 3 on any of the items, and therefore no level 3 responses were observed, there are only 

3 response categories in the IRT analysis. Therefore, each of the 8 items has 2 difficulties 

associated with score of 0/1 and score of 1/2 on the rubric for that item respectively. The Wright 

Map in Figure 2 shows difficulties for categories corresponding to score 1 (labeled as “1”) and 

score 2 (labeled as “2”) for all items. Solid black horizontal lines represent location of thresholds 

for each 3D LP levels. Level 0-1 threshold separates level 0 from level 1 of the 3D LP. The 

cutoff for level 0-1 is 1.05 on the logit scale. The cut score for level 0-1 was taken to be 

approximately at or below the lowest threshold for level 1 (1.05). It means that respondents with 

ability level above 1.05 are at level 1 of the 3D LP, and respondents with ability level below 1.05 

are at level 0 of the 3D LP. Further, the cutoff for level 1-2 is 1.72 and has the same 

 

108 

 

 

interpretation as level 0-1 cutoff. It was calculated as the median item threshold on logit scale 

(Doherty et al., 2015). Since no scores corresponding to level 3 of the 3D LP were observed and 

thresholds for level 3 LP have not been determined, the cutoff for level 2-3 cannot be accurately 

determined. However, the highest threshold for level 2 is 2.43, and it is likely that level 3 ability 

level will be located close or slightly above that value.  

As seen in Figure 2, level 1 difficulties are well separated from level 2 difficulties. 

Specifically, no level 1 difficulty falls above the cut-off point for level 1, and no level 2 

difficulty falls below the cutoff point for level 2. Therefore, all level 1 difficulties are located in 

approximately the same ability region and do not overlap any of the level 2 difficulties. This 

suggests that the progression of student understanding predicted by hypothetical 3D LP levels is 

supported by the data, which provides quantitative validity evidence piece for the 3D LP 

(Doherty et al., 2015; Wilson, 2004).  

Pre 
test 

Post 
test 

Level 1-2 Cutoff=1.72 

Level 0-1 Cutoff=1.05 

Level 3 

Level 2 

Level 1 

Level 0 

 
Figure 2.2 Wright map showing learning progression levels for unit 1 assessment items 
 
 

 

 

109 

 

 

Evaluating Student Learning based on unit 1 assessment 

The data for pre and post assessment was combined when fitting GRM model (see 

Appendix for details) in order to be able to compare how ability distributions change between 

pre and posttest. The Wright Map in figure 2 shows distribution of responses (Respondents) for 

pre and posttest on one graph. As you can see, both pre and post unit 1 contain significant 

number of respondents below 0 on the logit scale. These are respondents with missing data, for 

whom zeros were imputed at both time points. Respondents who did not provide any answer on 

pre and post-test still participated in the curriculum as can be seen from their work in Unit 1 

saved in the online portal, and provided responses for assessment on subsequent units. Therefore, 

even though they had missing data for Unit 1 assessment, they were left in the sample to ensure 

that we can use their data to further investigate levels of the 3D LP when assessment data for 

subsequent units is analyzed. 

To check the extent of learning that occurred before and after Unit 1 was covered, Wald 

test was conducted to determine if the increase in the mean between pre and post-test was 

statistically significant. The mean increased from 0.067 to 0.375 on the logit scale between pre 

and post-test, and the Wald test showed that this increase was statistically significant (W=149.8, 

df=1, p>0.001), indicating that learning occurred between pre and posttest assessment for the 

entire sample of students. However, to better understand how the learning occurred in terms of 

student movement along the levels of the 3D LP, we need to look at the distribution of responses 

and compare pre and post unit assessment for each level of the 3D LP. Since the respondents 

who did not provide any answer on pre and post assessment introduce too much noise into the 

distribution, they were removed from the Wright Map to be able to see the degree of spread in 

learning for those students who provided the answers. This allows to draw more accurate 

 

110 

 

 

conclusions about student growth upon completion of Unit 1. Figure 3 below shows the Wright 

Map of reduced data for those who provided answers on pre and post assessment. 

Pre 
test 

Post 
test 

3D LP level 
cutoffs 
           
Distribution 
maximum on 
pre and post 
test 
 
Average 
ability level 
for each 
threshold 

2.05 

1.59 

1.32 
1.21 

 

Figure 2.3 Wright map showing distribution of respondents who provided answers on pre and 
post unit 1 test 

Observe, in Figure 3, that the majority of responses on both pre and posttests lie within 

level 1 of the 3D LP, but the distribution of responses within level 1 changes between pre and 

posttest. Specifically, maximum peak is observed for pretest at the value of 1.21, which 

corresponds to level 1 3D LP, and is located slightly below average level 1 threshold of 1.32. On 

posttest the peak at 1.21 gets smaller, and a new maximum peak emerges at 1.59, which is above 

average threshold 1 value. Therefore, clear movement towards high level 1 3D LP region is 

evident on the post test. Additionally, some responses are observed at level 2 of the 3D LP, 

compared to essentially no level 2 responses for pretest. This indicates that some respondents 

moved to level 2 upon completion of unit 1. Below, changes in percent distribution of responses 

on the Wright Maps for pre and posttest are discussed further. 

Figures 4 shows a separate Wright Map with relevant peaks and percentage of response 

distribution for the pretest.  The distribution of student responses on pre-test contains 2 well-

 

111 

 

 

defined peaks, one at the lower end of the logit scale, at -0.54, and the other one at a higher end 

of the logit scale at 1.21. The peak at -0.54 lies within level 0 of the 3D LP. It corresponds to 

only about 8% of the sample, indicating that very few students started very low on the 3D LP. 

Overall, about 41 % of the sample starts at level 0 of the 3D LP. Similarly, only about 2% of 

responses start in level 2 of the 3D LP. The majority of respondents on the pre-test, about 57%, 

lie within level 1 of the 3D LP.  

The distribution for pre-test peaks at ability level of 1.21, which is slightly below the 

average ability for level 1 thresholds (1.32). About 21 % of respondents in level 1 of the 3D LP 

are likely to score above average the threshold 1 value. Similarly, about 36% of respondents in 

level 1 are likely to score below average threshold 1 value. This indicates that the majority of 

respondents who start in level 1 of the 3D LP on pretest are not likely to score in level 1 of the 

3D LP for all the items. Specifically, they are not likely to score in level 1 of the 3D LP for items 

1 and 2 whose level 1 thresholds are located significantly above average level 1 threshold.  

Therefore, on pretest these respondents have not achieved the level of 3D understanding 

associated with ability level for these item categories. Items 1 and 2 belong to the foil testlet and 

focus on evaluating students’ ability to model and construct scientific explanation of particle 

deflection pattern observed in the Rutherford experiment. These items require ability to construct 

causal microscopic level accounts of relatively abstract phenomenon, and it is not surprising that 

the majority of students on posttest have not achieved that level of 3D thinking yet. This 

response pattern is also consistent with qualitative interviews, where 59% (10 out of 17) of 

interviewed students started at level 0, and the other 41% started in level 1 of the 3D LP for these 

items.  

 

This distribution within level 1 changes on the posttest as shown in Figure 5. In Figure 5, 

112 

 

 

on post Unit 1 assessment, the largest proportion of abilities, about 55%, still lies within level 1 

of the 3D LP, but the distribution within level 1 changes. The peak value increased from 1.21 on 

pretest to 1.59 on posttest. On the posttest, the fraction of respondents above average threshold 1 

becomes 34% as opposed to 21% on the pretest. Similarly, the fraction of respondents below 

average threshold 1 drops to 21% from 36% on the pretest. Additionally, the fraction of 

responses at level 0 of the 3D LP drops from 41% on pretest to 26% on the post test, and the 

fraction of responses at level 2 of the 3D LP goes up from 2 % on the pretest to 19 % on the post 

test. Out of 19% of respondents in level 2 of the 3D LP, about 4 % lie above the average 

threshold for level 2, and 15% lie below average threshold for level 2. This is in contrast to 

pretest where all 2% of responses observed at level 2 of the 3D LP lie below average threshold 

for level 2. Therefore, clear increase in fraction of responses at the higher ability region of level 

1, and at level 2 of the 3D LP is evident on the post test. 

 

 

 

 

 

 

 

 

 

 

 

 

113 

 

           3D LP level cutoffs               Relevant distribution peaks (see text)                  Average ability level for each threshold 

 

 

 

Pre 
 
test 

 

Post 
test 

Figure 2.4 Wright map showing learning progression levels for 
unit 1 pretest assessment items and distribution of respondents for 
the relevant cut points for students who provided answers on both 
pre and posttest 

Figure 2.5 Wright map showing learning progression levels for 
unit 1 posttest assessment items and distribution of respondents 
for the relevant cut points for students who provided answers 
on both pre and posttest 
 

114 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Assigning Learning Progression level to individual students 

This section talks about how 3D LP can be used to accurately place student on a level, 

therefore allowing to use the validated 3D LP and the associated assessment as a diagnostic tool 

in the classroom. To assign a level on the 3D LP to each individual student, it is important to 

take into consideration measurement error associated with estimation of each proficiency level. 

This is especially important for students whose proficiency levels lie close to cut points for 3D 

LP levels, or provide answers consistent with in-between level assignment as was observed for 

the oral interviews. To do this, confidence interval (CI) for all proficiency estimates are 

calculated using one standard error in each direction (see Appendix for the R code).  

Wright Maps are further modified by arranging student proficiency in ascending order 

excluding students who had all zeroes on pre and/or post7. The modified Wright Maps for pre 

and posttest are shown in Figures 6 and 7 respectively. The curved black line shows 

proficiencies, and the grey band represents upper and lower interval bounds. The horizontal 

dashed lines represent cutoffs for 3D LP levels, and vertical lines show the area where 

confidence intervals overlap the cut points. If confidence intervals fall entirely into one of the 3D 

LP regions (for example, the first 655 students on pretest (Figure 6), and the first 601 students on 

the posttest (Figure 7)), these students are likely to provide answers consistent with level 0 of the 

3D LP, and therefore should be assigned level 0 with high degree of confidence. Similarly, 

students 721-891 on pretest and students 647-796 on the posttest have confidence intervals that 

fall entirely into level 1 of the 3D LP, so these students can be assigned level 1. Finally, 

confidence intervals for the students 848-899 on the posttest fall entirely into level 2, and that 

those students can be assigned level 2 on the 3D LP. 

                                                 
7 The X axis of the Wright Maps shown in figures 6 and 7 was truncated to exclude students who had zeroes on pre 
and post assessment and highlight the graph better.  

 

115 

 

 

Level 0 

Level 0-1 

Level 1 

Level 1-2 

Level 2 

Level 1-2 

Level 0-1 

Figure 2.6 Modified wright map for pre unit 1 test showing student proficiency estimates and 
standard error bands from lowest to highest 

Level 0 

Level 0-1 

Level 1 

Level 1-2 

Level 1-2 

Level 0-1 

Figure 2.7 Modified wright map for post unit 1 test showing student proficiency estimates and 
standard error bands from lowest to highest 
 

 

 

116 

 

 

However, sometimes the confidence intervals overlap the cut points for the 3D LP levels. 

For example, students 656-720 on the pretest, and students 602-646 on the posttest have 

confidence intervals that overlap level 0-1 cutoff, indicating that they are likely to provide 

answers consistent with in-between level assignment. In this case, there is less certainty about the 

3D LP level assignment for these students. Similarly, students 892-899 on pretest all have 

confidence interval overlapping level 1-2 threshold, indicating that there is less certainty in 

placing these students in level 2 of the 3D LP.  

Overall, only 71 students on the pretest and 94 students on the posttest fall in between 

levels of the 3D LP, which corresponds to 8% and 10% respectively. Therefore, there is high 

degree of certainty in assigning a level on the 3D LP to individual students for the majority of the 

sample. To be exact, since the confidence interval was calculated using 1 standard error in each 

direction, we are 68% confident in placing each individual student on a level of the 3D LP. This 

provides evidence for validity of the 3D LP as a diagnostic tool that allows placing a student on a 

level with a high degree of accuracy, and use the information about what student understanding 

looks like in terms of the three dimensions (DCI, SEP, CCCs) at each given level to characterize 

their science proficiency. To the author’s knowledge, this is the first validated 3DLP that 

provides this degree of level assignment certainty and therefore applicability in terms of 

immediate pedagogical use. 

Discussion  

 

The Framework (NRC, 2012) outlines a novel way to teach and learn science grounded 

in a developmental approach that states that complex ideas in science take time and appropriate 

scaffolding to develop (Smith, Wiser, Anderson, & Krajcik, 2006). In practice, a developmental 

approach is reflected in the idea of a learning progression, which describes increasingly more 

 

117 

 

 

sophisticated steps towards mastering understanding of a given construct (Duschl et al., 2007). 

The Framework outlines theoretical learning progressions across grades grounded in relevant 

disciplinary and educational research. The unique feature of learning progressions described in 

the Framework is that they focus on the three dimensions of science: DCIs, SEPs, CCCs (NRC, 

2012). Integrating the three dimensions when explaining phenomena and solving problems 

fosters knowledge application ability also called knowledge-in-use, which is indicative of deep 

understanding of science (Pellegrino, Hilton, 2012). Knowledge-in-use is achieved through 

situated cognition, or engagement in applying the content being studied to real life situations 

(Pellegrino & Hilton, 2012). In the language of the Framework this is equivalent to being 

engaged in 3D learning, or developing ability to integrate the three dimensions of science to 

solve real life problems. According to the Framework, it takes time and appropriate scaffolding 

to foster this ability in students (NRC, 2012).  

Learning progressions described in the Framework reflect increasing level of 

understanding of the three dimensions and have potential to guide educators in supporting 

students to develop knowledge-in-use. However, these learning progressions have only been 

described in theory (for example, CCCs, DCIs and SEPs progressions in the Framework). Their 

level of detail is very general, and their practical applicability in guiding educational process is 

very narrow. Moreover, while the Framework talks about the need to integrate the three 

dimensions in curriculum, instruction and assessment to foster knowledge application, learning 

progression for the three dimensions are still presented separately in the Framework (for 

example, CCCs, DCIs and SEPs progressions in the Framework). This is because integrating of 

the three dimensions in practice is still a vague concept. In fact, so far, there has been no research 

reported that demonstrates the feasibility of developing a validated learning progression that 

 

118 

 

 

describes the three dimensions, and can be used to accurately place individual students on a level 

for a large-scale sample in practice. Developing this kind of learning progression requires 

carefully specifying the aspects of the three dimensions at each level of sophistication, 

developing assessment tool capable of probing each level, and designing a reporting system 

based on well-aligned scoring rubric and LP levels that can be easily used to place each 

individual student response on a level of the LP. The work described in this paper has achieved 

all of these requirements and provides first-hand example of a learning progression that describes 

aspects of all the three dimensions at increasing levels of sophistication without separating them, 

and uses 3D assessment tool to probe the levels of 3D LP and place each individual student on a 

level with high degree of confidence (68% confidence to be exact). The assessment developed 

for probing the levels of the 3D LP requires student to apply three dimensions of NGSS to 

construct causal accounts of electrostatic phenomena.  The resulting validity argument provides 

rich source of information about what student 3D understanding looks like at various levels of 

sophistication from macro to micro scale. It also provides insights into how we can support 

students towards achieving higher levels of the 3D LP. Therefore, this work is a valuable 

contribution to research on design and validation of NGSS-aligned learning progressions because 

it expands our understanding of how to track, describe and measure 3D learning described in the 

Framework in practice. There are several major takeaways that this work aims to highlight, and 

they are discussed further. 

The first takeaway is based on analysis of student interview data showing that 

transitioning to higher level of 3D LP requires ability to apply relevant aspects of DCIs to 

explain phenomenon. Simple recollection of a large number of facts related to a specific DCI 

does not always translate to the ability to apply them when explaining phenomena. Evidence of 

 

119 

 

 

this is seen in transition level interview responses. Specifically, these responses tend to contain 

large numbers of relevant facts, including details about the structure of the atom, or heuristics 

like “neutral and charged objects attract”, but the models and explanations do not incorporate 

these ideas into the mechanistic causal explanations. For example, the models fail to explain how 

components of the atoms relate to phenomenon, or why neutral objects attract to charged ones. 

Therefore, these responses lack application of the content for explaining phenomena and do not 

demonstrate the level of ability related to integrating the three dimensions of science that is 

consistent with higher levels of 3D LP. For educators, this means that when evaluating student 

learning in NGSS classroom, we need to be cautious not to confuse student memorization ability 

with student knowledge application ability that provides an indicator of deep conceptual 

understanding. Otherwise, our educational efforts risk falling back to old ways. 

 

The second takeaway from our study, as suggested by analysis of student interviews, 

indicates that it is not possible to transition to higher level of 3D LP without being able to apply 

relevant DCIs at the microscopic level. This is also consistent with previous research suggesting 

that microscopic level understanding is indicative of deep conceptual understanding of science 

(NRC, 2012; Smith et. al., 2006; Stevens et. al, 2010; Stevens et. al., 2009). The research 

presented here shows that the distinctive feature of a truly causal mechanistic explanation lies in 

the microscopic level detail and specifying all important relationships between components of 

the system. This is seen in the data presented here in several forms. First, complete causal 

models for the paper and rod item require that all components of the model have full microscopic 

detail, and that causal relationships between all the components be specified. For example, in the 

context of the paper and rod item, to provide complete causal account for why neutral paper is 

attracted to the charged rod, one should specify the structure and components of the atoms that 

 

120 

 

 

make up paper and rod, and explain how the components interact with each other to cause the 

observed attraction. Without this level of microscopic detail, the explanation is not fully causal, 

and is likely to be based on memorized heuristics (for example, “because neutral and charged 

objects attract” as a way to explain attraction between neutral paper and charged rod). Therefore 

students should be given opportunities to practice constructing micro-level accounts. At the same 

time, it is important to distinguish between memorization and knowing. Specifically, even if 

students do not provide full causal account of phenomenon in their answer, it doesn’t mean they 

have memorized what they say. This is especially relevant for level 2 answers., where students 

don’t provide full causal microscopic level account, but still demonstrate considerable ability to 

develop causal accounts and apply their understanding to explain phenomenon.  

 

The third takeaway is connected to the previous one, and suggests that developing higher 

sophistication in SEPs and CCCs is not possible without knowledge of relevant DCIs and vice 

versa. While this work only shows preliminary evidence of this assertion, it shows that student 

ability to develop models and construct causal accounts is directly related to the degree of their 

familiarity with relevant DCIs. For both interview items a clear pattern is obvious where higher-

level models contain more DCIs that are used to develop causal mechanistic models with all 

relevant components connected at the microscopic level and directly related to explaining 

phenomena. Interview analysis suggests that if students lack knowledge of relevant DCIs, their 

models are incomplete and lack causal mechanistic accounts. This finding is consistent with 

previous research on interconnectedness between content knowledge and practice (Catley, 

Lehrer, & Reiser, 2005; Songer, 2006). There is considerable amount of research showing that 

content and practices, which is also called reasoning skills, develop in concert (Gotwals & 

Songer, 2006; Duschl et. al., 2007). It is therefore important to develop and validate learning 

 

121 

 

 

progressions that combine aspects of all the three dimensions including content (DCIs), SEPs 

and CCCs in order to gauge the development of 3D understanding across time. 

The fourth takeaway from our study suggests that ability to provide microscopic level 

causal explanation might depend on the context. The interview data presented here analysis 

shows that student response patterns for the paper and rod and the foil experiment item were 

slightly different. Specifically, there was larger number of student responses at higher levels of 

the 3D LP for paper and rod item than for foil experiment item. This finding might have to do 

with the fact that paper and rod item focuses on a more familiar phenomenon that is directly 

observed in the video, while the foil experiment item is abstract and hard to visualize. This 

finding is consistent with the vision of the Framework that builds upon the idea that knowledge 

is situated. Previous research suggests that novices tend to have a more fragmented knowledge 

structure, which in turn translates into different levels of demonstrated ability in solving science 

problems that depends on the context (Chi, Glaser, Rees, 1981; Sabella, Redish, 2007). In the 

case of the data presented here, foil experiment item represents a more complex context than the 

paper and rod item, and it also requires, understanding structure of the atom at a deeper level, so 

students have more difficulty applying their fragmented understanding of electrical interactions 

concept to this more challenging context that involves more complex ideas (Sabella, Redish, 

2007). It is therefore extremely important to make sure that, in NGSS classrooms, teachers 

consistently link the concepts being taught across different contexts and explicitly point out 

similarities in relation to key concepts across contexts. This will help students transition from 

fragmented science understanding of novices to a more uniform and integrated understanding of 

real scientists.  

Finally, the fifth and last takeaway of our study has to do with developmental nature of 

 

122 

 

 

student understanding and the idea that deep, integrated understanding of science takes time and 

appropriate scaffolding to develop (Smith et. al, 2006). Evidence of this is seen in student 

interviews, where more higher-level responses are observed by the end of unit 1, and student 

answers fall on a spectrum from less to more sophisticated level of understanding. Similar 

pattern holds for analysis of student written responses using IRT. Further, the fact that none of 

the students reached level 3 of the 3D LP by the end of unit 1 indicates that it takes a long time 

before students develop microscopic level causal mechanistic reasoning consistent with the 

highest level of the 3D LP. This suggests students need a lot of support and opportunities to 

engage in 3D learning and practice constructing causal models and explanations at the 

microscopic level to be able to transition to higher levels of the 3D LP.  

Study limitations and future research 

Data contained considerable number of missing values that were replaced with “0”. Since 

students were given unlimited amount of time to finish the assessments, the researchers assumed 

that if no answer was provided, students didn’t know it. Considerable number of zeroes, which is 

also reflected in considerable number of responses located in the lower end of ability spectrum 

might indicate that there were not enough items to measure lower ability levels. It would be 

beneficial to add items that measure lower end of ability spectrum to better describe student 3D 

understanding in that region. Another limitation is that level 3 responses were not observed. It 

would be useful to include some of Unit 1 assessment items on future unit tests to investigate at 

what point level 3 responses appear as students progress in the curriculum. This also indicates 

that upon completion of unit 1 students don’t develop level 3 type understanding, and therefore 

adjusting instruction during Unit 1 to emphasize certain ideas could be suggested to see if level 3 

responses are observed upon completion of Unit 1 in the future. 

 

123 

 

 

 

APPENDIX

124 

 

 

APPENDIX 

 

Testing Competing Item Response Theory (IRT) Models 

 

The items on Unit 1 present four ordinal response categories, where each category 

corresponds to a level of the 3D LP.  Specifically, 0, 1, 2, 3-point response category on each item 

corresponds to the 3D LP levels of 0, 1, 2, 3, as can be seen from examples of scoring rubrics 

above. Common IRT models for polytomous items are Graded Response Model (GRM, 

Samejima, 1969) and Generalized Partial Credit Model (GPCM, Muraki, 1992). To choose 

appropriate IRT model to represent the data in this study, model fits for GRM and GPCM were 

compared. To ensure more accurate representation of the data, and to be able to compare student 

learning on pre and post assessment, pre and post assessment data was combined to specify the 

IRT model to be estimated using GRM and GPCM. Slopes and corresponding items were 

constrained to be equal on pre and post assessment for each item. This rigid model specification 

was safe to assume because dimensionality and longitudinal invariance of Unit 1 assessment 

instrument was extensively studied a priori (Chapter 1). The results of this previous study 

showed that Unit 1 assessment scale is one-dimensional, and partial measurement invariance 

holds over time for pre and post assessment. The R code for GRM and GPCM analysis is 

provided in the Appendix. The results of IRT model estimation are shown in Table 8. 

Table 2.8  

Model comparison for GPCM and GRM 

Model 
GPCM 
GRM 

LL 

-4992 
-4957 

# par 

51 
51 

AIC 
10038 
9968 

BIC 
10168 
10098 

M2 
516 
488 

df 
109 
109 

P value 
<0.001 
<0.001 

RMSEA 
0.0645 
0.0622 

CFI/TLI 

0.983/0.982 
0.983/0.984 

Smaller log likelihood values as well as smaller values for AIC and BIC index values 

suggest better fitting model (Nering & Ostini, 2011; Toland, 2014).  Based on these indexes, 

 

125 

 

 

GRM is a slightly better fitting model for the data sample. Further, M2 goodness of fit statistics 

was used to evaluate overall model fit (Maydeu-Olivares & Joe, 2005). Smaller M2 values also 

indicate better model fit (Toland, 2014) and, following this guideline, GRM also presents better 

fitting model for the data, compared with GPCM model. The p-value for both GPCM and GRM 

indicate lack of fit. However, lack of fit for M2 statistics is common when fitting parametric 

models like GPCM and GRM to real data (Cai, Maydeu‐Olivares, Coffman & Thissen, 2006; 

Toland, 2014). Therefore, additional model fit indexes were used, including RMSEA and 

CFI/TLI. Good and reasonable model fit cut-off criteria for RMSEA was <0.6 and <0.8 

respectively, and for CFI/TLI was >0.95 and >0.90 respectively (Hu & Bentler, 1999; Marsh, 

Hau & Wen, 2004; Van Dam, Earleywine & Borders, 2010). Based on RMSEA and CFI/TLI 

values presented in Table 6, both GRM and GPCM have similar model fit. RMSEA for both 

models are marginally good, and CFI/TLI indexes represent good model fit. Therefore, based on 

evaluation of all the information, GRM appears to be a more suitable model for the data, and will 

be further used to evaluate model assumptions and obtain item parameters. 

Evaluating GRM model assumptions  

 

IRT model assumptions were further evaluated for GRM following Toland (2014). As 

mentioned above, unidimensional and partial measurement invariance were established for the 

measurement instrument in the previous study (Chapter 1). The assumption of local 

independence is further tested below. Local independence (LI) assumes that student responses on 

the test are influenced only by their level on the latent trait continuum of interest. LI assumption 

is very important for IRT analysis because, if violated, item parameters become distorted, 

including inflated slopes and more homogeneous thresholds across items (Toland, 2014). In the 

context of NGSS, assumption of local independence becomes increasingly harder to meet 

 

126 

 

 

because 3D assessments call for more contextualized, story-based items where students can use 

all the information available to them to demonstrate knowledge application ability (Gorin & 

Mislevy, 2013). These items often take the form of testlets, as is the case for the Unit 1 

assessment instrument here, which makes it especially difficult to meet assumption of local 

independence because items within a testlet share more commonalities than across testlets. This 

might lead to increased dimensionality and violation of LI assumption (Gorin & Mislevy, 2013). 

 

To evaluate LI assumption in this study the Q3 index was used with a cut-off value of 

|0.2| (Kim, De Ayala, Ferdous & Nering, 2011). This index and cut-off value have acceptable 

Type 1 error rate and is substantially more powerful than commonly used X2 and G2- LD 

indexes (Chen & Thissen, 1997). Further, it is also recommended that the 0.2 cut-off value be 

used in a relative way, and to determine what is “large” correlation relative to other residual 

correlations in the model (Dr. Chalmers, email conversation). Following these guidelines, the Q3 

statistics was used to evaluate local independence assumption. The Q3 statistics matrix is shown 

in figure 8. Only values above 0.2 are shown. Most residual correlations were below cut-off 

value of 0.2 in absolute values, and there were no residual correlations that were unusually high 

relative to others. Specifically, the highest correlation value was -0.36 between items 4 and 7 on 

the pre-test. Slightly high residual correlation is not surprising for items 4 and 7 because these 

items belong to the same testlet. However, this correlation is not unreasonably high compared to 

other values, and most of the correlations are below cut-off value of 0.2. Therefore, there is 

enough evidence to conclude that assumption of local independence is met.  

 

127 

 

 

 

Figure 2.8 Q3 matrix 

Model-Data Fit 

Once IRT model is chosen, and model assumptions are evaluated, it is appropriate to evaluate 

how well the GRM model fits the data and obtain item parameters that will be used in validating 

levels of the 3D LP. 

Item level fit. To assess how well the GRM model fits each item, S-X2 item fit statistics for 

polytomous data was examined (Orlando & Thissen, 2000; Orlando & Thissen, 2003). 

Statistically significant p-value indicates that the model does not fit a given item. Item fit was 

evaluated using 1% significance level, and RMSEA values. This is because evaluation of item fit 

using S-X2 item fit statistics involves testing multiple hypothesis and larger samples lead to 

greater likelihood of statistically significant results (Stone, Zhang, 2003 & Toland, 2014). Item 

fit S-X2 statistics is shown in Table 9 below. Items 3 and 5 of pre-test and items 1, 3 and 5 on the 

post-test have p-values <0.01 indicating poor model fit for these items. Since larger samples lead 

to greater likelihood statistically significant results, RMSEA values for these items are also 

examined. As can be seen from Table 9, all RMSEA values are below 0.06 indicating good 

model fit. Therefore, GRM model fits each item reasonably well. 

 

 

 

128 

 

Table 2.9 

 S-X2 item fit statistics 

 

Item 
Q1T1 
Q2T1 
Q3T1 
Q4T1 
Q5T1 
Q6T1 
Q7T1 
Q8T1 

S-X2 
35.002 
15.999 
40.075 
21.604 
42.847 
33.622 
38.575 
20.155 

df 
22 
12 
21 
13 
21 
25 
22 
13 

RMSEA 

0.026 
0.019 
0.032 
0.000 
0.034 
0.020 
0.029 
0.025 

p 

0.039 
0.191 
0.007 
0.544 
0.003 
0.116 
0.016 
0.091 

Item 
Q1T2 
Q2T2 
Q3T2 
Q4T2 
Q5T2 
Q6T2 
Q7T2 
Q8T2 

S-X2 
74.964 
36.894 
37.150 
36.313 
47.552 
28.418 
26.051 
25.723 

df 
29 
21 
19 
23 
19 
25 
21 
23 

RMSEA 

0.042 
0.029 
0.033 
0.025 
0.041 
0.012 
0.016 
0.011 

p 

0.000 
0.017 
0.008 
0.038 
0.000 
0.289 
0.204 
0.314 

Person level fit. To evaluate consistency of student reasoning across different contexts 

represented in the items, person fit (Zh) statistics was examined (Drasgow, Levine & 

Williams,1985). The Zh distribution across pre and post assessment events for all students is 

shown in Figure 9 below. 

 

 

 

 

 

 

Figure 2.9 Person fit Zh statistics 

The value of -1.96 was used as a cut-off for Zh statistics, where students with Zh fit 

statistics above -1.96 show regular responses (Drasgow et al.,1985; Felt, Castaneda, Tiemensma 

& Depaoli, 2017). Figure 2 shows that the majority of students are above the cut-off value of -

1.96 (dashed line) suggesting that majority of the sample demonstrate responses consistent with 

those hypothesized by 3D LP levels. This provides evidence towards the validity of the 

hypothetical 3D LP levels (Doherty, Draney, Shin, Kim & Anderson, 2015). 

 

 

129 

 

 

R Code
R Code    
R Code
R Code

library(mirt)#For fitting IRT models 
library(foreign)#For importing SPSS data file 
library(WrightMap)#For wright maps 
library(ggplot2)# For histogram 

Model Fit Evaluation Unit 1 pre_post test 

Items 1-8 represent Unit 1 pre test items, items 9-16 represent Unit 1 post test items. Pre and Post 
test items are identical 

Model Statement
Model Statement    
Model Statement
Model Statement
FAmodelU1pre_post<-mirt.model('F1=1, 2, 3, 4, 5, 6, 7, 8, 
                                        F2= 9, 10, 11, 12, 13, 14, 15, 16 
                               
                              CONSTRAIN=(1,9, a1, a2), (2,10,a1, a2),  
                                        (3,11,a1, a2), (4,12, a1, a2),  
                                        (5, 13, a1, a2), (6, 14, a1, a2),  
                                        (7, 15, a1, a2), (8, 16, a1, a2), 
                               
                                        (1,9, d1), (2,10,d1), (3,11,d1),  
                                        (4,12, d1), (5, 13, d1), (6, 14, d1),  
                                        (7, 15, d1), (8, 16, d1), 
                                         
                                        (1,9, d2), (2,10,d2), (3,11,d2),  
                                        (4,12, d2),(5, 13, d2), (6, 14, d2), 
                                        (7, 15, d2), (8, 16, d2), 
                              MEAN=F1, F2 
                               
                              COV=F1*F2') 

Model Estimation
Model Estimation    
Model Estimation
Model Estimation
pre.items<-c("U1T1","U2T1","U3T1","U4T1","U5T1","U6T1","U7T1","U8T1") 
post.items<-c("U1T2","U2T2","U3T2","U4T2","U5T2","U6T2","U7T2","U8T2") 
all.items<-c(pre.items,post.items) 

GRM Model
GRM Model    
GRM Model
GRM Model
modgrmU1pre_post<-mirt(newdataU1pre_post[all.items],FAmodelU1pre_post,itemtyp
e="graded",verbose=FALSE, SE=TRUE) 
modgrmU1pre_post #to get AIB/BIC parameters 
M2(modgrmU1pre_post, impute=20, CI=.95) #To get model fit CFI/TLI, RMSEA 

GPCM Model
GPCM Model    
GPCM Model
GPCM Model
modgpcmU1pre_post<-mirt(newdataU1pre_post[all.items],FAmodelU1pre_post,itemty
pe="gpcm",verbose=FALSE, SE=TRUE) 
modgpcmU1pre_post #to get AIB/BIC parameters 
M2(modgpcmU1pre_post, impute=20, CI=.95) #To get model fit CFI/TLI, RMSEA 

 

130 

 

 

Item analysis with choice model 
Item analysis with choice model ----    GRMGRMGRMGRM    
Item analysis with choice model 
Item analysis with choice model 

Model diagnostics
Model diagnostics    
Model diagnostics
Model diagnostics

Residual diagnostics 
residuals(modgrmU1pre_post,type="Q3", suppress=.2)# To evaluate loca  indepen
dence (LI), only shows pairs with cov>0.2 (possibly have LI issue) 

Item fit diagnostics 
print(item.fit<-itemfit(modgrmU1pre_post,fit_stats="S_X2"))#To evaluate items 
fit(cutoff: p<0.01) 

Person fit diagnostic 
person.fit<-personfit(modgrmU1pre_post, method="ML") #To evaluate person fit 
(Zh stats) 
 
ggplot(person.fit,aes(x=Zh))+ 
  geom_histogram(bins = 15,colour = "black",fill = "white")+ 
  geom_vline(xintercept=-1.96,col="black",linetype="dashed")+ 
  labs(x="Zh statistic",y="Count")+ 
  theme_bw(base_size=12)+ 
  theme_classic()#histogram of Zh stats (above -1.96 good person fit) 

Item parameters and thresholds
Item parameters and thresholds    
Item parameters and thresholds
Item parameters and thresholds
item.par<-data.frame(coef(modgrmU1pre_post, simplify=TRUE)$items) #Item param
eters; 
item.par$T1<-with(item.par, ifelse(a1>0,-d1/a1,-d1/a2))#a1= discrimination, D
ifficulty=(-d/a) 
item.par<-item.par[1:8,]#Select the first 8 rows since the remaining are time 
two items with equal parameters as time 1 
 
item.par$T2<-with(item.par, ifelse(a1>0,-d2/a1,-d2/a2)) 
mean.T1<-mean(item.par$T1)#Mean threshold 1 
mean.T2<-mean(item.par$T2)#Mean threshold 2 
t0_1<-min(item.par$T1)#cut off for level0_1 
t1_2<-median(c(item.par$T1, item.par$T2)) #cut off for level1_2 
t2_3<-max(item.par$T2)#cut off for level2_3 

Ability Wright Maps
Ability Wright Maps    
Ability Wright Maps
Ability Wright Maps
#Compute factor scores (Y-axis for Wright Map) 
AbilityU1Pre_Post<-data.frame(fscores(modgrmU1pre_post)) 
 
# Add ability scores to the data file 
fulldata<-data.frame(cbind(newdataU1pre_post, AbilityU1Pre_Post))  
 
#merge students who have complete data with fulldata file to create the reduc
ed sample file 
reducedata<-merge(U1P2_STUID_allstudentscompletedata, fulldata, by.x = "STUID
")  

 

131 

 

 

Complete sample data 
wrightMap(with(fulldata,cbind(F1,F2)),matrix(c(item.par$T1,item.par$T2),ncol=
2), person.side=personDens,cutpoints=c(t0_1,t1_2,mean.T1,mean.T2),min.l=-.9, 
max.l=2.5) 

####Reduced sample data 

wrightMap(with(reducedata,cbind(F1,F2)),matrix(c(item.par$T1,item.par$T2),nco
l=2), person.side=personDens,cutpoints=c(t0_1,t1_2,mean.T1,mean.T2),min.l=-.9
, max.l=2.5) 

Finding peaks on the Reduced sample Wright    Map and % of examinees in each level of the 3D LP
Finding peaks on the Reduced sample Wright
Map and % of examinees in each level of the 3D LP    
Finding peaks on the Reduced sample Wright
Finding peaks on the Reduced sample Wright
Map and % of examinees in each level of the 3D LP
Map and % of examinees in each level of the 3D LP

Functions to calculate percentiles for given cut-offs 
#Percentile function for pretest 
pct_pre<-ecdf(reducedata$F1) 
 
#Percentile function for post test 
pct_post<-ecdf(reducedata$F2) 

Thresholds for level 0-1 and 1-2 of the 3D LP 
t0_1 #lowest difficulty 1 
t1_2 #median between Difficulty 1 and Difficulty 2 
mean.T1 #Average difficulty 1 
mean.T2 #Average difficulty 2 

Percentage of examinees between thresholds 
#pretest 
pct_pre(t0_1)#% prob density that fall below cutoff for level 1 (% prob. dens
ity in level 0 of 3D LP) 
pct_pre(t1_2)#% prob density that fall below cutoff for level 2  
pct_pre(t1_2)-pct_pre(t0_1) #57% between level 1 cutoff and level 2 cutoff of 
pre test (% prob. density in level 1 of the 3D LP) 
pct_pre(mean.T1)#% prob density that fall below average difficulty 1 (1.32) 
pct_pre(t1_2)-pct_pre(mean.T1)#21% between average difficulty 1 and level 2 c
utoff (% prob. density in level 1 of the 3D LP that lies above average diffic
ulty 1) 
(pct_pre(t1_2)-pct_pre(t0_1))-((pct_pre(t1_2)-pct_pre(mean.T1)))# 36% between 
average difficulty 1 and level 1 cutoff (% prob. density in level 1 of the 3D 
LP that lies below average difficulty 1) 
pct_pre(mean.T2)#% prob density that fall below average difficulty 2 (2.02) 
 
 
#Post test 
pct_post(t0_1)#% prob density that fall below cutoff for level 1 (% prob. den
sity in level 0 of 3D LP) 
pct_post(t1_2)#% prob density that fall below cutoff for level 2 
pct_post(t1_2)-pct_post(t0_1)#55% between level 1 cutoff and level 2 cutoff a
t post test (% prob. density in level 1 of the 3D LP)  
pct_post(mean.T1)#% prob density that fall below average difficulty 1 (1.32) 
pct_post(t1_2)-pct_post(mean.T1)# 34% between average difficulty 1 and level 

 

132 

 

 

2 cutoff (% prob. density in level 1 of the 3D LP that lies above average dif
ficulty 1) 
(pct_post(t1_2)-pct_post(t0_1))-((pct_post(t1_2)-pct_post(mean.T1)))# 21% bet
ween average difficulty 1 and level 1 cutoff (% prob. density in level 1 of t
he 3D LP that lies below average difficulty 1) 
pct_post(mean.T2)#% prob density that fall below average difficulty 2 (2.02) 
pct_post(mean.T2)-pct_post(t1_2)#15% between average difficulty 2 and level 2 
cutoff (% prob. density in level 2 of the 3D LP that lies below average diffi
culty 2) 

Determine density peak values 
#Peak values for pretest 
print(pre_peak1<-density(reducedata$F1)$x[which.max(density(reducedata$F1)$y)
]) #pre test peak - larger peak 
print(pre_peak2<-density(reducedata$F1[which(reducedata$F1<0.5)])$x[which.max
(density(reducedata$F1[which(reducedata$F1<0.5)])$y)]) #smaller peak - for pr
e test values below 0.5 
 
#Peak values for post test 
print(post_peak1<-density(reducedata$F2)$x[which.max(density(reducedata$F2)$y
)]) #post test peak 
print(post_peak2<-density(reducedata$F2[which(reducedata$F2<0.5)])$x[which.ma
x(density(reducedata$F2[which(reducedata$F2<0.5)])$y)])#second peak - for pos
t test values below 0.5 
print(post_peak3<-density(reducedata$F2[which(reducedata$F2<1.5)])$x[which.ma
x(density(reducedata$F2[which(reducedata$F2<1.5)])$y)])#third peak for post t
est scores in level 1 3D LP region 
print(post_peak4<-density(reducedata$F2[which(reducedata$F2>1.7)])$x[which.ma
x(density(reducedata$F2[which(reducedata$F2>1.7)])$y)])#fourth peak for post 
test scores in level 2 3D LP region 

Ascending Ability Wright
Ascending Ability Wright    MapsMapsMapsMaps    
Ascending Ability Wright
Ascending Ability Wright
#create factor scores with standard errors (UP= upper bound, LP= lower bound) 
fulldata_with_SE<-cbind(newdataU1pre_post,data.frame(fscores(modgrmU1pre_post
, full.scores.SE=TRUE))) 
fulldata_with_SE$UBF1<-fulldata_with_SE$F1+fulldata_with_SE$SE_F1 
fulldata_with_SE$LBF1<-fulldata_with_SE$F1-fulldata_with_SE$SE_F1 
fulldata_with_SE$LBF2<-fulldata_with_SE$F2-fulldata_with_SE$SE_F2 
fulldata_with_SE$UBF2<-fulldata_with_SE$F2+fulldata_with_SE$SE_F2 
 
#Create a variable to count how many students have CI overlapping each LP lev
el (Pre test) 
fulldata_with_SE$LP0_1_F1<-ifelse(fulldata_with_SE$LBF1<= t0_1&fulldata_with_
SE$UBF1>=t0_1, 1, 0) 
fulldata_with_SE$LP1_2_F1<-ifelse(fulldata_with_SE$LBF1<= t1_2&fulldata_with_
SE$UBF1>=t1_2, 1, 0) 
#Create a variable to count how many students have CI overlapping each LP lev
el (Pre test) 
fulldata_with_SE$LP0_1_F2<-ifelse(fulldata_with_SE$LBF2<= t0_1&fulldata_with_
SE$UBF2>=t0_1, 1, 0) 

 

133 

 

 

fulldata_with_SE$LP1_2_F2<-ifelse(fulldata_with_SE$LBF2<= t1_2&fulldata_with_
SE$UBF2>=t1_2, 1, 0) 
 
#Find smallest lower bound score of F1 (pre test) 
LB_LP0_1_pre_stu<-max(fulldata_with_SE$LBF1[which(fulldata_with_SE$LP0_1_F1==
1)]) 
print(LB_L0_pre<-which(sort(fulldata_with_SE$LBF1)==LB_LP0_1_pre_stu))#studen
ts below level 1 LP 
LB_LP1_2_pre_stu<-max(fulldata_with_SE$LBF1[which(fulldata_with_SE$LP1_2_F1==
1)]) 
print(LB_L1_pre<-which(sort(fulldata_with_SE$LBF1)==LB_LP1_2_pre_stu))#studen
ts below level 2 LP 
#Find smallest lower bound score of F2 (post test) 
LB_LP0_1_post_stu<-max(fulldata_with_SE$LBF2[which(fulldata_with_SE$LP0_1_F2=
=1)]) 
print(LB_L0_post<-which(sort(fulldata_with_SE$LBF2)==LB_LP0_1_post_stu))#stud
ents below level1 LP 
LB_LP1_2_post_stu<-max(fulldata_with_SE$LBF2[which(fulldata_with_SE$LP1_2_F2=
=1)]) 
print(LB_L1_post<-which(sort(fulldata_with_SE$LBF2)==LB_LP1_2_post_stu))#stud
ents below level2 LP 
 
#Find highest upper bound score of F1 (pre test) 
UB_LP0_1_pre_stu<-min(fulldata_with_SE$UBF1[which(fulldata_with_SE$LP0_1_F1==
1)]) 
print(UB_L0_pre<-which(sort(fulldata_with_SE$UBF1)==UB_LP0_1_pre_stu))#studen
ts below level 1 LP 
UB_LP1_2_pre_stu<-min(fulldata_with_SE$UBF1[which(fulldata_with_SE$LP1_2_F1==
1)]) 
print(UB_L1_pre<-which(sort(fulldata_with_SE$UBF1)==UB_LP1_2_pre_stu))#studen
ts below level 2 LP 
#Find highest upper bound score of F2 (post test) 
UB_LP0_1_post_student<-min(fulldata_with_SE$UBF2[which(fulldata_with_SE$LP0_1
_F2==1)]) 
print(UB_L0_post<-which(sort(fulldata_with_SE$UBF2)==UB_LP0_1_post_student))#
students below level1 LP 
UB_LP1_2_post_student<-min(fulldata_with_SE$UBF2[which(fulldata_with_SE$LP1_2
_F2==1)]) 
print(UB_L1_post<-which(sort(fulldata_with_SE$UBF2)==UB_LP1_2_post_student))#
students below level2 LP 
 
#of people in overlap for each level on pre test 
LB_L0_pre-UB_L0_pre #66 people between level 0 and 1 
LB_L1_pre-UB_L1_pre #7 people between 1-2 
#of people in overlap for each level on pre test 
LB_L0_post-UB_L0_post #46 people between level 0 and 1 
LB_L1_post-UB_L1_post #50 people between 1-2 
 
#Sort data by ability score (pre test) 
sort_pre<-fulldata_with_SE[order(fulldata_with_SE$F1),]#sort data 

 

134 

 

 

sort_pre<-data.frame(x=seq(nrow(sort_pre)),F1=sort_pre$F1,lwr=sort_pre$LBF1,u
pr=sort_pre$UBF1) 
 
#Sort data by ability score (post test) 
sort_post<-fulldata_with_SE[order(fulldata_with_SE$F2),]#sort data 
sort_post<-data.frame(x=seq(nrow(sort_post)),F2=sort_post$F2,lwr=sort_post$LB
F2,upr=sort_post$UBF2) 

Ascending Ability Wright map for Pretest 
plot(sort_pre$F1, xlab="Persons", ylab="Ability", pch=16, ylim=c(-2, 3.1), xl
im=c(558, 900), cex=0.5) 
with(sort_pre,polygon(c(x,rev(x)),c(lwr,rev(upr)),col = "grey75", border = FA
LSE)) 
matlines(sort_pre[,1],sort_pre[,-1],lwd=c(1,1),lty=1,col=c("black","black","b
lack"),type=c("p","l","l"), cex=0.4, pch=16) 
abline(h=c(t0_1, t1_2),lty=2,v=c(LB_L0_pre, LB_L1_pre,UB_L0_pre,UB_L1_pre)) 

Ascending Ability Wright map for Post test 
plot(sort_post$F2, xlab="Persons", ylab="Ability", pch=16, ylim=c(-2, 3.1), x
lim=c(558, 900), cex=0.5) 
with(sort_post,polygon(c(x,rev(x)),c(lwr,rev(upr)),col = "grey75", border = F
ALSE)) 
matlines(sort_post[,1],sort_post[,-1],lwd=c(1,1),lty=1,col=c("black","black",
"black"),type=c("p","l","l"), cex=0.4, pch=16) 
abline(h=c(t0_1, t1_2),lty=2,v=c(LB_L0_post, LB_L1_post,UB_L0_post,UB_L1_post
)) 

 

135 

 

 

 

BIBLIOGRAPHY

136 

 

 

BIBLIOGRAPHY 

 

Alonzo, A. C., & Gotwals, A. W. (Eds.). (2012). Learning progressions in science: Current 

challenges and future directions. Springer Science & Business Media. 

 
Alonzo, A. C., & Steedle, J. T. (2009). Developing and assessing a force and motion learning 

progression. Science Education, 93(3), 389-421 

 
Berland, L. K., & McNeill, K. L. (2010). A learning progression for scientific argumentation: 

Understanding student work and designing supportive instructional contexts. Science 
Education, 94(5), 765-793. 

 
Cai, L., Maydeu‐Olivares, A., Coffman, D. L., & Thissen, D. (2006). Limited‐information 
goodness‐of‐fit testing of item response theory models for sparse 2P tables. British 
Journal of Mathematical and Statistical Psychology, 59(1), 173-194. 

 
Catley, K., Lehrer, R., & Reiser, B. (2005). Tracing a prospective learning progression for 

developing understanding of evolution. Paper commissioned by the National Academies 
Committee on Test Design for K-12 Science Achievement. Washington, DC: National 
Academies.  

 
Chalmers, R. P. (2012). “mirt: A Multidimensional Item Response Theory Package for the R 

Environment.” Journal of Statistical Software, 48(6), 1–29. doi: 10.18637/jss.v048.i06. 

 
Chen, W. H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response 

theory. Journal of Educational and Behavioral Statistics, 22(3), 265-289. 

 
Chi, M. T., Glaser, R., & Rees, E. (1981). Expertise in problem solving (No. TR-5). 

PITTSBURGH UNIV PA LEARNING RESEARCH AND DEVELOPMENT CENTER. 

 
Cooper, M. M., Underwood, S. M., Hilley, C. Z., & Klymkowsky, M. W. (2012). Development 

and assessment of a molecular structure and properties learning progression. Journal of 
Chemical Education, 89(11), 1351-1357. 

 
Corcoran, T., Mosher, F. A., & Rogat, A. (2009). Learning progressions in science: An evidence-

based approach to reform. New York, NY: Columbia University, Teachers College, 
Center on Continuous Instructional Improvement.  

 
Doherty, J. H., Draney, K., Shin, H. J., Kim, J., & Anderson, C. W. (2015). Validation of a 

learning progression-based monitoring assessment. Manuscript submitted for publication. 

 
Drasgow, F., Levine, M. V., & Williams, E. A. (1985). Appropriateness measurement with 

polychotomous item response models and standardized indices. British Journal of 
Mathematical and Statistical Psychology, 38(1), 67-86. 

 

137 

 

 

Duncan, R. G., & Hmelo‐Silver, C. E. (2009). Learning progressions: Aligning curriculum, 

instruction, and assessment. Journal of Research in Science Teaching: The Official 
Journal of the National Association for Research in Science Teaching, 46(6), 606-609. 

 
Duschl, R.A., Schweingruber H.A., Shouse A. (Eds.). (2007). Taking science to school: Learning 

and teaching science in grades K-8. Washington, D.C.: National Academy Press. 

 
Felt, J. M., Castaneda, R., Tiemensma, J., & Depaoli, S. (2017). Using person fit statistics to 

detect outliers in survey research. Frontiers in psychology, 8, 863. 

 
Gorin, J. S., & Mislevy, R. J. (2013, September). Inherent measurement challenges in the next 

generation science standards for both formative and summative assessment. In Invitational 
research symposium on science assessment. 

 
Gotwals, A. W., & Songer, N. B. (2013). Validity evidence for learning progression‐based 
assessment items that fuse core disciplinary ideas and science practices. Journal of 
Research in Science Teaching, 50(5), 597-626. 

 
Harris, C. J., Krajcik, J. S., Pellegrino, J. W., DeBarger, A. H. (2019). Designing Knowledge‐In‐

Use Assessments to Promote Deeper Learning. Educational Measurement: Issues and 
Practice. 

 
Herrmann‐Abell, C. F., & DeBoer, G. E. (2018). Investigating a learning progression for energy 

ideas from upper elementary through high school. Journal of Research in Science 
Teaching, 55(1), 68-93. 

 
Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: 

Conventional criteria versus new alternatives. Structural equation modeling: a 
multidisciplinary journal, 6(1), 1-55. 

 
Kim, D., De Ayala, R. J., Ferdous, A. A., & Nering, M. L. (2011). The comparative performance 

of conditional independence indices. Applied Psychological Measurement, 35(6), 447-
471. 

 
Krajcik, J.S., Sutherland, L.M., Drago, K., & Merritt, J. (2012). The promise and value of 
learning progression research. In S. Bernholt,, K. Neumann, & P.  Nentwig (Eds.)  

 
Lee, H. S., & Liu, O. L. (2010). Assessing learning progression of energy concepts across middle 
school grades: The knowledge integration perspective. Science Education, 94(4), 665-688. 

 
Lehrer, R., Kim, M. J., Ayers, E., & Wilson, M. (2014). Toward establishing a learning 

progression to support the development of statistical reasoning. Learning over time: 
Learning trajectories in mathematics education, 31-60. 

 

 

138 

 

 

Maydeu-Olivares, A., & Joe, H. (2005). Limited-and full-information estimation and goodness-

of-fit testing in 2 n contingency tables: A unified framework. Journal of the American 
Statistical Association, 100(471), 1009-1020. 

 
Mislevy, R. J., & Haertel, G. D. (2006). Implications of evidence‐centered design for educational 

testing. Educational Measurement: Issues and Practice, 25(4), 6-20. 

 
Mohan, L., Chen, J., & Anderson, W.A. (2009). Developing a multi-year learning progression 

for carbon cycling in socio-ecological systems. Journal of Research in Science Teaching, 
46(6), 675–698. 

 
Morell, L., Collier, T., Black, P., & Wilson, M. (2017). A construct‐modeling approach to 

develop a learning progression of how students understand the structure of matter. Journal 
of Research in Science Teaching, 54(8), 1024-1048. 

 
Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. ETS 

Research Report Series, 1992(1), i-30. 

 
National Academies of Sciences, Engineering, and Medicine. (2019). Science and engineering 

for grades 6-12: Investigation and design at the center. National Academies Press. 

 
National Research Council. (2000). How people learn: Brain, mind, experience, and school: 

Expanded edition. National Academies Press. 

 
National Research Council. (2012). A framework for K-12 science education: Practices, 

crosscutting concepts, and core ideas. National Academies Press. 

 
National Research Council. (2013a). Education for life and work: Developing transferable 

knowledge and skills in the 21st century. National Academies Press. 

 
Nering, M. L., & Ostini, R. (Eds.). (2011). Handbook of polytomous item response theory 

models. Taylor & Francis. 

 
Neumann, K., Viering, T., Boone, W. J., & Fischer, H. E. (2013). Towards a learning 
progression of energy. Journal of research in science teaching, 50(2), 162-188. 

 
Nordine, J., Krajcik, J., & Fortus, D. (2010). Transforming energy instruction in middle school to 

support integrated understanding and future learning. Science Education, 95(4), 670–690. 
DOI: 10.1002/ sce.20423  

 
Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item 

response theory models. Applied Psychological Measurement, 24(1), 50-64. 

 
Orlando, M., & Thissen, D. (2003). Further investigation of the performance of S-X2: An item 
fit index for use with dichotomous item response theory models. Applied Psychological 
Measurement, 27(4), 289-298. 

 

139 

 

 

Osborne, J. F., Henderson, J. B., MacPherson, A., Szu, E., Wild, A., & Yao, S. Y. (2016). The 

development and validation of a learning progression for argumentation in 
science. Journal of Research in Science Teaching, 53(6), 821-846. 

 
Pellegrino, J. W., & Hilton, M. L. (Eds.). (2012). Education for life and work: Developing 

transferable knowledge and skills in the 21st century. Washington, DC: The National 
Academies Press. 

 
Plummer, J. D., & Krajcik, J. (2010). Building a learning progression for celestial motion: 
Elementary levels from an earth‐based perspective. Journal of Research in Science 
Teaching, 47(7), 768-787. 

 
Plummer, J. D., & Maynard, L. (2014). Building a learning progression for celestial motion: An 

exploration of students' reasoning about the seasons. Journal of Research in Science 
Teaching, 51(7), 902-929. 

 
Reiser, B. J., Krajcik, J., Moje, E., & Marx, R. (2003, March). Design strategies for developing 

science instructional materials. In Annual Meeting of the National Association of Research 
in Science Teaching, Philadelphia, PA. 

 
RStudio Team (2015). RStudio: Integrated Development for R. RStudio, Inc., Boston, MA 

URL http://www.rstudio.com/. 

 
Sabella, M. S., & Redish, E. F. (2007). Knowledge organization and activation in physics 

problem solving. American Journal of Physics, 75(11), 1017-1029. 

 
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded 

scores. Psychometrika monograph supplement. 

 
Schwarz, C. V., Reiser, B. J., Davis, E. A., Kenyon, L., Achér, A., Fortus, D., ... & Krajcik, J. 

(2009). Developing a learning progression for scientific modeling: Making scientific 
modeling accessible and meaningful for learners. Journal of Research in Science 
Teaching: The Official Journal of the National Association for Research in Science 
Teaching, 46(6), 632-654. 

 
Shin, N., Stevens, S. Y., & Krajcik, J. (2010). Tracking student learning over time using 

construct-centred design. In Using Analytical Frameworks for Classroom Research (pp. 
56-76). Routledge. 

 
Smith, C. L., Wiser, M., Anderson, C. W., & Krajcik, J. (2006). FOCUS ARTICLE: implications 

of research on children's learning for standards and assessment: a proposed learning 
progression for matter and the atomic-molecular theory. Measurement: Interdisciplinary 
Research & Perspective, 4(1-2), 1-98. 

 
Songer, N. B., Kelcey, B., & Gotwals, A. W. (2009). How and when does complex reasoning 
occur? Empirically driven development of a learning progression focused on complex 

 

140 

 

 

reasoning about biodiversity. Journal of Research in Science Teaching: The Official 
Journal of the National Association for Research in Science Teaching, 46(6), 610-631 

 
Songer, N.B. (2006). BioKIDS: An animated conversation on the development of curricular 

activity structures for inquiry science. In: R. Keith Sawyer (Ed.), Cambridge handbook of 
the learning sciences (pp. 355–369). New York: Cambridge. 

 
Standards, N. G. S. (2013). Next generation science standards: For states, by states. 
 
Stevens, S. Y., Delgado, C., & Krajcik, J. S. (2010). Developing a hypothetical multi‐

dimensional learning progression for the nature of matter. Journal of Research in Science 
Teaching, 47(6), 687-715. 

 
Stevens, S. Y., Sutherland, L. M., & Krajcik, J. S. (2009). The big ideas of nanoscale science 

and engineering. NSTA press. 

 
Stone, C. A., & Zhang, B. (2003). Assessing goodness of fit of item response theory models: A 

comparison of traditional and alternative procedures. Journal of Educational 
Measurement, 40(4), 331-352. 

 
Talanquer, V. (2009). On cognitive constraints and learning progressions: The case of “structure 

of matter”. International Journal of Science Education, 31(15), 2123-2136. 

 
Toland, M. D. (2014). Practical guide to conducting an item response theory analysis. The 

Journal of Early Adolescence, 34(1), 120-151. 

 
Van Dam, N. T., Earleywine, M., & Borders, A. (2010). Measuring mindfulness? An item 
response theory analysis of the Mindful Attention Awareness Scale. Personality and 
Individual Differences, 49(7), 805-810. 

 
Wilson, M. (2004). Constructing measures: An item response modeling approach. Routledge. 
 
Wilson, M. (2009). Measuring progressions: Assessment structures underlying a learning 
progression. Journal of Research in Science Teaching: The Official Journal of the 
National Association for Research in Science Teaching, 46(6), 716- 

 
Wyner, Y., & Doherty, J. H. (2017). Developing a learning progression for three‐dimensional 

learning of the patterns of evolution. Science Education, 101(5), 787-817. 

 

 

 

 

141 

 

 

CHAPTER 3 

 

Exploring Student Reasoning about Chemical Bonds from Perspective of Energy and Force in 

the context of NGSS Classroom 

Introduction 

The new way of teaching about chemical bonds 

Understanding mechanisms and driving factors that influence formation of chemical 

bonds is essential for developing deep, useable understanding of Chemistry. Previous research 

suggests that students at different levels of science preparation hold multiple inaccurate ideas 

about why and how chemical bonds form. For example, even after instruction, students still hold 

on to the idea that chemical bonds “store” energy, which is released when chemical bonds are 

broken (Barker and Millar 2000; Boo 1998). Further, students tend to view different types of 

bonds, including covalent and ionic as distinctly different from intermolecular interactions rather 

than recognizing that these are different manifestations of the same phenomenon arising from 

atoms forming electrical interactions of different magnitude leading to increased stability of the 

system through energy minimization (Taber, 1998a).  

To build deep, useable understanding of chemical bonding student need to develop the 

ability to model and explain bond forming and bond breaking processes at the atomic level using 

ideas related to energy and electrostatic forces (Cooper et al., 2014). However, students in 

secondary science settings rarely discuss atomic level mechanisms for bond formation that are 

built on fundamental principles of energy minimization and electrostatic attraction. Instead, 

instruction tends to emphasize heuristics such as octet rule to explain why certain elements form 

certain types of bonds (Taber, 1998a). Additionally, in K-12 settings students rarely discuss 

 

142 

 

 

energy at the atomic and molecular level, and therefore struggle to apply ideas of energy to bond 

formation processes at the atomic level (Cooper, Klymkowsky, Becker, 2014).  

Recently there has been significant push from the educational community towards 

building instruction on fundamental scientific principles to help students build deep 

understanding of big ideas in science that they can apply when explaining natural world and 

solve real-life problems. This type of useable understanding is typically referred to as 

knowledge-in-use (National Research Council [NRC], 2012, NRC, 2013a, Pellegrino, Hilton, 

2012). These efforts resulted in the publication of the Framework for K-12 Science Education 

(the Framework) and Next Generation Science Standards (NGSS) both of which are based on 

years of research on how students best learn ideas in science (NRC, 2012, Standards, 2013). 

Previous research suggests introducing the phenomenon of chemical bonding in terms of 

state of the system of interacting objects in which attractive and repulsive forces are balanced 

out, which leads to energy minimization in the system (Nahum, 2007; Nahum, Mamlok‐Naaman, 

Hofstein, Krajcik, 2007). The Framework builds on this view and suggests using ideas of 

balance of electric forces and energy minimization as underlying big ideas when explaining 

various phenomena including chemical bonding. Specifically, the Framework emphasizes the 

importance of recognizing energy minimization as the driving force in formation of chemical 

bonds by stating that: “Matter in a stable form minimizes the stored energy in the electric and 

magnetic fields within it; this defines the equilibrium positions and spacing of the atomic nuclei 

in a molecule (e.g., chemical bonds)” (NRC, 2012, p. 121). It further emphasizes introducing 

electrical interactions between changed species as a mechanism according to which atoms form 

molecules: “The substructure of atoms determines how they combine and rearrange to form all of 

the world’s substances. Electrical attractions and repulsions between charged particles (i.e., 

 

143 

 

 

atomic nuclei and electrons) in matter explain the structure of atoms and the forces between 

atoms that cause them to form molecules (via chemical bonds), which range in size from two to 

thousands of atoms (e.g., in biological molecules such as proteins).” (NRC, 2012, p.107). In 

short, the Framework builds on previous research to suggest radically different ways of teaching 

ideas related to chemical bonding that are grounded in fundamental principles related to 

electrical interactions and energy minimization. However, as discussed below, apart from the 

scientific content, the Framework also outlines new ways of organizing the learning process to 

ensure students develop deeper, more meaningful and life-long understanding of the content. 

Learning Progressions and Three-Dimensional Learning  

The Framework views learning as a developmental progression designed to help students 

build and revise their knowledge and skills from elementary to high school (NRC, 20120, p. 11). 

This notion is built on years of research indicating that deep understanding of big ideas in 

science develops over time, and learning progressions provide a “road map” for different routes 

that students can follow to achieve this understanding (NRC, 2012, p.26, Duschl et al., 2007; 

Smith et al., 2006; Alonzo & Gotwals, 2012). The Framework suggests building instruction 

around the three dimensions of science: disciplinary core ideas (DCIs), scientific and engineering 

practices (SEPs) and crosscutting concepts (CCCs). Disciplinary Core Ideas represent few core 

ideas in a given scientific discipline and aim to help students build deep understanding of science 

and ability to explain a wide range of phenomena. Crosscutting concepts serve as lenses used to 

make sense of a wide range of phenomena and build coherent understanding of science across 

disciplines. Finally, Scientific and Engineering Practices represent authentic practices that 

scientists and engineers use to generate and revise scientific knowledge (NRC, 2012). The 

Framework further defines three-dimensional learning (3D learning) as a way to engage in 

 

144 

 

 

scientific and engineering practices in order to deepen understanding of crosscutting concepts 

and disciplinary core ideas (NRC, 2012). According to the Framework, engaging in 3D learning 

helps students build deep, useable understanding of big ideas in science coherently over time 

(NRC, 2012). 

While the Framework emphasizes the importance of developing student science 

proficiency following learning progressions along the three dimensions, it does not provide 

detailed learning progressions for the three dimensions (DCIs, SEPs, CCCs) across grade levels. 

Possible general LPs for SEPs and CCCs are outlined in the Framework, but they don’t specify 

grade band details due to lack of relevant research. For each component of DCIs the Framework 

describes in a bit more detail what student should understand by the end of a given grade band 

(2, 5, 8, 12). The NGSS provides slightly more detailed description for the possible learning 

progressions of the three dimensions across grade bands. Development of detailed and validated 

LPs for the three dimensions was beyond the scope of both the Framework and NGSS and 

remains one of the major tasks to be accomplished for successful implementation of the new 

vision of science education. 

Validating NGSS-aligned Learning Progressions in Practice 

To implement the educational changes called for by the Framework and NGSS, it is 

essential to develop and validate learning progressions that combine aspects of DCIs, SEPs and 

CCCs. The scope of learning progressions can range from large grain size encompassing 

multiple grades to finer grain size focusing on exploring development of student understanding 

of specific aspect of a broader LP (Gotwals, 2012; Mohan, Plummer, 2012). Smaller scale LPs 

can be more useful for instructional purposes, while large scale LPs provide a “large scale map” 

of progression of student understanding during a broad range of time (Gotwals, 2012). In this 

 

145 

 

 

work the smaller scale approach is used that focuses specifically on exploring how student 

reasoning about chemical bonding develops during the course of one unit constituting 

approximately 2 months of instructional time. This work is a continuation of previously 

described 3D learning progression (3D LP) for electrical interactions validated in the context of 

Unit 1 of NGSS-aligned curriculum for 9th grade Physical Science that spans one academic year 

(see chapter 2). The curriculum is called “Interactions” and focuses on helping student build 3D 

understanding of electrical interactions at macroscopic and atomic-molecular level to explain a 

wide range of phenomena including chemical bonding. The study described in Chapter 2 

demonstrated development of 3D LP focusing on the following ideas related to explaining 

electrical interactions: atomic nature of matter (focused on the DCI of Matter and Its 

Interactions, sub idea of Structure and Properties of Matter) and electric forces (focused on the 

DCI of Motion and Stability: Forces and Interactions, sub idea of Types of Interactions). The 3D 

LP also integrated SEP of Developing and Using Models and a CCC of Cause and Effect. This 

study provides evidence for validity of the 3D LP using Unit 1 assessment data only.  

 

This work uses assessment data from Unit 2 of the “Interactions” curriculum that is 

focused on building student understanding of chemical bonding from the perspective of energy 

and force to construct a finer-grained progression of student understanding related to chemical 

bonding. Specifically, this work uses Wilson’s approach (2009) that focuses on building a finer-

grain construct map focused on specific concept and aimed to relate assessment and cognition 

theories (Wilson, 2009). This work explored the progression of student understanding along the 

following aspects of the three dimensions: DCI of HS-PS3 - Energy, specifically the element of 

PS3.C Relationship Between Energy and Forces, DCI of HS-PS1- Matter and its Interactions, 

specifically the element of 1.B- Chemical Reactions; SEP of Developing and Using Models, SEP 

 

146 

 

 

of Constructing Explanations and CCC of Cause and Effect. Student interview data before and 

after completion of Unit 2 and item response theory analysis of Unit 2 assessment data is used to 

demonstrate validity evidence for the 3D construct map describing progression of student 

understanding of chemical bonding during the course of the unit.  

Contribution to the field of Chemical Education 

This work contributes to the field of chemical education in several ways. First, a 3D 

construct map for chemical bonding is presented that is based on ideas of energy minimization 

and balance of electric forces. The 3D construct map specifies aspects of DCI, SEP and CCC 

related to chemical bonding from macroscopic to atomic molecular scale for each level of 

sophistication. Previous studies describe learning progressions for energy (Lee, Liu, 2010; 

Neumann, Viering, Boone, & Fischer, 2013) and force (Alonzo & Steedle, 2009). There have 

also been studies that explore student thinking related to intermolecular interactions (Becker, 

Noyes, Cooper, 2016), energy at the atomic-molecular scale (Becker, Cooper, 2014), and 

chemical bonding (Burrows, Mooring, 2015; Taber, Coll, 2002). Previously published learning 

progression descriptions focus on both content and practice (Songer, Butler, Kelcey, Gotwals, 

2009; Gotwals, Songer, 2013), as well as practice only (Lehrer, Kim, Ayers, & Wilson, 2014; 

Schwarz, Reiser, Davis, Kenyon, Achér, Fortus, D., ... & Krajcik, 2009; Berland & McNeill, 

2010; Osborne, Henderson, MacPherson, Szu, Wild & Yao, 2016). However, to author’s 

knowledge, the current work is the first example of a study focused on exploring student thinking 

about chemical bonding that is built on the fundamental principles of energy and force, and 

according to the vision expressed in the Framework and NGSS, specifically focusing on 

integrating content (DCIs) and practice (SEPs, CCCs). Second, this work demonstrates that the 

3D construct map can be used to describe student learning in the context of Unit 2. Finally, this 

 

147 

 

 

work demonstrates that the 3D construct map can be used to place individual students on a level 

with 68% confidence, which suggests immediate and high degree of applicability of the 3D 

construct map for pedagogical use.  

Theoretical Framework 

This work uses construct modeling framework to develop and revise 3D construct map 

for chemical bonding (Brown & Wilson, 2011; Wilson, 2005, 2009). Construct in this case 

represents a specific unobserved (or latent) trait being measured. In the context of this study, the 

construct is student 3D understanding of chemical bonding. Construct modeling approach is an 

extension of the learning progression vision into the field of assessment because it allows to 

interpret assessment results based on relevant learning and cognition theories (Wilson, 2009; 

Pellegrino, Chudowsky, Glaser, 2001). It also allows to interpret assessment results more 

meaningfully by providing information about what students know and can do at each level of 

proficiency, and helping guide instructional process in terms of what supports students need to 

get to higher proficiency levels of a given construct (Brown & Wilson, 2011). Construct 

modeling approach therefore provides a framework for defining proficiency in a meaningful way 

and increasing the validity power of the test scores as a result (Brown & Wilson, 2011; Mislevy, 

1996).   

Construct modeling approach consists of four steps and constitutes an iterative process. 

The first step involves specifying cognition model, which in this case is the hypothetical 3D 

construct map for student understanding of chemical bonding that combines proficiencies in 

DCIs, SEPs and CCCs described above, and describes increasingly more sophisticated levels of 

ability along these three proficiencies as students develop deeper understanding and incorporate 

new knowledge into their existing knowledge framework. In this study a construct map for 

 

148 

 

 

chemical bonding is defined based on unpacking on NGSS performance expectations (PE) and 

feedback from disciplinary and pedagogical experts. It is important to point out that while the 

validation of the 3D construct map for chemical bonding is carried out in the context of 

“Interactions” curriculum, the assessment instrument used to probe the levels of 3D construct 

map is aligned to NGSS PEs and not to the curriculum learning goals. Therefore, the results 

obtained in this study are generalizable to contexts other than “Interactions” curriculum. 

The second step involves designing items to probe the levels of the 3D construct map 

following modified evidence centered design (mECD) methodology (Harris, Krajcik, Pellegrino, 

DeBarger, 2019) that will be further described in the methods section. The third step involves 

evaluating the outcome space by analyzing student responses to the items and mapping them to 

the levels of the construct map to ensure that scores on the items related to the levels of the 

construct map in a meaningful way. Finally, the last step involves choosing a measurement 

model that allows to relate student responses to calculated ability levels to gain additional 

evidence for the validity of the 3D construct map, therefore allowing to interpret results of 

assessment (Brown &Wilson, 2011; Pellegrino et al., 2001). This study will focus on 

investigating how student integrate ideas of electric forces learned in Unit 1 and ideas of energy 

learned in Unit 2 to explain chemical bonding and suggest ways of interpreting the obtained 

results. 

Methodology 

Context: Interactions curriculum 

“Interactions” curriculum was developed according to the principles outlined in the 

Framework and NGSS. Each unit engages students in relevant natural phenomena in the form of 

driving questions with the purpose of developing deeper understanding of electrical interactions 

 

149 

 

 

during the course of academic year through 3D learning strategies. The curriculum consists of 4 

units. The first unit is focused on building student understanding of electrical interactions using 

ideas of electric forces, fields and charges at macro and atomic-molecular scale. The second unit 

brings in ideas related to energy changes in the system when two charged objects interact at 

macro and atomic-molecular scale. Chemical bonding is the central phenomenon students are 

exploring in Unit 2. The driving question for unit 2 is “How does a small spark start a huge 

explosion?”. Students are exploring ideas related to bond formation and bond breaking from 

perspective of energy and force, and relate macroscopic observations of phenomena to atomic 

and molecular level mechanisms. Therefore, the curriculum aims at helping students build 3D 

understanding of chemical bonding as an extension of the same electrostatic principles that are 

responsible for observed attractive and repulsive forces between charged macroscopic objects, 

and recognizing that energy minimization is the driving force behind the observed electrostatic 

attraction between macroscopic objects and atoms forming a bond. Units 1 and 2 represent about 

two thirds of the curriculum instructional time. Units 3 and 4 further develop student 

understanding of electrical interactions to explain a wide range of phenomena including 

hydrophobic and hydrophilic interactions (Unit 3) and protein folding (Unit 4). 

Units 1 and 2 of the “Interactions” curriculum have gone through external review process 

by Achieve. Unit 1 received the highest rating termed “Example of high quality NGSS design”, 

and Unit 2 received the second highest rating termed “Example of high quality NGSS design if 

improved”. Further, National Science Teachers Association recognizes “Interactions” as being 

aligned to NGSS and provides classroom videos demonstrating curriculum use on their official 

webpage8. These pieces of evidence support the choice of this curriculum for developing and 

                                                 
8 Classroom videos demonstrating “Interactions” use: http://ngss.nsta.org/  

 

150 

 

 

validating 3D construct map in this study. The curriculum consists of online materials where all 

the student activities are located9 and paper-based teacher materials that can be accessed online 

via Google docs. The curriculum is free and available for anyone to use. 

Step 1: Specifying cognition model 

Similarly to an LP, a level on the 3D construct map can be described as one in a series of 

comprehensive and developmentally appropriate steps towards more sophisticated application of 

a given latent construct. The major differences between an LP and a construct map are that a 

construct map is typically defined at a smaller grain size than a learning progression, and 

specifically focusing on relating assessment and relevant cognition theories (Wilson, 2009). The 

3D construct map presented here focuses on DCI of Energy, sub idea of Relationship Between 

Energy and Forces as well as DCI of HS-PS1- Matter and its Interactions, specifically the sub 

idea of PS 1.B- Chemical Reactions.  These sub ideas are central to Unit 2 of “Interactions” 

curriculum.  Further, 3D construct map presented here focuses on SEP and Developing and 

Using Models, SEP of Constructing Explanations, and CCC of Cause and Effect because those 

dimensions were most heavily emphasized throughout the curriculum, and assessments designed 

to probe 3D construct map levels were focused on these dimensions. The lower anchor was 

based on students’ prior knowledge that was characterized from the written assessment and oral 

interviews with individual students before they started Unit 2.  The upper anchor is based on the 

NGSS PEs focused specifically on energy changes during bond breaking and bond forming 

processes as shown below: 

                                                 
9 “Interactions” online materials: http://interactions.portal.concord.org/ 

 

151 

 

 

HS-PS3-5. Develop and use a model of two objects interacting through electric or magnetic 

fields to illustrate the forces between objects and the changes in energy of the objects due to 

the interaction.  

HS-PS1-4. Develop a model to illustrate that the release or absorption of energy from a 

chemical reaction system depends upon the changes in total bond energy.  

The 3D construct map and the assessment used to probe the levels specifically focuses on 

phenomena related to electrical interactions, and therefore the aspect of PE HS-PS3-5 related to 

magnetic interactions is greyed out. Further, as related to PE HS-PS1-4, this study does not focus 

on evaluating student ability to calculate bond energies. Instead, it focuses on describing and 

evaluating student qualitative understanding of energy changes during bond breaking and bond 

making processes, specifically focusing on adding energy to the system in order to break a 

chemical bond. 

The intermediate levels of the LP are defined based on a combination of logical sequence of 

the discipline, feedback from disciplinary experts, and literature related to student learning. This 

process resulted in a hypothetical 3D construct map that was then empirically tested based on 

interviews with students and IRT analysis of written assessment. Table 1 provides description of 

levels for the hypothetical NGSS-aligned 3D construct map for Chemical bonding. This 

construct map represents the cognition model that was defined as the first step of the construct 

modeling approach (Brown & Wilson, 2011). It is important to point out that the Framework 

emphasizes the importance of measuring student ability to integrate the three dimensions, and 

while the 3D construct map shows DCIs and SEPs/CCCs in separate columns, the SEP/CCC 

column is an integrative statement because it refers to the DCIs. The DCIs are listed separately to 

avoid having to write each of the statements under SEP/CC C for all the DCI sub-ideas.

 

152 

 

Table 3.1 

Hypothetical 3D construct map for chemical bonding 

 

Level 

3 
 

 
c
i
p
o
c
s
o
r
c
i

M

2 

 

 
e
t
e
l
p
m
o
c
n
I

 
c
i
p
o
c
s
o
r
c
i

M

 
 

 

Chemical Bonding 

Includes DCI sub ideas:  

Matter and its interactions: “Structure and Properties of Mattr”,  “Chemical Reactions” 

Energy: “Relationship between Energy and Forces” 

Relationship between energy and forces: 
•  Energy is associated with either motion (kinetic) or “stored” (potential) at macroscopic and atomic level scale 
•  Ideas of energy are applied to explain bond breaking/making processes  
• energy changes are related to Coulombic interactions between charges at macro and atomic-molecular level 
• chemical bonds are described as resulting from balance of attractive/repulsive forces leading to energy 

minimization 

• electric fields are used to explain interactions at a distance 
Chemical Reactions: 
•  chemical reactions are explained using bond breaking/making processes 
•  energy changes are associated with chemical reaction and bond making/breaking processes 
•  chemical reactions described using atoms/molecules  
Relationship between energy and forces: 
•  energy is associated with either motion (kinetic) or “stored” (potential) at macroscopic and inaccurate 

microscopic scale 

•  inaccurate ideas about relationship between energy/heat/force 
•  ideas of energy are applied to explain bond breaking/making processes and explanation might relate to 

electrical interactions between atom components 

   with some inaccuracies 
• energy changes are related to Coulombic interactions between point charges and charges and macroscopic 

objects, might be inaccurate 

SEPs:  

“Developing and 
Using Models”, 
“Constructing 
Explanations” 

CCC: 

“Cause and 

Effect” 

• Student models/explanations 
are causal and explicitly use 
ideas of energy and electric 
force to explain phenomena 
related to bond breaking and 
bond making by showing a 
micro-level mechanism.  

• Models relate energy changes to 

changes in forces between 
interacting atoms to explain 
why a bonds form 

• Student models/explanations 

are causal and use ideas of 
energy and electric force to 
explain phenomena related to 
bond breaking and bond making 
by showing a micro-level 
mechanism with some 
inaccuracies  

• Need to prompt to elicit ideas 

• chemical bonds are described as resulting from balance of attractive/repulsive forces; energy relationships are 

of energy 

inaccurate or absent 

• electric fields might be used to explain interactions at a distance 
Chemical Reactions: 
•  chemical reactions are explained using bond breaking/making processes with some inaccuracies 
•  energy changes are associated with chemical reaction, contains inaccuracies 
•  chemical reactions described using atoms/molecules with some inaccuracies 

• Models relate energy changes to 

changes in forces between 
interacting atoms to explain 
why a bonds form. Need to be 
prompted. 

153 

 

Table 3. 1 (cont’d). 

Level 

Chemical Bonding 

Includes DCI sub ideas:  

Matter and its interactions: “Structure and Properties of Matter”,  “Chemical Reactions” 

Energy: “Relationship between Energy and Forces” 

Relationship between energy and forces: 
•  energy is associated with either motion (kinetic) or “stored” (potential) at macroscopic scale 
•  energy is the same as heat/friction/force 
•  ideas of energy are applied to explain bond breaking/making processes, but explanation does not relate to 

electrical interactions between atom components 

• energy changes are related to Coulombic interactions between point charges and charged macroscopic objects 

but with some inaccuracies 

• chemical bonds are not described as resulting from balance of attractive/repulsive forces that lead to energy 

minimization 

• electric fields might be used to explain interactions at a distance 
Chemical Reactions: 
•  chemical reactions are recognized (indicators include temperature change, color change, release of gas, 

precipitate formation, odor) 

•  chemical reactions are not explained using ideas related to chemical bonds 
•  energy changes are associated with chemical reaction, contains inaccuracies  
•  chemical reactions are not described using ideas of atoms/molecules 

Relationship between energy and forces: 
•  energy is the same as heat/friction/force 
•  ideas of energy are not applied to explain bond breaking/making processes 
• energy changes are not related to Coulombic interactions  
• chemical bonds are not described as resulting from balance of attractive/repulsive forces that lead to energy 

minimization 

• electric fields are not used to explain interaction at a distance 
Chemical Reactions: 
•  chemical reactions are not recognized  
•  chemical reactions are not explained using ideas related to chemical bonds 
•  energy changes are not associated with chemical reactions 
•  chemical reactions are not described using ideas related to atoms/ molecules 

154 

1 
 

 
c
i
p
o
c
s
o
r
c
a
M

0 
 
 
 
 
 
 
 
 
 

 

 

 

CCC: 

“Cause and 
Effect” 

SEPs:  

“Developing 

and Using 
Models”, 

“Constructing 
Explanations” 

•  Student models/explanations 

are causal, use ideas of energy 
when explaining chemical 
reactions, but at the 
macroscopic level only; contain 
inaccuracies 

•  Inaccurate macro-level 

mechanism. Might need to 
prompt to elicit ideas of energy. 

•  Models do not relate energy 
changes to changes in forces 
between interacting objects in a 
system 

•  models/explanations are not 
causal, based on recollection 
of facts or observable 
components only; 

•  no mechanism explaining 

phenomenon 

 

 

Step 2: Developing assessment to probe the levels of 3D Construct Map  

The second step of the construct modeling approach involves developing assessments to 

probe various levels of the 3D construct map. This work uses modified evidence-centered design 

(mECD) process (Harris et al., 2019) to develop assessments to probe levels of the hypothetical 

3D construct map for chemical bonding. The mECD approach combines elements of evidence-

centered design (ECD) (Mislevy & Haertel, 2006) and construct-centered design (CCD) process 

(Shin, Stevens, & Krajcik, 2010) to design tasks for measuring knowledge-in-use.  

The first step of mECD involves identifying and unpacking an NGSS PE to develop a 3D 

claim that describes what students should be able to do with the corresponding DCI, SEPs and 

CCs. The process of unpacking specifies aspects of the DCIs, SEPs and CCs that students should 

master in order to meet a given NGSS PE.  Unpacking also ensures coherency and alignment 

between NGSS PEs, assessment, and the 3D construct map levels and specifies which aspects of 

the broad NGSS PE are being measured. The next step involves specifying the evidence that 

shows students have met the requirements of the claim. Claim and evidence combine to form an 

mECD argument.  Finally, assessment tasks for each mECD argument are developed that will 

provide the necessary evidence to measure the claim. This process is illustrated in Figure  

  

Figure 3.1 Summary of modified evidence centered design process 

An example of the mECD argument for an item to help characterize the level of students’ 

understanding of chemical bonding is summarized in Table 2. The item is designed to provide 

evidence on whether students are at level 0, 1, 2 or 3 of the 3D construct map shown in table 1. 

The mECD argument focuses on DCI of HS-PS1- Matter and its Interactions, specifically the 

 

155 

 

 

element of PS 1.B- Chemical Reactions; SEP of Constructing Explanations, CCC of Cause and 

Effect. While NGSS PE focuses of SEP of Developing Models, and CCC of Matter and Energy, 

it is acceptable to change those elements as long as DCI focus is the same as the given NGSS PE. 

Table 3.2 

Example of mECD process 

 

 

E
P
S
S
G
N

HS-PS1-4. Develop a model to illustrate that the release or absorption of energy from a chemical 
reaction system depends upon the changes in total bond energy.   
 Note: assessment items designed for this PE focused on qualitative understanding of energy changes 
associated with bond breaking/bond making processes. Students were not asked to calculate changes in 
total bond energy 

  Students will construct explanation that tracks energy changes that occur when chemical bonds break and 
m
i
a
l
C

form in a chemical reaction to explain what causes observed energy/heat absorption or release. 

Students’ explanation will account for energy changes that occur when chemical bonds break and form in 
chemical reaction. Explanations will contain the following ideas as appropriate: 

1.  Chemical reactions involve breaking bonds of the reactants and forming bonds to make new 

substances as products.  

2.  Energy is required to break bonds; energy is released when bonds form. 
3.  Starting a chemical reaction requires adding energy to break bonds of the reactants. 
4.  Heat and light indicated that energy is being transferred to/from the system. 
5.  Causal explanations will account for molecular/atomic level mechanism and relate to observed 

phenomenon. 

Burning is a type of chemical reaction. The video shows a match that is lit by heating it on a hot plate. 
Note: the video shows a match on a hot plate that is turned on. After some time the match lights up. 
Snapshots of the video are shown below. 

 
e
c
n
e
d
i
v
E

 
t
n
e
m

s
s
e
s
s
A

 

k
s
a
T

         Match lights up after some              Match is burning for a while              The flame from the match 
             time on the hot plate                                                                                        dies out eventually 
Question: 
Striking a match across a rough surface will create a spark that sets the match on fire. How can the match 
in the video light without a spark? Justify your answer.  

 
There were total of 8 items designed to measure 3D understandings of chemical bonding 

for Unit 2. Each item is open-ended like the one shown in table 2, contains an aspect of a DCI, a 

 

156 

 

 

SEP and a CCC and designed to measure all three levels of the 3D construct map shown in Table 

1. Items were administered as a pre and post Unit 2 test. Two items were used to conduct 

interviews before and after Unit 2 to obtain qualitative validity evidence for 3D construct map. 

The first interview item called “Match on the Hot Plate”, which is the same as shown in table 2. 

The item assessed DCI of Chemical Reactions, SEP of Constructing Explanations, and CCC of 

Cause and Effect. The second interview item called “Atoms forming Chemical Bond” assessed 

DCI of Relationship between Energy and Forces, SEP of Developing and Using Models, and 

CCC of cause and effect. This item was not exactly the same as in the written assessment, but 

aligned closely to a similar item. For both items, each answer was scored using a scoring rubric 

aligned to the 3D construct map levels. The rubric describes scoring rules specific for each item 

and reflect ability to apply DCI, SEP and CCC described in the 3D construct map to make sense 

of specific phenomenon in question. The construct map, on the other hand, provides a general 

description of increasingly more sophisticated ways of thinking about chemical bonding from 

perspective of energy and force. In terms of alignment, score 1 on an item aligns to level 1 of the 

3D LP, score 2 aligns to level 2 and so on. Table 3 shows the rubric, corresponding level of the 

3D construct map, and sample answer from the oral interview for the “Match on the Hot Plate” 

item. Table 4 shows the rubric, corresponding level of the 3D construct map, and sample answer 

from the oral interview for the “Atoms Forming a Chemical Bond” item. For all items on the 

test, including those shown in table 3 and 4, both the rubric and the 3D construct map aim to 

characterize understanding of chemical bonding starting from basic level with essentially no 

relevant DCIs present (level 0), transitioning to macroscopic level understanding (level 1), to 

incomplete microscopic level (level 2), and finally compete microscopic level (level 3).  For both 

items, there were no responses for level 3 of 3D construct map upon completion of Unit 2.  

 

157 

 

Table 3.3 

Sample responses for every 3D construct map level for match on the hot plate  

 

Level/ 
Score 

 
 
 
 
0 

 

 
e
s
n
o
p
s
e
R

 
e
l
p
m
a
S

 
 
 
 
 
 
 
 
 

 

3D Construct Map 

Scoring Rubric 

DCI 
Chemical Reactions: 
•  chemical reactions are not recognized  
•  chemical reactions are not explained using ideas related to 

chemical bonds 

•  energy changes are not associated with chemical reactions 
•  chemical reactions are not described using ideas related to atoms/ 

molecules 

SEP and CC 
models/explanations are not causal, based on recollection of facts or 
observable components only; no mechanism explaining phenomenon 

Question  
Striking a match across a rough surface will create a spark that sets the 
match on fire. How can the match in the video light without a spark? Justify 
your answer.  
DCI: Chemical Reactions 
•  Match burning is not recognized as a chemical reaction 
•  No molecular level explanation for what causes match to light up from 

perspective of bond breaking/forming processes 

•  No relationship between match burning, chemical bonds and energy at 

the atomic level 

SEP and CC  
•  Explanations focused on observable components only 
•  No causal mechanism  
•  Heat from the hot plate causes match to light up 

Comment: relevant components of DCI are not present, explanation contains only 
observable components and no causal mechanism to explain what causes the match to 
light without a spark 

 

158 

 

Table 3.3 (cont’d). 

 

Level/Score 

3D Construct Map 

Scoring Rubric 

 
1 
 

 
c
i
p
o
c
s
o
r
c
a
M

 

 
e
s
n
o
p
s
e
R
 
e
l
p
m
a
S

 
 
 
 

 

DCI 
Chemical Reactions: 
•  chemical reactions are recognized (indicators include 

temperature change, color change, release of gas, precipitate 
formation, odor) 

•  chemical reactions are not explained using ideas related to 

chemical bonds 

Question  
Striking a match across a rough surface will create a spark that sets the match 
on fire. How can the match in the video light without a spark? Justify your 
answer.  
DCI: Chemical Reactions 
•  Match burning is recognized as a chemical reaction 
•  No molecular level explanation for what causes match to light up from 

•  energy changes are associated with chemical reaction, 

perspective of bond breaking/forming processes 

contains inaccuracies  

•  chemical reactions are not described using ideas of 

atoms/molecules 

SEP and CC 
Student models/explanations are causal, use ideas of energy 
when explaining chemical reactions, but at the macroscopic 
level only; contains inaccuracies 

•  No relationship between match burning, chemical bonds and energy at the 

atomic level 

SEP and CC  
•  Explanations uses ideas of energy to explain match lighting/burning 
•  No causal mechanism beyond observable components 
•  Explanation relates heat and energy, might be inaccurate 

  
Comment: Relevant DCIs are present, model provides macro level 
causal mechanism using ideas of energy to explain what causes match to 
light up when it is sitting on the hot plate. 
 
 
 

159 

 

Table 3.3 (cont’d). 

Level/Score  3D Construct Map 

Scoring Rubric 

 

Question  
Striking a match across a rough surface will create a spark that sets the 
match on fire. How can the match in the video light without a spark? 
Justify your answer.  
DCI: Chemical Reactions 
•  Match burning is recognized as a chemical reaction 
•  Molecular level explanation for what causes match to light up from 

• 

perspective of bond breaking/forming processes, might be inaccurate 
Inaccurate relationship between match burning, chemical bonds and 
energy at the atomic level 

SEP and CC  
•  Explanations uses ideas of energy to explain match lighting/burning 
•  Microscopic level causal mechanism with some inaccuracies 
•  Explanation relates heat and energy, might be inaccurate 
 

From the interview: 
Student: “The hot plate gives off heat, which is then transferred   
to the match, causing the molecules in the match to move faster. 
The heat energy causes atoms in the molecules to rub together 
faster, which separates molecules in the match to individual 
atoms and sets the match on fire” 

Comment: molecular level explanation with significant 
inaccuracies, no explicit mention of bond breaking/bond 
making processes, and how energy is involved in these 
processes. The model and explanation states that the atoms 
are set on fire and are also present in the flame itself. 

2 
 
 

 
c
i
p
o
c
s
o
r
c
i

M

 
e
t
e
l
p
m
o
c
n
I

DCI 
Chemical Reactions: 
•  chemical reactions are explained using bond breaking/making 

processes with some inaccuracies 

•  energy changes are associated with chemical reaction, contains 

inaccuracies 

•  chemical reactions described using atoms/molecules with some 

inaccuracies 
SEP and CC 
Student models/explanations are causal, use ideas of energy when 
explaining chemical reactions at the microscopic level with some 
inaccuracies 

 

 

 
e
s
n
o
p
s
e
R
 
e
l
p
m
a
S

 
 
 
 

 

160 

 

Table 3.3 (cont’d). 

Level/Score  3D Construct Map 

Scoring Rubric 

 

Question  
Striking a match across a rough surface will create a spark that sets the 
match on fire. How can the match in the video light without a spark? 
Justify your answer.  
DCI: Chemical Reactions 
•  Match burning is recognized as a chemical reaction 
•  Molecular level explanation for what causes match to light up from 

perspective of bond breaking/forming processes 

•  Relationship between match burning, chemical bonds and energy at 

the atomic level 

SEP and CC  
•  Explanations uses ideas of energy to explain match lighting/burning 
•  Microscopic level causal mechanism 
•  Explanation relates heat and energy 

Comments: no level 3 responses were observed by the end of Unit 2 for this interview item 

3 
 
 

DCI 
Chemical Reactions: 
•  chemical reactions are explained using bond breaking/making 

processes 

•  energy changes are associated with chemical reaction and bond 

making/breaking processes 

•  chemical reactions described using atoms/molecules  
SEP and CC 
Student models/explanations are causal, use ideas of energy when 
explaining chemical reactions at the molecular level 

 
c
i
p
o
c
s
o
r
c
i

M

 

 
e
s
n
o
p
s
e
R

 
 
e
l
p
m
a
S

 

161 

 

Table 3.4 

 

Sample responses for every 3D construct map level for atoms forming a bond  

Level/Score 

3D Construct Map 

Scoring Rubric 

 
 
 
 
0 

DCI 
Relationship between energy and forces: 
•  energy is the same as heat/friction/force 
•  ideas of energy are not applied to explain bond 

breaking/making processes 

• energy changes are not related to Coulombic 

interactions  

• chemical bonds are not described as resulting from 

balance of attractive/repulsive forces that lead to 
energy minimization 

• electric fields are not used to explain interaction at a 

distance 

SEP and CC 
models/explanations are not causal, based on recollection 
of facts or observable components only; no mechanism 
explaining phenomenon 

• 

Question  
Draw a model to explain how two atoms can form a chemical bond using ideas 
related to atomic structure, electric force and energy. 
DCI: Relationship between Energy and Forces 
•  Basic attractive interactions between opposite charges and repulsive 
interactions between similar charges might be used to explain bond 
formation, but with some inaccuracies.  
ideas of energy are not applied to explain bond breaking/making processes 
energy changes are not related to Coulombic interactions between 
components of atoms (protons, electrons) to explain bond formation 
chemical bonds are not described as resulting from balance of 
attractive/repulsive forces that lead to energy minimization 
electric fields are not used to explain bond formation 

• 
• 

• 
SEP and CC  
•  Models focused on observable components only 
•  No causal mechanism  

 

 
e
s
n
o
p
s
e
R

 
e
l
p
m
a
S

 
 
 
 
 
 
 
 
 
 

 

Student: “I am not sure how they form a bond. They will stick together somehow, but I am not sure how.” 
Comment: model/explanation does not contain any components beyond those provided in the question. The 
model/explanation does not use ideas of energy, force and atomic structure to explain bond formation 

162 

 

Table 3.4 (cont’d). 

Level/Score  3D Construct Map 

 
1 
 

DCI 
Relationship between energy and forces: 
•  energy is associated with either motion (kinetic) or 

“stored” (potential) at macroscopic scale 
•  energy is the same as heat/friction/force 
•  ideas of energy are applied to explain bond 

breaking/making processes w/o relating to electrical 
interactions between atom components 

• energy changes are related to Coulombic interactions 

between point charges and charged macroscopic objects 
but with some inaccuracies 

 

Scoring Rubric 
Question  
Draw a model to explain how two atoms can form a chemical bond using ideas 
related to atomic structure, electric force and energy. 
DCI: Relationship between Energy and Forces 
•  Basic attractive interactions between opposite charges and repulsive 

interactions between similar charges are used to explain bond formation. 
Charges are modeled as point charges, not parts of atoms. 
Ideas of energy are applied to explain bond formation w/o relating to 
electrical interactions between atom components 

• 

•  Energy is mentioned in the context of potential energy associated with 

energy “stored”, and kinetic energy associated with motion. 

 
c
i
p
o
c
s
o
r
c
a
M

• chemical bonds are not described as resulting from 

•  Chemical bonds are not described as resulting from balance of 

balance of attractive/repulsive forces that lead to energy 
minimization 

• electric fields might be used to explain interactions at a 

distance 

SEP and CC 
• Inaccurate macro-level mechanism (charges modeled as 

point charges and not parts of atoms). Might need to 
prompt to elicit ideas of energy. 

• Models don’t relate energy changes to changes in forces 

attractive/repulsive forces between components of atoms that lead to 
energy minimization 
electric fields might be used to explain bond formation 

• 
SEP and CC  
•  Models don’t show atomic-level causal mechanism, might relate energy 
changes to changes in forces between interacting objects in a system, but 
we some inaccuracies. 

between interacting objects in a system. 
 Sample response #1 

 

Comment: the model/explanation describes components of 
atoms (protons, electrons) as point charges, and provides 
basic causal mechanism for attractive integrations between 
these components as a basis of forming a chemical bond. 
The model and explanation do not provide causal atomic-
level mechanism of bond formation using ideas of energy, 
but explanation makes a distinction between energy 
associated with motion of atoms (kinetic), and potential 
energy associated with bonding state. 

From the interview: 
Student: atoms usually bond together by touching. Bond is like a bridge, I think 
it is just air in between. 
Interviewer: What makes the atoms stick together in a bond? 
Student: The charges, because opposite charges attract.  
Interviewer: So, are the atoms in a bond charged? 
Student: I think so, I am not sure. 
Interviewer: Does energy change in any way when atoms for a bond? 
Student: Yes. Say you have kinetic energy when they are moving, then when 
they are stuck together its potential energy. 
 

 

163 

 

Table 3.4 (cont’d). 

Level/Score  3D Construct Map 

Scoring Rubric 

 

Sample response #2 

 

Student: “Atoms need a third atom to form a bond. They give the extra energy to the third 
atoms through collision, which as allows them to form a bond” 
Comment: answer uses ideas of energy to explain bond formation, but does not relate 
energy changes to electrical interactions between atomic components. No atomic 
components or point charges are shown. This is also a piece of knowledge that comes 

directly from the simulation that students did as part of their Unit 2 learning experience. 
 
Sample response #3 

Student: the bond forms by adding energy. For those atoms to be able to connect we have to have a third 
atom that provides energy. So, when this one (third atom) gets pushed up, they attract and then they bond. 
 
Comment: the answer uses ideas related to electrical interactions (attraction between opposite charges) to 
explain bond formation. Charges are modeled as point charges, and not as parts of atoms. Idea of energy 
is inaccurate, the explanation states that you need to add energy to form a bond. The idea that atoms need 
a third atom to form a bond might also come from a simulation student did in Unit 2, just like for sample 
response #2. 

164 

 
e
s
n
o
p
s
e
R
 
e
l
p
m
a
S

 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 

Table 3.4 (cont’d). 

Level/Score  3D Construct Map 

Scoring Rubric 

 

2 
 
 

 
c
i
p
o
c
s
o
r
c
i

M

 
e
t
e
l
p
m
o
c
n
I

 

 
e
s
n
o
p
s
e
R
 
e
l
p
m
a
S

DCI 
Relationship between energy and forces: 
•  energy is associated with either motion (kinetic) or “stored” (potential) at 

macroscopic and inaccurate microscopic scale 

•  inaccurate ideas about relationship between energy/heat/force 
•  ideas of energy are applied to explain bond breaking/making processes 

with some inaccuracies 

• energy changes are related to Coulombic interactions between point 

charges and charges and macroscopic objects, might be inaccurate 

• chemical bonds are described as resulting from balance of 

attractive/repulsive forces; energy relationships are inaccurate or absent 

• electric fields might be used to explain interactions at a distance 
SEP and CC 
• Student models/explanations are causal and use ideas of electric force to 

explain phenomena related to bond breaking and bond making by showing 
a micro-level mechanism with some inaccuracies  

• Models relate energy changes to changes in forces between interacting 

atoms to explain why a bonds form. Need to be prompted. 

Question  
Draw a model to explain how two atoms can form a chemical 
bond using ideas related to atomic structure, electric force and 
energy. 
DCI: Relationship between Energy and Forces 

•  Energy changes are related to Coulombic interactions 
between components of atoms (protons, electrons) to 
explain bond formation, might be inaccurate 

•  Chemical bonds are described as resulting from 

balance of attractive/repulsive forces between 
components of atoms, energy minimization idea is 
inaccurate or needs to be prompted 
electric fields might be used to explain bond formation 

• 

SEP and CC  

•  Models show atomic-level causal mechanism for bond 

formation focused on describing balance of electrical 
interactions between components of atoms (protons, 
electrons), energy minimization idea not present or 
inaccurate. 

Student: a bond forms where electrons are attracted to the core of the 
nucleus, the electrons will attract to the core of each other’s atoms 
Interviewer: What about the two nuclei? 
Student: The do repel each other, so they keep some distance between them. 
They won’t be touching because the cores are repelling each other, but they 
also won’t get too far because the electrons are attracted to the core. 
Interviewer: So, with both attractive and repulsive interactions present, why 
does a bond form? 
Student: They get to the point where they are at equilibrium, they are not 
attracting or repelling, they are close enough to be attracted, but far enough 
away not to be repelled. 
Interviewer: In there anything else apart from attractive/repulsive 
interactions that is driving this process? 
Student: I am not sure. 
Interviewer: Do you think energy is involved in forming a bond? 
Student: I am not sure. 

Comment: Model and explanation describe atomic-level causal 
mechanism for bond formation focused on describing balance 
of electrical interactions between components of atoms 
(protons, electrons), energy minimization idea is not present 
even when prompted 
 

 

165 

 

Table 3.4 (cont’d). 

 

Level/Score  3D Construct Map 

Scoring Rubric 

3 
 
 

DCI 
Relationship between energy and forces: 
•  Energy is associated with either motion (kinetic) or “stored” (potential) 

 
c
i
p
o
c
s
o
r
c
i

M

at macroscopic and atomic level scale 

•  Energy ideas applied to explain bond breaking/making  
• energy changes are related to Coulombic interactions between charges at 

macro and atomic-molecular level 

• chemical bonds are described as resulting from balance of 
attractive/repulsive forces leading to energy minimization 
• electric fields are used to explain interactions at a distance 
SEP and CC 
• Student models/explanations are causal and explicitly use ideas of energy 
and electric force to explain phenomena related to bond breaking/making 
by showing a micro-level mechanism.  

• Models relate energy changes to changes in forces between interacting 

atoms to explain why a bonds form 

Question  
Draw a model to explain how two atoms can form a chemical 
bond using ideas related to atomic structure, electric force and 
energy. 
DCI: Relationship between Energy and Forces 

•  Energy changes are related to Coulombic interactions 
between components of atoms (protons, electrons) to 
explain bond formation 

•  Chemical bonds are described as resulting from 

balance of attractive/repulsive forces between 
components of atoms, driven by energy minimization  
electric fields might be used to explain bond formation 

• 

 

SEP and CC  

•  Models show atomic-level causal mechanism for bond 

formation focused on describing balance of electrical 
interactions between components of atoms (protons, 
electrons) and energy minimization 

 

Sample Response: no level 3 responses were observed by the end of Unit 2 for this interview item 

 

166 

 

 

Step 3: Evaluating Outcome Space 

 

The third step involves evaluating the outcome space by analyzing student responses to 

the items and mapping them to the levels of the construct map to ensure that scores on the items 

related to the levels of the construct map in a meaningful way. The hypothetical 3D construct 

map shown in Table 1 was constructed using logical sequence of the discipline, relevant research 

literature and unpacking of NGSS PEs. The “Interactions” curriculum was piloted in the same 

schools in the Mid-West a year prior to the data collection described here.  During the data 

collection year, a team of researchers used the scoring rubrics to score student answers directly to 

the construct map levels and made sure the types of answers student provided on each item were 

consistent with 3D construct map levels as well as the scoring rubrics. Examples of student 

answers for the interview items as well as scoring rubric and 3D construct map levels are 

provided in tables 3 and 4. 

Supporting levels of the 3D LP using qualitative analysis of student interviews 

The interview data was collected in a Mid-Western public high school where the 

“Interactions” curriculum was implemented. See chapters 1 and 2 for more detailed description 

of the sample. Several students from each of the three participating classrooms were interviewed 

before and after implementation of Unit 2, with total 17 students interviewed. The students were 

selected to represent different levels of academic achievement. Items shown in tables 3 and 4 

were used for the interview. Sample student responses for interview items are shown in Tables 3 

and 4. All items probe ideas related to the three levels of the hypothetical 3D construct map 

shown in Table 1. Student interviews were analyzed using the scoring rubric and each answer 

was assigned a level on 3D construct map (Tables 3 and 4).  

 

Inter-rater reliability was established in the following manner. One researcher scored all 

 

167 

 

 

17 interviews first. Then, two other researchers used the same rubric to score the interviews of 3 

students from each classroom (total 9 students). Once 100% agreement of 3D LP level placement 

for all 9 students was reached between the 3 scorers, the scoring rubric and the 3D construct map 

levels were modified accordingly, and the remainder of the interviews rescored based on this 

discussion.  

Step 4: Measurement Model 

IRT analysis for Unit 2 pre/post assessment was carried out following Toland (2014). The 

sample of 899 students was modeled using graded response model (GRM) (Samejima,1969). See 

chapters 1 and 2 for more detailed description of the sample. Score of “0” was imputed for 

students who had missing values on any of the items. This approach was deemed appropriate 

because students were given unlimited amount of time to finish the assessment. Therefore, it was 

safe to assume that if they did not provide the answer for a given item, they did not know it. Pre 

and post assessment data were combined in model estimation to allow for comparison of ability 

distributions on pre and posttest. The dimensionality and longitudinal invariance study were 

reported earlier (see Chapter 1). Pre and post measures were highly reliable (pre Unit 2=0.823, 

post Unit 2=0.932) and supported by validity evidence (see Chapter 1). There were two 

theoretical latent dimensions measured on Unit 2 assessment test: student 3D understanding of 

Energy (more specifically, student 3D understanding of relationship between energy and forces 

in bond breaking/bond making processes), and student 3D understanding of Chemical Reactions 

(see Chapter 1). The theoretically suggested latent dimensionality was confirmed by validity 

evidence based on response process and invariance studies (see Chapter 1). Previous study 

therefore suggests 2-dimensional latent structure for Unit 2 assessment instrument. For IRT 

modeling, however, the interest is only in the overall progression, and since both dimensions 

 

168 

 

 

were closely related (correlation coefficient was 0.784 on pre test and 0.928 on post test) and 

therefore likely to be developed in conjunction, unidimensional IRT was used to model students’ 

progression. This also made sense from theoretical point of view because all items on the test 

were aimed to measure student understanding of ideas related to chemical bonding, so even 

though the two latent constructs were slightly different, it was reasonable to combine them as 

they were measuring the same science idea. Based on this assumption, unidimensional IRT 

model was used to model the data. Appendix provides R code for model selection, specification, 

and estimation using the mirt package (Chalmers, 2012) in RStudio (RStudio Team, 2015). The 

results section presents IRT analysis relevant to validation of the 3DLP.  

Results 

Supporting the validity of the 3D construct map using qualitative analysis of student interviews 

Identifying Key Knowledge and Practices for Each Level of the 3D construct map 

Qualitative analysis of student interviews served as a rich source of information to help 

obtain validity evidence for hypothetical 3D construct map levels. Analysis of student responses 

supported the hypothetically suggested progression of student understanding reflected in the 3D 

construct map levels for both interview items. Specifically, at level 0 student answers contain 

very little information relevant to demonstrating knowledge application ability for ideas related 

to how bond making and bond breaking processes from the perspective of energy and electric 

force. The answers focus on reciting back information provided in the question itself, and some 

observable macroscopic level components. For example, for “Match on The Hot Plate” item 

level 0 type of responses usually focus on the idea that match is set on fire because of the heat 

from the hotplate without mentioning any mechanistic causal details. Similarly, for the “Atoms 

forming a Bond” item student answers at level 0 don’t demonstrate relevant knowledge of bond 

 

169 

 

 

making/bond breaking processes or how energy and electric force is involved in these processes. 

 Level 1 reflects the most diverse range of responses for “Atoms forming a Bond” item. 

In general, for this item student response reflect macroscopic level understanding with various 

ideas related to electrical interactions and energy being used to explain bond formation. For 

example, students use basic ideas related to attractive and repulsive interactions between charges 

to attempt and explain formation of chemical bonds (see level 1 sample response #1 for “Atoms 

forming a Bond”), but do not relate energy and electric force in the context of chemical bonding 

at the atomic-molecular level. Charges are modeled as point charges and not parts of atoms at 

this level. This lack of detail in the level 1 models leads to incomplete or inaccurate explanations 

of phenomena and lack of microscopic level details.  Additionally, students at this level might 

recall a computer simulation they study as part of Unit 2 instruction where two atoms couldn’t 

form a chemical bond until a third atom was introduced into the system, and they were able to 

form a bond after colliding and transferring extra energy to the third atom. Student responses use 

ideas from this simulation to suggest that two atoms need a third atom in order to give extra 

energy to the third atom and form a bond. (see level 1 sample response #2 for “Atoms forming a 

Bond”). Atoms are still modeled as spheres without mentioning components of the atoms or 

interactions between components. Finally, the third type of responses for this item at level 1 

reflect a combination of ideas related to electrical interactions between point charges, and the 

ideas from the simulation related to introducing a third atom into the system to form a chemical 

bond. For example, level 1 sample response #3 for “Atoms forming a Bond” indicates that atoms 

form a bond via attractive interactions between opposite charges (the charges are modeled as 

point charges, not as components of atoms), and the third atoms gives energy to the other two to 

form a bond. This response reflects a combination of prior knowledge about attractive 

 

170 

 

 

interactions between opposite charges, new knowledge from the simulation studies in class 

mixed with probably previous inaccurate idea that one needs to add energy to break chemical 

bond. Hence student suggestion that the third atom actually adds energy to help the other two 

form a chemical bond instead of taking the access energy away to allow for the bond to form. 

For the “Match on the Hot Plate” item level 1 responses were not as diverse, and generally 

reflect ability to track energy transfer in the system at macroscopic level without providing 

atomic level details (see level 1 sample response for “Match on the Hot Plate”).  

Level 2 reflect transitional macro-to molecular level understanding of chemical bonding 

from the perspective of energy and electric force. Specifically, student models show atomic and 

molecular level detail, and explanations mention that it takes energy to separate atoms (see level 

2 response for “Match on the Hot Plate”). However, answers at this level don’t provide full 

causal mechanistic account for how the energy is transferred when atoms are separated, and how 

the energy provided causes bonds to break. The answers at this level also contain some 

inaccuracies (for example, sample level 2 answer for “Match on the Hot Plate” mentions that 

atoms are part of the flame from the match). Similarly, as related to chemical bond formation, 

answers at level 2 reflect detailed atomic level causal mechanistic understanding of the 

mechanism from the perspective of electrical interactions, and the idea that a chemical bond is 

formed at a distance between two atoms where attractive and repulsive interactions between 

components of atoms are balances out (see sample response for level 2 “Atoms forming a Bond” 

item). However, the idea of energy, and how it is involved in bond making/bond breaking 

processes is still lagging or contains a lot of inaccuracies at level 2. Students might need to be 

prompted to elicit ideas of energy, but even when prompted they don’t necessarily relate energy 

changes in the system to changes in electrical interactions that lead to bond formation.  

 

171 

 

 

Finally, at level 3 of the 3D construct map student models and explanations demonstrate 

microscopic level causal mechanistic understanding of bond breaking and bond forming 

processes from perspective of energy and force. Upon completion of Unit 2 there were no 

responses identified that would be fully consistent with level 3. Specifically, at this level student 

models and explanations are expected to demonstrate full causal relationships between changes 

in energy of the system and associated changes in attractive/repulsive interactions between atoms 

as related to bond forming and breaking, as well as clear understanding of differences between 

heat, force and energy.  It is likely that this level of understanding develops as students have 

additional opportunities to explore more phenomena. 

Evidence in Support of Developmental Nature of Student 3D Understanding 

While there were no level 3 responses observed in the interviews or in the scoring of the 

entire student sample of written pre and post assessments, there were some responses that could 

be characterized as transitioning between the levels of the 3D LP. Table 5 provides examples of 

student answers that were considered to fall between the levels and explains why. For example, 

transitioning from level 0-1 of the 3D construct map for “Match on the Hot Plate” item is 

characterized by providing a more detailed causal account of phenomenon, but still few relevant 

DCIs present (no mention of energy or electrical interactions). For the “Atoms Forming a Bond” 

item level 0-1 responses are characterized by referring to idea of interactions between atoms as a 

driving factor of forming a bond, but no clear causal mechanism for the origin of the interactions. 

For “Atoms Forming a Bond” item, sample response uses idea of field to explain bond 

formation, but does not explain the origin of the field and how it is involved in forming a bond.  

Further, transitional responses for level 1-2 of the 3D construct map are characterized by 

providing more atomic and molecular level mechanistic details of phenomena, but lacking 

 

172 

 

 

detailed mechanism relating interactions between atomic components to bond formation, and 

absence or inaccurate use of energy ideas. For example, for “Match on the Hot Plate” item 

sample response shows a model indicating that bonds break as a result of “spark from the hot 

plate”, which causes a chemical reaction and sets the match on fire. Relevant ideas are present, 

but they are not used in a way that provides a clear causal mechanistic account of why the match 

lights up, and ideas of energy is completely absent. Similarly, for “Atoms forming a Bond” item 

level 1-2 transitional responses indicate atomic level detail (sample responses both show atomic 

structure, and relevant atomic components), but how interactions between atomic components is 

involved in forming a bond is not clear. In sample response #1 explanation indicates that the 

interactions are due to the field, and the origin of the field is not explained. In sample response 

#2 attractive and repulsive interactions between components of the atoms are accounted for, but 

the idea of balance of attractive and repulsive forces is missing, and therefore the mechanism of 

bond formation is not fully explained. Ideas of energy are not related to electrical interactions 

between components of atoms, and are not mentioned in the context of bond formation. 

Finally, level 2/3 transitional responses reflect molecular level detailed mechanism that 

relates interactions between components of atoms and energy, but contains some inaccuracies. 

For example, for the “Match on the Hot Plate” mechanism for lighting up the match is explained 

at the molecular level, but the answer indicates that bonds are broken as a result of heat from the 

hot plate and not energy. However, the answer further states that energy is released when new 

bonds form. It is therefore unclear if the student uses ideas of heat and energy as equivalent ones. 

Further, for “Atoms forming a Bond” item sample transitional response provides causal 

mechanistic account for how interactions between components of atoms and energy is involved 

in bond formation with some inaccuracies. For example, the answer is indicating that potential 

 

173 

 

 

energy is equally high for when atoms are far away and close to each other. 

To summarize, transitional responses tend to contain more relevant content (aspects of 

DCIs), but lack application of the content for explaining phenomena. This reflects the nature of 

3D understanding described in the 3D construct map, which is characterized by achieving 

knowledge-in-use, or ability to apply DCIs, SEPs and CCs to explain phenomena. Transitional 

responses were assigned the lower level on the 3D construct map as a final level for online 

responses because they did not contain all the aspects consistent with the higher level.  

 

174 

 

Table 3.5 

Sample responses that fall between levels of the 3D construct map 

 

Level 

0/1 

Sample Student Answer 
“Match on the Hot Plate” 

Student: “The friction from it moving across a rough 
surface causes the heat which makes it set on fire. The 
hot plate gives the match enough heat to lite” 
Comment: the answer does not use ideas of energy to 
explain why the match lights up, but provides a more 
detailed mechanistic account for what causes the 
phenomenon. 

1/2 

Student: “The match lights up because of the chemical 
reaction. The spark from the hot plate breaks the bonds 
in the molecules of the match, there is a chemical 
reaction and the match sets on fire”. 

 

Comment: the model and explanation provide atomic 
level mechanism for bond breaking, but does not 
explain how energy is involved in bond breaking 
process and setting the match on fire. 

Sample Student Answer 
“Atoms Forming a Bond” 

Student: the two atoms interact through the field 
to form a bond. 
Interviewer: Where do these fields come from? 
Student: all atoms have fields around them. 
Comment: explanation focuses on the idea of 
field as a major factor in bond formation, 
components of atoms or point charges are not 
shown, no explanation about the origin of the 

field is present, no explanation for how field contributes to forming a bond. 

Sample response #1 

Student: atoms are made of electrons, and protons and 
neutrons in the nucleus. When atoms form a bond, their 
fields interact. 
Interviewer: Where do these fields come from? 
Student: They are located around the atoms. When 
atoms are close enough, their fields interact and form a 
bond. 
Comment: model and explanation shown atoms and 

indicates components of the atoms, but does not explain how components interact to 
form a bond, or the origin of fields. Energy is not mentioned. 
Sample response #2: 

Student: as atoms get closer, the nuclei repel, and 
electrons of one atom attract to the protons of the 
other atoms. 
Interviewer: so how do the two atoms from a bond? 
Student: probably the attraction between protons and 
electrons is stronger than repulsion…I am not sure. 
Comment: model and explanation refer to interactions between components of atoms 
as being the driving factor in bond formation, but the idea of balance of attractive and 
repulsive forces is missing. Ideas of energy are also missing. 

175 

 
 

 

 

Table 3.5 (cont’d). 

Level 

2/3 

Sample Student Answer 
“Match on the Hot Plate” 

Sample Student Answer 
“Atoms Forming a Bond” 

 

 
 
 
 
 
 
Student: 

“The heat from the hot plate breaks bonds in the 
molecules of the match. The new bonds form and the 
match is set on fire. Energy is released when new bonds 
form, and fire is indication that energy is being 
transferred”. 
Comment: the answer does not use ideas of energy to 
explain why bonds in the molecules of the match are 
broken. However, molecular level mechanism is 
present, and energy is used to explain observed flame as 
a result of bond forming and energy being released. 

Student: They (atoms) have to get 
close enough without repelling, and 
they have to have the energy to 
make a bond, but I don’t remember 
how. When they are moving 
towards each other they have 
potential energy, when they are 
attracting or repelling. And when 
they get close enough the potential energy goes down I think because the atoms are 
attracting to each other more or something… 
Interviewer: Why do they repel when they are too close? 
Student: because they have electrons, they get too close and they repel. The electrons 
are negative, they are on the outside (of the spheres shown in the model), and the 
protons are positive they are on the inside. 
Interviewer: and why would atoms attract when they are far away? 
Student: because the electrons attract to the nucleus of the other atom 
Interviewer: And what do you mean by balanced here? 
Student: they are not repelling or attracting because the electrons are wanting to attract 
the protons, but at the same time the electrons are repelling from each other. 
Interviewer: How do your energy graphs relate to each of the situations? 
Student: the potential energy is high when they move towards each other or away from 
each other because they are either repelling or attracting, which builds up the potential 
energy. Potential energy is like, it wants to move, but it’s not moving yet. Kinetic 
energy is energy in motion. 
 
Comment: the explanation describes energy changes associated with electrical 
interactions between components of atoms when prompted, even though the model 
does not show components of atoms (protons, electrons). Some inaccuracies present in 
the answer. For example, the potential energy is high for when atoms are far away and 
close together. 

 

176 

 

 

Consistency in Assigning Responses to 3D Construct Map Level for Different Phenomena 

Since students were asked to explain more than one phenomenon, it allowed to study 

students’ ability to transfer their 3D understanding to different contexts. Specifically, the “Atoms 

Forming a Bond” item is an example of an abstract phenomenon that students can not directly 

observe, which makes it harder to model and explain. This items also contains more complex 

ideas and requires deeper understanding. On the other hand, the “Match on the Hot Plate” item 

focuses on a more familiar phenomenon that is directly observed in the video shown to students. 

This difference in how familiar the phenomena were to students is evident in the levels of the 

answers provided for both scenarios in the interview. Table 6 shows assignment of levels for 

each student on each interview item.  

Specifically, on the pretest, 7 students scored a level 0, 8 students scored a level 1 and 2 

students scored between levels 0-1 of the 3D construct map on the “Match on the Hot Plate” 

item. With the “Atoms forming a Bond” item, 10 students scored a level 0 and 5 students scored 

a level 1 and 2 students scored level 0/1 of the 3D construct map on the pretest. These results 

suggest that the abstract “Atoms forming a Bond” item was a little more difficult for students to 

model and explain. However, still a considerable number of students scored either level 1 or 

even level ½ on the pretest (total 7 students out of 17), suggesting that considerable number of 

students were able to apply prior knowledge about electric charges gained during Unit 1 to 

suggest possible mechanism for chemical bond formation. Overall, the majority of interviewed 

students demonstrated proficiency between level 0 and 1 of the 3D LP on the pre-unit 1 

interview.  

On the post test, nobody scored level 0 on any of the interview items. Further, 5 students 

scored level 1, 9 students scored level 2, 1 student scored level 1-2 and 2 students scored level 2-

 

177 

 

 

3 for “Match on the Hot Plate” item. Further, 6 students scored level 1, 6 students scored level 2, 

2 students scored level 1-2 and 3 students scored level 2-3 on the “Atoms forming a Bond” item.   

These results suggest that students start developing macroscopic level understanding of 

energy and are starting to make sense of how energy and force might be related to bond 

breaking/bond making processes, which is mostly consistent with level 1 and level 2 of the 3D 

construct map. Overall, clear progression along the levels of the 3D construct map is evident in 

student interview analysis. All students moved at least 1 level of the 3D construct map upon 

completion of Unit 2. Additionally, the construct map 3D level assignment was consistent across 

the two phenomena, meaning that students overall got the same 3D construct map level 

assignment for both phenomena upon completion of Unit 2. This suggests that while the context 

for the two interview items was quite different, it did not affect student ability to apply their 

understanding to explain bond making/bond breaking processes to a large extend. 

Table 3.6  

Student score and 3D construct map level for each interview phenomenon 

Student 

Pre-Unit 2 level 

Post-Unit 2 Level 

Match on the Hot Plate 

Match on the Hot Plate 

Pre-Unit 2 level 
Atoms Forming a 

Post-Unit 2 level 
Atoms Forming a 

Bond 

Bond 

1 
0 
0 
0 
0/1 
1 
1 
1 
1 
1 
0/1 
0 
1 
0 
0 
0 
1 

A 
B 
C 
D 
E 
F 
G 
H 
I 
J 
K 
L 
M 
N 
O 
P 
Q 

 

 

1 
0 
0 
0 
0 
0 
1/2 
1 
0 
1 
1/2 
0 
0 
1 
0 
0 
1 

2 
1 
1 
2 
1 
1 
2 
2/3 
1/2 
2/3 
2 
1 
2 
1/2 
1 
2/3 
2 

2 
1 
1 
2 
1 
1/2 
2/3 
2 
2 
2/3 
2 
2 
2 
1 
1 
2 
2 

178 

 

 

Supporting the Validity of levels of the 3D Construct Map using IRT 

In this section Wright Maps resulting from fitting graded response model (GRM) are 

used to show additional validity evidence for 3D construct map levels. The GRM model is a 

polytomous item model. It is used for items with more than 1 response category, like the ones 

designed for this study. Under GRM, each response category has its own difficulty parameter 

(Samejima, 1969). The interpretation of category difficulty under GRM is the following: a 

student with ability level equal to the difficulty of a given response category has a fifty percent 

probability of scoring in that category, and fifty percent in the category below (Samejima, 1969). 

When looking at the Wright Map, we want to see if abilities that correspond to 

difficulties for various item response categories are consistent with those theoretically suggested 

by the rubric and 3D construct map. Specifically, we expect item difficulties that correspond to 

lower ability levels be located in the lower ability region of the Wright Map for all items. This is 

because respondents of lower ability are more likely to endorse an easier item response category 

(lower difficulty category), which in turn corresponds to lower level of the 3D construct map. 

Similarly, item difficulties corresponding to higher ability levels should be located at the higher 

ability region of the Wright Map because higher ability is related to higher probability of 

endorsing more difficult response category, which corresponds to higher level of the 3D 

construct map. If this pattern is consistent for all items on the assessment, then we have evidence 

for the validity for the hypothetical 3D LP (Wilson, 2005; Wilson, 2009; Doherty et al., 2015).  

The Wright Map resulting from GRM analysis is shown in Figure 2. Recall that each item 

has 4 response categories (0-3) aligned to the three levels of the 3D construct map (see Tables 3 

and 4). Since no student received a 3 on any of the items, no level 3 responses were observed. 

Additionally, level 0 thresholds don’t show up on the Wright Map. Therefore, there are only 2 

 

179 

 

 

response categories in the IRT analysis. Therefore, each of the 8 items has 2 difficulties 

associated with score of 0/1 and score of 1/2 on the rubric for that item respectively. The Wright 

Map in Figure 3 shows difficulties for categories corresponding to score 1 (labeled as “1”) and 

score 2 (labeled as “2”) for all items. Solid black horizontal lines represent location of thresholds 

for each 3D construct map levels.  

Level 0-1 threshold separates level 0 from level 1 of the 3D construct map. The cutoff for 

level 0-1 is 1.31 on the logit scale. The cut score for level 0-1 was taken to be approximately at 

or below the lowest threshold for level 1 (1.31). It means that respondents with ability level 

above 1.31 are at level 1 of the 3D LP, and respondents with ability level below 1.31 are at level 

0 of the 3D construct map. Further, the cutoff for level 1-2 is 1.88 and has the same 

interpretation as level 0-1 cutoff. It was calculated as the median item threshold on logit scale 

(Doherty et al., 2015). Since no scores corresponding to level 3 of the 3D LP were observed and 

thresholds for level 3 LP have not been determined, the cutoff for level 2-3 cannot be accurately 

determined. However, the highest threshold for level 2 is 2.60, and it is likely that level 3 ability 

level will be located close or slightly above that value.  

As seen in Figure 2, level 1 thresholds are well separated from level 2 thresholds. 

Specifically, all level 1 thresholds are located in approximately the same ability region and do 

not overlap any of the level 2 thresholds. In other words, no level 1 threshold falls above the cut-

off point for level 1, and no level 2 threshold falls below the cutoff point for level 1. This 

suggests that the progression of student understanding predicted by the 3D construct map is 

supported by the data, providing quantitative validity evidence for the 3D construct map 

(Doherty et al., 2015; Wilson, 2004).  

 
 

 

180 

 

 

Pre 
test 

Post 
test 

Level 1-2 Cutoff=1.88 

Level 0-1 Cutoff=1.31 

Level 3 

Level 2 

Level 1 

Level 0 

Figure 3.2 Wright map showing 3D construct map levels for unit 2 assessment items 

Evaluating Student Learning based on unit 2 assessment 

Pre and post assessment data was combined when fitting GRM model (see Appendix for 

details) in order to be able to compare how ability distributions change between pre and posttest. 

The Wright Map in figure 2 shows distribution of responses (Respondents) for pre and posttest 

on one graph. As you can see, both pre and post unit 2 contain significant number of respondents 

below 0 on the logit scale. These are respondents with missing data, for whom zeros were 

imputed at both time points. Respondents who did not provide any answer on pre and post-test 

still participated in the curriculum as can be seen from their work in Unit 2 saved in the online 

portal. Therefore, even though they had missing data for Unit 2 assessment, they were left in the 

sample to ensure that their data can be used to investigate levels of the 3D construct map. 

To check the extent of learning that occurred before and after Unit 2 was covered, Wald 

test was conducted to determine if the increase in the mean between pre and post-test was 

statistically significant. The mean increased from -0.006 to 0.516 on the logit scale between pre 

 

181 

 

 

and post-test, and the Wald test showed that this increase was statistically significant (W=305.1, 

df=1, p>0.001), indicating that learning occurred between pre and posttest assessment for the 

entire sample of students. However, to better understand how the learning occurred in terms of 

student movement along the levels of the 3D construct map, we need to look at the distribution 

of responses and compare pre and post unit assessment for each level of the 3D construct map. 

Since the respondents who did not provide any answer on pre and post assessment introduce too 

much noise into the distribution, they were removed from the Wright Map to be able to see the 

degree of spread in learning for those students who provided the answers. This allows to draw 

more accurate conclusions about student learning upon completion of Unit 2. Figure 3 below 

shows the Wright Map of reduced data for those who provided answers on pre and posttest. 

 

   3D construct map level cutoffs                Distribution maximum on pre and post test      
 
  Average ability level for each threshold 

 

Pre 
test 

Post 
test 

 

2.23 

1.59 
1.52 

1.16 

 

 

 

 

 

 

 

Figure 3.3 Wright map with respondents who provided answers on pre/post unit 2 test 

 

182 

 

 

Observe, in Figure 3, that the distribution maximum on pretest is located at the value of 

1.16 on the logit scale, which is below 1.31 cutoff value for level 1. In total, about 80% of 

respondents on pre-test lie below level 1 cutoff value of 1.31 (see R code in the Appendix for 

percentage calculation). There is a small peak at 1.59 on the logit scale on pre-test, which is close 

to the 1.52 average ability level value for threshold 1. Overall, about 20% of respondents lie in 

level 1 of the 3D construct map on pretest, with essentially no respondents at level 2. Therefore, 

the majority of respondents lie below level 1 of the 3D construct map on the pretest with some 

respondents located around the average threshold value for level 1.  

 

On the posttest the distribution maximum is located at the value of 1.59 on the logit scale. 

Therefore, the small peak at 1.59 on the pre-test grows and becomes maximum of the 

distribution on the post test. Overall, about 53% of the respondents are located in level 1 of the 

3D construct map on the post test, about 21% of respondents fall in level 0, and about 26% of 

respondents fall in level 2 of the 3D construct map.  Therefore, there is clear movement of 

respondents along the levels of the 3D construct map upon completion of Unit 2. 

Assigning construct Map level to individual students 

This section shows how 3D construct map for chemical bonding can be used to 

accurately place student on a level, therefore allowing to use the validated 3D construct map and 

the associated assessment as a diagnostic tool in the classroom. To assign a level on the 3D 

construct map to each individual student, it is important to take into consideration measurement 

error associated with estimation of each proficiency level. This is especially important for 

students whose proficiency levels lie close to cut points for 3D construct map levels, or provide 

answers consistent with in-between level assignment as was observed for the oral interviews. To 

do this, confidence interval (CI) for all proficiency estimates are calculated using one standard 

 

183 

 

 

error in each direction (see Appendix for the R code).  

Wright Maps are further modified by arranging student proficiency in ascending order 

excluding students who had all zeroes on pre and/or post10. The modified Wright Maps for pre 

and posttest are shown in Figures 4 and 5 respectively. The curved black line shows 

proficiencies, and the grey band represents upper and lower interval bounds. The horizontal 

dashed lines represent cutoffs for 3D construct map levels, and vertical lines show the area where 

confidence intervals overlap the cut points. If confidence intervals fall entirely into one of the 3D 

construct map regions (for example, the first 777 students on pretest (Figure 4), and the first 613 

students on the posttest (Figure 5)), these students are likely to provide answers consistent with 

level 0 of the 3D construct map, and therefore should be assigned level 0 with high degree of 

confidence. Similarly, students 857-896 on pretest and students 670-790 on the posttest have 

confidence intervals that fall entirely into level 1 of the 3D LP, so these students can be assigned 

level 1. Finally, confidence intervals for the students 838-899 on the posttest fall entirely into 

level 2, and those students can be assigned level 2 on the 3D construct map.  

However, sometimes the confidence intervals overlap the cut points for the 3D construct 

map levels. For example, students 778-856 on the pretest, and students 614-669 on the posttest 

have confidence intervals that overlap level 0-1 cutoff, indicating that they are likely to provide 

answers consistent with in-between level assignment. In this case, there is less certain about the 

3D construct map level assignment for these students. Similarly, students 896-899 on pretest and 

791-837 on the post test have confidence interval overlapping level 1-2 threshold, indicating that 

there is less certainty in placing these students in level 2 of the 3D construct map.  

 

                                                 
10 The X axis of the Wright Maps shown in figures 4 and 5 was truncated to exclude students who had zeroes on pre 
and post assessment and highlight the graph better.  

 

184 

 

 

Level 0 

Level 1-2 

Level 0-1 

Level 0-1 

Level 1  Level 1-2 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 3.4 Modified wright map for pre unit 2 test showing student proficiency estimates and 
standard error bands from lowest to highest 

 

 

 

Level 0 

Level 0-1 

Level 1 

Level 1-2 

Level 2 

 
Level 1-2 

Level 0-1 

 

 

 

 
 
 
 
 
 
 
 
 
 
Figure 3.5 Modified wright map for post unit 2 test showing student proficiency estimates and 
standard error bands from lowest to highest 

 

185 

 

 

Overall, only 81 students on the pretest and 101 students on the posttest fall in between 

levels of the 3D construct map, which corresponds to 9% and 11% respectively. Therefore, this 

indicates have high degree of certainty in assigning a level on the 3D construct map to individual 

students for the majority of the sample. To be exact, since 1 standard error was used to calculate 

the confidence intervals, this indicates 68% certainty in assigning 3D construct map level to 

individual students. This provides evidence for validity of the 3D construct map as a diagnostic 

tool that allows placing a student on a level with a high degree of accuracy, and use the 

information about what student understanding looks like in terms of the three dimensions (DCI, 

SEP, CCC) at each given level to characterize their science proficiency. To the author’s 

knowledge, this is the first validated 3D construct map that provides this degree of level 

assignment certainty and therefore applicability in terms of immediate pedagogical use.  

Discussion 

 

This work presents a 3D construct map for chemical bonding developed following 

previous research and principles expressed in the Framework that suggest teaching the concepts 

chemical bonding from the perspective of balance of electric forces and energy minimization 

(NRC, 2012, Taber, 1998, Cooper et al., 2014). The 3D construct map presented here is aligned 

to NGSS PEs and validated in the context of NGSS-aligned “Interactions” curriculum. The 

curriculum aims to build student understanding of chemical bonding as an extension of the same 

principles of electrostatic attraction that drive interactions between macroscopic charged objects 

and formation of intermolecular interactions that lead to increasing stability of the system 

through energy minimization. In that regard, the curriculum is aiming at helping students build 

integrated understanding of energy and electrical interactions at macro and atomic-molecular 

scale. While this approach to teaching chemical bonding has been gaining popularity at the 

 

186 

 

 

undergraduate level (Cooper, Klymkowsky, 2013), to author’s knowledge this is the first study 

that shows development of student understanding of chemical bonding following this approach at 

the secondary level in the context of NGSS classroom. In that regard, this study provides 

valuable takeaways regarding student 3D understanding of chemical bonding following this 

instructional approach, which are further discussed. 

The first takeaway is that students need a lot of careful scaffolding to learn to integrate ideas 

of energy and electric force to explain chemical bonding. As can be seen in the “Atoms Forming 

a Bond” interview item that the highest-level responses for the most part did not contain ideas of 

energy unless prompted, and students felt that they have provided fully causal account of 

formation of a chemical bond by describing the mechanism of balancing attractive and repulsive 

interactions. It seemed like students didn’t think that energy was an important driving factor in 

formation of chemical bonds. This is consistent with previous research that shows that students 

struggle to connect ideas related to atomic structure and electrical interactions at atomic-

molecular scale to associated energy changes (Becker, Cooper, 2014). This difficulty might be 

due to the fact that ideas of energy are abstract, and student often don’t have direct experience 

observing energy changes associated with electrical interactions especially at atomic-molecular 

level, and therefore cannot make use of ideas of energy productively (Cooper, Klymkowsky, 

Becker, 2014). This finding is also evident in the results of student interviews using the “Match 

on the Hot Plate” item. Specifically, student models and explanations do not show clear 

relationship between energy and bond breaking process in the match at either level 1 or level 2 of 

the 3D construct map. Additionally, answers at levels 0-2 tend to interchange ideas of heat and 

energy when describing bond breaking and bond making processes. For example, at level 2 

answers tend to recognize that heat is involved in separating molecules into individual atoms, but 

 

187 

 

 

the details of bond breaking and bond making processes and associated energy changes are still 

missing. This might be due to the fact that student often confuse ideas of heat and energy, 

presenting heat as a form of energy rather than manifestation of energy transfer (Jewett, 2008). 

“Interactions” curriculum provides detailed description of these important subtle differences for 

teachers as part of teacher materials. In short, it is clear that students are struggling to incorporate 

the idea of energy when explaining chemical bonding, and it seems like the key idea required to 

achieve the type of 3D understanding consistent with the highest level (level 3) of the 3D 

construct map. For researchers and educators, the issue becomes: how do we organize instruction 

to help students understand the importance of using energy perspective along with electric force, 

and how do we help student distinguish between ideas of heat and energy? Or is it the case that 

student develop ability to integrate these ideas further in the curriculum? Additionally, how do 

we support teachers in emphasizing these ideas in their classroom as opposed to perpetuating 

traditional practices about teaching chemical bonding that are not grounded in emphasizing 

importance of energy minimization and balance of electrical forces? These are all important 

questions for future research. 

The second takeaway is that the idea of balance of electric forces between components of the 

interacting atoms seems central for developing a useable conceptual model for explaining why a 

chemical bond form. It seems that even when students reason about the mechanism of chemical 

bond formation in terms of attractive and repulsive interactions between components of atoms, 

they seem to struggle with the idea of balance of attractive and repulsive electric forces, and 

explain chemical bonding in terms of magnitude of force exerted by components of the atom 

upon each other (for example, student answer focused on suggesting that a chemical bond forms 

because attractive forces between nuclei and electrons are stronger than repulsive forces between 

 

188 

 

 

electrons). This finding is consistent with previous research that shows that students tend to 

believe that forces on the electrons from the nucleus are larger than force exerted by electrons 

from the nucleus (Taber, 1998). In short, it is clear that students are struggling to incorporate the 

idea of balance of attractive and repulsive interactions when explaining chemical bonding, and it 

seems like the key idea required to achieve the type of 3D understanding consistent with level 2 

of the 3D construct map. 

The third takeaway has to do with suggesting possible ways in which students build useable 

3D understanding of chemical bonding. Specifically, it is interesting to see that level 1 reflects 

the most diverse range of types of student responses for “Atoms Forming a Bond” item. It seems 

like for this item student reasoning can range from applying prior knowledge learned in Unit 1 

related to electrical interactions between point charges (see sampler response #1) to recalling 

information from classroom simulation related to giving excess energy to the third atom in order 

to form a stable bond (sample response #2). Additionally, some answers contain some 

combination of prior knowledge and new information reflected in using ideas related to 

interactions between point charges, and the involvement of the third atom learned in the 

simulation, but probably misunderstood in light of common misconception that energy is needed 

to form bonds (sample response #3). These all seem like very different ideas that students are 

trying to apply when explaining a very abstract, unobservable process of chemical bond 

formation. At level 1 student answers do not seem to reflect any more or less permanent mental 

model used to explain chemical bonding. Rather, it seems like student are using various ideas 

they have learned that they still struggle to connect together to construct a possible explanation.  

This makes a lot of sense because these are very challenging ideas, and it takes time to put these 

ideas together. At the same time, as students’ progress towards higher level of understanding, the 

 

189 

 

 

types of answers provided seem to reflect a more well-established model used to explain 

chemical bonding. Specifically, at level 2, all answers reflect student ability to model chemical 

bonding in terms of balance between attractive and repulsive interactions between atoms. 

Therefore, the progression of student understanding along the levels of the 3D construct map 

seems to reflect a process during which students learn to combine various ideas that are often 

disconnected to explain relevant phenomena.  

In that regard this finding is also consistent with the notion expressed in the Framework 

stating that experts incorporate new knowledge into already established frameworks of 

understanding, while novices tend to have knowledge that is largely unstructured and 

disconnected (NRC, 2012). The interesting finding of this study is that moving towards expert-

like understanding of DCIs (content) reflected in well-established mental models that are used to 

explain phenomena requires learning to make connections between various ideas to be able to 

use these disconnected ideas in a productive way to make sense of phenomena. In this regard, the 

fourth takeaway of this study suggests that transitioning to higher level of understanding can be 

characterized by learning to make connections between various ideas to form a more permanent 

mental model. This conclusion is also supported by evidence from  in-between level responses 

for “Atoms Forming a Bond” item that show emergence of one general conceptual model where 

various ideas are connected to be used in explaining phenomenon of bond making and bond 

breaking processes, but occasionally ideas are also present that are not connected to the rest of 

the framework in a meaningful way  (for example, transitional level responses for “Atoms 

forming a Bond” item that state that attractive interactions are stronger than repulsive is 

inaccurate, but the other example states that atoms interact through field when they form a bond, 

which is correct, but still does not provide a causal account for why atoms form a bond).  

 

190 

 

 

Further, student interview data suggests that students hold a wider range of disconnected 

 

ideas for “Atoms forming a Bond” phenomenon than for “Match on a Hot Plate” phenomenon, 

which is evident in larger variety of answers provided at level 1 of the 3D construct map for 

“Atoms forming a Bond” item. This might be due to the fact that “Atoms forming a Bond” is a 

very abstract, unobservable phenomenon as opposed to fairly familiar, observable “Match on the 

Hot Plate” phenomenon. Specifically, “Atoms forming a Bond” phenomenon requires students to 

directly apply knowledge of atomic models, which is not a knowledge that is constructed directly 

through experience, but rather communicated through previously developed models learned in 

class. Therefore, “Atoms forming a Bond” item elicits prior stored knowledge to a larger extend 

than “Match on the Hot Plate” item, for which students are aiming to construct their explanation 

primarily based on direct observations. This difference in the types of answers provided based on 

the context suggests that at level 1 students do in fact hold various disconnected ideas than a 

well-established mental model. As student move to level 2, the types of answers are not as 

diverse any more, and indicate formation of well-established mental model that student can apply 

across the two item contexts. 

 

Finally, the last takeaway of our study has to do with developmental nature of student 

understanding and the idea that deep, integrated understanding of science takes time and 

appropriate scaffolding to develop (Smith et. al, 2006). Evidence of this is seen in student 

interviews, where more higher-level responses are observed by the end of unit 2, and student 

answers fall on a spectrum from less to more sophisticated level of understanding. Similar 

pattern holds for analysis of student written responses using IRT (compare figures 3 and 4 in 

terms of number of students who progressed to levels 1 and 2 of the 3D construct map by the end 

of Unit 2). Further, the fact that none of the students reached level 3 of the 3D construct map by 

 

191 

 

 

the end of unit 2 indicates that it takes a long time before students develop ability to integrate 

ideas of electric force and energy for explaining chemical bonding at the atomic-molecular level.  

At the same time, 3D construct map for chemical bonding presented in this work can be used to 

accurately place students on a level, and provides reach description of what student 3D 

understanding of chemical bonding looks like at each level of the 3D construct map. 

Limitations 

This work includes several limitations. First, since Unit 1 of the “Interactions” curriculum 

was specifically focused on building student understanding of electrical interactions at macro and 

atomic-molecular scale, the prior knowledge that student came as well as future learning 

trajectory for Unit 2 was partly determined by what was learned in Unit 1. Therefore, it is 

possible that some of the student responses that were observed in the validation process of this 

construct map might not be observed if the context is different from “Interactions” curriculum. 

However, the general progression of student ability to integrate ideas of force and energy in the 

context of chemical bonding will still hold regardless of the curriculum context because the 3D 

construct map was built using relevant research literature and NGSS PEs that are not curriculum 

specific. Second, this work focused on specific aspects of relevant DCIs, one CCC and two 

SEPs. It is possible that the progression of student understanding might be different if a different 

set of CCC and SEP is chosen, and if the relevant DCI aspects are chosen differently. Third, the 

large number of students with missing data substituted by zeros indicates that the test was overall 

very difficult for the majority of students. For future work, it will be beneficial to include items 

that probe lower levels of the 3D construct map, which will allow to gain better understanding of 

how students at lower end pf ability spectrum make sense of ideas related to chemical bonding 

from the perspective of energy and force. 

 

192 

 

 

 

APPENDIX

193 

 

 

APPENDIX 

 

Testing Competing Item Response Theory (IRT) Models 

The items on Unit 2 present four ordinal response categories, where each category 

corresponds to a level of the 3D construct map.  Specifically, 0, 1, 2, 3-point response category 

on each item corresponds to the 3D construct map levels of 0, 1, 2, 3, as can be seen from 

examples of scoring rubrics. Common IRT models for polytomous items are Graded Response 

Model (GRM, Samejima, 1969) and Generalized Partial Credit Model (GPCM, Muraki, 1992). 

To choose appropriate IRT model to represent the data in this study, model fits for GRM and 

GPCM were compared. To ensure more accurate representation of the data, and to be able to 

compare student learning on pre and post assessment, pre and post assessment data was 

combined to specify the IRT model to be estimated using GRM and GPCM. Slopes and 

corresponding items were constrained to be equal on pre and post assessment for each item. This 

rigid model specification was safe to assume because dimensionality and longitudinal invariance 

of Unit 2 assessment instrument was extensively studied a priori (Chapter 1). The results of this 

previous study showed that Unit 2 assessment scale is two-dimensional, and partial measurement 

invariance holds over time for pre and post assessment. For IRT modeling, however, 2-

dimensional GRM and GPCM model could not be estimated due to a limited number of 

indicators (items). As a result, in this study, Unit 2 assessment data was modeled using one-

dimensional IRT. This was deemed appropriate because the two latent dimensions are highly 

correlated (0.784 on pretest; 0.928 on posttest) and aim to measure the same scientific idea- 

student understanding of chemical bonding from the perspective of energy and force. The R code 

is provided further in the Appendix. The results of IRT model estimation are shown in Table 7. 

 

194 

 

Table 3.7 

Model comparison for GPCM and GRM 

 

Model 
GPCM 
GRM 

LL 

-4240 
-4224 

# par 

51 
51 

AIC 
8535 
8503 

BIC 
8665 
8633 

M2 
577 
527 

df 
109 
109 

P value 
<0.001 
<0.001 

RMSEA 
0.0692 
0.0654 

CFI/TLI 

0.973/0.972 
0.976/0.975 

Smaller log likelihood values as well as smaller values for AIC and BIC index values 

suggest better fitting model (Nering & Ostini, 2011; Toland, 2014).  Based on these indexes, 

GRM is a slightly better fitting model for the data sample. Further, M2 goodness of fit statistics 

was used to evaluate overall model fit (Maydeu-Olivares & Joe, 2005). Smaller M2 values also 

indicate better model fit (Toland, 2014) and, following this guideline, GRM also presents better 

fitting model for the data, compared with GPCM model. The p-value for both GPCM and GRM 

indicate lack of fit. However, lack of fit for M2 statistics is common when fitting parametric 

models like GPCM and GRM to real data (Cai, Maydeu‐Olivares, Coffman & Thissen, 2006; 

Toland, 2014). Therefore, additional model fit indexes were used, including RMSEA and 

CFI/TLI. Good and reasonable model fit cut-off criteria for RMSEA was <0.6 and <0.8 

respectively, and for CFI/TLI was >0.95 and >0.90 respectively (Hu & Bentler, 1999; Marsh, 

Hau & Wen, 2004; Van Dam, Earleywine & Borders, 2010). Based on RMSEA and CFI/TLI 

values presented in Table 6, both GRM and GPCM have similar model fit. RMSEA for both 

models are marginally good, and CFI/TLI indexes represent good model fit. Therefore, based on 

evaluation of all the information, GRM appears to be a more suitable model for the data, and will 

be further used to evaluate model assumptions and obtain item parameters. 

Evaluating GRM model assumptions  

 

IRT model assumptions were further evaluated for GRM following Toland (2014). As 

mentioned above, dimensionality and partial measurement invariance were established for the 

measurement instrument in the previous study (Chapter 1). The assumption of local 

 

195 

 

 

independence is further tested below. Local independence (LI) assumes that student responses on 

the test are influenced only by their level on the latent trait continuum of interest. LI assumption 

is very important for IRT analysis because, if violated, item parameters become distorted, 

including inflated slopes and more homogeneous thresholds across items (Toland, 2014). In the 

context of NGSS, assumption of local independence becomes increasingly harder to meet 

because 3D assessments call for more contextualized, story-based items where students can use 

all the information available to them to demonstrate knowledge application ability (Gorin & 

Mislevy, 2013). These items often take the form of testlets, as is the case for the Unit 2 

assessment instrument here, which makes it especially difficult to meet assumption of local 

independence because items within a testlet share more commonalities than across testlets. This 

might lead to increased dimensionality and violation of LI assumption (Gorin & Mislevy, 2013). 

 

To evaluate LI assumption in this study the Q3 index was used with a cut-off value of 

|0.2| (Kim, De Ayala, Ferdous & Nering, 2011). This index and cut-off value have acceptable 

Type 1 error rate and is substantially more powerful than commonly used X2 and G2- LD 

indexes (Chen & Thissen, 1997). Further, it is also recommended that the 0.2 cut-off value be 

used in a relative way, and to determine what is “large” correlation relative to other residual 

correlations in the model (Dr. Chalmers, email conversation). Following these guidelines, the Q3 

statistics was used to evaluate local independence assumption. The Q3 statistics matrix is shown 

in figure 6. Only values above 0.2 are shown. Most residual correlations were below cut-off 

value of 0.2 in absolute values, and there were no residual correlations that were unusually high 

relative to others. Specifically, the highest correlation value was -0.381 between items 2 and 3 on 

the pre-test. Slightly high residual correlation is not surprising for those because they belong to 

the same testlet. However, this correlation is not unreasonably high compared to other values, 

 

196 

 

 

and most of the correlations are below cut-off value of 0.2. Therefore, there is enough evidence 

to conclude that assumption of local independence is met.  

 

Figure 3.6 Q3 matrix 

Model-Data Fit 

Once IRT model is chosen, and model assumptions are evaluated, it is appropriate to evaluate 

how well the GRM model fits the data and obtain item parameters that will be used in validating 

levels of the 3D construct map. 

Item level fit. To assess how well the GRM model fits each item, S-X2 item fit statistics for 

polytomous data was examined (Orlando & Thissen, 2000; Orlando & Thissen, 2003). 

Statistically significant p-value indicates that the model does not fit a given item. Item fit was 

evaluated using 1% significance level, and RMSEA values. This is because evaluation of item fit 

using S-X2 item fit statistics involves testing multiple hypothesis and larger samples lead to 

greater likelihood of statistically significant results (Stone, Zhang, 2003 & Toland, 2014). Item 

fit S-X2 statistics is shown in Table 8 below. Items 1, 3, 6 and 7 of pre-test and item 5 on the 

post-test have p-values <0.01 indicating poor model fit for these items. Since larger samples lead 

to greater likelihood statistically significant results, RMSEA values for these items are also 

examined. As can be seen from Table 8, all RMSEA values are below 0.06 indicating good 

model fit. Therefore, the GRM model fits each item reasonably well. 

 

 

197 

 

Table 3.8 

S-X2 item fit statistics 

 

S-X2 
Item 
Q1T1  22.788 
Q2T1  12.965 
Q3T1  26.294 
Q4T1  19.952 
Q5T1  22.941 
Q6T1  29.459 
Q7T1  30.422 
Q8T1  27.090 

df  RMSEA 
9 
8 
10 
8 
11 
11 
10 
18 

0.041 
0.026 
0.043 
0.041 
0.035 
0.043 
0.048 
0.024 

p 

Item 
0.007  Q1T2 
0.113  Q2T2 
0.003  Q3T2 
0.011  Q4T2 
0.018  Q5T2 
0.002  Q6T2 
0.001  Q7T2 
0.077  Q8T2 

S-X2 
20.23 
28.509 
22.788 
19.126 
35.283 
24.136 
37.617 
40.457 

df 
18 
17 
21 
18 
18 
17 
21 
23 

RMSEA 

p 

0.012 
0.027 
0.010 
0.008 
0.033 
0.022 
0.030 
0.029 

0.320 
0.039 
0.355 
0.384 
0.009 
0.116 
0.014 
0.014 

Person level fit. To evaluate consistency of student reasoning across different contexts 

represented in the items, person fit (Zh) statistics was examined (Drasgow, Levine & 

Williams,1985). The Zh distribution across pre and post assessment events for all students is 

shown in Figure 7 below. 

 

 

 

 

 

 

Figure 3.7 Person fit Zh statistics 

The value of -1.96 is used as a cut-off for Zh statistics, where students with Zh fit 

statistics above -1.96 show regular responses (Drasgow et al.,1985; Felt, Castaneda, Tiemensma 

& Depaoli, 2017). Figure 7 shows that the majority of students are above the cut-off value of -

1.96 (dashed line) suggesting that majority of the sample demonstrate responses consistent with 

those hypothesized by 3D construct map levels. This provides evidence towards the validity of 

the hypothetical 3D construct map levels (Doherty, Draney, Shin, Kim & Anderson, 2015). 

 

198 

 

 

R Studio Code
R Studio Code    
R Studio Code
R Studio Code

library(mirt)#For fitting IRT models 
library(foreign)#For importing SPSS data file 
library(WrightMap)#For wright maps 
library(ggplot2)# For histogram 

Model Fit Evaluation Unit 2 pre_post test 

#Items 1-8 represent Unit 2 pre test items, items 9-16 represent Unit 2 post test items. #Pre and 
Post test items are identical 

Model Statement 

FAmodelU2_1D<-mirt.model('F1=1, 2, 3, 4, 5, 6, 7, 8, 
                          F2=9, 10, 11, 12, 13, 14, 15, 16 
                     
                         CONSTRAIN= (1,9, a1, a2),  
                                    (2,10,a1, a2),  
                                    (3,11,a1, a2),  
                                    (4,12, a1, a2),  
                                    (5, 13, a1, a2), 
                                    (6, 14, a1, a2),  
                                    (7, 15, a1, a2),  
                                    (8, 16, a1, a2), 
                                
                               
                                    (1,9, d1),  
                                    (2,10, d1),  
                                    (3,11, d1),  
                                    (4,12, d1),  
                                    (5, 13, d1), 
                                    (6, 14, d1),  
                                    (7, 15, d1),  
                                    (8, 16, d1), 
                                           
                                    (1,9, d2),  
                                    (2,10, d2),  
                                    (3,11, d2),  
                                    (4,12, d2),  
                                    (5, 13, d2), 
                                    (6, 14, d2),  
                                    (7, 15, d2),  
                                    (8, 16, d2), 
         
                        MEAN=F1, F2 
                        COV=F1*F2') 

 

199 

 

Model Estimation 

 

pre.items<-c("U2Q1T1","U2Q2T1","U2Q3T1","U2Q4T1","U2Q5T1", "U2Q6T1","U2Q7T1", 
"U2Q8T1") 
post.items<-c("U2Q1T2","U2Q2T2","U2Q3T2","U2Q4T2","U2Q5T2", "U2Q6T2","U2Q7T2"
, "U2Q8T2") 
all.items<-c(pre.items,post.items) 

GRM model 1D 

modgrmU2_EM_1D<-mirt(U2_all[all.items],FAmodelU2_1D,itemtype="graded",verbose
=FALSE, SE=TRUE) 
M2(modgrmU2_EM_1D, impute=20, CI=.95) 

GPCM model 1D 

modgpcmU2_EM_1D<-mirt(U2_all[all.items],FAmodelU2_1D,itemtype="gpcm",verbose=
FALSE, SE=TRUE) 
M2(modgpcmU2_EM_1D, impute=20, CI=.95) 

Item analysis with choice model 
Item analysis with choice model ----    GRMGRMGRMGRM    
Item analysis with choice model 
Item analysis with choice model 

Model diagnostics
Model diagnostics    
Model diagnostics
Model diagnostics

Residual diagnostics 
residuals(modgrmU2_EM_1D,type="Q3", suppress=.2)# To evaluate loca  independe
nce (LI), only shows pairs with cov>0.2 (possibly have LI issue) 

Item fit diagnostics 
print(item.fit<-itemfit(modgrmU2_EM_1D,fit_stats="S_X2"))#To evaluate items f
it(cutoff: p<0.01) 

Person fit diagnostic 
person.fit<-personfit(modgrmU2_EM_1D, method="ML") #To evaluate person fit (Z
h stats) 
ggplot(person.fit,aes(x=Zh))+ 
  geom_histogram(bins = 15,colour = "black",fill = "white")+ 
  geom_vline(xintercept=-1.96,col="black",linetype="dashed")+ 
  labs(x="Zh statistic",y="Count")+ 
  theme_bw(base_size=12)+ 
  theme_classic()#histogram of Zh stats (above -1.96 good person fit) 

Wald Test for significance of the mean
Wald Test for significance of the mean    
Wald Test for significance of the mean
Wald Test for significance of the mean
#Is mean 2=mean 1? No, they are not equal if p<0.05 
(infonames <- wald(modgrmU2_EM_1D)) #choose column to be used in Wald test 
L <- matrix(0, 1, 27) 
L[26] <- 1 
L[25] <- -1 
wald(modgrmU2_EM_1D, L) 

 

200 

 

 

Item parameters and thresholds
Item parameters and thresholds    
Item parameters and thresholds
Item parameters and thresholds
### Item parameters and thresholds 
item.par<-data.frame(coef(modgrmU2_EM_1D, simplify=TRUE)$items) #Item paramet
ers; 
item.par$T1<-with(item.par, ifelse(a1>0,-d1/a1,-d1/a2))#a1= discrimination, D
ifficulty=(-d/a) 
item.par<-item.par[1:8,]#Select the first 8 rows since the remaining are time 
two items with equal parameters as time 1 
 
item.par$T2<-with(item.par, ifelse(a1>0,-d2/a1,-d2/a2)) 
mean.T1<-mean(item.par$T1)#Mean threshold 1 
mean.T2<-mean(item.par$T2)#Mean threshold 2 
t0_1<-min(item.par$T1)#cut off for level0_1 
t1_2<-median(c(item.par$T1, item.par$T2)) #cut off for level1_2 
t2_3<-max(item.par$T2)#cut off for level2_3 

Ability Wright Maps
Ability Wright Maps    
Ability Wright Maps
Ability Wright Maps
#Compute factor scores (Y-axis for Wright Map) 
AbilityU2Pre_Post<-data.frame(fscores(modgrmU2_EM_1D)) 
 
# Add ability scores to the data file 
fulldata_U2<-data.frame(cbind(U2_all, AbilityU2Pre_Post))  
 
#merge students who have complete data with fulldata file to create the reduc
ed sample file 
reducedata_U2<-merge(U2_STUID_allstudentscompletedata, fulldata_U2, by.x = "S
tudentID", by.y = "STUID")  

Complete sample data 
#Plot Wright Map full sample 
wrightMap(with(fulldata_U2,cbind(F1,F2)),matrix(c(item.par$T1,item.par$T2),nc
ol=2), person.side=personDens,cutpoints=c(t0_1,t1_2,mean.T1,mean.T2),min.l=-.
9, max.l=2.7) 

####Reduced sample data 

#Plot Wright Map reduced samle 
wrightMap(with(reducedata_U2,cbind(F1,F2)),matrix(c(item.par$T1,item.par$T2),
ncol=2), person.side=personDens,cutpoints=c(t0_1,t1_2,mean.T1,mean.T2),min.l=
-.9, max.l=2.7) 

Finding peaks on the Reduced sample Wright Map and % of examinees in each level of the 3D LP
Finding peaks on the Reduced sample Wright Map and % of examinees in each level of the 3D LP    
Finding peaks on the Reduced sample Wright Map and % of examinees in each level of the 3D LP
Finding peaks on the Reduced sample Wright Map and % of examinees in each level of the 3D LP

Functions to calculate percentiles for given cut-offs 
#Percentile function for pretest 
pct_pre<-ecdf(reducedata_U2$F1) 
#Percentile function for post test 
pct_post<-ecdf(reducedata_U2$F2) 

 

201 

 

 

Thresholds for level 0-1 and 1-2 of the 3D LP 
#### Thresholds for level 0-1 and 1-2 of the 3D LP 
t0_1 #lowest difficulty 1 
t1_2 #median between Difficulty 1 and Difficulty 2 
mean.T1 #Average difficulty 1 
mean.T2 #Average difficulty 2 

Percentage of examinees between thresholds 
##### Percentage of examinees between thresholds 
#pretest 
pct_pre(t0_1)#% prob density that fall below cutoff for level 1  
pct_pre(t1_2)#% prob density that fall below cutoff for level 2  
pct_pre(t1_2)-pct_pre(t0_1) # % prob density between level 1 cutoff and level 
2 cutoff of pre test  
pct_pre(mean.T1)#% prob density that fall below average difficulty 1  
pct_pre(t1_2)-pct_pre(mean.T1)# % prob density between average difficulty 1 a
nd level 2 cutoff  
(pct_pre(t1_2)-pct_pre(t0_1))-((pct_pre(t1_2)-pct_pre(mean.T1)))# % prob dens
ity between average difficulty 1 and level 1 cutoff  
pct_pre(mean.T2)#% prob density that fall below average difficulty 2  
 
#Post test 
pct_post(t0_1)#% prob density that fall below cutoff for level 1  
pct_post(t1_2)#% prob density that fall below cutoff for level 2 
pct_post(t1_2)-pct_post(t0_1)# % prob density between level 1 cutoff and leve
l 2 cutoff at post test  
pct_post(mean.T1)# % prob density that fall below average difficulty 1  
pct_post(t1_2)-pct_post(mean.T1)# % prob density between average difficulty 1 
and level 2 cutoff  
(pct_post(t1_2)-pct_post(t0_1))-((pct_post(t1_2)-pct_post(mean.T1)))# % prob 
density between average difficulty 1 and level 1 cutoff  
pct_post(mean.T2)# % prob density that fall below average difficulty 2  
pct_post(mean.T2)-pct_post(t1_2)# % prob density between average difficulty 2 
and level 2 cutoff  

Determine density peak values 
#### Determine density peak values 
#Peak values for pretest 
print(pre_peak1<-density(reducedata_U2$F1[which(reducedata_U2$F1>1.5)])$x[whi
ch.max(density(reducedata_U2$F1[which(reducedata_U2$F1>1.5)])$y)]) # peak - f
or pre test values above 1.5 
print(pre_peak2<-density(reducedata_U2$F1)$x[which.max(density(reducedata_U2$
F1)$y)]) #pre test peak - larger peak 
print(pre_peak3<-density(reducedata_U2$F1[which(reducedata_U2$F1<0)])$x[which
.max(density(reducedata_U2$F1[which(reducedata_U2$F1<0)])$y)]) #smaller peak 
- for pre test values below 0.5 
 
#Peak values for post test 
print(post_peak1<-density(reducedata_U2$F2[which(reducedata_U2$F2>1.8)])$x[wh
ich.max(density(reducedata_U2$F2[which(reducedata_U2$F2>1.8)])$y)])#third pea

 

202 

 

 

k for post test scores in level 2 3D LP region 
print(post_peak2<-density(reducedata_U2$F2)$x[which.max(density(reducedata_U2
$F2)$y)]) #post test peak 
print(post_peak3<-density(reducedata_U2$F2[which(reducedata_U2$F2<0)])$x[whic
h.max(density(reducedata_U2$F2[which(reducedata_U2$F2<0)])$y)])#second peak - 
for post test values below 0.5 

Percentage of examinees between peak values 
##### Percentage of examinees between peak values 
#Pretest 
pct_pre(pre_peak1) 
pct_pre(pre_peak2)  
pct_pre(pre_peak1)-pct_pre(pre_peak2) 
 
#Pretest 
pct_post(pre_peak1) 
pct_post(pre_peak2) 
pct_post(pre_peak1)-pct_pre(post_peak2) 

Ascending Ability Wright Maps
Ascending Ability Wright Maps    
Ascending Ability Wright Maps
Ascending Ability Wright Maps
#create factor scores with standard errors (UP= upper bound, LP= lower bound) 
fulldata_with_SE<-cbind(U2_all,data.frame(fscores(modgrmU2_EM_1D, full.scores
.SE=TRUE))) 
fulldata_with_SE$UBF1<-fulldata_with_SE$F1+fulldata_with_SE$SE_F1 
fulldata_with_SE$LBF1<-fulldata_with_SE$F1-fulldata_with_SE$SE_F1 
fulldata_with_SE$LBF2<-fulldata_with_SE$F2-fulldata_with_SE$SE_F2 
fulldata_with_SE$UBF2<-fulldata_with_SE$F2+fulldata_with_SE$SE_F2 
 
#Create a variable to count how many students have CI overlapping each LP lev
el (Pre test) 
fulldata_with_SE$LP0_1_F1<-ifelse(fulldata_with_SE$LBF1<= t0_1&fulldata_with_
SE$UBF1>=t0_1, 1, 0) 
fulldata_with_SE$LP1_2_F1<-ifelse(fulldata_with_SE$LBF1<= t1_2&fulldata_with_
SE$UBF1>=t1_2, 1, 0) 
#Create a variable to count how many students have CI overlapping each LP lev
el (Pre test) 
fulldata_with_SE$LP0_1_F2<-ifelse(fulldata_with_SE$LBF2<= t0_1&fulldata_with_
SE$UBF2>=t0_1, 1, 0) 
fulldata_with_SE$LP1_2_F2<-ifelse(fulldata_with_SE$LBF2<= t1_2&fulldata_with_
SE$UBF2>=t1_2, 1, 0) 
 
#Find smallest lower bound score of F1 (pre test) 
LB_LP0_1_pre_stu<-max(fulldata_with_SE$LBF1[which(fulldata_with_SE$LP0_1_F1==
1)]) 
print(LB_L0_pre<-which(sort(fulldata_with_SE$LBF1)==LB_LP0_1_pre_stu))#studen
ts below level 1 LP 
LB_LP1_2_pre_stu<-max(fulldata_with_SE$LBF1[which(fulldata_with_SE$LP1_2_F1==
1)]) 
print(LB_L1_pre<-which(sort(fulldata_with_SE$LBF1)==LB_LP1_2_pre_stu))#studen
ts below level 2 LP 

 

203 

 

 

#Find smallest lower bound score of F2 (post test) 
LB_LP0_1_post_stu<-max(fulldata_with_SE$LBF2[which(fulldata_with_SE$LP0_1_F2=
=1)]) 
print(LB_L0_post<-which(sort(fulldata_with_SE$LBF2)==LB_LP0_1_post_stu))#stud
ents below level1 LP 
LB_LP1_2_post_stu<-max(fulldata_with_SE$LBF2[which(fulldata_with_SE$LP1_2_F2=
=1)]) 
print(LB_L1_post<-max(which(sort(fulldata_with_SE$LBF2)==LB_LP1_2_post_stu)))
#students below level2 LP 
 
#Find highest upper bound score of F1 (pre test) 
UB_LP0_1_pre_stu<-min(fulldata_with_SE$UBF1[which(fulldata_with_SE$LP0_1_F1==
1)]) 
print(UB_L0_pre<-which(sort(fulldata_with_SE$UBF1)==UB_LP0_1_pre_stu))#studen
ts below level 1 LP 
UB_LP1_2_pre_stu<-min(fulldata_with_SE$UBF1[which(fulldata_with_SE$LP1_2_F1==
1)]) 
print(UB_L1_pre<-which(sort(fulldata_with_SE$UBF1)==UB_LP1_2_pre_stu))#studen
ts below level 2 LP 
#Find highest upper bound score of F2 (post test) 
UB_LP0_1_post_student<-min(fulldata_with_SE$UBF2[which(fulldata_with_SE$LP0_1
_F2==1)]) 
print(UB_L0_post<-which(sort(fulldata_with_SE$UBF2)==UB_LP0_1_post_student))#
students below level1 LP 
UB_LP1_2_post_student<-min(fulldata_with_SE$UBF2[which(fulldata_with_SE$LP1_2
_F2==1)]) 
print(UB_L1_post<-which(sort(fulldata_with_SE$UBF2)==UB_LP1_2_post_student))#
students below level2 LP 
 
#of people in overlap for each level on pre test 
LB_L0_pre-UB_L0_pre #80 people between level 0 and 1 
LB_L1_pre-UB_L1_pre #1 people between 1-2 
#of people in overlap for each level on pre test 
LB_L0_post-UB_L0_post #57 people between level 0 and 1 
LB_L1_post-UB_L1_post #48 people between 1-2 
 
#Sort data by ability score (pre test) 
sort_pre<-fulldata_with_SE[order(fulldata_with_SE$F1),]#sort data 
sort_pre<-data.frame(x=seq(nrow(sort_pre)),F1=sort_pre$F1,lwr=sort_pre$LBF1,u
pr=sort_pre$UBF1) 
 
#Sort data by ability score (post test) 
sort_post<-fulldata_with_SE[order(fulldata_with_SE$F2),]#sort data 
sort_post<-data.frame(x=seq(nrow(sort_post)),F2=sort_post$F2,lwr=sort_post$LB
F2,upr=sort_post$UBF2) 

Ascending Ability Wright map for Pretest 
#### Ascending Ability Wright map for Pretest 
plot(sort_pre$F1, xlab="Persons", ylab="Ability", pch=16, ylim=c(-2, 3.1), xl
im=c(558, 900), cex=0.5) 

 

204 

 

 

with(sort_pre,polygon(c(x,rev(x)),c(lwr,rev(upr)),col = "grey75", border = FA
LSE)) 
matlines(sort_pre[,1],sort_pre[,-1],lwd=c(1,1),lty=1,col=c("black","black","b
lack"),type=c("p","l","l"), cex=0.4, pch=16) 
abline(h=c(t0_1, t1_2),lty=2,v=c(LB_L0_pre, LB_L1_pre,UB_L0_pre,UB_L1_pre)) 

Ascending Ability Wright map for Post test 
#### Ascending Ability Wright map for Post test 
plot(sort_post$F2, xlab="Persons", ylab="Ability", pch=16, ylim=c(-2, 3.1), x
lim=c(558, 900), cex=0.5) 
with(sort_post,polygon(c(x,rev(x)),c(lwr,rev(upr)),col = "grey75", border = F
ALSE)) 
matlines(sort_post[,1],sort_post[,-1],lwd=c(1,1),lty=1,col=c("black","black",
"black"),type=c("p","l","l"), cex=0.4, pch=16) 
abline(h=c(t0_1, t1_2),lty=2,v=c(LB_L0_post, LB_L1_post,UB_L0_post,UB_L1_post
)) 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

205 

 

 

 

BIBLIOGRAPHY 

206 

 

 

BIBLIOGRAPHY 

 

Alonzo,  A.  C.,  &  Gotwals,  A.  W.  (Eds.).  (2012). Learning  progressions  in  science:  Current 

challenges and future directions. Springer Science & Business Media. 

 
Barker, V., & Millar, R. (2000). Students' reasoning about basic chemical thermodynamics and 

chemical bonding: what changes occur during a context-based post-16 chemistry 
course?. International Journal of Science Education, 22(11), 1171-1200. 

 
Becker, N. M., & Cooper, M. M. (2014). College chemistry students' understanding of potential 

energy in the context of atomic–molecular interactions. Journal of Research in Science 
Teaching, 51(6), 789-808. 

 
Becker, N., Noyes, K., & Cooper, M. (2016). Characterizing students’ mechanistic reasoning 

about London dispersion forces. Journal of Chemical Education, 93(10), 1713-1724. 

 
Berland, L. K., & McNeill, K. L. (2010). A learning progression for scientific argumentation: 

Understanding student work and designing supportive instructional contexts. Science 
Education, 94(5), 765-793. 

 
Boo, H. K. (1998). Students' understandings of chemical bonds and the energetics of chemical 
reactions. Journal of Research in Science Teaching: The Official Journal of the National 
Association for Research in Science Teaching, 35(5), 569-581. 

 
Brown, N. J., & Wilson, M. (2011). A model of cognition: The missing cornerstone of 

assessment. Educational Psychology Review, 23(2), 221. 

 
Burrows, N. L., & Mooring, S. R. (2015). Using concept mapping to uncover students' 

knowledge structures of chemical bonding concepts. Chemistry Education Research and 
Practice, 16(1), 53-66. 

 
Cai, L., Maydeu‐Olivares, A., Coffman, D. L., & Thissen, D. (2006). Limited‐information 
goodness‐of‐fit testing of item response theory models for sparse 2P tables. British 
Journal of Mathematical and Statistical Psychology, 59(1), 173-194. 

 
Chen, W. H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response 

theory. Journal of Educational and Behavioral Statistics, 22(3), 265-289. 

 
Cooper, M., & Klymkowsky, M. (2013). Chemistry, life, the universe, and everything: A new 
approach to general chemistry, and a model for curriculum reform. Journal of Chemical 
Education, 90(9), 1116-1122. 

 

 

207 

 

 

Cooper, Melanie M., Michael W. Klymkowsky, and Nicole M. Becker. "Energy in chemical 

systems: An integrated approach." Teaching and learning of energy in K–12 education. 
Springer, Cham, 2014. 301-316. 

 
Doherty, J. H., Draney, K., Shin, H. J., Kim, J., & Anderson, C. W. (2015). Validation of a 

learning progression-based monitoring assessment. Manuscript submitted for publication. 

 
Drasgow, F., Levine, M. V., & Williams, E. A. (1985). Appropriateness measurement with 

polychotomous item response models and standardized indices. British Journal of 
Mathematical and Statistical Psychology, 38(1), 67-86. 

 
Duschl, R.A., Schweingruber H.A., Shouse A. (Eds.). (2007). Taking science to school: Learning 

and teaching science in grades K-8. Washington, D.C.: National Academy Press. 

 
Felt, J. M., Castaneda, R., Tiemensma, J., & Depaoli, S. (2017). Using person fit statistics to 

detect outliers in survey research. Frontiers in psychology, 8, 863. 

 
Gorin, J. S., & Mislevy, R. J. (2013, September). Inherent measurement challenges in the next 

generation science standards for both formative and summative assessment. In Invitational 
research symposium on science assessment. 

 
Harris, C. J., Krajcik, J. S., Pellegrino, J. W., DeBarger, A. H. (2019). Designing Knowledge‐In‐

Use Assessments to Promote Deeper Learning. Educational Measurement: Issues and 
Practice. 

 
Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: 

Conventional criteria versus new alternatives. Structural equation modeling: a 
multidisciplinary journal, 6(1), 1-55. 

 
Jewett JW (2008). Energy and the confused student I: work. Phys Teach 46, 38–43.  
 
Kim, D., De Ayala, R. J., Ferdous, A. A., & Nering, M. L. (2011). The comparative performance 

of conditional independence indices. Applied Psychological Measurement, 35(6), 447-
471. 

 
Lee, H. S., & Liu, O. L. (2010). Assessing learning progression of energy concepts across middle 
school grades: The knowledge integration perspective. Science Education, 94(4), 665-688. 

 
Lehrer, R., Kim, M. J., Ayers, E., & Wilson, M. (2014). Toward establishing a learning 

progression to support the development of statistical reasoning. Learning over time: 
Learning trajectories in mathematics education, 31-60. 

 
Maydeu-Olivares, A., & Joe, H. (2005). Limited-and full-information estimation and goodness-

of-fit testing in 2 n contingency tables: A unified framework. Journal of the American 
Statistical Association, 100(471), 1009-1020. 

 

208 

 

 

Mislevy, R. J. (1996). Test theory reconceived. Journal of Educational Measurement, 33(4), 

379-416. 

 
Mislevy, R. J., & Haertel, G. D. (2006). Implications of evidence‐centered design for educational 

testing. Educational Measurement: Issues and Practice, 25(4), 6-20. 

 
Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. ETS 

Research Report Series, 1992(1), i-30. 

 
Nahum, T. L. (2007). Teaching the concept of chemical bonding in high-school: Developing and 

implementing a new framework based on the analysis of misleading systemic factors. 

 
Nahum, T. L., Mamlok‐Naaman, R., Hofstein, A., & Krajcik, J. (2007). Developing a new 

teaching approach for the chemical bonding concept aligned with current scientific and 
pedagogical knowledge. Science Education, 91(4), 579-603. 

 
National Research Council. (2012). A framework for K-12 science education: Practices, 

crosscutting concepts, and core ideas. National Academies Press. 

 
National Research Council. (2013a). Education for life and work: Developing transferable 

knowledge and skills in the 21st century. National Academies Press. 

 
Nering, M. L., & Ostini, R. (Eds.). (2011). Handbook of polytomous item response theory 

models. Taylor & Francis. 

 
Neumann, K., Viering, T., Boone, W. J., & Fischer, H. E. (2013). Towards a learning 
progression of energy. Journal of research in science teaching, 50(2), 162-188. 

 
Orlando, M., & Thissen, D. (2003). Further investigation of the performance of S-X2: An item 
fit index for use with dichotomous item response theory models. Applied Psychological 
Measurement, 27(4), 289-298. 

 
Osborne, J. F., Henderson, J. B., MacPherson, A., Szu, E., Wild, A., & Yao, S. Y. (2016). The 

development and validation of a learning progression for argumentation in 
science. Journal of Research in Science Teaching, 53(6), 821-846. 

 
Pellegrino, J. W., & Hilton, M. L. (Eds.). (2012). Education for life and work: Developing 

transferable knowledge and skills in the 21st century. Washington, DC: The National 
Academies Press. 

 
Pellegrino, J. W., Chudowsky, N., & Glaser, R. (2001). Knowing what students know: The 

science and design of educational assessment. National Academy Press, 2102 
Constitutions Avenue, NW, Lockbox 285, Washington, DC 20055. 

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded 

scores. Psychometrika monograph supplement. 

 

 

209 

 

 

Schwarz, C. V., Reiser, B. J., Davis, E. A., Kenyon, L., Achér, A., Fortus, D., ... & Krajcik, J. 

(2009). Developing a learning progression for scientific modeling: Making scientific 
modeling accessible and meaningful for learners. Journal of Research in Science 
Teaching: The Official Journal of the National Association for Research in Science 
Teaching, 46(6), 632-654. 

 
Shin, N., Stevens, S. Y., & Krajcik, J. (2010). Tracking student learning over time using 

construct-centred design. In Using Analytical Frameworks for Classroom Research (pp. 
56-76). Routledge. 

 
Smith, C. L., Wiser, M., Anderson, C. W., & Krajcik, J. (2006). FOCUS ARTICLE: implications 

of research on children's learning for standards and assessment: a proposed learning 
progression for matter and the atomic-molecular theory. Measurement: Interdisciplinary 
Research & Perspective, 4(1-2), 1-98. 

 
Songer, N. B., Kelcey, B., & Gotwals, A. W. (2009). How and when does complex reasoning 
occur? Empirically driven development of a learning progression focused on complex 
reasoning about biodiversity. Journal of Research in Science Teaching: The Official 
Journal of the National Association for Research in Science Teaching, 46(6), 610-631 

 
Standards, N. G. S. (2013). Next generation science standards: For states, by states. 
 
Stone, C. A., & Zhang, B. (2003). Assessing goodness of fit of item response theory models: A 

comparison of traditional and alternative procedures. Journal of Educational 
Measurement, 40(4), 331-352. 

 
Taber, K. S. (1998a). An alternative conceptual framework from chemistry 

education.  International Journal of Science Education, 20(5), 597-608. 

 
Taber, K. S. (1998b). The sharing-out of nuclear attraction: or I can’t think about Physics in 

Chemistry, International Journal of Science Education, 20 (8), pp.1001-1014. 

  
Taber, K. S., & Coll, R. K. (2002). Bonding. In Chemical education: Towards research-based 

practice (pp. 213-234). Springer, Dordrecht. 

  
Toland, M. D. (2014). Practical guide to conducting an item response theory analysis. The 

Journal of Early Adolescence, 34(1), 120-151. 

 
Van Dam, N. T., Earleywine, M., & Borders, A. (2010). Measuring mindfulness? An item 
response theory analysis of the Mindful Attention Awareness Scale. Personality and 
Individual Differences, 49(7), 805-810. 

 
Wilson, M. (2009). Measuring progressions: Assessment structures underlying a learning 
progression. Journal of Research in Science Teaching: The Official Journal of the 
National Association for Research in Science Teaching, 46(6), 716-730. 

 

 

210 

 

 

CONCLUDING REMARKS 

 

This study is a first example of developing and validating both large and small grain size 

NGSS-aligned learning progressions in practice. It provides valuable insights about the process 

of developing learning progressions aligned to specific NGSS performance expectations 

including specific DCIs, SEPs and CCCs, and developing assessment instruments capable of 

measuring 3D learning of complex NGSS constructs. Further, this study demonstrates the 

process of obtaining validity evidence for NGSS aligned LPs, and feasibility of using validated 

LPs to describe student learning in the context of NGSS classroom. 

The major implication of this study for both future and practicing teachers relates to 

describing and tracking development of 3D understanding in practice. Specifically, the current 

study shows that the three dimensions of NGSS work together when it comes to forming a basis 

of 3D understanding. Study described in chapter 1 provides evidence for this assertion from the 

perspective of assessment development theory and psychometrics. In particular, chapter 1 

demonstrates that following a systematic evidence-based assessment design process aimed at 

designing NGSS-aligned tasks that measure student ability to integrate the three dimensions 

(DCIs, SEPs, CCCs) as suggested by the Framework results in assessments that demonstrate 

good psychometric properties, and measure one underlying conceptual dimension of interest. 

Studies described in chapters 2 and 3 demonstrate how 3D learning can be characterized in 

practice, and show how student progress towards developing the ability to integrate the three 

dimensions of NGSS to explain electrostatic phenomena can be measured and characterized.  

For teachers and teacher educators these results indicate that to effectively assess 3D 

learning, one should not aim design tasks that measure separate dimensions of NGSS, but rather 

 

211 

 

 

integrate the SEPs and CCCs to make sense of DCIs, as suggested by the Framework. Focusing 

on measuring SEPs or CCCs devoid of the context (DCIs) will not allow to evaluate student 

knowledge in use, because it can only be assessed in a given context (DCIs). On the other hand, 

focusing on measuring DCIs without including SEPs and CCCs might results in fact-based 

assessments that don’t measure student ability to apply big ideas to make sense of phenomena. 

Therefore, it is only through integration of the three dimensions, as suggested by the Framework, 

that 3D learning can be effectively assessed and characterized.  

The integration of the three dimensions of NGSS is essential for both assessment and 

instruction. For teacher education findings described in this work suggest that future teachers 

should be prepared to organize their science classroom in a way that students are provided with 

opportunities to engaged in 3D learning and develop their ability to integrate the three 

dimensions of NGSS to explain phenomena. “Interactions” curriculum provides a good example 

of instructional settings that reflect principles of NGSS and the Framework. However, more 

work needs to be done to develop similar instructional materials that are aligned with the vision 

of the Framework in different disciplines, and across grades. Additionally, both future and 

practicing teachers need considerable support in helping implement the vision of the Framework 

into practice. Just like 3D learning is a process of constantly revising one’s understanding in light 

of new evidence, 3D teaching (or teachings in NGSS classroom) also requires constant 

examination of student ideas to find ways to respond to various questions that students bring up 

in class and use these questions to guide their natural curiosity towards developing deep 3D 

understanding. Just like students develop this type of understanding in a group with peers, 

teachers should aim to develop extended professional learning communities for sharing ideas and 

 

212 

 

 

exchanging their experience to support each other and help guide each other towards successful 

implementation of NGSS in practice. 

The work presented here also has important limitations. Specifically, in spite of the fact 

that validity evidence collected in the context of Unit 1 and Unit 2 separately supports 

hypothetically suggested progression of student understanding outlined by 3D LP for electrical 

interactions and 3D construct map for chemical bonding, at this point no data is available to draw 

accurate conclusions as to how the 3D LP and the 3D construct map relate to each other. In other 

words, there is no data available that would allow development of common latent ability 

continuum describing progression of student understanding of electrical interactions and 

chemical bonding on the same ability scale. This is the main drawback of the current study. In 

the future, it would be beneficial to include common through items on both written and oral 

interview assessments to be able to develop common ability scale and study how student 

understanding of electrical interactions develops during the course of both Unit 1 and Unit 2, and 

possibly the entire curriculum. Constructing such an overarching 3D LP spanning the entire 

academic year will allow to accurately measure student understanding at any point during the 

year, describe in detail what student understanding looks like in terms of ability to integrate the 

three dimensions of NGSS, and provide guidance to educators as to what supports students need 

to be able to get to higher levels of understanding of electrical interactions as the progress 

towards mastering NGSS performance expectations described by the 3D LP. This kind of 

applicability of learning progression research is an ultimate goal that researchers should be 

aiming for in order to enact the vision of the Framework and NGSS in practice and ensure 

significant improvement of the learning process in science classroom for both students and 

teachers. 

 

213