MEASUREMENT INVARIANCE OF A SUMMATIVE ACHIEVEMENT ASSESSMENT OVER TIME: IS STATUS REALLY READY FOR GROWTH? By Steven Guy Viger A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Measurement and Quantitative Methods Œ Doctor of Philosophy 2014 ABSTRACT MEASUREMENT INVARIANCE OF A SUMMATIVE ACHIEVEMENT ASSESSMENT OVER TIME: IS STATUS REALLY READY FOR GROWTH? By Steven Guy Viger The current study investigates the phenomenon of measurement invariance by examining the construct stability of a summative mathematics achievement instrument over time gleaned from an existing data set . In doing so, not only is the general question of measurement invariance of the particular instrument addressed, but also in the context of growth studies. The onus of the study as well as the results are presented in lig ht of the current political context of large scale K - 12 assessment and the shifting of emphasis from status to growth. As the reader will discover, great pressure is placed on results not necessarily intended to serve as the metric required by policy. The results and implications are framed in both mea surement and practical contexts. iii The final product of this tremendous journey is dedicated to my incredible family, those still present and those who have left us. Without your support, mot ivation, love and patience, the finish line might have continued to elude me. To Richard S. Viger and Richard W. Viger–I finished the game ; thank you for pushing me across the goal line . iv ACKNOWLEDGMENT S A heartfelt thank you goes out to t he faculty and staff of the Measurement and Quantitative Methods program; especially Drs. Mark Reckase and Kimberly Maier. The wisdom you™ve shared, the patience, the understanding and the dedication to producing quality students and brilliant researchers forced me to continuously raise my own bar to meet your high standards. Your support was sometimes of the tough love variety, but was never absent. As a result of this I pushed myself further than I thought possible and I now carry the same high standards to my work and life. Thank you to the Michigan Department of Education for providing access to data and for support in many other areas. They are an organization dedicated to fulfilling policy with the upmost integrity. I™ve never worked with a better gro up of professionals. To my wife, Andrea Jackson, you have always been my biggest fan, loudest cheerleader and most stubborn supporter. From the moment we met I shared with you my goals and not once did you show the slightest bit of doubt in my ability to follow through. It is because of you I was able to press on and keep my eyes on the prize. I™m also proud to acknowledge and pay tribute to the critical role my children Caitlyn, Casey and Chase Viger played in the completion of this major goal. Even whe n you didn™t understand what I was doing you never complained. Whenever I needed a hug or to get my mind off of my studies you provided the ultimate escape. Perhaps most importantly, I couldn™t think of a greater motivation to finish my goals then to provi de you with a working example of what hard v work and dedication to your dreams will lead to. Never wanting to fail your children is the greatest motivation any parent can have. To the rest of my family–you never pushed, never rushed, never asked too many q uestions or inquired as to ‚when™ I™d be done. You just always assumed I would be done at some point. I™m so glad that time is now and that I could share my success with you all. You helped make me the man and the scholar I have become. I™m eternally grate ful for my loving family. What a great example of the right way to do things and love each other . Finally, I™d like to acknowledge the support of my friends. Some of you have motivated me just by being who you are while others have directly pushed and engaged me to finish my goals. Regardless of your style, you knew what made me move and used that to urge me along. Sometimes that meant a ‚normal™ conversation, sometimes a hug, or even going out of your way to give me private access to a place where I ca n get my work done in peace. All of my friends fit the bill but I need to specifically acknowledge Ryan Newton, Colleen Kelly, and Rosalie Kern for their tremendous support through thick and thin. vi TABLE OF CONTENTS LIST OF TABLES ................................ ................................ ................................ .................. v i ii LIST OF FIGURES ................................ ................................ ................................ ................... i x Chapter 1: Introduction ................................ ................................ ................................ .............. 1 Research Questions ................................ ................................ ................................ ......... 3 Chapter 2: Literature and Policy Review ................................ ................................ .................... 5 A Paradigm for the Discussion of Validity Evidence ................................ ........................ 5 Policy Changes as an Influence on the Validity Argument ................................ .............. 6 Measurement Invariance ................................ ................................ ............................... 10 Factor Analytic Strategies to Determining Measurement Invariance ............................. 11 The Standard Invariance Model Testing Sequence ................................ ........................ 17 Configural Invariance ................................ ................................ ................................ ... 18 Metric Invariance ................................ ................................ ................................ .......... 1 9 Scalar Invariance ................................ ................................ ................................ .......... 19 Strict Invariance ................................ ................................ ................................ ........... 20 Outcome Measures: Fit Indices ................................ ................................ ..................... 20 Absolute Fit ................................ ................................ ................................ .................. 22 Descriptive Fit ................................ ................................ ................................ .............. 24 Comparative Fit Indices ................................ ................................ ................................ 25 Akaike Information Criterion (AIC) ................................ ................................ .............. 25 Bayesian Information Criterion (BIC) ................................ ................................ ........... 25 The Sample - Size Adjusted BIC (SABIC) ................................ ................................ ...... 26 Construct Stability, Measurement Invariance and Validity Evidence ............................. 26 Content Based Evidence ................................ ................................ ............................... 28 Factorial Evidence ................................ ................................ ................................ ........ 29 Chapter 3: Method ................................ ................................ ................................ ................... 33 Data ................................ ................................ ................................ .............................. 33 Instrument ................................ ................................ ................................ .................... 3 5 Analysis ................................ ................................ ................................ ........................ 38 Chapter 4: Results ................................ ................................ ................................ .................... 4 5 Descriptive Statistics/Previous Achievement ................................ ................................ 45 Confirmatory Factor Analysis of One and Four Factor Models by Sample .................... 46 Measurement Invariance Tests of One and Four Factor Models ................................ .... 48 Chapter 5: Discussion ................................ ................................ ................................ ............... 50 Related to the original Fall 2009 administration and applied across groups: Does the Rasch model fit the data ................................ ................................ ................. 50 vii Do the data fit the linear confirmatory model implie d by the blueprint (content strands) for the test? ................................ ................................ ................................ .................... 51 Does one of the models fit significantly better than the other model? ............................. 51 Do the aforementioned models exhibit measurement invariance across groups/study conditions? ................................ ................................ ................................ .................... 52 Implications ................................ ................................ ................................ .................. 52 Chapter 6: Limitations and Future Research ................................ ................................ ............. 58 APPENDICES ................................ ................................ ................................ ......................... 60 Appendix A Œ Scale Score Distributions (Pre and Post) for each of the Study Groups ... 6 1 Appendix B Œ Item Characteristic Curves by Group ................................ ...................... 7 4 Appendix C Œ IRT Calibration Values by Group ................................ ............................ 77 REFERENCES ................................ ................................ ................................ ........................ 80 viii LIST OF TABLES Table 1 Œ Sample Demographic Characteristics ................................ ................................ ........ 3 5 Table 2 Œ Assessment breakdown by strands (numerals are item numbers) ................................ 36 Table 3 Œ Model 1, Configural Invariance ................................ ................................ ................ 4 1 Table 4 Œ Model 1, Metric Invariance ................................ ................................ ....................... 41 Table 5 Œ Model 2, Configural Invariance ................................ ................................ ................ 4 2 Table 6 Œ Model 2, Metric Invariance ................................ ................................ ....................... 42 Table 7 Œ Michigan MEAP Transition Table ................................ ................................ ............ 44 Table 8 Œ Previous Performance (Mean Scale Score) ................................ ................................ . 46 Table 9 Œ Previous Performance (Percent Proficient) ................................ ................................ . 46 Table 10 Œ Group level Model Fit (Single Factor Model) ................................ .......................... 47 Table 11 Œ Group level Model Fit (Blueprint Based/Four Factor Model) ................................ .. 47 Table 12 Œ Measurement Invariance Study Results ................................ ................................ .... 49 Table 13 Œ Outcome Variable Group Differences ................................ ................................ ...... 53 Table 1 4 Œ IRT Calibration Valu es by Group ................................ ................................ ........... 78 ix LIST OF FIGURES Figure 1 Œ Performance Level Change by Study Condition ................................ ....................... 54 Figure 2 Œ Multigroup IRT Test Characteristic Curves ................................ .............................. 56 Figure 3 Œ Test Information Functions from Multiple Group IRT Run ................................ ...... 57 Figure 4 Œ Post - test Scale Score Distribution (Grade 8) ................................ ............................ 62 Figure 5 Œ Post - test Performance Level Frequencies (Grade 8) ................................ ................. 6 3 Figure 6 Œ Pre - test Scale Score Distribution (Grade 8) ................................ .............................. 6 4 Figure 7 Œ Pre - test Performance Level Frequencies (Grade 8) ................................ .................. 6 5 Figure 8 Œ Post - test Scale Score Distribution (Grade 9 ) ................................ ............................ 6 6 Figure 9 Œ Post - test Performance Level Frequencies (Grade 9 ) ................................ ................. 6 7 Figure 10 Œ P re - test Scale Score Distribution (Grade 9 ) ................................ ............................ 6 8 Figure 11 Œ P re - test Performance Level Frequencies (Grade 9 ) ................................ ................ 6 9 Figure 12 Œ Post - test Scale Score Distribution (Grade 10 ) ................................ ........................ 70 Figure 13 Œ Post - test Performance Level Frequencies (Grade 10 ) ................................ ............. 71 Figure 14 Œ Pre - test Scale Score Distribution (Grade 10 ) ................................ .......................... 7 2 Figure 15 Œ Pre - test Performance Level Frequencies (Grade 10 ) ................................ .............. 73 Figure 16 Œ Item Characteristic Curves by Group ................................ ................................ ..... 75 1 Chapter 1: Introduction In standards based assessments, such as a State Educational Agency™s (SEA) summative K - 12 achievement test, items are indicators of content standards that as a whole, are used as a mechanism to place students on a construct (or constructs) underlying conti nuum. Further, when scores are reported they are often transformed to the desired reporting scale. While these metrics vary widely, all are able to maintain students™ relative standing to others as well as to criterion referenced cut - scores. The author co ntends that it is extremely difficult to truly assess student growth as the students are changing cognitively, physically, and emotionally as a function of their development in ways that are not often measured, are difficult to measure, or in the least we have no direct data to link to. At the same time, the instruments are often changing as the students proceed with their schooling due to a constantly changing set of content standards and performance standards. Consequently, the interaction of persons and items that forms the foundation of modern scaling is modeled in the presence of often shifting sets of items and persons with naturally changing expectations or criteria. Taking more of a purist approach suggests looking at what can and cannot be under ou r control as we seek to evaluate construct stability. It seems reasonable that as we are not able to control the changes of the people over time we can see how they change over time by holding constant the instrument with its intended underlying construct( s). To link back to the concept of internal structure, by allowing time to pass and hence for the students to develop by holding the actual instrument constant, also affords the opportunity to see how what is intended to be measured may or may not change o ver time. The results of such a factorial/structure evidence analysis can certainly inform whether or not changes in scores over time on a constant 2 instrument are the function of maturation and/or a change in underlying construct (possibly represented by c hanges in the pattern of inter - item correlations on the instrument). Factor analysis is one such analytical vehicle to use that will provide useful information in making such decisions. In the current study, there is an underlying scale purported to measur e a specific intended construct. The scale is made up of multiple items (or subscales, or multi - item parcels). In factor analytic terms, the items serve as indicators of the trait or factor in a common factor model. The author makes use of this scale in sa mples from distinct populations: the original sample and the three samples in which later data were collected. For any such use of scale scores, there is a critical assumption that the scale is measuring the same trait in all of the groups. If that assumpt ion holds, then comparisons and analyses of those scores are acceptable and yield meaningful interpretations. But if that assumption is not true, then such comparisons and analyses do not yield meaningful results. When constructs shift across grades, such as when mathematics assessments move from testing arithmetic skills in third grade to testing pre - algebra and geometry skills in later grades, the growth model results may lead to imprecise longitudinal interpretations (Reckase 2004; Martineau 2006). To th is end, this study leverages confirmatory factor analysis techniques as well as the literature around the concept of measurement invariance to examine the factorial stability of a mathematics achievement test given to a sample of students one instructional year following the intended time of testing, two years after the intended time of testing and three years after the intended time of testing to determine to what degree measurement invariance over this cross - section of students based on grade remains the same . To the extent that the structure holds over time, this supports the ability to glean similarly interpreted growth data by use of a parallel form 3 of an assessment in a pre - test/post - test paradigm for growth. To the extent that the structure does not h old over time , this suggests an unintended relationship of other variables to the construct being measured if the paradigm is a simple gains (pre - test/post - test) type of approach. That is to say, that the development and everything occurring in the passage of time that differentiates the cross - sections of students, is also related to the achievement which would invalidate the instrument for the intended use in a gain score approach as it no longer measures the same construct and is not able to be scaled tog ether in a meaningful way. Or at least, as scaled (likely horizontally), the intended inferences would not be supported. The current study will address the following research questions within a measurement invariance paradigm driven by factor analytic str ategies . Research Questions 1. Related to the original Fall 2009 administration and applied across groups : a. Does the Rasch model fit the data? b. Do the data fit the linear confirmatory model implied by the blueprint (content strands) for the test? c. Does one of the models in a. and b. fit significantly better than the other model? 2. Are the measurement models posited in Question 1 invariant to additional years of instruction? a. Do the measurement models (unidimensional and multidimensional) exhibit configural invaria nce across groups such that both groups associate the same subsets of items with the same constructs? 4 b. Do the measurement models (unidimensional and multidimensional) exhibit metric invariance across groups, indicating that overall, the strength of the rela tionships between items and their underlying constructs are the same for both groups? c. Do the measurement models (unidimensional and multidimensional) hold to the property of strict invariance across groups, suggesting that factor patterns, loadings, interc epts and residual variances are equal across groups? d. Does the comparative fit of both the unidimensional and multidimensional change across groups? The next chapter will present both a review of the literature that speaks more to the motivation for the c urrent study as well as delving into literature around the particular method and measurement paradigm explored. Taken as a whole, the notion of invariance or measurement stability over time speaks directly to the validity of inferences one can support. As such, the way in which a study such as this fits into validity arguments will be discussed. 5 Chapter 2: Literature and Policy Review In this chapter, a brief introduction to validity evidence and the support of inferences will be provided to serve as a f ramework from which the criticality of this study, and others like it, can be deduced when considered in tandem with broad sweeping K - 12 assessment policies. Put differently, the concept of validity is presented first to present the context of the evaluati on and is followed by a review of policy and statutory changes brought into place which were created independently of the research literature and in some cases independent of AERA/NCME/APA standards. Once that context has been presented, a review of the li terature pertinent to the study of measurement invariance as well as the confirmatory factor analysis strategy used to investigate the measurement invariance phenomenon will be provided. A Paradigm for the Discussion of Validity Evidence Generally speakin g, validity refers to fithe degree to which evidence and theory support the interpretations of test scores entailed by the proposed uses of tests.fl (Messick, 1989, 1994, 1995). In practical terms, validity can be used to describe how well one can legitimate ly trust the results of a test as interpreted for a specific purpose. In the world of operational psychometrics, it is the fispecific purposefl portion of that definition that tends to vary from location to location, assessment to assessment–proposed use to proposed use. It logically follows that validity is a property of inferences, not instruments. As a result, validity must be established for each intended interpretation. It is because of this philosophical framework, which the author endorses, that it als o becomes problematic to think of an instrument as valid or not. Validity is not a property of an instrument, it is a property of the inference one makes from the data produced by the instrument (Kane, 2006). As a result, each intended use must be supporte d by an accumulation of evidence suggesting that the instrument, and scaling/scoring/reporting 6 mechanism, is valid for that intended use and subsequent inferences made as the result. Shepard (1997) argues that intended effects and likely side effects are c learly within the responsibility of the test developer. Furthermore, persistent unanticipated effects are also the responsibility of the test developer. Moss (1998) suggests greater responsibility for the test developer and argues that considerations of te st consequences should encompass the anticipated uses of test scores. In other words, test developers are obligated to attempt to maximize positive consequences and minimize negative consequences. Further, test developers should consider the consequences o f testing in general rather than the immediate consequences of using scores from a specific test. For example, Moss argues that testing is reactive with test takers and test users. The administration of a test in a school changes the school, whether inform ation from scores are intentionally used or ignored. How they are used, is likely driven more by policy then by proper measurement and psychometric considerations. As already alluded to, it is the responsibility of the test developer t o be proactive in con sidering the immediate intended uses but also perhaps the more forward thinking unintended uses and/or consequences. Policy Changes as an Influence on the Validity Argument A November, 2005 announcement by the United States Department of Education (USED, 2005) encouraged states to propose pilot programs for growth - based accountability models for use in the 2005 Œ 2006 and 2006 Œ 2007 school years. Seven requirements for the pilot programs were given , with the first three viewed largely as alignment elements an d the last four considered foundational elements. The alignment elements were as follows: 1. The accountability model must ensure that all students are proficient by 2013 Œ 14 and set annual goals to ensure that the achievement gap is closing for all groups of students. 2. The accountability model must not set expectations for annual achievement based upon 7 student background and school characteristics. 3. The accountability model must hold schools accountable for student achievement in reading/language arts and mathematics. The foundational elements covered: 4. The accountability model must ensure that all students in the tested grades are included in the assessment and accountability system. Schools and districts must be held accountable for the performance of student subgroups. The accountability model includes all schools and districts. 5. The state™s asse ssment system, the basis for the accountability model, must receive approval through the NCLB (No Child Left Behind) peer review process for the 2005 Œ 2006 school year. In addition, the full NCLB assessment system in grades 3 Œ 8 and in high school in reading /language arts and math ematics must have been in place for two testing cycles. 6. The accountability model and related state data system must track student progress. 7. The accountability model must include student participation rates in the state assessment s ystem and student achievement on an additional academic indicator. Following review, the USED proposal review team published a document summarizing cross - cutting issues that influenced their decisions to approve or deny states™ proposals (USED, 2006). In particular, the guidance document indicated that states shall: (a) incorporate available years of existing achievement data, instead of relying on only two years of data; (b) align growth timeframes with school grade configuration and district enrollment; (c) make growth projections for all student s, not just those below proficient; (d) hold schools accountable for the same subgroups as under the status model; (e) not use wide confidence intervals; (f) not reset growth 8 targets each year and (g) not average scores between proficient and non - proficien t students. Although these issues were noted as influential in the peer review group™s decisions, not all proposals approved through the growth model pilot peer review process met all of these conditions (CCSSO Accountability Systems and Reporting Working Group, 2009). Figuring out a paradigm to address all of those issues is challenging at best and each state faces their own unique data idiosyncrasies. Approaches proposed varied from simple pre - test/post - test designs to elaborate vertical scaling designs, student growth percentiles and other projection methods and regression - based approaches as well as standards - based transition tables (Castellanos and Ho, 2012). All of these have strengths and weaknesses some of which take a toll monetarily and on the hu man resource side in that some require testing above and beyond the current status measures. The finding s of the CCSSO working group also suggest that oversight over these practices is not necessarily as tight as it could be and if this somewhat loose appr oach trickles down to state agencies, who are naturally reactionary to such policy changes and initiatives, then there might be many assumptions being made that could potentially go unchecked and lead to distortions in interpretation s of the results down t he road. There is also a parallel push to make assessments more instructionally useful and relevant so that the content specifications that feed both the curriculum and assessment paths can be in - sync. The consequences of test score use take on increasing importance in the current era in which educators are attempting to leverage the information in test scores to improve student learning (Perie, Marion & Gong, 2007). Since then, s tates have come to realize that the proficiency requirements of 2013 - 2014 are not likely to be realized and have submitted waiver applications to absolve themselves of those requirements but with the caveat that while the proficiency 9 markers might be able to be reset or relaxed, an enhanced focus on growth must take place that give s credit to all students, not just those bordering the proficiency marker. As a result, it should not come as a surprise that growth modeling is a huge topic right now in the psychometric and educational measurement literature and is occupying much of the reworking of ESEA and waiver applications. Compared with the original uses and purposes of test scores laid out in the NCLB of 2001, the amount of utility and information attempting to be gleaned from a measure intended to measure proficiency is enormous. While some may argue that anticipating the intended uses is not the sole responsibility of the test developer (Reckase, 1998), the CCSSO working group pointed out the obvious that the burden does trickle down to the State agencies who are the responsible party for the content of the assessment. This is important because common growth definitions include the necessity of at least two substantively and statistically comparable measures of status to deduce anything meaningful from the change or difference in measures over time. Therefore, the idea of assuming a stable construct, without actually confirming that assumption, has the potential for serious implications if problems are present that invalidate the assumptions around the assessments and the use of t he scores. The use of transition tables based on horizontally scaled, yet with underlying vertically articulated performance standards, offer an alternative way to assess whether or not students are on track towards standards based proficiency. Unfortunat ely, since the construct is assumed to not be stable over time, the underlying measures are not useful in determining why a student is no longer (or is now) proficient. The former case is likely to be the target of intervention or instruction while the lat ter is useful in helping in determining ‚what works™. 10 A stable construct measured from indicators from a common domain can naturally lead to a range of outcomes that are comparable. Therefore, when changes over time are noted , it is actually feasible to d rill into the measure a bit to determine where the differences occurred. However, before going down that path it is important that the proper housekeeping has occurred with respect to the measurements. First and foremost, we need a way of ensuring that we are measuring the same thing, or a stable construct, over time before we make inferences over time. This becomes an issue of measurement invariance over time. Measurement Invariance Mellenburgh (1989), Meredith (1993), and Meredith and Millsap (1992) provi ded a statistical definition of measurement invariance (MI) in which an observed score is said to be measurement invariant if a person™s probability of an observed score does not depend on his/her group membership, conditional on the true score. That is, r espondents from different groups, but with the same true score, will have the same observed score. Or, given a person™s true score, knowing a person™s group membership does not alter the person™s probability of getting a specific observed score. (Wu, Li, a nd Zhumbo, 2006). As such, measurement invariance is a rather blanket term that is used to refer to several different phenomenon. From a mechanical standpoint, measurement invariance can refer to the invariance of factor loadings, intercepts, or errors (Me redith, 1993). Unfortunately, in most large scale assessments the concept of invariance is much more of an assumption than it is a quality of measurement to be empirically investigated. As comparability and bridging studies emerge due to the shifting of a ssessment modalities from paper/pencil to a digital environment and status measures and interim assessments (purposed as learning tools and feedback mechanisms) are used in the development of proxy measures of growth, the assumptions of invariance over sub groups becomes an 11 important issue that can become an assumption rather than empirical evidence. As mentioned, a popular methodology is to use factor analytic methods to determine whether or not structures hold across groups. Confirmatory factor analysis is usually preferred because clearly the purpose of an instrument is to measure something and we really should know what we wish to measure before the administration. How we define that something is not always clear nor is the construct and hence is not easi ly articulated. Psychological batteries are often developed using factor analysis paradigms applied over multiple responses and potential indicators of an underlying construct. The factor analytic strategies can be used for the purpose of variable reductio n and clarification to get at the most salient indicators of the hypothesized constructs. Factor Analytic Strategies to Determining Measurement Invariance A common paradigm to investigate measurement invariance is in the context of factor analysis with applications of both exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) often seen in the literature. Factor analysis can inform score v alidity as well as to help understand the theoretical nature of constructs (Thompson, 2009). A major use is then in the development of the operational construct and the operational representativeness of the theoretical constructs (Gorsuch, 1983). Taking a more exploratory approach lends itself well to helping to understand the theoretical nature of the construct in relation to the data. Similarly, Muthen and Muthen (2009) suggest that EFA can be used to explore the dimensionality of a measurement instrument by determining the smallest number of interpretable factors needed to explain the correlation matrix among a set of observed, or measured, variables. As was previously alluded to, there are two discrete classes of factor analysis techniques, exploratory and confirmatory approaches (Thompson, 2009). EFA approaches are from the family of analytic techniques attributed to Spearman (1904) and are typically applied when the 12 researcher has little or no a priori expectations regarding either the number or nature of the latent variables or factors underlying a measurement instrument. Although it is typical when there is a strong hypothesis concerning the structure of the measurement to use a CFA approach, there is no requirement to declare that model in EFA as the analysis does not require, nor allow for, these expectations to enter into the calculations. This is not entirely true though as some programs require that you specify the number of factors and by specifying a given rotation method one is allowing (or not ) the underlying factors to be correlated. However, if permitted, EFA programs will often extract as many factors as there are indicators. It is then up to the analyst to perform rotations and seek guidance in the interpreting the factors before deciding w hat the model actually demonstrated. Of course, a major distinction is that with the CFA approach the researcher has already declared the number of factors as well as the relationships between the observed indicators and the underlying factors. Commonly r eferenced to J , CFA models require that the researcher provide specific direction with regards to the number of latent variables/factors, the relationships of the measured variables (i.e. items) to those latent variables, and the degree to wh ich the latent variables are correlated. Put simply, researchers without a theory regarding the underlying structure of an instrument cannot use CFA techniques as they have nothing to confirm (the big ‚C™ in confirmatory factor analysis). However, it doesn ™t preclude researchers with theories from resorting to exploratory techniques should the theories not pan out as intended. Such a practice could lead to capitalizing on chance in that all possible combinations could potentially be worked out until the bes t fitting model is brought to light. These concerns are valid in that the rotations come up with infinite solutions that account for the same variance/covariance matrix yet can differ greatly in interpretation. As a result, one™s ability to 13 interpret the c onstruct depends not only on a strong understanding of the content (or access to someone who has that!) but also on their point of view on the space occupied by the factor solutions. With so many possible loading patterns, many possibilities will be seen a s propitious but nothing speaks to whether or not the model is ‚correct™ as there is no comparison in EFA. CFA has that power in that not only is there model fit indices based on absolute criteria but also those that are based on comparative fit. F actor interpretation is a difficult and somewhat subjective endeavor. In the context of a principal component analyses or orthogonally rotated factor analysis solutions, the goal is to determine a set of factors where the loadings are strong for some indicators and near zero for the rest, explicitly disallowing the presence of cross loadings. T he strict requirement of zero cross - loadings in CFA has come under scrutiny because this requirement often does not fit the data well and has leads to a tendency to rely on the extensive use of model modification indices to find a well - fitting model (Asparhouv & Muthen, 2012). Browne (2001) suggests that in such cases, searching for a well - fitting measurement model may be better carried out by EFA in that all of the possible models are simultaneously tested rather than those specified by the researcher and subsequently tweaked using univariate modification indices which give projections of model fit if variables are removed from the analysis. While it is true that an inherent weakness of exploratory approaches is that they tend to capitalize on chance in creating factors based sometimes on the weakest of correlations that are considered large due to sample size, really the danger lies in putting too much faith in factors that, regardless of rotation methods, often leaves factors difficult to interpret substantively (Thompson, 2009). In the CFA framework proposed models may not be supported by much empirical evidence that would back - up their selection in the first place; that is their interpretation related 14 back to the original theory may be equally difficult to interpret in relations to the theoretical model and empirical research questions. Furthermore, there is no direct index of which is the ficorrect modelfl and in many instances one could come up with a model of good fit, perhaps better than already published research, which is of lit tle interpretive value and more importantly fails to serve as content evidence within the validity argument for use of the scores as status and/or growth measures. With this said, carefully designed assessments often adhere to a strict assessment blueprint which is driven heavily by grade level content standards or curriculum standards. As such, there is a pre - determined structure to the measurement instrument that is implied by the test design and table of specifications. When such structures are imposed o n the data, co n f i r m a t o ry f a c t o r an al y sis ( C F A ) is the f orm o f t h e f a ct o r a nal y tical m o d el that is most appropriate. In invoking this strategy, t h e c o v ar i ation a m ong ma n i f e st in d i c at o rs is examined i n order t o c o n f i r m t he h y pot h esi z ed u nder l y ing lat e nt c onstructs, as specified in advance by the researcher and supported substantively by the literature. CFA is a theory driven technique in which the researcher specifies (1) the number of factors and their inter - correlation, (2) which items load on which fact or and (3) whether errors are correlated. Statistical tests can then be conducted to determine whether the data confirm the theoretical model; thus the model is thought of as confirmatory (Bollen, 1989). A powerful aspect of C FA, leveraged in the context o f this study, is that a res e archer is ab l e to s i m ul t an e ously c o nduct multiple g roup a n al y ses ac r oss time or s a m ple s , in order to e v alu a te me a su r e m e nt in v ar i ance/ e q ui v ale n c e across those groups. Issues of MI are relevant in longitudinal research and growth studies. When a scale is administered over repeated occasions to the same sample of people or cross sectionally, the question of MI involves the issue of whether the scale is measuring the same construct at 15 different occasions. In traditional validity studies, it is common practice to assume that as students mature cognitively, their scores on a given instrument should increase as a function of age. That difference in mean scores is definitely of interest, but even the finding that scores do significantly increase as a function of age (grade level in the present study) still does not mean that the construct is the same and is reflected the same in that underlying scale score. Any discussion of gro wth based on a repeated measure or parallel forms paradigm, in the absence of a vertical scale, should be done with caution as to not infer too much. Whatever growth, or lack of, does occur is certainly attributable to multiple factors. Many of them lack g ood measurement or data. So, approaching the validity argument from the aspect of structural or factorial validity is another way to get at the appropriateness of the inference before it is made. A central principle of MI is that measures across groups ar e considered to be on the same scale if relationships between the indicators and the trait are the same across groups. This statement can be translated into factor analytic terms: Given multiple items that make up a scale, if the loadings for those items o n the single underlying factor are the same across groups, then measurement invariance is supported. When framed in these factor analytic terms, this property is called factorial invariance, and represents one approach to the study of MI. The various aspec ts of MI can be investigated using confirmatory factor analysis models. As will be seen, these models can be supplemented with a model for structured means so as to allow for the study of group differences in means on latent variables. That is, growth can be studies once invariance (or the degree of invariance) can be assessed. In the context of using CFA or EFA to evaluate the measurement qualities of an instrument, the item level data is really the data with which we start. Therefore, it is important t o 16 start by specifying the data model. The data model here (Equation 1) is one in which nonzero means on the measured and latent variables are assumed: = + + Equation 1 where x represents a vector of intercept terms for the x measured variables, x is the factor loading matrix, is the vector of latent variables and is the vector of error of measurement terms for the x measured variables. Further, is defined as the vector o f means on the latent variables. The following covariance model (equation 2) can be derived from equation 1 and expresses the population variances and covariances of the measured variables as a function of the parameters in x, , and which are parameters in the matrix of factor loadings, the variance/covariance matrix of the latent variables, and the variance/covariance matrix of error terms respectively. = + Equation 2 The mean structure model ca n also be derived (equation 3) which expresses the population means of the measured variables as a function of the vector of intercept terms, the factor loadings and the vector of means on the latent variables. = + Equation 3 Overall, the full model for means and covariances has five parameter matrices: x, , x , and . A model is specified by designating fixed, free, and constrained parameters in these 5 matrices. This model can be fit to data (sample covariance m atrix, S , and sample mean vector, ) by 17 obtaining estimates of the parameters such that the resulting implied population covariance matrix and mean vector ( and respectively) are as similar as possible to their sample counterparts. In fact, a critical aspect of model fit is the degree to which the implied and observed are one and the same. Equations 2 and 3 are easily generalizable to multiple groups. For instanc e, the covariance structure is generalized to equation 4. = ( ) ( ) ( ) ( ) + ( ) Equation 4 Similarly, the mean structure is generalized in equation 5. ( ) = ( ) + ( ) ( ) Equation 5 Where, g represents the g th of G populations. The model is specified in terms of the parameter matrices for each group, possibly including equality constraints on selected parameters across groups. The model is fit to the multiple samples simultaneously and in a deliberate order as more and more constraints are placed on the multiple group models. The Standard Invariance Model Testing Sequence While numerous theoretical formulations for measurement invariance have already been posited in this text, this study concerns itself primari ly with levels of invariance that tap the psychometric properties of the measures. Little (1997) refers to these degrees of measurement invariance as Category 1, which subsumes the common taxonomy of configural, metric and strict invariance most frequently found in the MI literature (e.g. Horn & McArdle, 1992; Meredith, 1993). The notion of MI usually is raised with reference to a single scale and the question of whether it measures the same trait in different groups. This question can be studied using the 18 multi - sample CFA model, usually with structured means. In the simple MI case, there would be only one factor, and the indicators of that factor would be the scale items (or subscales, parcels, etc.). However, the matrix representation of the models shows t hat the concepts and procedures apply equally in the case of multiple implied factors (latent variables). The multi - sample CFA model with structured means can be used to investigate MI and to test for group differences in factor/latent variable means. This is achieved by testing a sequence of models, beginning with an unconstrained model and progressively introducing equality constraints on parameters. In doing so, it is preferred that the sequence be determined a priori. The following subsections discuss t his progression in detail. Configural Invariance Configural (Horn, McArdle and Mason, 1983), Weak (Meredith, 1993) or pattern invariance (Millsap, 1997), is considered the lowest or weakest level of measurement invariance that can be obtained. This type of invariance refers to the pattern of salient (non - zero) and non - salient (zero or near zero) loadings which define the structure of a measurement instrument. Configural invariance is supported if the specified model with zero - loadings on non - target factors fits the data well in all groups. Put another way, configural invariance holds when the same items load on the same factors for both groups of interest (e.g. grade 8 vs. grade 10). In CFA literature, configural models are also used as the baseline model in comparisons of competing model. However, this is interesting in that it could be seen as rather subjective, or possibly over - restrictive, if 0 loadings were in fact specified in the model. This is certainly a topic related closely to rotation technique in EFA. 19 Metric Invariance Metric (Thurstone, 1947), weak (Meredith, 1993), or factor pattern invariance (Millsap, 1995) is more restrictive than configural invariance. This level of invariance requires that the loadings in a CFA be constrained to be equiva lent in each group while permitting the factor variances and covariances to vary across groups. In other words, 12 G . Given that such statistics rarely demonstrate equality, the statement really means that the loadings in one group ar e proportionately equivalent to corresponding loadings in other groups (Bontempo & Hofer, 2007). In order to make such proportionally equivalent statements, the common - factor variances must be freely estimated in all but the first group (or whichever group is chosen as a reference). This is because loadings standardized to the common - factor variance each differ from the corresponding loading in another group by the same proportion. The proportion is the ratio of the variance in each group. The presence of m etric invariance can support researcher™s claims that there are similar interpretations of the factors across groups but most would recommend a more stringent level of invariance testing to support that the statements are equivalent. Scalar Invariance Scal ar (SteenKamp & Baumgartner, 1998) or Strong (Meredith, 1993) invariance is more restrictive than metric/weak invariance because it constrains factor loadings as well as intercepts to be equal across groups. In other words, equality constraints across grou ps are applied to factor loading parameters and the intercept parameters; 12 G & 12 G . By applying these constraints, we are saying that any observed mean differences at the item level are accounted for by the common - factor mean. If this assumption holds, the comparison of factor means across groups is reasonable. 20 Strict Invariance The final sets of constraints are the most restrictive and hence they are associated with demonstrating strict invariance. By sayi ng that groups hold to the principal of strict invariance we are specifying equality constraints on factor loadings, intercepts and errors across groups. 12 G , 12 G & 12 G . In this paradigm, all parameters except for the latent variable level are constrained to be equal. So, the latent variable, factor, means and covariances can be used in comparisons. A model is said to be identified when, for a given research problem and data set, sufficien t constraints are imposed such that there is a single set of parameter estimates yielded by the analysis (Thompson, 2004). Some mechanical processes must also take place to achieve model identification. Specifically, your latent constructs have to be assig ned a scale of measurement. One way to accomplish this in a multiple group setting such as this is to specify the value of the factor loading of one item per factor to unity (1.0). These items are then referred to as marker, or reference, items. In the con text of multiple group factor analysis paradigms, it is critical that the same item be fixed to unity for all groups examined. The procedures just mentioned should be considered as being part of the decision sequence in determining the level of invariance between groups. Outcome Measures: Fit Indices Fit refers to the ability of a model to reproduce the data (i.e., usually the variance - covariance matrix). A good - fitting model is one that is reasonably consistent with the data; a good - fitting measurement mo del is required before interpreting the causal paths of the structural model. 21 It should be noted that a good - fitting model is not necessarily a valid model . Models with arguably ridiculous results (e.g., paths that are clearly the wrong sign) and models wi th poor discriminant validity or Heywood cases can be figood - fittingfl models. Therefore, parameter estimates must be carefully examined to determine if one has a reasonable model as well as a good - fitting model. It is important to realize that one might o btain a good - fitting model, yet it is still possible to improve the model and remove specification error. Of course, having a good - fitting model does not prove that the model is correctly specified. Finally, it should be noted tha t a model all of whose p arameters are statistically significant can be from a poor fitting model. So, does this mean all hope is lost and this is a meaningless endeavor? Absolutely not, it just means we should be cautious with inferences and overstretching the reaches of the gen eralizability of results (as usual). The appropriateness of one fit index compared to another is not a new ‚argument™ in the literature. Some researchers (e.g., Barrett, 2007) do not believe that fit indices add anything to the analysis, and only the chi square should be interpreted. The pri mary concern driving the 2 argument is that fit indices allow researchers to claim that a miss - specified model is not a bad model. Hayduk, Cummings, Boadu, Pazderka - Robinson, & Boulianne (2007) argue that cutoffs for a fit index can be misleading and su bject to misuse in that they are generally rules of thumb not driven by empirical evidence. Therefore, the author contends that they are useful but much like the allusion to Messick™s validity paradigm, they are only useful when used as intended. There is also the potential of ficherry pickingfl a fit index; computing several fit indices and picking the one index, or indices, that best confirms the research hypothesis rather than what is appropriate given the data and the intended inferences. Choosing not t o use a commonly referenced index (like the TLI or the RMSEA) requires justification, especially if one wishes to 22 publish in high quality journals. Others, such as Kenny, Kaniskan, and McCoach (2011), have argued that fit indices should not even be compute d for small degrees of freedom models. What is more important in those situations is to locate the source of specification error (Kenny & McCoach, 2011). Bollen and Long (1993) is a great reference that discusses in great detail many of the indices mentio ned here. A crucial consideration discussed in choice of a fit index is the penalty it places for complexity. The penalty can be thought of as how much a 2 needs to change for the fit index not to change. Another crucial consideration is what the purpos e is and your main research question. Answering the question of whether or not a model fits is different than answering which model fits better or which model fits better across groups. Here the purposes are twofold. During the establishment of configural variance, the author must first establish that the baselines are in fact decent fitting models. That would require an absolute measure of fit whereas the comparative fit investigations are best addressed with indices of relative fit. Absolute Fit The most common approach is to utilize the chi square distribution. For models with about 75 to 200 cases, the 2 test is a reasonable measure of fit. However, for models with roughly 400 or more cases, the chi square is almost always statistically significant. Chi square is also affected by the size of the correlations in the model: the larger the correlations, the poorer the fit. Sometimes chi square is more interpretable if it is transformed into a Z value using the following approximation: = 2 ( 2 ) 2 ( ) 1 A problem with this fit index is that there is no universally agreed upon standard as to what is a good and a bad fitting model. Using areas on the standard normal curve does not 23 remediate the sensitivity to sample sizes. The chi square test is too lib eral (i.e., too many Type 1) errors when variables have non - normal distributions, especially distributions with kurtosis. Moreover, with small sample sizes, there are too many Type 1 errors. Of important note is that two very popular fit indices, TLI and RMSEA, are largely based on the 2 / concept. The root mean square error of approximation (RMSEA) is currently the most popular measure of model fit and it now reported in virtually all papers that use CFA or SEM and some refer to the measure as the fi Ramsey.fl This absolute measure of fit is based on the non - centrality parameter. Its computational formula is: 2 / df ( N 1 ), where N the sample size and df the degrees of freedom of the model. If 2 is less than df, then the RMSEA is set to ze ro. The penalty for complexity is the 2 to df ratio. The measure is positively biased (i.e., tends to be too large) and the amount of the bias depends on smallness of sample size and df, primarily the latter. MacCallum, Browne and Sugawara (1996) hav e used 0.01, 0.05, and 0.08 to indicate excellent, good, and mediocre fit respectively. However, others have suggested 0.10 as the cutoff for poor fitting models. These are definitions for the population. That is, a given model may have a population valu e of 0.05 (which would not be known), but in the sample it might be greater than 0.10. There is greater sampling error for small df and low N models, especially for the former. Thus, models with small df and low N can have artificially large values of th e RMSEA. For instance, a chi square of 2.098 (a value not statistically significant), with a df of 1 and N of 70 yields an RMSEA of 0.126. For this reason, Kenny, Kaniskan, and McCoach (2011) argue to not even compute the RMSEA for low df models. A confi dence interval can be computed for the RMSEA. Ideally the lower value of the 90% confidence interval includes or is very near zero (or no worse than 0.05) and the upper value is not very large, i.e., less than .08. 24 The width of the confidence interval is very informative about the precision in the estimate of the RMSEA. A value less than .08 is generally considered a good fit (Hu & Bentler, 1999). Descriptive Fit Incremental (sometimes called relative) fit indices are analogous to R 2 ; a value of zero indicates having the worst possible model and a value of one indicates having the best possible. In that respect, the model(s) of most interest are essentially put on a continuum ranging from the null or independence model (worst) to the ideal (a perfectly fitting model) with the theoretical frameworks typically falling in between. The Bentler - Bonett Index (1980) or Normed Fit Index (NFI) is credited as being one of the very first measures of incremental fit proposed in the literature. The best model is defined as model with a 2 of zero and the worst model by the 2 of the null model. Formulaically, the index can be seen as: Traditionally, a value between .90 and .95 is consider ed marginal, above .95 is good, and below .90 is considered to be a poor fitting model. A major disadvantage of this measure is that it cannot be smaller if more parameters are added to the model. That is, there is a penalty of 0 for complexity; the more parameters added to the model, the larger the index. When comparing models with vastly different model specifications, such as this study, the NFI doesn™t perform well and these differences would be quite misleading. One remedy is to use the Tucker Lewi s Index or Non - normed Fit Index (NNFI) which overcomes the non - penalty problem of the Bentler - Bonett index. The Tucker - Lewis index has such a penalty and leverages the historically preferred method of looking at the 2 / . The TLI is computed as follow s: 25 2 2 2 1 A weakness of the correction is that the index can rise above 1, however it is capped at 1 for practical purposes. Interpreted just as the Bentler - Bonett index, values closer to 1 indicate greater fit. An artifact is that for a given model, a lower 2 / (as long as it is not less than one) implies a better fitting model. The penalty for complexity is 2 / . That is, if that ratio doesn™t change, the TLI does not change. Also worth noting, the TLI depends on the average size of the correlations in the data. If the average correlation between variables is not high, then the TLI will not be very high. Comparative Fit Indices Akaike Information Criterion (AIC) The AIC is a comparative measure of fit and so it is meaningful only when two different models are estimated. Lower values indicate a better fit and so the model with the lowest AIC is the best fitting model. There are somewhat different formulas g iven for the AIC in the literature, but those differences are not really meaningful as it is the difference in AIC that really matters: 2 + k(k - 1) - 2df, where k is the number of variables in the model and df is the degrees of freedom of the model. No te that k(k - 1) - 2df equals the number of free parameters in the model. The AIC makes the researcher pay a penalty of two for every parameter that is estimated. Bayesian Information Criterion (BIC) Whereas the AIC has a penalty of 2 for every parameter e stimated, the BIC increases the penalty as sample size increases: 2 + ln(N)[k(k + 1)/2 - df], where ln(N) is the natural logarithm 26 of the number of cases in the sample. (If means are included in the model, then replace k(k + 1)/2 with k(k + 3)/2). As ca n be seen, the BIC places a VERY high value on parsimony. The Sample - Size Adjusted BIC (SABIC) The Sample - size adjusted BIC or SABIC like the BIC places a penalty for adding parameters based on sample, size but not as high a penalty as the BIC. The SABIC is not given in Amos, but is given in Mplus. Several recent simulation studies (Enders & Tofighi, 2008; Tofighi, & Enders, 2007) have suggested that the SABIC is a useful tool in comparing models. Its formula is: 2 +[(N + 2)/24][k(k + 1)/2 - df]. With a ll of the comparative fit indices, the goal is to obtain estimates as close to 0 as possible. Essentially, when comparing two competing models, values with lower comparative fit indices are a better way to explain the data. Especially when referencing thos e that penalize/adjust for model complexity and increasing sample size. Construct Stability, Measurement Invariance and Validity Evidence What should be evident thus far, is d epending on the theoretical viewpoint being utilized to assess the c omparability /interchangeability arguments , the methodology employed could focus on equating , test specification matches, alignment, measurement invariance in the spirit previously reviewed or most likely, a convoluted combination of all of the above. Regardless of how one wishes to frame the issue , it is always a va lidity question as it speaks directly to the intended inferences or uses of scores and assessments that produce those scores. T he heart of assessment is the construct and the generally conceived notion that one is measuring what they target to measure for the intended uses and purposes. Truly that is a massive set of assumptions that requires multiple inputs to assess. As such, there is guidance given that isn™t meant to be exhaustive in any respect but helps to appropriately categorize many 27 of the analyses responsible psychometricians and test developers already undergo in the process of accumulating validity evidence in a large - scale assessment. Messick (1989, 1995) identifies five sources of evidence to sup port construct validity: content, response process, internal structure, relations to other variables, and consequences. These are not different types of validity ; they are best thought of as categories of evidence that can be collected to support the const ruct validity of inferences made from assessment scores. Evidence should always be sought from several different sources to support any given interpretation, and strong evidence from one source does not negate the need to seek evidence from other sources. While accumulating evidence, one should specifically consider two threats to validity: inadequate sampling of the content domain (construct underrepresentation) and factors exerting nonrandom influence on scores (bias, or construct - irrelevant variance). I will return to the issue of the inadequate sampling of the content domain in a later section; however the second factor is a major source of concern in the proposed study. Nonrandom influence on scores is especially difficult to contain and identify. In t erms of apportioning variance, a frame of reference or theoretical viewpoint must be stated clearly or else it becomes a rather difficult task as sources of bias may be considered relevant or irrelevant in certain uses and purposes. Depending on the intend ed use, to what extent does teaching a student actually contribute construct irrelevant variance? Is the delivery of curriculum actually a non - random influence on scores? Such questions may at first make you chuckle; when is teaching a bad thing? However, if you think about it, the concept might not be so odd when trying to conceptualize growth. If an instrument is intended to measure whether a student has a certain achievement level, as measured by a standardized test designed to tap content taught up unti l a certain point in time , then it could be argued that any activity after that designated point in time 28 could potentially contribute construct irrelevant variance. In that context, it may also call into question whether or not a repeated measures test - retest paradigm using a parallel form, or the same form, of the assessment used as a measure of growth would find supportive evidence to validate the instruments use, and the inferences made, in that context. Aside from policy considerations surround ing the use of measures for growth, there are also assumptions regarding instrument content that drive the appropriate measure and method of measuring and determining growth that may or may not be considered. This issue actually returns us to the concern o f inadequate content sampling; an issue of test development and design. Content Based Evidence For the purpose of making decisions on a student™s status measure, whether it is via a proficient/not - proficient or multi - level designation based on that score, the practice of creating a table of test specifications based on content standards and creating a content weighting scheme of what needs to be assessed is typically the beginning. Items are then written towards the table of specifications, reviewed by cont ent experts who are able to judge the appropriateness of the items as well as the depth of knowledge (DOK) and to ensure there is no initial hint of potential bias towards one or more subgroups. Items are then field tested, the best items are selected base d on a carefully balanced blend of psychometric qualities and alignment to the test blueprint and table of specifications, and those create the foundation for an operational assessment. Following the first operational assessment, it is common to delay for a short time the reporting of the results. This is to allow stake holders and appropriate personnel the ability to articulate achievement level descriptors (ALDs) which are the first step in bridging the performance on the assessment with the holistic expe ctation linked to the ALDs. 29 Following this, standard setting meetings take place in which cut scores are recommended. Standard setting takes on many forms but at the heart of most commonly used methods are those which leverage performance data on items and total scores. The recommendations are then taking to the appropriate approving bodies for the purpose of becoming policy. All of this is a very long and deliberate process that takes years to accomplish and if done properly results in a strong assessment for its intended uses and purposes. Up till this point, it is important to note that the only intended use and purpose supported is the use as a status measure of achievement, in as much as the instrument still abides by the predefined blueprints and table s of specifications developed during the initial scaling and used for standard setting, where the score can be applied towards the criterion reference targets and a performance level derived. The language contained within the ALD becomes the operational de finition of the intended inference for which validity evidence has been accumulated; and this is a status measure only. Factorial Evidence Test development processes support the accumulation of content related evidence towards the construct validity argume nt and suggest an internal structure (Messick 1995; Cook and Beckman, 2006). Reliability and factor analysis data are generally considered evidence of internal structure. That is to say, scores intended to measure a single construct should yield homogenous results, whereas scores intended to measure multiple constructs should demonstrate heterogenous responses in a pattern predicted by the constructs. Just as constructs can be defined by blueprints and tables of specifications, they can also be implicitly assumed and defined by a choice of measurement model (e.g. Rasch, 2PL, 3PL, GPCM). These IRT - based models look precisely at that ever important person and item 30 interaction that the blueprint itself does not address, nor is it purposed to address. When data are scaled using one of the unidimensional IRT models, the assumption is that there is one underlying construct that is measured by the collection of operational (scored) items on the assessment. Interestingly, the specification of a Rasch model implies t hat the items themselves measure the construct equally well in that the discrimination parameters are assumed to be equal. It is important to point out that internal consistency should be seen as a necessary but not sufficient condition for measuring homog eneity or unidimensionality in a sample of test items. Essentially, that concept ion of reliability assumes that unidimensionality exists in a sample of test items (Tavakol and Dennick, 2011; Green, Lissitz and Mulaik, 1977). And of course, the fitting of a unidimensional IRT model further makes that presumption. Furthermore, systematic variation in responses to specific items among subgroups who were expected to perform similarly (i.e. DIF) suggests a flaw in internal structure, whereas confirmation of pred icted differences provides suppor ting evidence in this category. Dimensionality is a characteristic of the interaction of persons and items. In the context of the proposed study, if students on the first occasion of an assessment consistently answer a ques tion one way and on subsequent administrations answer another way, regardless of other responses, this will weaken (or support, if this was expected) the validity of intended interpretations with respect to the desire to generalize analysis. In this contex t, a lack of DIF can be considered an associated supporting measure , but not necessary pre - condition, to measurement invariance. The current study does not propose to examine DIF directly but the author introduces it here as an analogue to the invariance i ssue. DIF is essentially lack of invariance at the item level which in the IRT paradigm plays out as different item parameters but in factor analytic terms leads to lack of equality in factor loadings or the presence of metric invariance. 31 With regards to the relations to other variables, correlation with scores from another instrument or outcome for which correlation would be expected, or lack of correlation where it would not, supports interpretation consistent with the underlying construct. This idea can be extended to other variables believed to account for variability where a lack of significant relationship is one way to accumulate evidence for construct validity of inferences via the correlation with scores or variables that theoretically should (or s hould not) be related to the score of interest. Both of these ideas, support for the internal structure and relationship to other variables, become crucial when the focus of assessment scores (i.e. the intended uses) switch to not only support status, but also growth. Of course one should never forget that the assessment scores should be instructionally relevant given that an obvious, yet unintended consequence, of assessment is that curriculum decisions can become somewhat guided by the potential content o f a high - stakes assessment (Perie et al, 2007). One can only reason, rationally, that the practice of tying assessment scores to evaluation and accountability would tend to get the attention of those being evaluated and being held accountable. In growth mo deling the correlation or relation of the construct to the variable of time (which is confounded with instruction - type, kind, quality, etc) is a potential threat to or enhancing piece of the validity argument; as has been discussed this is buried in the i ntended (whether implicit or explicit) uses of the assessment scores. That is to say, it™s an issue of measurement invariance over time. In a factor analytic paradigm, we need to at least be able to assume configural invariance such that the data collected at each point in time, decompose into the same number of factors, with the same items associated with each factor (Meredith, 1993). If that hypothesis holds, there is support for the assumption that participants belonging to different groups conceptualize the constructs in the same way (Riordan & Vandenberg, 1994). It is this specific issue, whether or 32 not participants in the different groups specific to this study (e.g. students exposed to differing levels of instruction and additional years of cognitive growth) conceptualize the mathematics achievement construct in the same way, within the confines of the measurement invariance paradigm. 33 Chapter 3: Method This study utilizes original census data from the statewide administration of an 8 th grade mathematics assessment as well as data collected as part of a special study purposed at determining off level testing behavior of students. The instrument under investigation is the base form for that assessment containing the collection of i tem responses for all students tested under standard conditions. It was administered initially in Fall 2009 to then 8 th grade students as their summative assessment (technically assessing 7 th grade content). The same instrument was administered to subsampl es of then current 8 th , 9 th , and 10 th grade students in the Spring of 2011. This creates four sets of student responses: the initial census sample tested fion - gradefl, a sample one grade of instruction above the intended level, a sample two grades of instruction above the intended level and a sample three grades of instruc tion above the intended sample. For purposes of the current study, the intended population will be referred to as the control condition and the other groups as treatment groups 1 - 3 for 8 th , 9 th , and 10 th grade samples respectively. Data During the Fall 2009 administration , 118 , 891 then eighth grade students sat for the assessment in question over a two week period of October. This sample is actually considered a census in that anyone considered to be tested under standard conditions actually received th e same standard form of content with the only difference being unique sets of field test items embedded at the same location across the various forms. In as much as this is considered a census, there is no attempt made to compare to a larger group. T reatme nts 1 - 3 are convenience samples obtained by schools volunteering to take part in a study designed to investigate the viability of off - grade testing . They were not drawn to be representative of the state 8 th grade testing population , but some degree of simi larity was desired. It is important to note, that the 34 current study is a re - analysis of pre - existing data sets. The data sets were provided to the author as deidentified longitudinal student profiles already linking previous assessment performance, demogra phics and data from the study of off - grade testing. While this was desirable in that the data were completely anonymous to the researcher it also prevented many follow up questions that would have been helpful after the analysis was complete. Some of these issues are elaborated further in the discussion and limitations sections. of Table 1 presents the sample characteristics in terms of gender, ethnicity, special education designation, limited English proficiency classification and eligibility for free/redu ced lunch (here used as a proxy for economically disadvantaged) for all of the samples. 35 Table 1 - Sample Demographic Characteristics Group Control Treatment 1 Treatment 2 Treatment 3 Fall 2008 Spring 2009 Spring 2010 Spring 2011 8 th Grade 8 th Grade 9 th Grade 10 th Grade Count Percent Count Percent Count Percent Count Percent Total Sample 118891 100 1423 100 547 100 644 100 Gender Female 58645 49.3 721 50.7 276 50.5 347 53.9 Male 60246 50.7 702 49.3 271 49.5 297 46.1 Ethnicity 1 1125 0.9 6 0.4 7 1.3 4 0.6 2 3025 2.5 0 0 4 0.7 6 0.9 3 21643 18.2 344 24.2 144 26.3 150 23.3 4 5132 4.3 113 7.9 22 4 28 4.3 5 86431 72.7 901 63.3 366 66.9 455 70.7 6 1395 1.2 21 1.5 4 0.7 1 0.2 7 80 0.1 0 0 0 0 0 0 8 60 0.1 0 0 0 0 0 0 9 0 0 38 2.7 0 0 0 0 SES Non - ED 66552 56 605 42.5 282 51.6 392 60.9 ED 52339 44 818 57.5 265 48.4 252 39.1 LEP Non - LEP 114962 96.7 1370 96.3 536 98 629 97.7 LEP 3929 3.3 53 3.7 11 2 15 2.3 Spec.Ed. Non - SE 106215 89.3 1326 93.2 529 96.7 615 95.5 SE 12676 10.7 97 6.8 18 3.3 29 4.5 Instrument The assessment contained 51 operational multiple choice items. Michigan is a fall testing state so the fall testing encompasses the previous year™s content expectations only. The content standards being referred to are the Michigan Mathematics Grade Level Co ntent Expectations (GLCEs) in effect during the testing related to this study ( http://www.michigan.gov/documents/MathGLCE_140486_7.pdf ). The expectations are divided into strands wit h multiple domains within each. In practice, the skills and content addressed in these expectations are woven together into a coherent, 36 Mathematics curriculum. The domains in each mathematics strand are broader, more conceptual groupings. In several of the strands, the fidomainsfl are similar to the fistandardsfl in Principles and Standards for School Mathematics from the National Council of Teachers of Mathematics. For this particular assessment, five strands are possible for assessment: Numbers and Operations (N), Algebra (A), Measurement (M), Data and Probability (D), and Geometry (G). The particular grade and content area in this particular year did not include Measurement (M) items on the test blueprint. Therefore, only items belong to the A, D ,G or N stra nds appeared on the assessment with the following item counts (out of 51) respectively: 22 , 6, 9 and 14. What is clear is that the Numbers and Operations and Algebra items are most heavily weighted in the blueprint. These strands are mutually exclusive cat egories. Scores are reported at the strand level but not as a scale score; a raw score as well the total possible are given. Scale scores are based on item response theory scaling (specifically under the Rasch model) and represent a unidimensional scaling of all of the operational items together to form the underlying . The scale score is a linear transformation of that value. Table 2 lists the breakdown of items by strand. Table 2 Œ Assessment breakdown by strands (numerals are item numbers) Strand A: 1 , 2, 3, 4, 11, 13, 14, 15, 16, 17, 18, 24, 27, 28, 29, 33, 34, 44, 46, 47, 48, 49 Strand D: 19, 20, 21, 22, 50, 51 Strand G: 12, 35, 36, 37, 38, 39, 40, 41, 42 Strand N: 5, 6, 7, 8, 9, 10, 23, 25, 26, 30, 31, 32, 43, 45 The blueprints underwent formal alignment procedures using the methodology developed by Norman Webb ( http://www.michigan.gov/documents/Alignment_Analysis_of _Grades_3 - 8_Mathematics_Standards_and_the_MEAP_165665_7.pdf ). While other models exist, this is a popular approach that state K - 12 testing programs have used to satisfy the requirements for demonstrating alignment with the state standards in the given cont ent area. Furthermore, careful item construction procedures are followed such that items are commissioned to individuals with 37 proper credentials to serve as content experts. The items are then vetted for initial review with state department level content l eads at which point they are either accepted, denied or denied with revision requests. Those items that survive move on to an initial review by referent groups, commissioned by both the MDE and the development contractor, to determine: 1) is the item refle ctive of the intended content standard? 2) is the item written to the appropriate level of cognitive complexity (DOK as defined in Webb procedures in this case)? 3) Does the item contain any language/text/symbols/images that would unfairly advantage or dis advantage any subgroup of the intended population? 4) Is the content of the item appropriate for the grade level? Items, at this point, can be cleared for field testing immediately, cleared for field testin g following revision, or marked as do not use (DN U). Items surviving are then field tested where a large array of statistics are calculated for the item and the results are then taken back to appropriate referent groups for further review to be used in combination with expert judgment to deem if an item should then be marked for further revision and additional field testing, ready for operational or do not use. As alluded to in the review of the literature and policies , it typically takes 12 - 18 months and a minimum of $2,000 to produce an item used in o perational assessment. It is crucial for the reader to understand that because what it does is provide a very multifaceted, and very real, perspective that those in high - stakes assessment must take. That is, there is an enormous amount of faith put into th e test blueprint and specifications understandably given the time, money and effort put into so many activities surrounding them. They hold the key to the validity arguments; if the blueprint is flawed or the construct is flawed the measurement cannot be v alid for the intended inferences. 38 Analysis As mentioned previously, a big assumption of CFA is that one has a model they want to confirm. It is not uncommon to start with an assessment blueprint as a confirmation approach (Thompson, 200 4 ). Howeve r, those really define the intended content of an item as it relates to a curriculum or content standard. The interaction of a person with an item involves much more than an intended content standard. Factor analysis data deal with the interaction of the p erson with the measurement device so it may not necessarily be fruitful to consider as a solid baseline, a model built entirely on item specifications. Therefore, I propose to use an additional baseline model which is the Rasch model for dichotomous respon ses (Rasch, 1960; Wright & Stone, 1979) . This was the model used to scale the original data and provide the scale scores in question. Th e model consists of a single latent construct measured by the collection (51) of operational items in the assessment , ea ch with a unique error component . The second model is based on the assessment blueprint and references the content strands of T able 2 . There are 4 latent variables/factors each being measured by the observed variables referenced in that table. Each of those measured variables has a unique error component associated with it. The 4 latent factors are assumed to be correlated. T h e instr u m e nt used in this study does not utilize a partial credit model and as a result all of items are dichotomously scored. Be c a use o f t h is, multipl e - g roup CFA me a sur e m e n t m odels w ith bin a ry indicators r e q ui r e a d i f f er e nt p ar a m e t er i z ation w hich re q ui r es m od i f i c ations to t h e a f o r e m e ntio n ed proce d ures ( Jöreskog & Mo u s t aki, 2 00 1 ; M i l l sap, R.& Y u n - Tein, 20 0 4 ; M ut h é n , B. & Aspar o u h o v , T., 2 0 0 2 ) . Essentially, each i t em o n t h e me a s ure is co n nec t ed to its respecti v e c o nstru c t t h rou g h a lat e nt c onti n uous re s pon s e v ar i abl e . T his v ar i able is c u t by m - 1 t h reshold parameters, where m represents the number of item score categories . Anal y ses a r e 39 t h en bas e d o n a m a t rix o f t e trac h or i c c o r r elatio n s . T he lat e nt re s pon s e v ar i ables re q ui r e a ddit i o n al s c al i ng f a ct o rs i n order t o a s s e ss g roup d i fferences in the c o mm on f a ct o r m e an a nd v ar i ance. T o i d ent i f y t he m o d el t h e f o l l o w ing st e ps m ust b e ta k en : (1) The in t erce p t p arame t ers f or all la t ent res p on se v ar i ables m ust b e f ix ed to 0 in the f i r st g ro u p; (2) U ni q ueness v ar i ances ne e d to b e f i x ed to unity in t h e f i r st g roup. As w ith a ny stan d ard m ultipl e - g roup c o n f i r m at o ry f a c t o r a nal y sis, a dditio n al constraints a re neces s ary in o rder to place t h e c o m m o n - f a c t o r m e a n a n d v ar i ance o n t he s a m e m etric ac r oss g roups. The two most commonly referenced approaches for achieving this end are presented by Millsapp and Tein (2004) as well as Muthen and Aparouhov (2002). T h e M i l l sap a nd T ein a p proa c h re q u i r es that t h e f i r st m - 1 t hresho l d s b e c onstrai n ed ac r o s s a ll g roups and a s e c o nd t hres h old or uni q ueness (in t h e c a s e o f bin a ry it em s , t h ere w ould be no 2 n d t hreshold) be c onstrai n ed f o r o n e ref e ren c e it e m in ea c h g roup. Similarly, t h e Mu t hen a n d A s pa r ou h ov a pproa c h re q ui r es t h at t hresho l d s a n d lo a din g s a re c onstra i n ed in a re d uc e d mo d el and t h a t t e s ts o f selec t ed it e m s are con d uct e d a g ainst a f u ll m o d el w here thres ho lds a n d lo a din g s f or t h ese it e m s are f r e ed w hile m ainta i ning m o del i d ent i f ic a tion t h rou g h f i x ing the s p ec i f i c - v ar i ance to unity for the s e lec t ed i t e m s. However, it is important to note that in the documentation of these paradigms presented in the Mplus user™s guide (Muthen & Muthen, 1998 - 2012), an important discussion occurs where critical differences are presented that discuss how the progression takes place in a different sequence when continuous variables are considered versus categorical (dichotomous in this case) variables. This is for a couple of reasons but namely the unique and somewhat simplified statistical properties of binary variables in addition to the idea, which is true here, that many of the measurement models 40 presented based on these binary outcomes are themselves part of an IRT bas ed solution that leads to item characteristic curves (referred to in Muthen and Muthen as item probability curves). As such, the constraining of thresholds and factor loadings takes place in tandem as these parameters represent the IRT parameters of difficulty/scale location parameters and discrimination/scaling factor parameters respective ly. As a result, there are fewer steps presented than one would typically note in an invariance study using c ontinuous variables ; the invariance progression typically goes through four iterations where various parameters are constrained making the multi - gr oup mode l s more stringent as the progression occurs. The steps involved depend on the particular paradigm chosen and the parameterization schema selected. It is recommended, that when underlying IRT models are assumed and when most, if not all observed va riables are categorical the weighted least squares indicator using the parameterization (versus the parameterization; Mplus default) is preferred. The key issue is that for categorical outcomes, the measurement parameters of interest are truly the fact or loadings and the threshold parameters. When the parameterization is considered , scale factors are also considered and the parameterization adds in the ability to examine residual variances in addition to the other parameters . Variances for continuou s latent response variables (factors) are estimated but residual variances for the observed categorical indicators are not estimated. The parameters that are estimated and fixed can be found in Tables 3 - 6 for both model 1 (Rasch ) and model 2 (the blueprint based approach; table of specifications). They are presented in sequence for measurement invariance testing. 41 Table 3 - Model 1 , Configural Invariance Parameter Constraints Control Group Loadings 1 51 Constrained to be equal within group per Rasch requirements Thresholds 1 51 Free Residuals 1 51 Fixed to 1 Factor means Fixed to 0 for factor 1 Factor variances Fixed to 1 for factor 1 Treatment Group Loadings 1 51 Constrained to be equal within group per Rasch requirements Thresholds 1 51 Free Residuals 1 51 Fixed to 1 Factor means Fixed to 0 for factor 1 Factor variances Fixed to 1 for factor 1 Table 4 - Model 1 , Metric Invariance Parameter Constraints Control Group Loadings 1 51 Held Equal Across Groups Thresholds 1 51 Free Residuals 1 51 Fixed to 1 Factor means Fixed to 0 for factor 1 Factor variances Fixed to 1 for factor 1 Treatment Group Loadings 1 51 Held Equal Across Groups Thresholds 1 51 Free Residuals 1 51 Fixed to 1 Factor means Fixed to 0 for factor 1 Factor variances Fixed to 1 for factor 1 The procedures followed for Model 2 (Blueprint based) are similar to those presented in T ables 3 and 4 . The differences come with the addition of multiple underlying factors and the correlation among factors to be considered. Similar to Model 1, the config ural and metric invariance tests are presented in sequence. 42 Table 5 - Model 2 , Configural Invariance Parameter Constraints Control Group Loadings 1 51 Constrained to be equal within groups and factors, per Rasch requirements Thresholds 1 51 Free Residuals 1 51 Fixed to 1 Factor means Fixed to 0 for factor 1 Factor variances Fixed to 1 for factor 1 Treatment Group Loadings 1 51 Constrained to be equal within groups and factors, per Rasch requirements Thresholds 1 51 Free Residuals 1 51 Fixed to 1 Factor means Fixed to 0 for factor 1 Factor variances Fixed to 1 for factor 1 Table 6 - Model 2 , Metric Invariance Parameter Constraints Control Group Loadings 1 51 Held equal across groups, within factors Thresholds 1 51 Free Residuals 1 51 Fixed to 1 Factor means Fixed to 0 for factor 1 Factor variances Fixed to 1 for factor 1 Treatment Group Loadings 1 51 Held equal across groups, within factors Thresholds 1 51 Free Residuals 1 51 Fixed to 1 Factor means Fixed to 0 for factor 1 Factor variances Fixed to 1 for factor 1 In the literature review section regarding fit indices, multiple types with differing purposes , were presented. What is clear in the literature is that some indices are more appropriate than others and there is a great degree of correlation evident between the indexes. The rationale for one being better than the other, given it is an appropriate situation, is based primarily on theoretical arguments and partially on the pride of the author. Additionally, an important 43 consideration must necessarily be whether or not the analysis software and estimation paradigm are able to produce t he desired metrics. Typically, the program (which is true in the case of MPlus) will disallow the calculation of an inappropriate index. The TLI, CFI and RMSE are used in the current study for all of the named rationales. Those indexes are useful for deter mining model fit which is only part of the research questions. As pointed out in literature reviews on the topic of GFIs (e.g. Cheung and Rensvold, 2002) , most researchers take a market basket approach to the use of the indices as non e are ‚known to be tru e and accurate™ and many have limitations depending on the structure and nature of the data. For the purpose of determin ing if students have in fact shown substantively significant growth , two approaches were taken . The first makes several assumptions wit h the most critical being measurement invariance of the instrument over time. That is, the students test performance, whom had already been measured with a parallel form of the assessment during the correct time period, was scored during the experimental p eriod using the raw to scale score conversion table as was used during the initial scaling of the instrument. So, there is the definite assumption in place that the IRT parameters are invariant over time. Nevertheless, their initial administration of the assessment as well as the follow up conducted during the treatment administration allowed two estimates of a student™s scale score as well as an estimate of the students™ performance level and sub - performan ce level. Specifically, I applied the agency™s transition table (see T able 7 ) to the pre and post measurements to determine if the students in the various treatments had at least achieved the amount of growth expected given instructions and increased rigor of performance and content expectations. Although the use of the IRT raw to scale score tables and the transition tables are outside of the scope of the intended use of these 44 measures, the author believes it will provide a useful context within which to d iscuss the various results. Table 7 Œ Michigan MEAP Transition Table NOTE : SI = Significant Improvement, I = Improvement, M = Maintain, D = Decline, SD = Significant Decline As discussed in the introduction, the current study is not meant to create a n ew methodology nor is the author introducing a methodology that has never been used. The purpose of this study is to back up a little bit and take a look at how well some basic assumptions hold up when using a measure as a status index before we go forward into assuming that measure also is appropriate to be used as a measure of growth. I believe the extent to which the IRT model holds over time , in some ways, address es the extent to which it may be appropriate to generalize these measures to a pre - test/pos t - test type of situation. Similarly, the degree to which the context or blue print based model holds over time is informative in that it speaks to the extent these subscales hold their meaning over time. Advanced Low Mid High Low High Low Mid High Mid Low M I I SI SI SI SI SI SI Mid D M I I SI SI SI SI SI High D D M I I SI SI SI SI Low SD D D M I I SI SI SI High SD SD D D M I I SI SI Low SD SD SD D D M I I SI Mid SD SD SD SD D D M I I High SD SD SD SD SD D D M I Advanced Mid SD SD SD SD SD SD D D M Not Proficient Partially Proficient Proficient Grade X MEAP Achievement Not Proficient Grade X + 1 MEAP Achievement Proficient Partially Proficient 45 Chapter 4: Results In this section, findings of the measurement invariance analyses performed on the MEAP Grade 8 Mathematics assessments are described in detail. The overall goals of this study were to: (1) evaluate the fit of both a single factor (Rash model) and four fact or (blueprint based test design) in the original administration data and (2) to determine the extent to which those models are invariant to additional years of instruction. That is Measurement Invariance paradigms will be evaluated on the focus groups of t he study. To this end, the results are sequenced as follows: (1) descriptive statistics of past performance for the groups of data referenced , (2) confirmatory factor analysis results of proposed models and (3) measurement invariance tests for the groups in question. Descriptive Statistics/Previous Achievement As all of the study participants had previously been administered a parallel and equated form of this assessment in October of their 8 th grade year, previous performance was available and is also pr esented in tabular form for comparison purposes. In T ables 8 and 9, two different ways to express prior performance are provided. In Table 8 , Mean scale scores and standard deviations are presented to give an indication of central tendency and variability differences among the groups and compared to the original census administration of the particular form of interest. The distribution of student scale scores (pre and post) are given in A ppendix A. As can be seen, the study conditions were more variable and were centered on different means. Follow up paired t - tests on mean differences (invoking the Scheffe™ procedure to control for inflated family wise type I error) revealed that all treatments were significantly different from the assumed census population value. Furthermore, they were significantly different from each other. 46 Table 8 Œ Previous Performance (Mean Scale Score) N Mean Scaled Score Scale Score SD Census 118851 818.20 27.70 Treatment 1 1423 814.96 29.39 Treatment 2 547 824.91 33.72 Treatment 3 644 821.04 28.75 Additionally, ordinal performance levels (1 = Advanced, 2= Proficient, 3= Partially Proficient and 4 = Proficient) are also presented along with the percent proficient (sum of percent in performance level 1 and 2; used for accountability and reporting purp oses). This metric is interesting in how it differs from T able 8 in that here, only treatment 2 showed significant proportional differences with the rest of the groups; there were a significantly lower proportion of students who were proficient in this group compared to the census and the others. These results highlight somewhat of a paradox in communicating these results. While we do have significant mean differences, the categorical placement of these students i n terms of their pass/fail status was remarkably similar. This is the influence of the criterion referenced cut - score placement. Table 9 Œ Previous Performance (Percent Proficient) N % PL1 % PL2 % PL3 % PL4 % Proficient Census 118851 42.7 31.8 18.5 7.0 74.5 Treatment 1 1423 34.4 39.6 21.3 4.8 74.0 Treatment 2 547 45.9 23.4 21.8 9.0 69.3 Treatment 3 644 34.6 40.4 20.8 4.2 75.0 Confirmatory Factor Analysis of One and Four Factor Models by Sample Prior to submitting the data to multigroup invariance testing, the fit of the two measurement models (single factor and four factor) for each of the treatment and the control sample were evaluated in a CFA framework. The extent to which each of the models fit was examined using Mplus v. 7.11 (L. M uthén & B. O. Muthén, 2013). WLSMV estimation including a probit link and the THETA parameterization was used to estimate all models (L. 47 Muthén & B. Muthén, 2013). WLSMV provides weighted least squares parameter estimates using a diagonal weighted matrix with standard errors and mean - and - variance adjusted chi - squared test statistic that use a full weight matrix (B. Muthén, du Toit, & Spisic, 1997). Model fit was evaluated with relative fit indices CFI, TLI, and RMSEA. For the CFI and TLI indices values a bove .95 indicate a good fit. For the RMSEA, a value less than .06 is considered to indicate good fit. Table 1 0 - Group level Model Fit (Single Factor Model) Group N Chisquare DF # free parameters TLI CFI RMSE Census 118851 121565.1 1224 102 0.951 0.953 0.029 Grade 8 Study 1423 2210.484 1224 102 0.966 0.967 0.024 Grade 9 Study 547 1558.031 1224 102 0.975 0.976 0.022 Grade 10 Study 644 1679.457 1224 102 0.956 0.957 0.024 For the single factor model ( T able 10) , all chi - square tests were significant (p< .0001), however, due to the binary nature of the variables and the WLSMV estimation utilized, the chi - squares are not trustworthy as global goodness of fit indices. CFI, TLI and RMSEA indicate adequate fit of the single factor model in all of the groups. T able 1 1 - Group level Model Fit (Blueprint Based/Four Factor Model) Group N Chisquare DF # free parameters TLI CFI RMSE Census 118851 112980.015 1218 108 0.954 0.956 0.028 Grade 8 Study 1423 2190.486 1218 108 0.966 0.968 0.024 Grade 9 Study 547 1547.218 1218 108 0.976 0.977 0.022 Grade 10 Study 644 1657.152 1218 108 0.957 0.959 0.024 The results in T able 1 1 for the four - factor model also suggest adequate fit across all study groups. Of note, is the small increase (which is dependent on index) in fit gained by the additional constraints placed on the parameters. Although both models had adequate fit, the com parative gain was small. 48 Measurement Invariance Tests of One and Four Factor Models In the discussion of invariance test procedures in previous sections there were multiple levels of invariance to be explored. Specifically (in order from least to most restrictive) tests fo r the degree to which configural, scalar, metric and strict paramet erizations of measurement invariance assumptions hold across groups were outlined in T ables 3 - 6 . When using maximum likelihood estimation, all of these tests of invariance are possible . However, one of the challenges with using binary data in a confirmator y factor analysis paradigm is that there is a reliance on alternative estimation procedures, such as the weighted least squares procedure invoked in MPlus. This technique requires that scale factors or residual variances be allowed to vary across groups; t he metric invariance test constrains these to be equal so there is an obvious disconnect. As a result, only the configural and scalar approaches are feasible within MPlus (Muthen and Muthen, 2013). For binary variables using weighted least squares estimat ion and the parameterization, the configural setting has factor loadings and thresholds free across groups, residual variances fixed at one in all groups, and factor means fixed at zero in all groups. The metric of a factor is set by freeing all factor l oadings and fixing the factor variance to one, the factor variance is fixed at one in all groups. The scalar setting has factor loadings and thresholds constrained to be equal across groups, residual variances fixed at one in one group and free in the othe r groups, and factor means fixed at zero in one group and free in the other groups. Again, the metric of a factor is set by freeing all factor loadings within a group and fixing the factor variance to one. Furthermore, the factor variance is fixed at one i n one group and is free in the other groups. Table 1 2 presents the results of the invariance studies for three different models. 49 Models 1 and 2, single factor and blueprint based respectively were introduced previously in the paper as the models of interes t. Table 1 2 - Measurement Invariance Study Results Paradigm Model Chisquare DF P - value TLI CFI RMSE Configural Single Factor/Rasch 5331.53 3672 p<.001 0.967 0.969 0.023 Scalar Single Factor/Rasch 5677.13 3770 p<.001 0.964 0.964 0.024 Configural 4 Factor (blueprint) 5279.638 3654 p<.001 0.968 0.969 0.023 Scalar 4 Factor (blueprint) 5598.375 3740 p<.001 0.964 0.965 0.024 As can be seen from T able 1 2 , the results of the invariance testing revealed adequate fit across all of the models and paradigms put forth for analysis. The fit indices are all within acceptable limits and are similar to those found when the same models were posited within groups before applying between group equality constraints. The results indicate that configural invariance is suppo rted between groups. Therefore, the specified model with zero - loadings on non - target factors fits the data well in all groups; the same items load on the same factors for all of the groups considered. The findings of scalar invariance holding across the groups subsumes the assumptions of configural invariance, the same pattern of loadings apply, but also goes a step further. Scalar invariance suggests that the factor loadings for the factors can be considered identical across groups. In multi - group confir matory factor analysis terms, this suggests that the inter - item tetrachoric correlation matri ces for all groups are statistically equivalent. 50 Chapter 5: Discussion The current study was a journey with t he overall goal to evaluate the degree to which me asurement invariance held across time for a high - stakes mathematics achievement instrument. Put differently, the degree to which the measurement device would produce the same set of composite latent measur ements over time, with an assumed increase in instr uctional time for each of the study conditions to determine to what degree the instrument might be sensitive to instruction. This is important due to the increased emphasis being placed on growth measures in K - 12 accountability measures. Policy makers will need to determine if they wish to measure growth via a measure that is static and not as sensitive (i.e. invariant) over a typical time trajectory used in such high stakes decisions. In order to reach the conclusions brought forth so far and in the paragr aphs to follow, the author proceeded to: (a) fit the model separately in each group; (b) fit the model in all groups allowing all parameters to be free (c) fit the model in all groups holding factor loadings equal to test the invariance of the factor loadi ngs and (d) fit the model in all groups holding factor loadings and intercepts equal to test the invariance of the intercepts. By following such a prescribed sequence I was able to address all of my research questions which I will now step through sequenti ally. Related to the original Fall 2009 administration and applied across groups : Does the Rasch model fit the data? According to T able 1 0 , the single factor (parameterized to replicate the Rasch model) measurement model appeared to fit the data well in each of the groups. With the sample sizes in this study it wasn™t surprising to discover that the chi - square tests were all significant so alternative model fit indexes were referenced as recommend ed appropriately in the literature. 51 Perhaps what was surprising in this case is that the ‚best fit™ wasn™t for the census population but rather the study condition group data appeared to fit the model a bit better than the larger group. Without first estab lishing that this model fit well in the groups and in the pooled groups it is untenable to look at differences in groups let alone proceed to test the same measurement configuration. Of course, this study looked at two models specifically and the invarianc e was of interest across both. Therefore, the next research question addressed was: Do the data fit the linear confirmatory model implied by the blueprint (content strands) for the test? As was the case with the single factor model, T able 1 1 indicates tha t all of the groups showed adequate fit to the blueprint based model. Again, the poorest fit seemed to be in the larger census population. The differences are negligible and there is not a valid test to determine if the differences in absolute fit are diff erent across the four non - nested groups but they did share a common model parameterization. Does one of the models fit significantly better than the other model? The model fit indices in each of the groups and the pooled groups both showed remarkable sim ilarity across the two models. In fact, in terms of parsimony it seems that little is gained by adding in the additional parameters needed to represent the blueprint based test design. Therefore, it seems that while they serve as useable reporting categori es in terms of grouping items toge ther by content specifications they do little to improve the fidelity of the measurement as a whole. Given that the single factor model will provide a more reliable underlying construct, 52 it would make more sense to go with the single factor representation as a matter of model parsimony. Do the aforementioned models exhibit measurement invariance across groups/study conditions? In this study, due to the dichotomous nature of the indicator variables and the estimation metho d employed by MPlus. It was not feasible to proceed to tests for strict and metric invariance as the IRT nature of the underlying models also do not make it possible to constrain it in the same way. The study found support for configural and scalar invaria nce of both models across the groups (see T able 12) . The additional constraint of factor loading equality produced a significant difference test with the configural model. Taken together, it appears as if this measure is invariant across these groups in bo th model configurations. Therefore, the suggestion would be that one can treat these as parallel measures and mean differences and comparisons on latent variables will be permitted. Implications The developments of parallel assessments that take into account curriculum are expensive and it is likely that many agencies will attempt to use a more cost effective test - retest strategy for computing growth. Such a simple approach holds a lot of assumptions with the mos t important being that the latent trait(s) or scores that come out of the measurement need to hold a high degree of generalizability across time and at the least, there should be an indication that the structure of the assessment, as intended, is logical f or all of the groups being assessed. The degree to which the configural and scalar invariance assumptions hold, either inhibits or enables the types of generalizations one would wish to make in a gain - score paradigm. 53 The results of this study seemed to in dicate that is feasible to at least use this particular instrument as a pre - test and post - test measure and that differences noted are true differences in student ability rather than an artifact of the underlying inter - item correlations manifesting themselv es in different structures over time. In fact, what has been found with the current study is that the tetrachoric correlation matrix, in which the correlation between the latent factors underlying the items as expressed by the binary indicators are expres sed, is consistent over time and appears not to be sensitive to instruction or continued learning. In essence, the conclusion is that the instruments are invariant over time and therefore, the latent factors can be compared across groups. Table 1 3 Œ Outco me Variable Group Differences N Mean Scaled Score Scale Score SD % Proficient Grade 8 1423 814.96 29.39 65.5 Grade 9 547 824.91 33.72 76.6 Grade 10 644 821.04 28.75 79.3 Table 1 3 depicts the performance of the groups on the outcome measure. With the exception of the grade 9 compared to grade 10 students on percent proficient, all other differences are statistically significant. In addition, if the student scores were submitted to the performance level change matrix presented in T able 10, the results would be as presented in F igure 1. Perhaps the biggest surprise of that result set was the finding that after 3 additional years of instruction there was still nearly 10% of 10 th grade students who ™s score s have decreased so much from when they originally sat for the assessment as incoming 8 th grade students. The finding for the 9 th grade students is similar. Is this a function of decay? Is the decay from lack of use of the skill set? It would have been interesting to explore these students™ course taking activity following grade 8 to determine which, if any, took a very minimal approach t o furthering themselves in the mathematical areas. While those results are troubling what is even more 54 shocking is that at the end of 8 th grade, the students who were only 5 months removed from their initial assessment failed to show growth for the most pa rt. More than 70% of these students exhibited significant decline, decline or merely maintained their standing along the continuum. One thing that can be confirmed, all of these 8 th grade students were currently enrolled in pre - algebra or algebra at the ti me of this study. That is, all were on either a standard or advanced curriculum. No students were being instructed off grade - level. Figure 1 - Performance Level Change by Study Condition Figure 1 presents some pretty dire results, especially when comb ined with the finding of invariance across conditions. Supposedly, these are real differences; these declines represent a true decline in student standing on the same underlying construct measured at their initial testing session. However, although the MPl us results suggested invariance it seems that there are other sources of information that suggest otherwise. To further investigate the invariance of the instrument and its sensitivity to the inherent differences between groups, instructional and otherwi se, further analyses were conducted that provide further exploration into the phenomenon. An interesting artifact of the approach taken 0 5 10 15 20 25 30 35 40 SD D M I SI Percent of Students in Each Peformance Level Change Category Grade 8 Grade 9 Grade 10 55 with Mplus on the single factor model is that the model is actually a two - parameter logistic item response theory model. Therefore, the Mplus program, for the single factor model, suggested that there was invariance of the two 2PL IRT model across the groups. A slightly different approach to fitting the 2PL model to multiple groups was taken to determine if the same finding held. That is, is their invariance to the extent that we can be comfortable we™re measuring the same thing? To accomplish this, a multiple group run of the two parameter model was carried out using the IRTPRO application version 2.1.1 (SSI, 2011). The ori ginal census population anchors the parameter estimation with the other groups getting placed on the same scale via the concurrent run. Figure 2 presents the test characteristic curves for the four groups. In the figure, group 1 is the 8 th grade; group 2 i s the 9 th grade; group 3 is the 10 th grade and group 4 the original census group respectively. As can be seen, while this chart suggests similar performance for the study conditions they also show that there is a rather dramatic difference in TCCs between the study conditions and the original census. 56 Figure 2 - Multigroup IRT Test Characteristic Curves Test information curves for the groups are also presented, along with their reciprocal standard error curves (see figure 3). These provide a slightly different picture in that they represent somewhat of a combination between the ability distribution of the assessment for each of the groups as well as their alignment to the item difficulty of the assessment. Figure 3 is based on the notion that the census scale is the appropriate parameterization. What is suggested is that the test information function for a ll of the groups is centered just below the origin of the theta scale (approximately - .5 on the logit scale). Scaling in IRTPro is accomplished by assuming a theta 0 10 20 30 40 -3 -2 -1 0 1 2 3 Test Characteristic Theta Test Characteristic Curves Test Characteristic, Group 1 Test Characteristic, Group 2 Test Characteristic, Group 3 Test Characteristic, Group 4 57 distribution with a mean of 0 and standard deviation of 1. Therefore, the assessment is cent ered just below average value; the calibration goal. Figure 3 Œ Test Information Functions from Multiple Group IRT Run Appendix B and C presents the IRT calibration results for all 51 items for each of the groups in two ways . Appendix B are each of the item characteristic curves by group whereas Appendix C is a tabular presentation of item parameters by group (i.e. the data driving the charts in Appendix B) . T here are several areas where there are departures from a consensus value. Cell tabl es highlighted in yellow denote items departing from expectation. In this case, it seems reasonable to assume that the item difficulty, if anything, would decrease over time as opposed to increase. 58 Chapter 6: Limitations and Future Research The biggest limitation in this study was the volunteer nature of the sample. The participants were purely voluntary and in fact most administrators revealed to the researcher that they allowed teachers to self - select their classes into the study and of course no stud ents were required to participate and could back out at any time. Of course, with such situations motivation of the student being assessed is always discussed and particularly a possible lack of motivation for the student to perform well. In this case, all participants were aware that the study apparatus was not part of the state mandated battery of assessments and therefore the level of effort might not have been great. Additionally, this begs the question of how much extrinsic motivation the teachers of t hese students might have decided to not push or impart on the students. Of course, the lack of a random sample and random assignment limits the ability of the researcher to generalize to a great degree. Certainly it would not make sense for the author to a ssert that the measures are invariant across all grades represented in the study and across the universe of potential students/participants. Another limitation, and a future direction should I choose to pursue this research further, would be to look more closely at the modification indices and other univariate tests available to help in model refinement. In this study I posited two main models in a confirmatory environment. I never specified they were without question the correct models as that is not the driving force of this study. However, there are further refinements that could be made to determine the best set of indicators (items) that collectively leads to the most invariant model over time and instructional exposure. For instance, there are severa l items in A ppendix A that show great deviations across groups in terms of their IRT parameters. It could be an artifact of the scaling technique, but those trace curves do suggest some large item level differences that are 59 likely cancelled out much like d ifferential item functioning at times leads to DIF cancellation such that there might not be obvious bias at the test level yet it still exists at the item level. Purifying the measurement instrument is going to be key for true invariant measures over time . Of course, each content area should probably be expanded a bit by adding more items to the content strands and testing those out as intact tests. I believe the type and level of inferences th at will be required of growth modeling in the future is going t o well beyond the current norm and will push validity inferences to all time levels of thin. To me, that is why it is imperative we tak e a step backward before we rush forward and make too many assumptions. These are not arbitrary test scores. They relate to student standing, they relate to school standing, district standing, teacher standing and if you think about it to many extents the livelihood of the students going forward as well as those whose jobs depend on evaluations based largely on assessment da ta. 60 APPENDICES 61 Appendix A Scale Score Distributions (Pre and Post) for each of the Study Groups 62 Figure 4 Œ Post - test Scale Score Distribution (Grade 8) 63 Figure 5 Œ Post - test Performance Level Frequencies (Grade 8) 64 Figure 6 Œ Pre - test Scale Score Distribution (Grade 8) 65 Figure 7 Œ Pre - test Performance Level Frequencies (Grade 8) 66 Figure 8 Œ Post - test Scale Score Distribution (Grade 9) 67 Figure 9 Œ Post - test Performance Level Frequencies (Grade 9) 68 Figure 10 Œ Pre - test Scale Score Distribution (Grade 9) 69 Figure 11 Œ Pre - test Performance Level Frequencies (Grade 9) 70 Figure 12 Œ Post - test Scale Score Distribution (Grade 10) 71 Figure 13 Œ Post - test Performance Level Frequencies (Grade 10) 72 Figure 14 Œ Pre - test Scale Score Distribution (Grade 10) 73 Figure 15 Œ Pre - test Performance Level Frequencies (Grade 10) 74 Appendix B Item Characteristic Curves by Group 75 Figure 16 Œ Item Characteristic Curves by Group 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR2 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR3 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR4 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR5 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR6 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR7 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR8 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR9 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR10 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR11 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR12 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR13 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR14 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR15 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR16 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR17 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR18 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR19 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR20 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR21 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR22 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR23 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR24 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR25 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR26 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR27 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR28 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR29 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR30 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR31 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR32 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR33 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR34 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR35 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR36 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 76 Figure 1 6 (cont'd) 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR37 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR38 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR39 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR40 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR41 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR42 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR43 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR44 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR45 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR46 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR47 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR48 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR49 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR50 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR51 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 0.0 0.2 0.4 0.6 0.8 1.0 -3 -2 -1 0 1 2 3 Probability Theta VAR52 G1, 0 G1, 1 G2, 0 G2, 1 G3, 0 G3, 1 G4, 0 G4, 1 77 Appendix C IRT Calibration Values by Group 78 Table 14 Œ IRT Calibration Values by Group Census apar Census bpar Gr8 apar Gr8 bpar Gr9 apar Gr9 bpar Gr10 apar Gr10 bpar Item1 0.68 - 0.14 0.59 - 0.13 0.78 - 0.48 0.44 - 0.72 Item2 0.37 0.87 0.44 0.97 0.2 1.83 0.43 2.27 Item3 1.46 - 1.01 1.56 - 1.18 1.73 - 1.43 1.42 - 1.4 Item4 0.78 0.96 0.91 0.13 0.91 0.24 1.2 0.43 Item5 1.55 - 0.37 1.53 - 0.55 1.41 - 0.67 1.25 - 0.81 Item6 0.53 1.07 0.53 0.92 0.68 1.23 0.47 1.57 Item7 1.15 - 0.08 0.96 - 0.17 1.15 - 0.28 1.24 - 0.31 Item8 0.9 0.24 0.98 0.1 1.03 - 0.39 0.8 - 0.43 Item9 0.76 0.2 0.78 0.28 0.59 0.58 0.9 0.45 Item10 1.77 0.03 1.9 - 0.27 1.51 0 1.49 0.4 Item11 0.38 3.11 1.07 - 0.88 0.99 - 1 1.05 - 1.01 Item12 0.89 - 0.78 1.18 0.25 0.79 0.42 0.82 0.72 Item13 0.52 2.01 0.49 2.78 0.7 2.45 1 1.55 Item14 1.38 - 0.44 0.73 - 0.24 0.64 - 0.22 0.78 - 0.17 Item15 1.01 1.31 0.6 1.22 0.91 1.09 0.81 1.33 Item16 1.2 0.12 1.77 - 0.61 1.58 - 0.84 2.19 - 0.75 Item17 1 0.07 1.22 0.18 1.16 0.17 1.39 0.25 Item18 0.88 - 0.31 1.93 - 0.6 1.99 - 1 2 - 1.13 Item19 1.47 - 0.5 1.09 - 0.21 1.05 - 0.55 1.03 - 0.82 Item20 0.46 0.56 1.15 - 0.68 1.48 - 0.82 1.16 - 0.96 Item21 1.29 - 0.87 1.77 - 0.53 1.71 - 0.5 1 0.2 Item22 1.41 - 0.73 0.49 0.7 0.69 0.44 0.41 1 Item23 1.97 - 0.46 1.63 - 0.82 1.76 - 0.78 2.18 - 0.84 Item24 0.69 1.18 1.79 - 0.76 2.05 - 0.76 1.93 - 0.81 Item25 1.97 - 0.32 0.58 0.75 0.64 0.93 0.87 0.68 Item26 2.1 - 0.51 1.06 - 0.12 0.79 - 0.11 1.02 - 0.14 Item27 1.53 - 0.74 0.56 1.48 0.7 1.2 0.81 0.88 Item28 1.56 - 1.03 2.28 - 0.61 2.2 - 0.45 2.35 - 0.54 Item29 0.77 0.15 1.41 - 0.05 1.46 - 0.07 1.28 0.01 Item30 1.25 - 0.28 2.24 - 0.47 2.13 - 0.43 2.3 - 0.47 Item31 1.55 - 0.95 2.51 - 0.69 2.43 - 0.53 2.22 - 0.6 Item32 1.78 - 0.41 1.95 - 0.77 2.17 - 0.79 2.23 - 0.76 Item33 1.58 - 0.86 2.07 - 0.99 1.85 - 1.02 1.95 - 1.09 Item34 1.59 - 0.71 1.03 - 0.01 1.15 - 0.02 1.24 0.06 Item35 1.22 - 0.08 1.42 - 0.4 2.1 - 0.4 1.62 - 0.52 Item36 0.94 0.43 1.79 - 0.92 1.94 - 0.88 2.28 - 0.78 Item37 1.44 - 0.7 2.01 - 0.49 1.99 - 0.27 2.26 - 0.34 Item38 1.62 - 0.35 2.02 - 0.8 1.84 - 0.77 2.25 - 0.76 Item39 1.21 - 0.37 1.8 - 0.65 1.86 - 0.54 2.12 - 0.66 79 Table 14 (cont™d) Census apar Census bpar Gr8 apar Gr8 bpar Gr9 apar Gr9 bpar Gr10 apar Gr10 bpar Item40 0.57 0.45 1.26 - 0.05 1.66 - 0.04 1.48 - 0.07 Item41 0.47 2.05 1.08 0.26 0.97 0.18 1.39 0.22 Item42 0.16 4.63 1.64 - 0.75 1.98 - 0.62 2.33 - 0.65 Item43 0.65 1.12 1.6 - 0.35 2.25 - 0.32 2.29 - 0.36 Item44 0.99 0.84 1.32 - 0.34 1.46 - 0.32 1.71 - 0.55 Item45 0.83 - 0.14 0.54 0.61 0.93 0.33 1 0.01 Item46 0.85 0.13 0.53 1.79 0.7 1.15 0.7 1.31 Item47 0.76 0.82 0.25 2.89 0.37 1.59 0.36 1.85 Item48 0.56 0.22 0.75 0.6 0.96 0.64 0.98 0.43 Item49 1.44 - 0.32 1.17 0.55 1.14 0.47 1.15 0.69 Item50 0.61 0.82 2 - 0.39 2.06 - 0.36 1.54 - 0.73 Item51 1.14 0.08 1.36 0.9 1.3 1.04 1.2 1.07 80 REFERENCES 81 REFERENCES Asparouhov & Muthen (2012). Multiple group multilevel analysis. Web Note 16. www.statmodel.com Barrett, P (2011). Structural equation modeling: Adjudging model fit. Personality and Individual Differences, 42, 815 - 824. Bentler, P.M. & Bonett, D.G. (1980). Significance tests and goodness of fit in the analysis of covariance structu res. Psychological Bulletin,88, 588 - 606. Bollen, K. A. (1989). Structural equations with latent variables . New York: Wiley. Bollen, K. A., & Long, J. S., Eds. (1993). Testing structural equation models . Newbury Park, CA: Sage. Bontempo, D. E., & Hofer, S. M. (2007). Assessing factorial invariance in cross - sectional and longitudinal studies. In A. D. v. D. Ong, Manfred H (Ed.), Oxford handbook of methods in positive psychology. Series in positive psychology (pp. 153 - 175). New York, NY: Oxford Uni versity Press. Browne, M. W., (2001) An overview of analytic rotation in exploratory factor analysis. Multivariate Behavioral Research , 36 , 111 Œ 150 . Castellano, K. E., & Ho, A. D. (2012). A Practitioner's Guide to Growth Models . castellano_and_ho_ - practitioners_guide_to_growth.pdf . Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness - of - fit indexes for testing measurement invariance. Structural Equation Modeling, 9 (2), 233 - 255. Cook, D.A. & Beckman, T.J. (2006). Current Concepts in Validity and Reliability for Psychometric Instruments: Theory and Application. The American Journal of Medicine (2006) 119, 166.e7 - 166.e16. CCSSO ASR Working Group (2009). Guide to United States Department of Education Growth Model Pilot Program 2005 - 2008. Retrieved from the Council of Chief State School Offi cers Web site: http://www.ccsso.org/Resources/Publications/Guide_to_United_States_Department_of_E ducation_Growth_Model_Pilot_Program_2005 - 2008.html Council of Chief State School Officers (2009). Guide to United States Department of Education Growth Model P ilot Program 2005 - 2008. http://www.ccsso.org/Resources/Publications/Guide_to_United_States_Department_of_E d ucation_Growth_Model_Pilot_Program_2005 - 2008.html Enders, C.K., & Tofighi, D. (2008). The impact of misspecifying class - specific residual variances in growth mixture models. Structural Equation Modeling: A Multidisciplinary Journal, 15 , 75 - 95. 82 Gorsuch, Ric hard L. (1983), Factor Analysis 2 nd ed., Hillsdale, NJ: Erlbaum Green, S. B., Lissitz, R. W., & Mulaik, S. A. (1977). Limitations of coefficient alpha as an index of test unidimensionality. Educational and Psychological Measurement, 37, 827 Œ 838. Hayduk, L. , Cummings, G. G., Boadu, K., Pazderka - Robinson, H., & Boulianne, S. (2007). Testing! Testing! One, two three Œ Testing the theory in structural equation models! Personality and Individual Differences, 42 , 841 - 50. Horn, J. L., McArdle, J. J., & Mason, R. ( 1983). When is invariance not invariant: A practical scientist's look at the ethereal concept of factor invariance. Southern Psychologist, 1, 179 - 188. Hu, L., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity to underparame terized model misspecification. Psychological Methods, 3, 424 Œ 453. Jöreskog, K. G. (1971). Simultaneous factor analysis in several populations. Psychometrika, 36 , 409 - 426. Jöreskog, K. G., & Moustaki, I. (2001). Factor analysis of ordinal variables: A comp arison of three approaches. Multivariate Behavioral Research, 36 (3), 347 - 387. Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17 - 64). Washington, DC: The National Council on Measurement in Education & the American Council on Education. Kenny, D.A. (2012). http://davidakenny.net/cm/fit.htm Kenny, D. A., Kaniskan, B., & McCoach, D. B. (2011). The performance of RMSEA in models with small degrees of freedom . Unpublished paper, University of Connecticu t. Kenny, D. A., , & McCoach, D. B. (2003). Effect of the number of variables on measures of fit in structural equation modeling. Structural Equation Modeling, 10 , 333 - 3511. Little, T. D. (1997). Mean and covariance structures (MACS) analyses of cross - cul tural data: Practical and theoretical issues. Multivariate Behavioral Research, 32, 53 Œ 76. MacCallum, R. C., Browne, M. W., & Sugawara, H. M. (1996). Power analysis and determination of sample size for covariance structure modeling. Psychological Methods , 1 , 130 - 149. Martineau, J. A. (2006) Distorting value added: the use of longitudinal, vertically scaled student achievement data for growth - based value - added accountability. Journal of Educational and Behavioral Statistics, 31 (1), 35 - 62. Mellenbergh, G. J. (1989). Item bias and item response theory. International Journal of Educational Research, 13, 127 - 43. Meredith, W. (1993). Measurement Invariance, Factor - Analysis and Factorial Invariance. Psychometrika, 58 (4), 525 - 543. 83 Meredith, W. & Millsap, R. E. (1992). On the misuse of manifest variables in the detection of measurement invariance. Psychometrika, 57(2), 289 - 311. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (pp. 13 - 103). Washington, DC: American Council on Education a nd National Council on Measurement in Education. Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23, 13 - 24. Messick, S. (1995). Standards of validity and the validity of standards in performance assessment. Educational Measurement: Issues and Practice. 14(4), 5 - 8. Millsap, R. E. (1995). Measurement invariance, predictive invariance, and the duality paradox. Multivariate Behavioral Research, 30 (4), 577 - 605. Millsap, R. E. (1997). Invariance in measurement and prediction: Their relationship in the single - factor case. Psychological Methods, 2 (3), 248 - 260. Millsap, R. E., & Everson, H. T. (1993). Methodology review: statistical approaches for assessing measurement bias. Appli ed Psychological Measurement, 17 , 297 - 334. Millsap, R. E., & Yun - Tein, J. (2004). Assessing Factorial Invariance in Ordered - Categorical Measures. Multivariate Behavioral Research, 39 (3), 479 - 515. Moss, P. A. (1998). The role of consequences in validity th eory. Educational Measurement: Issues and Practice,17(2), 6 - 12. Muthén, L.K. and Muthén, B.O. (1998 - 2012). Mplus User™s Guide. Seventh Edition. Los Angeles, CA: Muthén & Muthén. Muthén, B. (1984). A general structural equation model with dichotomous, or dered categorical, and continuous latent variable indicators. Psychometrika, 49, 115 - 132. Muthén, B., & Asparouhov, T. (2002). Latent Variable Analysis With Categorical Outcomes: Multiple - Group And Growth Modeling In Mplus. . Mplus Web Notes: No. 4 Versio n 5. Retrieved from http://www.statmodel.com/download/webnotes/CatMGLong.pdf website: Shepard, L. A. (1997). The centrality of test use and consequences for test validity. Educationa l Measurement: Issues and Practice, 16(2), 5 - 8, 13, 24. Perie, M, Marion, S., & Gong, B. (June, 2007). A framework for considering interim assessments. Paper presented at the National Conference on Large - scale Assessment Conference sponsored by the Coun cil of Chief State School Officers, Nashville, Tennessee. Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests . (Copenhagen, Danish Institute for Educational Research), expanded edition (1980) with foreword and afterword by B.D. Wright. Chicago: The University of Chicago Press. 84 Reckase, M. (1998). Consequential validity from the test developer™s perspective. Educational Measurement: Issues and Practice, 17(2), 13 - 16. Reckase, M. D. (2004). The real world is more complicate d than we would like. Journal of Educational and Behavioral Statistics , 29 , 117 - 120. Riordan, C. M., & Vandenberg, R. J. (1994). A central question in cross - cultural research: Do employees of different cultures interpret work - related measures in an equival ent manner? Journal of Management, 20, 643 Œ 671. Spearman, C. (1904). fiGeneral intelligence,fl objectively determined and measured. American Journal of Psychology, 15, 201 Œ 292. SteenKamp, J. - B. E. M., & Baumgartner, H. (1998). Assessing Measurement Invariance in Cross - National Consumer Research. Journal of Consumer Research 25, 78 - 90. Tavakol, M. & Dennick, R. (2011). Making sense of Cronbach™s Alpha. International Journal of Medical Education. 2011; 2:53 - 55 Thompson, B. (2004). Exploratory and Confi rmatory Factor Analysis: Understanding Concepts and Applications. American Psychological Association. Thurstone, L. L. (1947). Multiple - factor analysis: a development and expansion of The Vectors of Mind. Chicago, IL: University of Chicago Press. Tofighi, D., & Enders, C.K. (2007). Identifying the correct number of classes in growth mixture models. In G. R. Hancock & K. M. Samuelsen (Eds.), Advances in latent variable mixture models (pp. 317 - 341). Greenwich, CT: Information Age. USED, 2005. PRESS RELEASES.S ecretary Spellings Announces Growth Model Pilot, Addresses Chief State School Officers' Annual Policy Forum in Richmond Archived Information. http://www2.ed.gov/policy/elsec/guid/secletter/051121.html USED, 2006. http://www2.ed.gov/admins/lead/account/growthmodel/cc.doc Williams, L.J. & O™Boyle, E. (2011). The myth of global fit indices and alternatives for assessing latent variable relations. Organizational Research Methods, 14(2), 350 - 369. Wright, B.D., & Stone, M.H. (1979). Best Test Design . Chicago, IL: MESA Press