. MK.» 1; . ~ 6: ,4 113 e .. win $ 0;. .v nfiawmufiflfi .0. ta" .1}. I ..t.\.. . “EX: .- 3. . 3:)“ i. i... 9.55! Jun-Huzh. . 5.3.. . ~ all! .. .1 K. 3!... \ AmflffiLi . II! .. Eu :31.qu éfi. ‘1 :13. “hydra" . I.) ....I.. V . , L 54.....1wh EV LIBRARY Michiqa“ State University This is to certify that the dissertation entitled AN INVESTIGATION OF USING COLLATERAL INFORMATION TO REDUCE EQUATING BIASES OF THE POST- STRATIFICATION EQUATING METHOD presented by Sungwom Ngudgratoke has been accepted towards fulfilIment of the requirements for the PhD. degree in Measurement and Quantitative Methods 374 a»? [19. @9444 Major Professor’s Signature 9,9“? f3; 2009 Date MSU is an Affirmative Action/Equal Opportunily Employer PLACE IN RETURN Box to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. D EDUE DAIEDUE DAJEDUE APR {958291}; 5/08 K:lProi/Acc&Pres/CIRCIDateDue.indd AN INVESTIGATION OF USING COLLATERAL INFORMATION TO REDUCE EQUATIN G BIASES OF THE POST-STRATIFICATION EQUATING METHOD By Sungwom Ngudgratoke A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Measurement and Quantitative Methods 2009 ABSTRACT AN INVESTIGATION OF USING COLLATERAL INFORMATION TO REDUCE EQUATING BIASES OF THE POST-STRATIFICATION EQUATING METHOD By Sungwom Ngudgratoke In many educational assessment programs, the use of multiple test forms developed from the same test specification is very common because requiring different examinees to take different test forms of the same test makes it possible to maintain the security of the test. When multiple test forms are used, it is necessary to make the assessment fair to all examinees by using a statistical procedure called “equating” to adjust for differences in the test forms. If equating is successfiilly carried out, equated scores are comparable as if they were from the same test form. Two commonly used observed score equating methods that use the Non- Equivalent groups with Anchor Test (NEAT) design to collect equating data include the chain equating (CE) method and the post-stratification equating (PSE) method. It has been documented that the CE method produced smaller equating biases than the PSE method, when two groups of examinees differ greatly in abilities. Therefore, the CE method has been used more widely in practice, even though the PSE method is more theoretically sound than the CE method. Larger equating biases are due to the fact that the anchor test score fails to remove unintended differences between groups of examinees. Aiming to reduce equating biases of the PSE method, this study used collateral information about examinees as a new way to construct synthetic population fimctions, rather than a single variable such as the anchor test score or the anchor test true score. Collateral information used in this study included the anchor test score, sub-scores, and examinees’ demographic variables. This study investigated two different methods of using such collateral information about examinees to improve equating results of the PSE method. These two methods included the propensity score method (Rosenbaum & Rubin, 1983) and the multiple imputation method (Rubin, 1987). Both simulation data and empirical data were used to develop the equating function to explore if it was feasible to use collateral information to reduce equating biases under different conditions including test length, group differences, and missing data treatment. The results from simulation data show that sub-scores or sub-scores combined with other collateral information in a form of propensity scores had a potential to reduce equating biases for long tests, when there were group differences in abilities. However, demographic variables had a potential to reduce equating biases for the multiple imputation method. Copyright by Sungwom Ngudgratoke 2009 ACKNOWLEDGEMENTS My deepest gratitude goes to Dr. Mark D. Reckase. His wisdom inspired me to come to Michigan State University to pursue my doctoral degree. Without his guidance, support, and insightful comments, this work would not have been possible. I would like to thank members of my dissertation committee: Dr. Richard T. Houang, Dr. Kimberly S. Maier, and Dr. Alexander Von Eye for their helpful suggestions on this study. I also would like to thank the Royal Thai government for giving me financial support for my graduate study at MSU. I would like to thank Mike Sherry and Dipendra Subedi for their comments on early versions of my dissertation. Finally, I wish to thank my parents, grandmother, brother and sister for their love, patience, and encouragement. TABLE OF CONTENTS LIST OF TABLES ........................................................................ ix LIST OF FIGURES ....................................................................... x CHAPTER I INTRODUCTION 1.1 Background ........................................................................ 1 1.2 Research Questions ............................................................... 5 CHAPTER II LITERTURE REVIEW 2.1 Collateral Information ............................................................ 8 2.2 Equating ............................................................................ 9 2.3 The Equipercentile Equating Function in Observe-Score Equating. 10 2.4 The Non-Equivalent Groups with Anchor Test Design (NEAT) ........... 11 2.5 The Importance of Synthetic Population ....................................... 13 2.6 Presmoothing the Score distribution .......................................... 14 2.7 Post-Stratification Equating Method (PSE) .................................... 16 2.8 Subscore Estimation .............................................................. 20 2.9 Propensity Score .................................................................. 22 ‘ 2.10 Multiple Imputation Method ................................................... 24 2.10.1 Missing Data Mechanism ............................................. 26 2.10.2 Introduction to EM Algorithm ....................................... 27 2.11 The Statement of the Research Problem and Its Solution ................. 28 2.12 The Goal of This Study and the Evaluation Indices ........................ 30 CHAPTER III RESEARCH METHOD 3.1 Research Design .................................................................. 32 3.1.1 Test Length ................................................................ 32 3.1.2 Ability Differences between Groups of Examinees .................. 33 3.1.3 Missing Data Treatment ................................................. 33 3.2 Data Simulation Procedure ...................................................... 34 3.2.1 Item Parameter Generation ............................................. 35 3.2.2 0 Parameters Generation ................................................. 37 3.2.3 Item Responses Generation ............................................. 39 3.3 Missing data Generation ......................................................... 40 3.3.1 Pseudo Test Data and Missing Data Generation for 60-item test ................................................................ 41 3.3.2 Pseudo Test Data and Missing Data Generation for 40-item test ................................................................ 41 3.4 Analytic Strategies for the Purposed Methods.................................. 43 3.4.1 Prediction of Score Frequencies of Missing Data .................... 43 3.4.1.1 The Propensity Score Approach to the Prediction vi of Missing data ................................................... 44 3.4.1.2 The Multiple Imputation Approach to the Prediction of Missing Data ................................. 46 3.4.2 Test Equating Procedure ................................................. 47 3.4.2.1 Sub-score Estimation ............................................ 48 3.4.2.2 Place the Estimated Sub-scores on the Same Scale. . . . . 49 3.4.2.3 Estimate Propensity Scores .................................... 49 3.4.2.4 Construction Synthetic Population Functions ............... 50 3.4.2.5 Equate Test Score ................................................ 51 3.4.2.6 Evaluation Criteria .............................................. 52 3.5 Real data Analysis ................................................................ 53 CHAPTER IV RESULT 4.1 The Results form Simulation Data ............................................. 61 4.1.1 The Propensity Score Method: Standard Errors of Equating. . . . . 61 4.1.2 The Propensity Score Method: Equating Biases ....................... 66 4.1.3 The Propensity Score Method: Predictions of Score Frequencies... 69 4.1.4 The Propensity Score Method: Freeman-Tukey (FT) Residuals. . .. 78 4.1.5 The Multiple Imputation Method: Standard Errors of Equating. . 78 4.1.6 The Multiple Imputation Method: Equating Biases ................... 83 4.1.7 The Multiple Imputation Method: Predictions of Score Frequencies ................................................................. 88 4.1.8 The Multiple Imputation Method: Freeman-Tukey (FT) Residuals ................................................................... 96 4.2 The Results from Empirical Data .............................................. 96 4.2.1 The Propensity Score Method: Standard Errors of Equating ........ 97 4.2.2 The Propensity Score Method: Equating Biases ....................... 100 4.2.3 The Multiple Imputation Method: Standard Errors of Equating. . 102 4.2.4 The Multiple Imputation Method: Equating Biases .................... 104 CHAPTER V SUMMARY AND DISCUSSION 5.1 Background ........................................................................ 108 5.2 Study Design ...................................................................... 111 5.3 Results from the Simulation Study ............................................ 113 5.4 Results from the Empirical Data Study ....................................... 117 5.5 Discussion ......................................................................... 118 5.6 Implications ........................................................................ 128 5.7 Limitations ........................................................................ 128 5.8 Future Direction ................................................................... 129 APPENDICES ............................................................................. 132 Appendix A: FT Residuals for the Propensity Score Method .................. 133 Appendix B: FT Residuals for the Multiple Imputation Method ............... 149 Appendix C: WinBUGS Code for the Test Form 1 .............................. 165 vii REFFERNCES ............................................................................. 167 viii Table 1: Table 2: Table 3: Table 4: Table 5: Table 6: Table 7: Table 8: Table 9: Table 10: Table 1 1: Table 12: LIST OF TABLES Non-Equivalent Groups with Anchor Test (NEAT) Design ..................................................................... Correlation Coefficients Among 0 Parameters from WinBUGS (Test form 1) ................................................ Correlations Coefficients Among 0 Parameters from WinBUGS (Test form 2) ................................................ Average 0 Parameters from WinBUGS .............................. Descriptive Statistics for Scores on the Simulated Test Form 1 ............................................................... Descriptive Statistics for Scores on the Simulated Test Form 2 ............................................................... Distributions of Empirical Items ....................................... Test Score Performance of Six Countries ............................. Descriptive Statistics for the Empirical Data ......................... Correlations Between the Operational Test Score and Sub-Scores ............................................................ Descriptive Statistics for Empirical Data (Group Differences) ...................................................... Correlations Between the Operational Test Score and Sub-Scores (Group Differences) .................................. ix 12 37 38 39 42 43 54 55 55 56 56 56 Figure 4.1a: Figure 4.1b: Figure 4.1c: Figure 4.1d: Figure 4.2a: Figure 4.2b: Figure 4.2c: Figure 4.2d: Figure 4.3a: Figure 4.3b: Figure 4.3c: Figure 4.3d: Figure 4.4a: Figure 4.4b: Figure 4.4c: Figure 4.4d: LIST OF FIGURES PS Standard Errors of Equating: Long Test and No Group Differences ........................................... PS Standard Errors of Equating: Long Test and Group Differences ................................................ PS Standard Errors of Equating: Short Test and No Group Differences ................................................ PS Standard Errors of Equating: Short Test and Group Differences ...................................................... PS Equating Biases: Long Test and No Group Differences ............................................................... PS Equating Biases: Long Test and Group Differences .............................................................. PS Equating Biases: Short Test and No Group Differences ............................................................... PS Equating Biases: Short test and Group Differences ............................................................... PS Chi-Square Statistics: Long Test and No Group Differences .................................................. PS Chi-Square Statistics: Long Test and Group Differences ...................................................... PS Chi-Square Statistics: Short Test and No Group Differences .................................................. PS Chi-Square Statistics: Long Test and Group Differences ....................................................... PS Likelihood Ratio (LR) Chi-Square Statistics: Long Test and No Group Differences ................................ PS Likelihood Ratio (LR) Chi-Square Statistics: Long Test and Group Differences .................................... PS Likelihood Ratio (LR) Chi-Square Statistics: Short Test and No Group Differences ................................ PS Likelihood Ratio (LR) Chi-Square Statistics: 63 63 64 64 67 . 67 68 68 71 71 72 72 76 76 77 Figure 4.53: Figure 4.5b: Figure 4.50: Figure 4.5d: Figure 4.6a: Figure 4.6b: Figure 4.6c: Figure 4.6d: Figure 4.73: Figure 4.7b: Figure 4.7c: Figure 4.7d: Figure 4.8a: Figure 4.8b: Figure 4.8c: Figure 4.8d: Short Test and Group Differences ................................... MI Standard Errors of Equating: Long Test and No Group Differences ............................................. MI Standard Errors of Equating: Long Test and Group Differences ................................................. MI Standard Errors of Equating: Short Test and no Group Differences ............................................. MI Standard Errors of Equating: Short Test and Group Differences .................................................. MI Equating Biases: Long Test and no group Differences ................................................... MI Equating Biases: Long test and Group Differences ............................................................... MI Equating Biases: Short test and no group Differences ............................................................... MI Equating Biases: Short Test and Group Differences ............................................................... MI Chi-Square Statistics: Long Test and No Group Differences. . . .' .............................................. MI Chi-Square Statistics: Long Test and Group Differences ...................................................... MI Chi-Square Statistics: Short Test and No Group Differences .................................................. MI Chi-Square Statistics: Long Test and Group Differences ...................................................... MI Likelihood Ratio (LR) Chi-Square Statistics: Long Test and No Group Differences ................................ MI Likelihood Ratio (LR) Chi-Square Statistics: Long Test and Group Differences .................................... MI Likelihood Ratio (LR) Chi-Square Statistics: Short Test and No Group Differences ................................ MI Likelihood Ratio (LR) Chi-Square Statistics: xi 77 81 81 82 82 85 85 86 86 9O 90 91 91 94 94 95 Figure 4.9a: Figure 4%: Figure 4.10a: Figure 4.10b: Figure 4.11a: Figure 4.11b: Figure 4.12a: Figure 4.12b: Short Test and Group Differences ................................... 95 PS Standard Errors of Equating: No Group Differences .......... 99 PS Standard Errors of Equating: Group Differences ............... 99 PS Equating Biases: No Group Differences ......................... 101 PS Equating Biases: Group Differences ............................. 101 M1 Standard Errors of Equating: No Group Differences .......... 103 Ml Standard Errors of Equating: Group Differences ............... 103 Ml Equating Biases: No Group Differences ......................... 106 Ml Equating Biases: Group Differences ............................. 106 xii CHAPTER 1 INTRODUCTION 1.1 Background Equating is a statistical procedure that is used to adjust scores on test forms (e. g., X and Y) so that scores on the forms can be used interchangeably (Kolen & Brennan, 2004) as if scores are from the same test forms. Without equating, examinees taking the easier test form will have an unfair advantage. When test score equating is performed, standard errors of equating should be estimated to quantify equating errors (AERA, APA, NCME, 1999). Accurate equating results not only facilitate test score interpretations but also enhance fair comparisons across individuals, states, and countries. The Non-Equivalent Groups with Anchor Test (NEAT) design is commonly used in equating practice because the anchor test score (A) can adjust for preexisting differences between examinees. In this design, two operational tests to be equated, X and Y, are given to two samples of examinees from potentially different test populations (referred to as P and Q). In addition, an anchor test, A, is given to both samples from P and Q. In observed score equating, when equating data are collected through the NEAT design, test score equating can be performed using a variety of observed score equating methods such as the Tucker method, the Levine observed score method, the Levine true score method, the post-stratification method, the Braun and Holland linear method, and the chain equipercentile method. Among these methods, the post-stratification equating (PSE) method (von Davier, Holland & Thayer, 2004)~which is also called the “frequency estimation method” (Kolen & Brennan, 2004) and the chain equipercentile (CE) method are two important methods commonly used in practice (Holland, Sinharay, von Davier, & Han, 2007). The role of an anchor test (A) in the PSE method is not only to remove differences between P and Q but also to estimate score frequencies of the designed missing data in the NEAT design so that synthetic population functions required to derive comparable scores using the equipercentile equating function can be constructed (Braun & Holland, 1982). The PSE method is based on a strong theoretical foundation that centers on the generalization of equating function linking X-scores to Y-scores (Harris & Kolen, 1990; Kolen, 1992), making it more appealing than the CE method. More specifically, the equating function is computed for a single population. When P and Q differ greatly in abilities, we do not know for what population the equating function is computed. The PSE method defines the target population in the form of synthetic population fimctions (T) which are mixtures of both P and Q. In contrast, the CE method is considered to be less sound in its theoretical foundation because it does not define any synthetic population functions. Even though the PSE is more theoretically sound than the CE method, it produces unfavorable equating functions and researchers prefer the CE method to the PSE method. Braun and Holland (1982) noted that the PSE and CE methods give different results. The CE is more preferable than the PSE method since it produces smaller equating bias. Holland, Sinharay, von Davier, and Han, (2007); and Wang, Lee, Brennan, and Kolen (2008) found that when groups differ in abilities, the PSE method produces larger equating bias but less standard errors of equating than does the CE method. This might be because the anchor test fails to remove the bias to which the nonequivalence of P and Q can lead. The bias due to preexisting differences between groups that cannot be removed by the anchor test precludes valid interpretations and fair uses of test scores. To reduce equating bias when P and Q differ greatly in abilities, the propensity score (Rosenbaum & Rubin, 1983) can be a desirable method to augment the PSE method (Livingston, Wright, & Dorans, 1993). In the equating context, the propensity score is the estimated conditional probability that a subject will be assigned to a particular test form, given a vector of observed covariates (e. g., demographic variables and anchor test scores). Covariates used to estimate examinees’ propensity scores are called “collateral information”, which is available information about examinees in addition to their item responses (Mislevy, Kathleen, & Sheehan, 1989). Any examinees with equal propensity scores are homogeneous in terms of covariates. Propensity scores computed from both demographic variables and anchor test scores may be intuitively advantaged because score frequencies of missing data may be better estimated and thus synthetic population firnctions might be precisely estimated, resulting in the more accurate equating function. Because more covariates have a potential to handle missing data in the NEAT design, less equating bias is expected. Therefore, the propensity score method may be another equating method alternative to the method that uses a special anchor test (Sinharay & Holland, 2007) and the method that uses the anchor test true score (Wang & Brennan, 2009). The Sinharay and Holland’s method uses an anchor test composed of a large number of medium difficulty items, and it is appropriate for equating that uses an external anchor test only because the special anchor test construction may not meet the test specification well (Sinharay & Holland, 2007). However, it is found that when groups differ greatly, using a few demographic variables in combination with anchor test scores does not improve equating accuracies (Paek, Liu, & Oh, 2008). Therefore, it is necessary to find more collateral information about examinees that is available from the test to adjust for group differences. The variable that is promising in this regard is the subscore which is a score on the subsection of the test. For example, a test measuring mathematics proficiency may contain subsections such as algebra, functions, geometry, and number and operation, and a subscore is the score assigned to a subsection of the test. Subscore reporting usually provides more detailed diagnostic information about examinees’ performance that may be useful, for example, in making individual instruction placement and remediation decisions (Tate, 2004) and in formatively supporting teaching and learning (Dibello & Stout, 2007). In equating, subscores are expected to give accurate equating functions since high correlations between operational test scores and scores on subtests make them feasible to compute missing data on the operational test through an existing missing data treatment method such as the multiple imputation method (e. g. Rubin, 1987; Schafer, 1997). Therefore the combination of anchor test scores, subscores, and demographic variables are worth investigating if it could improve the PSE equating method. In this study, demographic variables, the anchor test scores, and subscores are called collateral information. To improve the accuracy of the PSE equating results, this study proposed using two different approaches of using this collateral information to equate test scores using the PSE method. These two approaches include the propensity scores (Rosenbaum & Rubin, 1983) and the multiple imputation method (Rubin, 1987; Schafer, 1997). These two methods were used to fill in missing data in the NEAT design. For the propensity score method, demographic variables, subscores and the anchor test score were combined into examinees’ propensity score with which the anchor test score will be replaced in the PSE method. For the multiple imputation method, demographic variables, subscores, and the anchor test scores were used as covariates to fill in missing data. 1.2 Research Questions Using both real and simulated data, this study explored the feasibility of using combinations of demographic variables, anchor test scores, and subscores in two different ways to increase the precision of the PSE method. The main research question concerns the potential use of subscores combined with demographic variables and an anchor test score in improving the accuracy of the PSE equating results. In this study, the “traditional PSE method” refers to the PSE method that uses the anchor test score in the PSE equating process (von Davier, Holland, & Thayer, 2004). This method is the same as the frequency estimation method (Kolen & Brennan, 2004). The PSE method that replaces anchor test scores with the anchor test true score (Wang & Brennan, 2009) is called the “modified PSE method” in this study to contrast between the original PSE method and the modified PSE method, even though it is originally called the “modified frequency estimation method” by Wang and Brennan. The more specific research questions are as follows: 1. How accurate are predicted score frequencies of missing data when the proposed methods are used to compute missing data? 2. How comparable are predicted score frequencies of missing data produced by the proposed methods and those produced by the traditional and the modified PSE methods? . How do the proposed methods influence equating bias, and standard errors of equating? . How comparable are the proposed methods to the traditional and modified PSE methods in terms of equating bias, and standard errors of equating? CHAPTER II LITERATURE REVIEW This study explored benefits of using collateral information as an alternative approach to construct synthetic population functions that are required for equating test scores using the post-stratification equating (PSE) method. The PSE method is an equating method that uses the non-equivalent groups with anchor test (NEAT) design. The PSE method and chain equipercentile equating (CE) method are two important equipercentile function-based equating methods that use the NEAT design. This study focuses on the PSE method only, because it is developed based on more sound theoretical foundation than the CE method. The improvement of the PSE is needed because it has been shown in equating literature that even though it is based on a strong theoretical foundation, it produced large equating biases than the CE method, especially when groups differ greatly in abilities. In this regard, this study used collateral information, as an alternative approach to improve equating results of the PSE method. This study explored the uses of collateral information the PSE method in two different ways, which will be explained in the next section. This chapter reviews concepts and methodologies relating to equating and the development of the PSE method, and to the existing measurement and statistical developments that used to develop the approaches to enhancing PSE equating results in this study. Specifically, there are 12 related sections presented in this chapter which are outlined as follows: Section 2.1 Collateral information Section 2.2 General idea and the importance of test score equating. Section 2.3 Equipercentile equating function Section 2.4 Basic idea of the non-equivalent groups with anchor test design (NEAT). Section 2.5 Importance of the synthetic population function Section 2.6 Log-linear presmoothing technique and how it is important in test equating Section 2.7 Post-stratification equating (PSE) method and synthetic population functions Section 2.8 Method of sub-score estimation Section 2.9 Propensity scores and the logistic regression approach to propensity score estimation. Section 2.10 Multiple imputation method Section 2.11 Statement of research problem Section 2.12 Goal of this study and the evaluation indices 2.1 Collateral Information In test score equating, only test scores are involved in the equating process. However, when groups of examinees differ greatly in terms of abilities, it is desire that collateral information about examinees be included in the equating process to reduce biases due to the group differences (Livingston, Dorans, Wright, 1990; Kolen, 1990). Collateral information is available information about examinees in addition to their item responses (Mislevy, Kathleen, & Sheehan, 1989). Familiar examples include demographic variables, and educational variables such as opportunity to learn variables and grade received. Collateral information used in this study includes subscores, demographic variables, opportunity-to-leam variables, and the anchor test score. 2.2 Equating The use of multiple test forms of the same test is a common practice by many large-scale assessment programs because of security issues. For example, administering different forms of the same test to different groups of test takers ensures that a large portion of items in the item bank will not be exposed to examinees. However, when multiple forms are used, it is possible that one test form could be harder than another. This existence of test forms with unequal difficulty raises a question regarding the fairness of testing which is a major concern for most testing programs. To make assessments fair to all examinees, it is therefore necessary to adjust for unintended difficulties that are left imbalanced across test forms by using a statistical adjustment called equating. Equating is a statistical method used to produce scores that can be used interchangeably (Kolen & Brennan, 2004) and one of its advantage is that when equating is successfully done, it produces comparable scores that are fair to test takers who take different test forms with unequal difficulty. Equating and linking are two different terms that express how scores on different test forms are transformed to each other. Even though they have different meaning, both of these terms are best understood among other score transformation methods such as anchoring, calibration, statistical moderation, scaling, and prediction (Linn, 1996). While linking is a generic name for score transformation, equating is the most demanding type of linking (Linn, 1996). Transforming scores from one test form to scores on another test form cannot be called equating if it does not satisfy requirements of equating. That is, tests to be equated should meet requirements as stated by Lord (1980) and by Doran and Holland (2000). Broadly speaking, tests being equated to each other should measure the same construct, have equal reliability, and also satisfy three requirements: symmetry, equity, and population invariance. The Standard 4.1 10f Standards for educational and psychological testing (AERA, APA, & NCME, 1999) requires that when test score equating procedures are used to produce comparable scores, detailed technical information about the equating method and data collection method used should be provided, and indices measuring the uncertainty in the estimated equating function should also be estimated and reported. In practice, the accuracy of equating is commonly assessed through standard errors of equating (von Davier, Holland, & Thayer, 2004; Kolen & Brennan, 2004) that reflect the degree of sampling errors in equating firnctions. Equating biases are also indices used to assess the quality of equating. 2.3 The Equipercentile Equating Function in the Observe-Score Equating The derivation of the equipercentile equating function is detailed in this section because it is used by many observed score equating such as the PSE and the CE methods. Test score equating methods can be divided into two different categories: the observed score equating and the item response theory (IRT) equating. This study focuses on the observed score equating method. The most important component of observed score equating methods is the equipercentile equating function (von Davier, Holland, & Thayer, 2004). The equipercentile equating function is developed by identifying scores on the new form (Y) that have the same percentile ranks as scores on the old form (X). For 10 example, to find a form Y equivalent of a F orrn X score, one has to start by finding the percentile rank of the Form X score. Then it has to do with finding the Form Y score that has the same percentile rank. Algebraically, the equipercentile equating function for converting Y scores to X scores, e X (y) , is obtained by (Kolen & Brennan, 2004) 9X (y) = F“[Gl, where x and y are, respectively, a particular value of X and Y, F —1 is the inverse of the cumulative distribution function F, G is cumulative distribution function of Y. 2.4 The Non-Equivalent Groups with Anchor Test Design (NEAT) A wide range of test equating designs can be used for collecting equating data but the non-equivalent groups with anchor test (NEAT) design (von Davier, Holland, & Thayer (2004) which is also called common-items non-equivalent groups design (Kolen Brennan, 2004) is the most frequently used in practice. It is because of the fact that when groups are nonequivalent, some information is needed to adjust for group differences and typically that information is scores on the anchor test. In the NEAT design (see Table 1), the two operational tests to be equated, X and Y, are given to two samples of examinees from potentially different test populations (referred to as P and Q). In addition, an anchor test, A, is given to both samples from P and Q. Samples from P and Q that take the test at different administrations are generally self selected and thus might differ in systematic ways. One of the well-known systematic differences between P and Q are ability differences. Adjustments are needed to compensate for such differences using the appropriate anchor test which can be either internal or external test. The internal anchor test is a part of X and Y, while the external 11 anchor test is used only to adjust for group differences but it is not used for scoring the test. It is recommended that the anchor test should be proportionally representative of the two tests in content and statistical characteristics. When groups differ greatly, all equating methods tend to produce large equating errors because the anchor test fails to adjust for group differences. In practice, the degree of the precision of equating results could be assessed through estimates of standard errors of equating (SEE). Table 1. Non-Equivalent groups with Anchor Test Design (NEAT) Population Sample X A Y P l V V Not observed Q 2 Not observed V V It is suggested that to enhance the equating performance of the anchor test in producing more accurate equating results, the construction of the anchor tests should be specially created so that their correlations with the tests to be equated are maximized (Sinharay and Holland (2006, 2007). More specifically, the anchor test can be constructed by embedding items with moderate difficulty. Doing so relaxes the requirement of equal distributions of statistical characteristics between the anchor tests and the operational tests. This newly suggested anchor test construction provides an alternative guideline for constructing the anchor test as it is an approach for enhancing the precision of the test equating function. 12 2.5 The Importance of Synthetic Population The designed missing data as parts of the NEAT design makes available a variety of equating methods that use the NEAT design. Because X is never observed for examinees in Q and Y is never observed for examinee in P (see Table 1), different types of missing data treatments are required to handle these missing data. For these reasons, there are several different methods of test score equating under the NEAT design (Holland, & Dorans, 2006) such as the Tucker method, and the Levine method. However, two commonly used observed score equating methods that use the NEAT design are the chain equipercentile (CE) method and the post-stratification equating (PSE) method which is also known as the frequency estimation method (Kolen & Brennan, 2004). These two methods are different in the way that the anchor test score is used to produce an equating function and also in different assumptions about missing data made to handle missing data arising when the NEAT design is used to collect equating data (Holland, Sinharay, von Davier, & Han, 2007). Unlike the CE, the PSE method is considered a strong equating method because it is more sound in terms of the developed theoretical foundation (Haris, & Kolen, 1990; Kolen, 1992), making this equating method more appealing than the CE. Braun and Holland (1982) noted that there are theoretical problems with the CE which center on the definition of the equipercentile equating function. That is, equipercentile relationships are defined for a particular group of examinees, as in the PSE method. However, the equipercentile relationship between X and Y for the CE method is not defined for a particular group. More specifically, the PSE method uses synthetic population functions defined as the weighted mixture of P and Q as an important tool to equate test scores, 13 while the CE method does not define any synthetic population function (Kolen, 1992). The reason for employing synthetic population functions was detailed by Braun and Holland (1982). The basic reason of this development is that the NEAT design uses “two samples of P and Q” which sometimes are nonequivalent groups—they are different to some degrees depending on for example how well they are sampled. However, an equating function for the NEAT design is typically viewed as being defined for a single population. To obtain a “single” population for defining a single equating relationship, therefore P and Q must be combined (Kolen & Brennan, 2004). Given the appeal of the synthetic population function, this study focuses on the PSE method. To obtain a single population for defining an equating relationship, Braun and Holland (1982) used the target population (7) or the synthetic population function which is the weighted mixture of P and Q to combine P and Q. T is an important ingredient of the PSE method for performing equating. T is a mixture of both P and Q, T 2 WP + (1 - w)Q, where w is the weight given to P. The weight can be any number raging from 0 to 1 but in Holland, & Rubin (1982) the usual w is a proportion of sample size from P relative to the total sample size (P+Q) defined by w = N p /(N p + NQ) , where Np and NQ represent the sample sizes of P and Q, respectively. 2.6 Presmoothing the Score Distribution In computing equipercentile equating functions, the estimates of population score distributions can be used in place of the observed sample distributions. The estimated score distributions are typically much smoother than the distributions observed in the sample (Livingston, 1993). Therefore, the estimated distributions are often described as 14 smoothed, and the process is often referred to as smoothing. The PSE method uses smoothed distributions in computing equating relationship. Log-linear models (Holland & Thayer, 2000) are a smoothing model that offers the user a flexible choice in the number of parameters to be estimated fiom the data. The log-linear smoothing model is detailed as follows. Assume there is a random variable X that defines the test form X with possible values x0 ,..., x j , with j is the possible score values, and the corresponding vector of observed score frequencies n=( n0,..., n j )t that sum to the total sample size N. The vector of the population score probabilities p=( p0 ,..., p j )t is said to satisfy a log-linear model if loge(pJ-) = a+uj +bjfl Where the ( p j) are assumed to be positive and sum to one, [9 j is a row vector of constants referred to as score functions, fl is a vector of free parameters, u j is a known constant that specifies the distribution of the ( p 1') when ,3 = 0, and a is a normalized constant that ensures that the probabilities sum to one. When u j is set to zero, the log-linear model used to fit a univariate distribution is 1 . 108.2(1)!) = 0+ Zfliocjy' i=1 IS The terms in this model can be defined as follows: the (x j )i are score functions of the possible score values of test X (e. g., x1-, x37 ,..., xI-) and the ,61- are free parameters to be estimated in the model fitting process. The value of 1 determines the number of moments of actual test score distribution that are preserved in the smoothed distribution. For example, if [=4 then the smoothed distribution preserves the first, second, third, and fourth moments (mean, variance, skewness, and kurtosis) of the observed distribution. Similarly, the bivariate distribution of the scores of two tests (e.g., X and Y) is given by I H G F Iogxpfl.) = a + 21mm).- + hzlflyhonh + 21 Flag/omen!” , r: = g: = Where p jk is the joint score probability of the score (x j , yk; score x j on test X and score y k on test Y). This model produces a smoothed bivariate distribution that preserves I moments in the marginal (univariate) distribution of X; H moments in the marginal (univariate) distribution of Y; and a number of cross moments (G SI, F S H) in the bivariate X-Y distributions. 2. 7 Post-Stratification Equating (PSE) Method The process of equating test score using the PSE method is composed of two major steps. The first step is to estimate score frequencies of missing data by invoking conditional assumptions such that score frequencies of missing data are obtained for constructing synthetic population functions. Anchor test scores are usually used as the conditional variable for the PSE method (von Davier, Holland, & Thayer, 2004). So in 16 this study, the traditional PSE method is referred to the PSE method that uses anchor test scores. The second step is to use derived synthetic population firnctions to equate test score on the new form (Y) to score on the old form (X) using the equipercentile equating function. Equated scores are scores on the new form (Y) that have the same percentile ranks as scores on the old form (X). In order to create the T forX and Y ( T X and T y), score distributions from both populations must be known but, as seen in Table 1, scores on X for the population Q and scores on Y for the population P are however unavailable due to the characteristic of the NEAT design (known as designed missing). Therefore some statistical assumptions need to be invoked to obtain the score distributions for themissing parts to be used for constructing the synthetic population. Test equating methods that use the NEAT design employ different untestable statistical assumptions about the uses of anchor test data to predict the scores on the designed missing parts. The post-stratification equating (PSE) method assumes that the conditional distributions of X conditional on the anchor test data (A) are the same across populations and it is similar for the distributions of Y conditional on A, which are expressed by f(xl A,P)=f(x|A,Q) (1) and f(y I 4Q) = f(y | AP) (2) Let I f x p be the marginal distribution of X for the population P, f xQ the marginal distribution of X for the population Q, f yQ the marginal distribution of Y for the 17 population Q, and f y p the marginal distribution of Y for population P, then the distributions for the synthetic population for Form X and Y are f(x) = WXfo + (1* 1%)fo (3) f(y) =(1—Wr)fyP+WnyQ- (4) The quantities f rQ in (3) and f y p in (4) are usually missing data (unobserved) but can be obtained as follows by using assumptions (1) and (2): fo:ZfOC’A:aiQ)=Zf(xiA:a9P)haQ (5) fyp=2f(y,A=alP)=Zf(ylA=a,Q)hap (6) where haQ and haP are marginal distributions of A for the population Q and P, respectively. The expression in (5) and (6) can be substituted into (3) and (4), correspondingly, to provide expressions for the synthetic population as follows: f(x) =WXfo+(1—WX)Zf(x| A=a,P)haQ (7) f(y) =(1—WY)Zf(yIA=0,Q)haP+WnyQ (8) Then equating scores on X to scores on Y based on the synthetic population functions can be carried out using the equipercentile equating fimction mentioned earlier. This equating function is analogous to the equipercentile relationship for random groups equipercentile equating function (Kolen & Brennan, 2004). Even though the PSE is theoretically a promising method, under general realistic conditions, the PSE equating relationship does not correspond to the relationship for the 18 CE method (Braun & Holland, 1982). Moreover, it produces larger equating biases than does the CE method (e.g., Wang, Lee, Brennan, & Kolen, 2008; Holland, von Davier, Sinhary, & Han, 2008). The reason for this shortcoming might be because the PSE method employs the less reasonable missing data assumption when compared to the CE method (Holland, Sinharay, von Davier, & Han, 2008) which does not require any assumption about missing data. It is later found that there is more evidence revealing that using scores on the anchor test to make the conditional assumption as a way to deal with the missing data assumption of the PSE method is less reasonable. By using anchor test true scores to replace anchor test scores, the result of the PSE method is however much improved and more accurate than that of the CE (Wang & Brennan, 2009). This method is called the modified fi'equency estimation method. Therefore, it comes to understand that when attempting to improve the equating result of the PSE method it is better to use a good variable as a conditional variable to predict frequencies of missing data in the NEAT design and anchor test true score is a promising choice. Another attempt proposed by Holland and Sinharay (2007) is to construct an anchor test with items with medium difficulty, but their method is appropriate for an external anchor test only. However, the use of anchor test true score may not be sufficient to remove biases when P and Q differ greatly. This study proposes to use subscores combined with anchor test scores in two different ways to handle missing data and group differences. The two methods will be explained in the next sections. Using both subscores and anchor test scores in this study is based on the idea that using more information to predict score frequencies of missing data is expected to increase the accuracy of equating results. By using the combination of 19 subscores and anchor test score, the prediction of score frequencies of missing data could be improved because high correlations between operational test scores and subscores ’ could have a potential to increase the accuracy of the prediction of missing data. This method is also viable when a number of good examinee demographic variables are not available to compute the propensity score (Rosenbaum & Rubin, 1982) which is a recommended conditional variable to be used to handle group differences (Livingston, Dorans, & Wright, 1992). Although using examinees’ demographic variables combined into examinees’ propensity score to adjust for group differences is recommended, a large number of demographic variables is recommended because a smaller set of demographic variables used to compute propensity scores could not add much value to the equating results (Paek, Liu, &Oh, 2008) 2. 8. Subscore Estimation There has been much interest in assisting students in determining which of the skills within a particular domain of knowledge needs improvement and numerous testing programs report subscale scores defined by the test design. Most of achievement tests have subsections and a subscore is the score assigned to a subsection of the test. Subscore reporting usually provides more detailed diagnostic information about examinees’ performance that may be useful, for example, in making individual instruction placement and remediation decisions (Tate, 2004) and in formatively supporting teaching and learning (Dibello & Stout, 2007). Tests with multiple subsections imply a mulidimensional structure of tests. Using scores on subtests may provide additional information about examinee performance rather than using only total test scores. To estimate and report subscores of a test, 20 sophisticated approaches such as multidimensional item response theory (MIRT) can be used. Alternatively, subscores can also be estimated using the classical test theory (CTT) where subscale scores are estimated from number-correct responses. The Haberman method (2008) of subscore estimation, which is based on CTT, was adopted in this study because it produced estimates of subscores that that were highly correlated with estimates from the MIRT approach (Haberman, & Sinharay, 2008). Moreover, it is straightforward and does not require much computation time. The methodological approach to subscore estimation is illustrated and detailed in Haberman (2008) and Sinharay, Haberman, and Puhan (2007). The Haberman method of subscore estimation is typically a regression of true subscore on both observed score and observed total score, and the linear regression of true subscore r“. on the observed subscore SX and the observed total score Sz is estimated by HEY I SXaSZ) = E(SX)+/5’(TX ISX 'SZ)[SX -E(SX)I +.3(TX ISZ 'SX)[SZ —E(Sz)] , where AUX iSX 'SZ) ___ 0(TX)[P(SX’TX)_P(TX’SZ)p(SXaSZ)I otsx)r1-p2[1—p2l=a+flx' 2.10 Multiple Imputation Method Missing data often occurs due to factors beyond the control of the researcher. Missing data may be planed. For example, they are part of the research design which is 24 similar to missing data arising due to the NEAT design. Missing data can create biases in parameter estimates that can lead to generalization problems. When missing data is serious, valid inferences regarding a population of interest cannot be made. This occurs when missing data is not missing at random. For example, it is missing in a manner which makes the sample different from the population from which it was drawn. There are several methods developed to handle missing data such as listwise deletion, pairwise deletion, and imputation of missing data (replace the missing data with estimated scores). The multiple imputation method (Rubin, 1987) has increasingly gained interested to researchers in various fields because it has been shown to produce unbiased parameter estimates (Schafer & Graham, 2002). In the multiple imputation method, missing values for any variable are predicted using existing values from other variables (covariates). The predicted values are called “imputes”, and are substituted for the missing values, resulting in a full data set called an “imputed data set.” This process is performed multiple times producing multiple imputed data sets (hence the term “multiple imputation”). The results from m imputed data sets are analyzed using standard statistical analyses and the results from m complete data sets are combined to produce inferential results. It is recommended that small number of m (e. g., 5 imputations) is adequate for multiple imputation (Fichman and Cummings, 2003) but larger is better when fraction of missing data is large (Schafer, 1997). Currently, multiple imputation procedures are more accessible to researchers. One can impute missing data using software such as SAS. The SAS V9.lsoftware ”(SAS Institute, 2003) has a procedure “proc mi that enables one to impute missing data easily. 25 2.10.1 Missing data mechanism Data are missing for many reasons. For example, participants dropped out from a longitudinal study, died, or refused to answer surveys. In some cases, missing data is a result of a research design itself. The data collection that uses the NEAT design is an example of a design that creates missing data. Missing data are problematic because most statistical procedures require a value for each variable. When a data set is incomplete, the data analyst has to decide how to deal with it. Causes of missing data fit into three categories, which are based on the relationship between the missing data mechanism and missing and observed values. The first is missing completely at random (MCAR). MCAR means that the missing data mechanism is unrelated to the values of any variables. The second is missing at random (MAR). MAR means that the missing values are related to either observed covariates or response variables. When missing data is MCAR or MAR, the missing mechanism is ignorable and the best method to use to impute data is the multiple imputation method that uses maximum likelihood (ML). The third is not missing at random (NMAR). NMAR means that missing values depend on missing values themselves. When missing data is NMAR, the missing data mechanism is non-ignorable. When equating data are collected using the NEAT design, examinees taking X will have missing values on Y, and examinee taking Y will have missing values on X. Missing data of the NEAT design is said to be missing by design. Conventionally, it has been assumed that this missing data mechanism is MAR (Holland & Rubin, 1982). That is, score distributions of two different groups of examinees are assumed to be identical when the anchor test score is held constant. When missing data mechanism is MAR, the 26 anchor test score can be used to fill in missing values. This assumption is feasible when there are no group differences in terms of ages or abilities. However, when groups differ greatly in abilities, this assumption is likely to fail. In the literature, certain background information has been recommend for matching observations when the anchor test score fails to reduce equating biases due to group differences (Livingston, Dorans, & Wright, & Dorans, 1992). 2.10.2 Introduction to the EM algorithm The Expectation Maximization (EM) algorithm is a very general iterative algorithm for ML estimation in complete-data problems. Analyses performed using the EM algorithm assumes that missing data are MAR (Little & Rubin, 2002). Basically, the EM algorithm employs iterative processes in which initial estimates of missing data values are obtained. Basically, the EM algorithm employs these steps: (I) replace missing values by using estimated values, (2) estimate parameters, (3) re-estimate the missing values, assuming the new parameter estimates are correct, (4) re-estimate parameters, and so fourth, iterating until convergences. Each iteration of EM consists of an E step (expectation step) and an M step (maximization step). Each step has a direct statistical interpretation. Specifically, the E step finds the conditional expectation of the missing data given the observe data and current estimated parameters, and then substitutes these expectations for the missing data. The M step performs ML estimation of parameters just as if there were no missing data, that is, as if they have been filled (Little and Rubin, 2002). 27 One of advantages of the EM algorithm is that it can be shown to converge reliably. However, when there is a large fraction of missing information, its rate of convergence can be painfully slow. 2.11 The Statement of Research Problem and its Solution It has been shown that the anchor test fails to remove equating biases when groups differ greatly in abilities. To reduce equating biases produced by the PSE method, the anchor test score should be replaced with the anchor test true score (Wang & Brennan, 2009). Although the anchor test true score is an interesting choice in that it can reduce equating biases of the PSE method, it is not clear if it would produce the accurate equating result when samples of P and Q differ greatly in abilities. Group differences may arise due to some reasons. For example, two samples of P and Q taking the test at different administrations are “generally self selected and thus might differ in systematic ways” (Peterson, Kolen, and Hoover, 1993). One of the well-known systematic differences between the P and Q are the ability differences and therefore adjustments are needed to compensate for such differences by using the appropriate anchor test. It was found that when groups differ greatly in ability, all equating methods that use the NEAT design produce larger equating biases and standard errors of equating because the distributions of the scores on the anchor test in the two groups are not the same (von Davier, 2003). Larger equating biases and standard errors of equating occur when the anchor test score does not adjust for group differences well (Holland & Sinharay, 2007), implying that the conditional assumptions about missing data do not hold in this case. These findings imply that using only anchor test true score may produce less equating accuracies when groups differ greatly in ability. The possible reason is that using a single 28 piece of examinee information (e.g., anchor test score) to adjust for group differences may not be achieved. In order to obtain more accurate equating results when groups differ greatly, it is suggested that multiple pieces of exarrrinee information such as their demographic variables be used together with scores on the anchor test to make the conditional distribution assumptions more appropriate. Such information may be combined into a form of the propensity score (Rosenbaum & Rubin, 1983), where the examinee’ s propensity score is a conditional probability that the examinee will be assigned to a particular test form, given a vector of observed covariates. Paek, Liu, and Oh (2008) found that using a small number of demographic variables did not add much value to improve equating results. It has been suggested that it would be usefirl to choose variables that distinguish the two groups of examinees to establish examinees’ propensity scores (Livingston, Dorans, & Wright, 1990). The variables that are of interest include variables that are related to opportunity-to-learn (Kolen, 1990). This study proposed the use of subscores combined with the anchor test score and demographic variables to improve equating results of the PSE method. Subscores are increasingly attractive to researchers, educators, and policy makers for the diagnostic purposes, but their premises have not been explored in the equating context. More accurate equating results might be obtained by using subscores because high subscore to total score correlations can produce more accurate score frequencies of missing data needed for constructing synthetic population functions and deriving equating functions. The use of more collateral information about examinees defined as available information 29 about examinees in additional to test scores is an alternative to the traditional PSE method that uses the anchor test score only. Once collateral information is ready, it can be used to compute score frequencies of missing data due to the design of the NEAT in two different ways. The first method combines them into a form of examinees’ propensity scores with which the anchor test score is replaced to compute such score frequencies. The second method is to use observed collateral information to impute missing data using the existing multiple imputation method. This study examined if these two different implementations improve the PSE equating results. 2.12 The Goal of This Study and the Evaluation Indices The goal of this study is to evaluate the effectiveness of using collateral information about examinees for the PSE method in two different aspects: the accuracy of the prediction of score frequencies of missing data; and the accuracy of equating in terms of equating biases, and standard errors of equating. This investigation was explored using both simulation data and the empirical data. The following details research questions of this study. The first question addressed in this study is related to whether the additional use of the collateral information by the purposed methods (the propensity score method and the multiple imputation method) offer the improvement to the synthetic population functions. Specifically, to assess whether it adds the additional improvement to synthetic population functions is actually to assess the improvement of the prediction of score frequencies of missing data predicted by purposed methods. This can be assessed by evaluating the agreement between observed score frequencies of hill data (pseudo-test 30 data) and predicted score frequencies of missing data. This study adopted the agreement indices of Holland et a1 (2008), where these indices include Pearson [2 statistic, and Likelihood ratio 12 statistic. The second research question of this study is related to the performance of the proposed methods of this study in terms of accuracy of equating fimctions. The investigation is carried out by evaluating the equating biased and standard errors of equating. The criterion equating function used for computing the equating bias is the PSE equating function obtained by using the pseudo-test data or full data which is generated data without missing data. The third research question concerns the relative performance of the proposed methods in establishing accurate equating functions. Given that the anchor test score and the anchor test true score have been used as conditional variables to make missing data assumptions of the traditional PSE method and the modified PSE method, respectively, this investigation is carried out to evaluate the equating performance of the proposed methods by comparing equating accuracy indices (the equating biases, and standard errors of equating) produced by the proposed methods and the other methods. The method that produces the smallest equating biases and standard errors of equating is the most appropriate for test score equating. The criterion equating function used for computing the equating bias is the PSE equating function obtained by equating test scores using the pseudo-test data. 31 CHAPTER III RESEARCH METHOD The data analysis of this study has two parts: simulation and empirical data analyses. The simulation data were used to evaluate if it was feasible to use collateral information about examinees to improve accuracy of the post-stratification equating (PSE) method in terms of equating biases, and standards errors of equating. There were three simulation factors investigated in this study including ability differences, test length, and missing data treatment. For real data analyses, group differences and missing data treatment were investigated. This chapter details the research design, data generation, and the equating procedure for the simulation study. For empirical data analysis, descriptive statistics and the information about the empirical data set chosen for this study are presented. Note that the data analyses for the empirical data were the same as those for the simulation data. 3.1 Research Design This study examined the feasibility of using collateral information about examinees (sub-scores, anchor test score, and examinees’ demographic variables) in two different ways (the propensity score stratification method and the multiple imputation method) as an effort to improve the accuracy of the PSE method. Simulation factors were manipulated as follows. 3.1.1 Test Length This study investigated two different test lengths, 60 and 40 items, which represent long and short tests respectively. Details regarding how tests were generated are presented in the next section. 32 3.1.2. Ability Differences between Groups of Examinees Previous studies found that group differences had tremendous effects on both standard errors of equating and equating biases. This study investigated two types of ability differences between two groups of examinees. The first type was the condition in which there were no ability differences between two groups of examinees. The second type was the condition in which the two groups differ greatly in abilities. More specifically, the group of examinees taking the second test form or the new test form (the Q population) was a group that was more proficient than the group taking the old form (the P population). How to manipulate the degree of group differences is presented in the next section. 3.1.3 Missing data treatment Two methods of missing data treatment used in this study included the propensity score stratification method and the multiple imputation method. When collateral information was obtained, they were manipulated in eight different sets to investigate which collateral information set yielded best equating results for the PSE method. As noted previously, this study pr0posed using collateral information about examinees in the PSE method to compute score frequencies of missing data. These frequencies are needed for equating test scores using the PSE method. It was interesting to examine what types of collateral information provided better improvement to equating accuracy. Therefore, for each of 8 conditions, the following different sets of collateral information were investigated. 0 Anchor test score only (A), which is the traditional PSE method 0 Anchor test true score only (T), which is the modified PSE method 33 0 Demographic variables only (D) 0 Anchor test and demographic variables (A&D) - Anchor test score and sub-scores (S&A) - Sub-scores and demographic variables (S&D) 0 Anchor test score, sub-scores and demographic variables (ALL) This yielded 64 conditions (8*8=64) investigated in this study, and 100 data sets were replicated for each condition for the propensity score method. But 20 data sets were replicated for the multiple imputation method. 3.2 Data Simulation Procedure This study equated scores on two test forms (X and Y). Tests with multiple subsections imply a multidimensional structure. It was reasonable in this study to use a compensatory multidimensional item response theory (MIRT) model (e.g., Reckase, 1997) to generate X and Y scores. To generate examinees’ multiple test scores or sub- scores which are sums of correct responses within each subsection, the items parameters had to be generated. Then item responses for the five subtests were generated using the compensatory MIRT model. The procedure for the item parameter generation was explained in the next section. The probability of a correct response to item i of examinee j for the compensatory MIRT model can be expressed as exp[l .761,in -bi] 1+ exp[l .7a;9j — bi] 3 P(Xi:I|6)=Ci+(I—Ci) where 34 Xiis the score (0, 1) on item i (i=1, ..., n), a} is the vector of item discrimination parameters (slope), b,- is the scalar difficulty parameter for item i, C,- is the scalar guessing parameter for item i, and 6}- is the vector of trait parameters for person j (i=1, ..., N). 3. 2.1 Item Parameter Generation It was assumed in this study that each of the two test forms to be equated had five content areas (subsections). Item responses of two test forms (X and Y) were simulated using a 5-dimensional IRT model. Note that there were 60 items for the long test condition and 40 items for the short test condition. For the long test condition, 45 items were Operational test items (each section had 15 items), and 15 items were the anchor test items. For the short test condition, 30 items (each section had 6 items) were operational test items, and 10 items were anchor test items. Item parameters were generated separately for the long test and short test condition. Specifically, item parameters for the 5-dimensional IRT model were generated as follows. The vector of slope (a), difficulty (b), and guessing (c) parameters for the compensatory MIRT model were generated using WINGEN2 (Han & Hambleton, 2007) such that a~LN(O, .2), b~N(0,1), and c~BETA(8,32), where LN(,u, 0') designates a log- norrnal distribution with mean u and standard deviation 00f the logarithm, and BETA( a, ,6) a beta distribution with two parameters or and [3. Note that F orm X and Form Y item parameters were generated using the same procedure. That is, two sets of item parameters (forX and Y) were generate separately. 35 Item parameters produced by the WINGEN2 were a complex structure, meaning that items, approximately, measure multiple dimensions equally. This was inconsistent with item specifications in real practices; items are usually purposely developed to measure only one dimension. Therefore, modifications were made to produce a more simple structure. This was done by allowing items to be dominantly loaded on a single dimension only, by fixing a-parameters corresponding to other dimensions at .01. For example, in the long test condition, items 1 to 12 had higher a-parameters on the first dimension than dimensions 2 to 5, by replacing a-parameters for dimensions 2 to 5 with .01. Similarly, items 13 to 24 had higher a-parameters on the second dimension than other dimensions, by replacing a-parameters for dimension 1, and a-parameters for dimensions 3 to 5 with .01. After item parameters were generated, operational tests (X and Y) and the anchor test (A) were constructed as follows. For the long test condition, the first 9 items from each of the five subsections were chosen as operational test items. Therefore, there were 45 items chosen forX and 45 items chosen for Y. The last 3 items from each of the five SUbsections were chosen as anchor test items. Therefore, there were 15 anchor test items. For the short test condition, the first 6 items from each of the five subsections were Chosen as operational test items. Therefore, there were 30 items chosen forX and 30 iterns chosen for Y. The last 2 items from each of the five subsections were chosen as aIlehor test items. Therefore, there were 10 anchor test items. To item parameters for an anchor test, the Form Y generating common item parameters were replaced with the Form X generating common item parameters. This procedure was used for both long and short test conditions. 36 3. 2.2 6 Parameter Generation This study did not simulate examinees’ demographic variables because simulating demographic variables was infeasible due to the fact that the distribution of demographic variables and abilities was unknown. Therefore, demographic variables from the real data were used and merged with the simulated test data, pretending that simulated examinees had those demographic variables. However, by doing so, the sample sizes in this study could not be varied and were fixed at 1,361 and 1,266 for test form 1 and 2, respectively. The five vector of theta estimates were obtained by calibrating empirical data using a multidimensional IRT model. Specifically, a test form 1 and a test form 2 were separately fitted to a 5-dimensional item response theory model using the software WinBUGSl.4 (Spiegelhalter, Thomas, Best, & Lunn, 2003) to obtain examinees’ five ability estimates. The WinBUGS code used to run WinBUGS is in the appendix C—this code was modified from the code written by Bolt and Lall (2003). The correlation Coefficients among the estimates of 0 parameters for the test forms 1 and 2 are in Table 2 and Table 3, respectively. These correlation matrices are roughly similar, indicating that Covariance structures for the test form 1 and 2 data are roughly comparable. Table 2. Correlation coefficients among 9 parameters from WinBUGS (test form 1) I\91 1.00 \ 92 .09 1.00 \ 63 .55 .15 1.00 \ 94 .16 .10 .12 1.00 \ 95 .01 .31 .14 .54 1.00 \ 37 Table 3. Correlation coefficients among 9 parameters from WinBUGS (test form 2) 91 1.00 92 .21 1.00 93 .37 .03 1.00 94 .05 .24 .27 1.00 95 .19 .01 .10 .47 1.00 As presented in Table 4, averages of the five ability estimates for each test form were close to zero as the result of the parameterization in WinBUGS that sets the means of estimates to zeros. Five vectors of theta estimates for the test form 1 and five vectors of theta estimates for the test form 2 were used along with the generated item parameters mentioned above to generate item responses for the condition that there were no group differences. However, when there were group differences in terms of abilities, 0.5 was added to the five vectors of ability estimates for the test form 2, meaning that examinees taking the test form 2 were more proficient in‘all five dimensions than those examinees taking the test form 1. The differences in the mean abilities of .5 were used because small differences cannot allow for investigating values of collateral information to the improvement of equating biases. Therefore, greater differences about .4 or .5 in standard deviation unit of 0 were used (e.g., Holland and Sinharay, 2007). 38 Table 4. Average 0 parameters from WinBUGS Average 91 62 03 94 05 Test form 1 0.00 0.00 0.01 0.00 0.00 Test form 2 -0.01 0.01 -0.01 0.01 0.02 3.2.3 Item Responses Generation The simulation data were generated using the SAS V9.1 software (SAS Institute, 2003). Specifically, the probability of a correct response to item i by simulated examinee j was computed using the compensatory multidimensional item response theory model (e. g., Reckase, 1997). A response vector of dichotomous item scores for each examinee was obtained by generating, for each item, a uniform random number (ranging between 0 and l) and comparing the value with the probability of an examinee of that ability level passing the item. If the computed probability exceeded the random number, then the item was scored as correct (1); otherwise, it was scored as incorrect (0). This study did not generate examinees’ demographic variables, but used demographic variables from the empirical data set. After item responses data were generated, they were merged with the demographic variables from real data. Specifically, the examinees’ demographic variables were linked up with the examinees’ simulated test scores by matching examinees’ demographic variables with their item responses using the estimates of their 6 values from WinBUGS. Therefore, every simulation data set has the same demographic values, and the sample sizes were not varied. Note that test form X 39 and test form Y data generated were called “pseudo-test data” and they had no missing data. To check if data was acceptably generated, one generated data set was analyzed through an exploratory factor analysis and a confirmatory factor analysis using the Mp1usS.2 (Muthen & Muthen, 2008), which is a structural equation model software. It was found that the exploratory factor analysis produced a 5-factor model that was slightly better than the 6-factor model. For example, the BIC was 100,083.801 for the 5-factor model, while it was 100,3 70.907 for the 6-factor model. Moreover, the confirmatory factor analysis of the simple structure of this test showed that the S-factor model fitted data very well as indicated by the small and non-significant chi-square statistic (x2 =692.484, df = 670, p=.2658). This evidence indicated that the S-dimension data was reasonably generated and therefore it was reasonable to use the data generation procedure to generate test scores used for equating in this study. 3.3 Missing Data Generation There were two types of simulation data used in this study. The first type of data was the complete test data, and the second type was the incomplete test data. Each of these data was used to address different research questions. Data used to address the research questions 1 to 2 are the pseudo-test data (Holland, Sinharay, von Davier, & Han, 2007), which is the complete test data. The pseudo-test is the test that is manipulated by pretending that each examinee has scores on both test forms, meaning that it was pretended that each examinee took both test form X and Y. The creation of this type of data was proposed and used to test the assumptions of various equating methods that use the NEAT design (Holland, Sinharay, von Davier, & Han, 2007). Later, it was used in the 40 TEDS-M project to investigate the effectiveness of the balanced incomplete block design used to collect the international assessment data of the TEDS-M project. 3.3.1 Pseudo Test Data and Missing Data Generation for 60—Item Test Under the MIRT model, the F orm-X and Form-Y generating item parameters were used, respectively, to generate item responses X for P and item responses Y for P, resulting in a number of examinees of P having 105 item responses (45 items for test form X, 45 items for test form Y, and 15 anchor test items). Similarly, Form-X and Forrn- Y generating item parameters were used, respectively, to generate item responses X for Q and item responses Y for Q, resulting in a number of examinees of Q having 105 item responses (45 items for test form X, 45 items for test form Y, and 15 anchor test items). Data for P and Q were then merged and called complete data. A completed data set was called a pseudo-test which was used as the criterion for comparisons. For example, observed frequencies of score X for Q and observed frequencies of score Y for P were used as the true frequencies. Also, the criterion equating functions used for computing biases were obtained by equating scores on pseudo-tests (completed data). An incomplete test data set that reflects missing data from the NEAT design was created from the pseudo-tests data simulated above, by deleting the first 45 item responses X from Q and the last 45 item responses Y from P. But the 15 anchor test items were kept in both forms. 3.3.2 Pseudo Test Data and Missing Data Generation for 40-Item Test For the 40-item test conditions, pseudo test data and test data with missing data were generated by the same procedure as for the 60-item test conditions. Specifically, the generating item parameters for the test form X and test form Y were used to generate item 41 responses X for P and item responses Y for P. The same sets of generating item parameters were used to generate item responses X for Q and Y for Q. P and Q were then merged, pretending that each examinee had a score X and score Y. This procedure resulted in examinees having 70 item responses (30 items for the test form X, 30 items for the test form Y, and 10 anchor test items). To generate test data from the NEAT design, the first 30 item responses X were deleted from Q and the last 30 item responses Y were deleted from P. But the remaining 10 anchor test items were kept in both forms. The mean (X ), standard deviation (SD), minimum (Min.), and maximum (Max.) for the simulated test form X and Y data are presented in Table 5 and Table 6, respectively. As shown in these Tables, when there were group differences, the mean for the test form 2 is greater than that for the test form 1. For example, the mean for the 60 item-test form 2 is 41.66 (SD. = 2.05), which is greater than that for the test form 1 (mean = 35.62, SD. =2.00). However, when there are no group differences, scores on test form 2 and test form 1 are identically distributed. Table 5. Descriptive statistics for the simulated test form 1 2 SD. Min. Max. No Ability 60 items 35.62 2.01 30.47 41.34 Differences 40 items 24.66 1.45 20.52 28.91 Ability 60 items 35.62 2.00 29.57 41.53 Differences 40 items 24.66 1.43 20.75 29.14 42 Table 6. Descriptive statistics for the simulated test form 2 )7 SD. Min. Max. No Ability 60 items 36.99 2.05 29.89 44.00 Differences 40 items 23. 31 1.40 18.40 28.64 Ability 60 items 41.66 2.05 34.50 48.25 Differences 40 items 26.48 1.44 21.77 31.01 3.4 Analytic Strategies for the Purposed Equating Methods This study had two parts. The first part was to evaluate the predictions missing data from the combination of sub-scores and the anchor test score. The second part evaluated equating results obtained by using the two methods proposed in this study. Therefore, this research method section comprised of two sections corresponding to these two parts. Specifically, the first section explained the procedure to predict missing data and how to assess the prediction performance of collateral information, while the second section described procedures and how to equate test scores using the proposed two methods. 3. 4. 1 Prediction of Score Frequencies of Missing Data There were two proposed uses of sub-scores to equate test scores. The first method was to use the examinees’ propensity score as a stratification variable in the equating process of the PSE method by replacing the anchor test score with the propensity score. The second method uses a multiple imputation methods 43 to compute missing scores directly. When applied to equate test scores, these two proposed methods used different strategies to estimate score frequencies of missing data required for constructing the synthetic population fimctions. 3. 4. 1. I The Propensity Score Approach to the Prediction of Missing Data Holland, Sinharay, von Davier, and Han (2007) provided the approach to predicting score frequencies of missing data of the NEAT design and their strategies were adopted in this study. The procedure to predict score frequencies of missing data using the propensity score method is as follows. First, a logistic models was used to estimate the propensity score (Z) of each examinee by using collateral information as predictor variables. The detailed sub-scores estimation and propensity score estimation are presented in Section 3.4.2.1 and Section 3.4.2.3, respectively. Then the loglinear model was used to presmooth the bivariate distribution of (X, Z) obtained from P and the bivariate distribution of (Y, Z) from Q, by preserving the first four moment of X and Y and covariance between Y and Z; and X and Z. The presmoothed bivariated probabilites, respectively, are denoted as: p x2 = P{X=x, Z=z|P} and q yz = P{Y=y, Z=z|Q} These bivariate probabilities were used to form the marginal distributions of Z in P and Q, that is thzszz and th=qu2 x y Then the conditional probability, P{X=x|Z=z, P}, was computed as the ratio p xz / h z p . The estimated conditional probabilities were used to obtain the predicted score probabilities forX in Q as follows: 44 fo = 2pxz(th /th)- 2 By similar reasoning, the predicted score probabilities for Y in Q are fyP = Zpyz(th /th) z The predicted frequencies of X in P and Y in Q, respectively, were N Q f xQ and N p f y P, where N Q and NP were sample sizes of Q and P respectively. The focus of this section was to assess the agreement of N Q f xQ’ the observed frequencies of X in Q (’1le , the agreement of N p f y P. and the observed frequencies of Y in P ( m y P ). The criteria used for this investigation include Pearson chi-square statistic ( 12 ), Likelihood ratio chi-square statistic (G2 ), and Freeman-Tukey (FT) residuals. The following formulas define these statistics. In each case, nl- denotes the observed fiequencies and m,- the corresponding predicted frequencies: 2 mi) . . . 7 ("i " Pearson chr-square statrstrc, 1‘ = Z . m. l 1 Likelihood ratio chi-square statistic, G2 = 22 71,- log(n,- / m), i F reeman-Tukey (FT) residuals, FT residuals = 1/ 71,- + tint + - 1[4171i + I These three statistics are often used to measure the closeness of the fitted fiequencies to observed frequencies in discrete distributions of score (Holland & Thayer, 2000). Note that 12 and G2 measure a summary of the closeness between the observed frequencies and predicted frequencies. However, FT residuals assess the closeness at each score 45 point. FT residuals are also used to assess the rounding effect of the multiple imputation method at each score point. 3.4.1.2 The Multiple Imputation Approach to the Prediction of Missing Data When the NEAT design is used to collect equating data, the missing data occurs because of the unique characteristic of the NEAT design that creates the designed missing. That is, scores on test form X for Q and scores on test form Y for P are never observed. The multiple imputation method using “proc mi” in the SAS V9.1 software (SAS Institute, 2003) was used in this study to impute these missing data. The procedure “mi” implemented in the SAS V9.1 was appropriate for the NEAT design because missing data mechanism generated by the NEAT design is assumed to be missing at random (MAR; Rubin, 1976) or an ignorable mechanism. In other words, the subpopulation the two groups represent are assumed to have the same target score distribution when the anchor test score is held constant (Holland & Rubin, 1982). The EM algorithm in SAS V9.1 assumes that a missing data mechanism is MAR. When missing data mechanism is ignorable, the EM algorithm is appropriate to impute missing data. In this study, the EM algorithm implemented through the procedure mi in SAS V9.1 (SAS Institute, 2003) was used to compute test score X for the population Q and test score Y for the population P. Specifically, this study used 20 simulation data sets each having 5 imputation data sets. Five imputations were chosen because five imputations are considered to be adequate in the multiple imputation (Rubin, 1996; Schafer, 1997; Fichman & Cummings, 2003). Then imputed values were rounded. Any imputed values less than 0 were set to 0, and any values greater than the maximum score point was set to 46 the maximum value of score points. The effect of rounding was trivial at the low and high ends of the score scale as seen in the result section. For the imputation ith (i=1,. . ., 5) of the data set jth (i=1,. . ., 20), once X for Q and Y for P were predicted, the predicted score frequency distribution of X for Q and predicted score frequency distribution of Y for P were obtained directly, which are denoted by fo and f y p, respectively. Similarly to Section 3.4.1.1, the chi-square statistic, likelihood ratio chi-square statistic, and FT residuals were computed to assess the agreement between the predicted score frequencies and the observed (true) score frequencies. These statistics were averaged across 5 imputations and 20 data sets and the resulting averages were used and reported. 3. 4. 2. Test Score Equating Procedure This study focused on the PSE method, and the procedure for equating test scores using methods of this study employed the following steps. These steps were based on the procedures of the PSE equating method (von Davier, Holland, & Thayer, 2004; Kolen & Brennan, 2004) but little modifications were made such that the propensity scores were included in the PSE equating framework. The modified steps were as follows: 1. Estimate subscores using CTT model 2. Place estimates of subscores on the same scale 3. Estimate propensity score using the logistic regression model 4. Construct synthetic population fimctions 5. Equate test score using the equipercentile method 47 However, when the multiple imputation method was used, step 3 was not necessary and thus it was replaced by the multiple imputation approach to score frequency estimation. Also, step 4 is less complicated than when the multiple imputation method was used. The following section has more details regarding the procedure of equating test scores. 3. 4. 2. I Sub-score Estimation This study generated tests with five subsections. The sub-score estimation used in this study is based on the classical test theory (CTT). The sub-score estimation used was adopted from Haberman (2008) and Sinharay, Haberman, and Puhan (2007). The Haberman method of subscore estimation is a regression of true subscore on both observed subscore and observed total score, and the linear regression of true subscore TX on the observed subscore SX and the observed total score S2 is estimated by L(TX 152052) = E(Sx)+,3(TX I SX '52)[SX —E(SX)] + 16(TX I 52 'SX)[SZ -E(Sz)] 9 where 16(TXISX’ 'SZ )_0'(TX)[P(SX27X)- p(TX’SZ)p(SXaSZ)} 0(SX)[1— ,0 2(5.1/’52)] and 16(7X ISZ 'SX)= 0(TX)[P(SZJX)-P(TXaSX)P(SXaSz)] ‘ 0(Sz)ll—P2(5Xa52)] This method of true sub-score estimation gives weights to both the total score and the sub-score and provides a better approximation of true sub-score than is provided by observed sub-score alone (Haberman, 2008). 48 3. 4. 2. 2. Place the Estimated Sub-scores on the Same Scale The estimates of true sub-scores of different test forms may have different scales and different meaning especially when two groups of examinees differ greatly in abilities. Therefore, it is necessary to adjust the estimated sub-scores by using information that is common across two groups of examinees. In this study, the estimates of true sub-score were adjusted by the covariate adjustment technique, where the covariate used was the anchor test score. Specifically, the jth sub-score for the ith examinee L(‘t’ X I S X ,5 Z )1] was adjusted as follows: Laij = L(TX I SX’SZlij — ,Bj(Ai — A7), where ,6 j is the regression weight of a sub-score L on anchor test score (A), :4- denotes the mean of the anchor test score, and Ai the score on the anchor test of the examinee i. 3. 4. 2. 3. Estimate Propensity Scores Exarninees’ propensity scores (Z) were estimated using the logistic regression model. The outcome variable was the test form (F) (F =0 if test F orrn=X, otherwise 1), and a set of covariates includes five sub-scores, demographic variables, and the anchor test score. An estimated examinee’s propensity score is the predicted group membership of an examinee assigned to the test Form Y. The estimated examinees’ propensity scores were then divided into 21 strata. 49 3. 4. 2. 4. Construct synthetic population Synthetic population (I) is a mixture of both P and Q, T = wP + (1 — w)Q, where w is the weight given to P. When the propensity scores (Z) were used to construct synthetic population, the two assumptions of the post-stratification equating (PSE) method (von Davier, Holland, & Thayer, 2004; Holland, & Dorans, 2006) were modified as follows: 1. The conditional distribution of X given Z over T, f(X =x| Z=z, T), is the same for any T of the form T=wP+(1-w)Q. 2. The conditional distribution of Y given Z over T, f( Y =y| Z=z, T), is the same for any Tof the form T=wP+(1-w)Q. By using the assumptions 1 and 2, the score distributions of X and Y for T were estimated by f(x)T = WXfo +(1- WXlZfOC | Z = 2,1”)th 12.), = (1— Wy)Zf(y I Z = 2,0)th + 14ny where th =P(Z =ZIQ) and th =P(Z =ZIP) However, when the multiple imputation method was used to estimate missing data, the propensity score (Z) was not used because the multiple imputation method is capable of using the collateral information to compute missing data for P and Q directly. In addition, PSE assumptions mentioned above were not necessary for the multiple imputation method synthetic population fimctions were constructed directly from the imputed score X and imputed score Y. Specifically, the procedure “mi” implemented in SAS was used to impute missing data. The EM algorithm was used with the procedure 50 “mi” because missing at random is assumed for the equating data obtained through the NEAT design. “Proc mi” imputed examinees’ total scores X or Y, while the conditional variables were the collateral information about examinees. When missing test scores X and Y data were filled, the score distributions of X and Y for the target population T were then estimated by f(x)T = WXfo +(1- WX)fo and _f(y)T = (1- WYlfyX + WnyQ- Note that f (x) and f (y)were smoothed distributions obtained by presmoothing X and Y separately using the log-linear model (Holland & Thayer, 2000). The first five moments of score distributions were preserved to eliminate irregular distributions of imputed values. 3. 4. 2. 5. Equate Test Scores Note that f (x) and f (y) are the distributions of test score of X and Y, respectively. The equating that transforms X-raw score to Y-raw score was carried out using the equipercentile function which is Equiyoo = G“(Fko I: a-a-a S&D HaD H—o-S 61 "*T ---- S&A __No BIAS 5 4% 3i 2: .z-r”"’.‘+..-.‘\ I'qumling liins SCOTS ”FIGURE 4.6c: MI equatingbiases: short test and no group differencesfl 8.1 C) \l OI - _ .. - .. -4- 4‘ I 4—0—0""’7’-. . *5...- . o . ‘7.\ I “*-Q-.‘*+ \ I'Iqunling Ilins _3‘ i o-o-o A m 111 o-oo A&D —;3<1 a—a—a S&D rune D 4—o—+ s _6, WT ---S&A ~—.\'o BIAS 0 10 20 3O 40 SCOTS FIGURE 4.6d: MI equating biases: short test and group differences 86 Figure 4.6c presents equating biases when test length was 40 and there were no group differences. As seen in this Figure, the PSE method that uses demographic variables (D) and the method that uses anchor test true score (T) underestimated equating functions, as evidenced by negative equating biases. The method that uses the combination of anchor test score and demographic variables (A&D) overestimated equating functions as indicated by positive equating biases. The PSE method that uses sub-score (S) and the PSE method that uses the combination of sub-sores and demographic variables (S&D) had the smallest equating biases which are close to zero. The PSE method that uses all collateral information (ALL) and the PSE method that uses the combination of sub-scores and anchor test score (S&A) had comparable equating biases. It is clear that the ALL, S, S&A, and S&D methods had comparable equating biases at the score points ranging from 14 to 40. The traditional PSE method (A) had small negative biases but larger than those produced by ALL, S, S&A, and S&D. But at the low end of the score scale, the traditional PSE method (A) had the smallest equating biases. Figure4.6d shows equating biases when test length was 40 and there were group differences. Similar to the Figure4.6b, when groups differ in abilities all equating methods tended to overestimate equating functions, as evidenced by larger degree of positive equating errors as compared to those in the Figure 4.6c. In this condition, the modified PSE method (T) and the PSE method that uses demographic variables (D) were two methods that had the smallest equating biases. 87 4.1. 7 The Multiple Imputation Method: Prediction of Score Frequencies of Missing Data One objective of this study is to investigate the efficiency of the multiple imputation method in predicting score frequencies of missing data. As noted previously, better score frequencies of missing data results in smaller equating biases for the PSE method. Therefore, the method that best predicts score frequencies of missing data would have the smallest equating biases. Like the propensity score method section, two statistics used in this study to measure the closeness between the predicted frequencies and the ‘true (simulated frequencies) included Pearson chi-square statistics, and likelihood ratio (LR) chi—square statistics. The smaller chi-square, and LR statistics indicate the greater degree of closeness between the two frequencies. Figures 4.7a - 4.7d show chi-square statistics and Figures 4.8a — 4.8d present LR statistics. Note that in these Figures, goodness-of fit statistics were presented separately for different two populations (P and Q). For example, a chi-square statistics for P shows the degree to which score frequencies of missing data for examinees in the population P were predicted by a certain equating method, whereas a chi-square statistics for Q shows the degree to which score frequencies of missing data for examinees in the population Q were predicted by a certain equating method. Figure 4.7a compares chi-square statistics for different equating methods when the test length was 60 and there were no group differences. It was obvious that the PSE method that uses all collateral information (ALL), the PSE method that uses sub-scores (S), the method that uses the combination of sub-scores and demographic variables (S&D), the modified PSE method (T), and the method that uses the combination of the anchor test score and sub-scores were combined (S&A) had more comparable and smaller chi-square statistics than other methods. The traditional PSE method (A) had total 88 chi-square statistics larger than the modified PSE method (T). The PSE method that uses demographic variables (D) had the largest chi-square. Figure4.7b compares chi-square statistics when the test length was 60 and there were group differences. It was obvious that when groups differ in abilities, nearly all equating methods had larger chi-square statistics, except for the method that uses demographic variable (D), and the increments in chi-square statistics were more evidenced for the sample of population P or the population that had lower abilities. It was also obvious that all equating methods predicted score frequencies of missing data better for Q than P, as indicated by smaller chi-square statistics for Q. In other words, all equating methods better predicted frequencies for the population that had higher abilities than the population that had lower abilities. The method that uses the combination of the anchor test and demographic variables (A&D) had the largest total chi-square, while the modified PSE method (T) and the PSE method that uses demographic variables (D) had smaller total chi-squares than other methods. This result was consistent with the equating bias results in that the T and D methods were two methods that had smaller equating biases. 89 30000.00 1 25000.00 7 20000.00 «Nun ”06m vudnvu 2.69. 2.6m no. 2. 2.3:. fl 3...”: P Q I D m m. m m a .m. m¢— _1« _2 ~ ,\ i —3 7 I U T I I I I I 0 5 10 15 20 25 30 35 4O SCORE 45 Figure A.1 No Group Differences, 45 Missing Items, Anchor Test Score (A) ~—-—- ..... li'l' Rnsirltml _3< 20 25 30 SCORE 10 15 35 4O 45 Figure AZ No Group Differences, 45 Missing Items, All Collateral Information (ALL) 133 Arr Appendix A. FT Residuals for the Propensity Score Method (Continued) Figure A.3 No Group Differences, 45 Missing Items, Anchor Test Score and Demographic Variables (A&D) Figure A.4 No Group Differences, 45 Missing Items, Subscore and Demographic li'l' Rositlunl ‘_.A_A.A‘—A4L.A- I A. .4 -2 _3r 0 ' T ..... I I T l 10 15 20 25 30 SCORE 45 I"|' Rrrsirlunl 3 i- 2: ' I. 1‘ ’ =1" l f; 4r \ l3". _ 14 I -3 i fi fl 1 r r T r 0 10 15 20 25 30 35 40 SCORE 45 Variables (S&D) 134 Appendix A. FT Residuals for the Propensity Score Method (Continued) 34 O»- 2? li'l' Rnsirlunl ...2. —3 I - _ _ I I f I I r I I "l , 0 5 10 15 2o 25 35 40 45 1 SCORE Figure A.5 No Group Differences, 45 Missing Items, Demographic Variables (D) M __ 54 _ _ _____ _ .‘ a—a-a P m Q \ i ' f \ l \, 5 9; , .=‘ i. I’ _— ii A 7:31 1 f1 I —1< ‘I :IE '1. -2* '1. 1“ ,. —3l I I I :41er I I I I o 5 10 15 20 25 30 35 40 45 SCORE Figure A.6 No Group Differences, 45 Missing Items, Subscores (S) 135 Appendix A. FT Residuals for the Propensity Score Method (Continued) r“ '37—‘67“ 5 ““W“‘“*- 2. 14 .7: 0 J ’ f / _1‘ _21 _31 T I T , r a v I o 5 10 15 20 5 30 35 4o 45 SCORE Figure A.7 No Group Differences, 45 Missing Items, Anchor Test True Score (T) I"'|' Rnsidunl Figure A.8 No Group Differences, 45 Missing Items, Subscores and Anchor Test Score (S&A) 136 Appendix A. FT Residuals for the Propensity Score Method (Continued) I"|' Rnsidunl 9 -3 I I I I I I f I I 0 5 10 15 20 5 30 35 4O 45 SCORE l_ _ Figure A.9 Group Differences, 45 Missing Items, Anchor Test Score (A) 01 li'l' Rnsidunl o is i) 15 2'0 23 so 55 4'0 45 SCORE Figure A.10 Group Differences, 45 Missing Items, All Collateral Information (ALL) 137 Appendix A. FT Residuals for the Propensity Score Method (Continued) ,3; _- I'"|' Rnsirllml o ..2« —3‘ 7 r r' r r "— r r ‘1‘ r 20 25 30 35 40 45 O 5 ’0 15 SCORE Figure A.11 Group Differences, 45 Missing Items, Anchor Test Score & Demographic variables (A&D) ’\ \ I"'I' Rusirlunl o ...24 -34 r r r T r r r r 5 30 35 40 45 SCORE Figure A.12 Group Differences, 45 Missing Items, Subscores & Demographic variables (S&D) 138 Appendix A. FT Residuals for the Propensity Score Method (Continued) 31 |"'I' Rnsirlrrrrl 45 0 5 10 15 20 25 30 35 4O Figure A.13 Group Differences, 45 Missing Items, Demographic variables (D) 31' 1 mp mQ i 2 I'l Rosrrlrrrrl 9 I I _2< —3 ‘ r 1 r r r r r F o 5 10 15 20 5 30 as 40 45 . SCORE ' Figure A.14 Group Differences, 45 Missing Items, Subscores (S) I 139 Appendix A. FT Residuals for the Propensity Score Method (Continued) .____.__..._ _ __ _. ___.____.__._ —— __ . ___» .. . v___‘ __- l l Rosldunl 9 _3'4 I I T Y I I T l 20 5 30 $ 40 45 Figure A. l 5 Group Differences, 45 Missing Items, Anchor Test True Score (T) Bi _ 21 1‘ E Oo—o—o— " » _1< —34 I T T I U I I j 0 5 10 15 20 25 30 (5 4O 45 SCORE Figure A.16 Group Differences, 45 Missing Items, Subscores and Anchor Test Score (S&A) 140 Appendix A. FT Residuals for the Propensity Score Method (Continued) 3: l'"l' Residual 9 0 5 ’0 15 20 25 30 Figure A.17 No Group Differences, 30 Missing Items, Anchor Test Score (A) F" 3‘ ‘ 0d l"'l' Rosidunl -3‘ 1 r . . , 10 15 20 25 30 SCORE Figure A. 1 8 No Group Differences, 30 Missing Items, All Collateral Information (ALL) 141 Appendix A. FT Residuals for the Propensity Score Method (Continued) 3: . ' """ -~.,__“ ”-6 P m Q ~,‘ l"'|' Residual —3 ‘ . . . . . o 5 1o 15 20 25 30 SCORE | Figure A. 19 No Group Differences, 30 Missing Items, Anchor Test Score and Demographic Variables (A&D) J N” 2 / l 14 ‘5 7 2:3 04 b _1< l _2« _3‘ I f I I 1 o 5 1o 15 20 25 30 SCORE Figure A.20 No Group Differences, 30 Missing Items, Subscores and Demographic Variables (S&D) 142 Appendix A. FT Residuals for the Propensity Score Method (Continued) l 7 W 3‘ W "m- ’3' 'x.----~--“‘-~-.- m P m Q l"|' Residual —3 F I T 7 Y ‘D 15 20 25 30 Figure A.21 No Group Differences, 30 Missing Items, Demographic Variables (D) 3‘ FT Residual 9 o é in 1'5 2'0 23 30 SCORE Figure A.22 No Group Differences, 30 Missing Items, Subscores (S) 143 Appendix A. FT Residuals for the Propensity Score Method (Continued) 3: l"‘|‘ Residual —3 I 1 Y T T o 5 10 15 20 25 30 l SCORE j Figure A.23 No Group Differences, 30 Missing Items, Anchor Test True Score (T) 3: " 2‘ \ 1‘ _1. _2‘ -3+ 5 . 1 . f o 5 1o 15 2o 25 30 SCORE Figure A.24 Group Differences, 30 Missing Items, Subscores and Anchor Test Score (S&A) 144 Appendix A. FT Residuals for the Propensity Score Method (Continued) |"'|' Residual all. ‘I‘. l '1 ‘. I Y 5 1o 15 2o 25 30 ; SCORE ] Figure A.25 Group Differences, 30 Missing Items, Anchor Test Score (A) l l l"'|' Residual 3‘ ___-___ ‘; i 2‘ , // ‘ l 14 0‘ I —3‘ T Y I T 0 10 15 20 25 30 SCORE Figure A.26 Group Differences, 30 Missing Items, All Collateral Information (ALL) 145 Appendix A. FT Residuals for the Propensity Score Method (Continued) 3. l | Resulual o l \ \ / / // _2< / ,// —3‘ . T , o 5 1o 15 2o 25 30 SCORE Figure A.27 Group Differences, 30 Missing Items, Anchor Test Score and Demographic Variables (A&D) 3 1 2*. I? _1: -23 -33 1 I 1 f 1 o 5 10 15 2o 25 30 SCORE Figure A.28 Group Differences, 30 Missing Items, Subscores and Demographic Variables (S&D) 146 Appendix A. FT Residuals for the Propensity Score Method (Continued) 31 ! —3‘ . . . . fl 0 5 10 15 20 25 30 SCORE Figure A.29 Group Differences, 30 Missing Items, Demographic Variables (D) S 34 |"'l' Residual o _24 — 3 ‘ I I I I I O 5 10 15 20 25 SCORE Figure A.3O Group Differences, 30 Missing Items, Subscores (S) 147 Appendix A. FT Residuals for the Propensity Score Method (Continued) 3. l"’l' Residual 9 -31 . . . . . o 5 10 15 20 25 so SCORE Figure A.3 1. Group Differences, 30 Missing Items, Anchor Test True Score (T) 3 ~ ___ " ' ___“ 2.1 1. _1‘ ...2‘ -3‘ r r r 1 1 o 5 1o 15 20 25 30 SCORE Figure A.32 Group Differences, 30 Missing Items, Subscores and Anchor Test Score (S&A) 148 Appendix B. FT Residuals for the Multiple Imputation Method 31 | | Resulual o A L... Figure B.1 No Group Differences, 45 Missing Items, Anchor Test Score (A) 31 1 mp mQ |"|' Residual -3‘ 1 1 1 1 1 1 1 1 o 5 1o 15 2o 25 30 35 4o 45 SCORE Figure B.2 No Group Differences, 45 Missing Items, All Collateral Information (ALL) 149 Appendix B. FT Residuals for the Multiple Imputation Method (Continued) 3 ,.s / |"'|' Residual 9 _. a _3 1 I I I I I I 20 5 30 35 4O 45 SCORE Figure B.3 No Group Differences, 45 Missing Items, Anchor Test Score and Demographic Variables (A&D) 3. |"|' Residual _21 —3 ‘ I I I I T I I 15 20 25 30 35 40 45 SCORE Figure B.4 No Group Differences, 45 Missing Items, Subscore and Demographic Variables (S&D) 150 Appendix B. FT Residuals for the Multiple Imputation Method (Continued) 2 meg5252 m w ___—w #1 ___ 1 ——1 2‘. 1< _11 / -21 -3‘ 1 1 1 1 1 1 1 1 o 5 1o 15 2o 25 30 35 4o 45 1 SCORE Figure B.5 No Group Differences, 45 Missing Items, Demographic Variables (D) 3‘ C- _ a-a-a p m Q 1"1‘ Residual 35 4O 45 0 5 ’0 15 20 5 30 SCORE Figure 3.6 No Group Differences, 45 Missing Items, Subscores (S) 151 Appendix B. FT Residuals for the Multiple Imputation Method (Continued) T" " 3f I1'l' Residual 9 o 5 10 15 20 25 30 35 4o 45 SCORE Figure B.7 No Group Differences, 45 Missing Items, Anchor Test True Score (T) 31 I l"l' Residual 9 1 _2< 0 5 10 15 20 25 30 35 4O 45 SCORE Figure 8.8 No Group Differences, 45 Missing Items, Subscores and Anchor Test Score (S&A) 152 Appendix B. FT Residuals for the Multiple Imputation Method (Continued) 3~ . 21 l"|' Residual o o s? 13 15 {o 25 ab 35 4'0 45 SCORE Figure 3.9 Group Differences, 45 Missing Items, Anchor Test Score (A) 3. 01 FT Residual -3‘ r fii 1 I l r I $ 30 35 40 45 SCORE Figure B.10 Group Differences, 45 Missing Items, All Collateral Information (ALL) 153 Appendix B. FT Residuals for the Multiple Imputation Method (Continued) 34 '1‘ l"|' Residual 9 0 E1 10— 15 ab aisma'o 35 421—715 SCORE Figure 8.11 Group Differences, 45 Missing Items, Anchor Test Score & Demographic variables (A&D) 31 _ . m P m Q 1"1' Residual 9 -3‘ 1 1 1 1 1 1 1 1 o 5 10 15 2o 5 30 35 4o 45 SCORE Figure 3.12 Group Differences, 45 Missing Items, Subscores & Demographic variables (S&D) 154 Appendix B. FT Residuals for the Multiple Imputation Method (Continued) 3? 11"1' Residual Figure 8.13 Group Differences, 45 Missing Items, Demographic variables (D) 34 2. 11 _11 1 1 —3‘ I f I 77 I I f I 0 5 ‘D 15 20 Z 30 35 40 45 SCORE Figure B.14 Group Differences, 45 Missing Items, Subscores (S) 1 5 5 Appendix B. FT Residuals for the Multiple Imputation Method (Continued) |*“l‘ Residual 9 —3‘ I T T I I I I I 20 5 so 35 4o 45; 1 Figure B.15 Group Differences, 45 Missing Items, Anchor Test True Score (T) 315' A ' ' 1 l"|' Residual o _1: 0 5 1O 15 20 25 30 SCORE 35 40 45 Figure 8.16 Group Differences, 45 Missing Items, Subscores and Anchor Test Score (S&A) 156 Appendix B. FT Residuals for the Multiple Imputation Method (Continued) 3. 0 ‘ ' / 1"1' Residual —3 ‘ I 7 I T ' 0 5 10 15 20 25 30 SCORE Figure B. 1 7 No Group Differences, 30 Missing Items, Anchor Test Score (A) 3: , ._ l"l' Residual 30 5 1b 123 2'0 25 SCORE Figure B.18 No Group Differences, 30 Missing Items, All Collateral Information (ALL) 157 Appendix B. FT Residuals for the Multiple Imputation Method (Continued) 3. 11 1"1' Residual o _3 I I I I I I 0 5 10 15 20 25 30 SCORE Figure 8.19 No Group Differences, 30 Missing Items, Anchor Test Score and Demographic Variables (A&D) 3 _1‘ _2- -3* 1 1 1 1 1 0 5 10 15 2o 25 30 SCORE Figure B.20 No Group Differences, 30 Missing Items, Subscores and Demographic Variables (S&D) 158 Appendix B. FT Residuals for the Multiple Imputation Method (Continued) 3. l I Resudual 9 k // /\ \ _. ‘/. 1‘ I —31 r I r o 5 10 15 20 25 30 SCORE Figure 8.21 No Group Differences, 30 Missing Items, Demographic Variables (D) 3. - ’1‘ 1"1' ReSidlIal o -3‘ r I I I I 0 5 10 15 20 5 30 SCORE Figure B.22 No Group Differences, 30 Missing Items, Subscores (S) 159 Appendix B. FT Residuals for the Multiple Imputation Method (Continued) 34 ‘ mp mQ If" l"'|' Residual o l r 1 1 4 1. \/ 4 4 _2< _3 9 I I I I r 0 5 10 15 2O 5 30 SCORE Figure 8.23 No Group Differences, 30 Missing Items, Anchor Test True Score (T) 75.-.. 31—" FT Residual _21 _3< I f I I ‘ o 5 1o 15 2o 25 30 SCORE Figure B.24 Group Differences, 30 Missing Items, Subscores and Anchor Test Score (S&A) 160 Appendix B. F T Residuals for the Multiple Imputation Method (Continued) 3? 4 01 4 |"'l‘ Residual 30 é 16 15 221 55 SCORE Figure B.25 Group Differences, 30 Missing Items, Anchor Test Score (A) 31’ FT Residual o o 5 1b 15 2'0 25 30 SCORE Figure B.26 Group Differences, 30 Missing Items, All Collateral Information (ALL) 161 Appendix B. FT Residuals for the Multiple Imputation Method (Continued) 3. 2. 14 _11 _21 / .34 I I I I I o 5 1o 15 20 5 30 f SCORE Figure B.27 Group Differences, 30 Missing Items, Anchor Test Score and Demographic Variables (A&D) F ‘ 3. 2. 1. ”g 01 l _14 _21 -3‘ I I I #1 f o 5 10 15 2o 5 30 SCORE Figure B28 Group Differences, 30 Missing Items, Subscores and Demographic Variables (S&D) 162 Appendix B. FT Residuals for the Multiple Imputation Method (Continued) 3. I 1 1 1 r—o—o—o—F—o— 0'- 0+/ -1: _2‘ —3: I T T I Y o 5 1o 15 20 25 30 SCORE Figure 3.29 Group Differences, 30 Missing Items, Demographic Variables (D) 3‘ '— 2 . 1. i 01 ‘ -3‘ r 7 I I I 0 5 10 15 20 25 30 SCORE Figure B.3O Group Differences, 30 Missing Items, Subscores (S) 163 Appendix B. FT Residuals for the Multiple Imputation Method (Continued) 34 FT Rusidlml W ‘- - 3 r r T I 10 15 SCORE l F... 31 04 FT Rosidlml 20 25 30 Figure 8.31 Group Differences, 30 Missing Items, Anchor Test True Score (T) w SCORE 20 25 30 Figure 3.32 Group Differences, 30 Missing Items, Subscores and Anchor Test Score (S&A) 164 APPENDIX C: WinBUGS Code for the Test Form 1 hdodel { #Read in Item Responses for (j in l:N){ for (k in l:T){ x[i,k] <- response[j,k] }} # Specify five-dimensional Two-Parameter Logistic Model for(jhilflfl){ for (k in l:T) { pfivk] <- exp(al [k]*theta[j ,1]+a2[k]*theta[j ,2]+a3 [k] *thetaLi ,3]+a4[k] *theta[j ,4]+a5[k]*theta[j,5]+ d[l<])/ (1+exp(a1 [k]*theta[j,1 ]+a2[k]*theta[j ,2]+a3 [k]*theta[j,3]+a4[k]*thetaLi ,4]+a5 [k]*theta[j , 5]+d[k])) x[j,k] ~ dbern(p[j,k]) } #Specify Prior for Thetas thetafj,l:5] ~ dmnorm(mu[l :5],tau[1:5,1:5]) I # Specify Priors for Discrimination and Difficulty Parameters al[1] <-0 aZ[1] ~ dnorm(0,.5) a3[l] ~ dnorm(0,.5) a4[l] ~ dnorm(0,.5) a5[l] ~ dnorm(0,.5) d[l] ~ dnorm(0,.5) for (k in 2:62) { a1[k] ~ dnorm(0,.5); a2[k] ~ dnorm(0,.5); a3[k] ~ dnorm(0,.5); a4[k] ~ dnorm(0,.5); a5[k] ~dnorm(0,.5); d[k] ~ dnorm(0,.5); }} list(N =1 361 ,T=62, mu=c(0,0,0,0,0), tau = structure(.Data = c(l,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,l,0,0,0,0,0,1),.Dim = c(5,5)), response=structure(.Data=c( aLLLamQQQQQQLQLQQQQLLQQQQQQLQQQLQLQLLLQLQQLLmQLQQQQQLLQQQQQQQL LadddmdaadoadoddoddadoodLLQLLLdLLmdLLLaodddoddmmdodadadddLemma deoddaddoodddadddoddddLLLQLQLQQLQdLQLadddmamddodddamddLLdedd LaoodadmnoLodadaaoddduddnranLLQQLQLLLLdomdddedoodLoommnnrdndm LuLLLQLLLLLLLQQLLQLQLLQLLLdLQLLLdeLdLLQLumLmLLaodLaoddLLLdLLL 0,1.0,0,l ,O,l ,l .l ,l ,l ,0,l ,0,0,0,l ,l ,0,l ,O,l ,l,l,0.l ,l ,0,l ,l ,l,l .0.0,0,0,0.0,l,l ,l ,0.0,0,0.l ,l ,l .0,0,l.0.0,0,0.0.0,l .l,l ,l ,0, 165 0,1,l,l.l,l,l,l,l,l ,l,l,l,0,l,0,l,0,0,0,0,0,0.l,l,O,l.l,l,l ,0.0,l,0,l,l,0,0.1,l ,0,0,0,0,0,0.l,0,l.0.1,l ,0,l ,l,0,l,l.0,l ,l.0, 0,0,l,l,l ,0,l,l,l ,l ,l ,l ,l.l .l .0.0,0,0,0.0,0,0,0,0,l,l ,l .l ,l ,l ,l ,0,l ,O,l.l,l ,l ,l,0,l ,0,1 ,l ,O,l,l,l ,l,l .0,0,0,0,0,0.0,0.l,l .0), .Dim=c(1361,62)) 166 REFERENCES 167 REFERENCES Agresti, A. (1990). Categorical data analysis. New York: John Wiley & Sons. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington DC. Bolt, D. M., & Lall, V. F. (2003). Estimation of compensatory and noncompensatory multidimensional item response models using Markov Chain Monte Carlo. Applied psychological measurement, 27(6), 395-414. Braun, H. I., & Holland, P. W. (1982). Observed-score test equating: A Mathematical Analysis of some ETS equating procedures. In P. W. Holland & D. B. Rubin (Eds.), Test equating. New York: Academic Press. Chen, H., Yan, D., Hemat, L., Han, N., & von Davier, A. A. (2008). KB & Loglin Software v. 3.0[Computer Sofiware]. Educational Testing Services, New Jersey. Clauser, B. E., Margolis, M. J ., & Case, S. M. (2006). Testing for licensure and certification in the professions. In R. L. Brennan (Ed.), Educational Measurement(4"‘). New York: American Council on Education, Praeger. von Davier, A. A. (2003). Notes linear equating methods for the Non-Equivalent Groups designs (ETS RR-03-24). Princeton, NJ: Educational Testing Services. von Davier, A. A., Holland, P W., & Thayer, D. T. (2004). The kernel method of test equating. New York: Springer. D’Agostino, R. B., & Rubin, D. R. (2006). In D. B. Rubin (Ed.). Matched sampling for causal effects. United States of America, Cambridge. Dorans, N. J. & Holland, P. W. (2000). Population invariance and equitability of tests: Basic theory and the linear case. Journal of educational measurement, 3 7, 281- 306. Fichman, M., & Cummings, J. N. (2003). Multiple imputation for missing data: Making the most of what you know. Organizational research methods, 6(3), 282-308 Haberman, S. J. (2008). When can subscores have value? Journal of educational and behavioral statistics, 33(2), 204-229. Haberman, S. J ., & Sinharay, S. (2008). Subscores based multidimensional item response theory. Princeton, NJ: Educational Testing Services. Han, K. J., & Hambleton, R. (2007) WINGEN2: Windows software that generates [R T model parameters and item responses[Computer Software]. Amherst, MA. 168 Holland, P. W., Sinharay, 8., von Davier. A. A., & Han, N. (2007). An Approach to Evaluating the Missing Data Assumptions of the Chain and Post-stratification Equating Methods for the NEAT Design. Journal of educational measurement, 45(1), 17-43. Holland, P. W., & Rubin, D. B. (1982). Test equating. New York: Academic Press Holland, P. W., & Sinharay, S. (2007). Is it necessary to make anchor tests mini-versions of the tests being equated or can some restrictions be relaxed? Journal of educational measurement, 44(3), 249-275. Holland, P. W., & Thayer, D. T. (2000). Univariate and bivariate loglinear models for discrete test score distributions. Journal of educational and behavioral statistics, 25(2), 133-183. Horton, N. J., Lipsitz, S. R., & Parzen, M. (2003). A potential for bias when rounding in multiple imputation. The American statisticians, 5 7(4), 229-232. Kolen, M. J. (1990). Does matching in equating works? A discussion. Applied measurement in education, 3(1), 97-104. Kolen, M. J ., & Brennan, R. L. (2004). Test equating, scaling, and linking, 2”" . New York: Springer. Luellen, J. K. (2007). A comparison of propensity score estimation and adjustment methods on simulated data. Unpublished doctoral dissertation, The University of Memphis, Memphis, Tennessee. Livingston, S. A., Dorans, N.J., & Wright, N. K. (1992). What combination of sampling and equating methods works best? Applied measurement in education, 3, 73-95. Little, R., & Rubin, D. B. & (2002). Statistical analysis with missing data. 2'”. New Jersey: John Wiley & Sons. Lord, F. M. (1950). Notes on comparable scales for test scores (RB-50-48). Princeton, NJ: Educational Testing Service. Lord, F. M. (1980). Applications of item response theory to practical testing problems. New Jersey: Lawrence Erlbaum Associates. Mislevy, R. J ., Kathleen, M., & Sheehan, K. M. (1989). The role of collateral information about examinees in item parameter estimation. Psychometrika, 54, 661-679. Mostteller, F., Youtz, C. (1961). Tables of the Freeman-Tukey transformation for the binomial and Poison distributions. Biometrika, 48, 433-440.. Muthen, B., & Muthen, L. (2008). Mplus 5.1 [Computer Software]. Los Angeles, CA: Muthen & Muthen 169 Pack, 1., Liu, J ., & Oh, H. (2008). An investigation of propensity score matching on linear/nonlinear observed score equating method in a large scale assessment. Paper presented at the Annual Meeting of National Council on Measurement in Education. 25- 27 March, New York. Petersen, N. S. (2008). A Discussion of population invariance of equating. Applied psychological of equating, 32(1), 98-101. Reckase, M. D. (1997). A linear logistic multidimensional model for dichotomous item response data. In W. J. van der Linden & R. K. Hambleton (Eds), Handbook of modern item response theory (pp. 271 -286). New Jersey: Educational Testing Service. Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of propensity score in observational studies for causal effects. Biometrika, 70(1), 41-55. Roussos, L. A., Templin, J. L., & Henson, R. A. (2007). Skills diagnosis using IRT-based latent class models. Journal of Educational Measurement, 44(4), 293-312. Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: John Wiley & Sons. SAS Institute. (2003). SAS Stat. Cary, NC: SAS Institute, Inc. Schafer, J. L. (1997). Analysis of incomplete multivariate data, New York: Chapman and Hall. Schafer, J. L. & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological methods, 7(2), 147-177. Schmidt. W. H., et a1 (2007). The preparation gaps: teacher education for middle school mathematics in six countries. (M T 21 Report). East Lansing, MI: Michigan State University. Sinharay, S. & Haberman, S. & Puhan, G. (2007). Subscores based on classical test theory: To report or not to report. Educational measurement: issues and practice, 26(4), 21-28. Spiegelhalter, D., Thomas, A., Best, N., & Lunn, D. (2003). WinBUGSl.4 [Computer software]. Tate, R. L. (2004). Implications of multidimensionality for total score and subscore performance. Applied measurement in education, 1 7 (2), 89-112. Wang, T., Lee, W., Brennan, R. L., & Kolen, M. J. (2008). A comparison of the frequency estimation and chained equipercentile methods under the common-item non-equivalent groups design. Applied Psychological Measurement, 32(8), 632- 651. 170 Wang, T., & Brennan, R. L. (2009). A modified frequency estimation equating method for the common-item non-equivalent groups. Applied measurement in education, 33(2), 118-132. Zimowski, M., Muraki, E., Mislevy, R., & Bock, D. (2002). BILOG-MG3 [Computer Software]. Lincolnwood, IL: Scientific Sofiware International, Inc. 171 11IIIHHIWill!1|111||111IlHIIMNIIWHIWllllll