PLACE IN RETURN BOX to remove this checkout mm your record. TO AVOID FINES Mum on or baton date duo. DATE DUE DATE DUE DATE DUE MSU I. An Afflmdlvo ActioNEqud Opportunity Instltmion W EXAMINING LOCAL ITEM DEPENDENCE EFFECTS IN A LARGE SCALE SCIENCE ASSESSMENT BY A RASCH PARTIAL CREDIT MODEL by Jean Weiqin Yan A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling, Educational Psychology, and Special Education 1996 ABSTRACT EXAMINING LOCAL ITEM DEPENDENCE EFFECTS IN A LARGE SCALE SCIENCE ASSESSMENT BY A RASCH PARTIAL CREDIT MODEL by Jean Weiqin Yan Frequently in a science assessment, several items are generated. from. the same scenario. These context-dependent items are traditionally analyzed as independent items. However, the potential local item dependence effects among these items may cause a biased estimation of the examinees' abilities in science literacy. The purpose of this study was to investigate the local item dependence effects on testlets in the tryout version of the Michigan High School Proficiency Test in Science by the Rasch partial credit model. Cluster sampling combined with stratified sampling was used in the tryout, in which school was the cluster unit and population density was the stratum unit. Data were analyzed in five different configurations to study the relationships between context—dependent items at the individual item level and at the testlet level. The major findings of the study were: 1. Context-dependent items correlated more closely within- COntext than across-context for most original testlets. 2. Local dependence effects can be controlled and a better fit for item calibration can be obtained by employing the Rasch partial credit model for some, but not all original testlets. 3. There is no significant difference between the partial credit model and the dichotomous model in average person measures. 4. It seems that an implicit factor other than the local item dependence affects the misfit original testlets. 5. Truly statistically independent items should be analyzed independently, whether they belong to a context or not. Additional costs will occur if one treats context— dependent items as testlets in a large-scale assesment because the partial credit model is more complex than the dichotomous model. More money, time, technology and human resources will be involved. Copyright by Jean Weiqin Yan 1996 ACKNOWLEDGMENTS I once joked that this list of acknowledgments would be longer than those in the Oscar Academy Awards, for so many people Lhave contributed to the completion of this dissertation. To me, the experience of doctoral study and dissertation writing was invaluable and unforgettable for my professional development. Today's accomplishment is primarily due to the unfailing support, guidance, and encouragement of my respectful advisor, Dr. William Mehrens. Throughout his busy schedule, Dr. Mehrens carefully scrutinized my manuscript many times and provided immediate and insightful suggestions, comments, and advice. His wisdmm, open-mindedness, ridh experiences of teaching, psychometrics, and education policy have been very precious to me throughout my doctoral study and will be, I believe, in the years to come. I am deeply indebted to Dr. Benjamin Wright for his profound. interest and substantial help in my study. Dr. Wright is not on my dissertation committee, but he has done as much as, if not more than, the committee members. NOt only did he “rescue” me from the dead-end of the research, but also led me to the new direction and guided me step by step in the process of research through long distance communication. Without his expertise in rating scale analysis and his encouragement, this study simply could not have been completed. I wish to express my gratitude to the members of the dissertation committee: Dr. Steve Raudenbush, who recommended Dr. Wright to me, for his expertise in educational statistics and incisive criticism; Dr. Edward Smith for his thorough understanding of the Michigan science curriculum and the science assessment framework, and for his constructive and detailed comments and suggestions on the design of the study and the writing of this work; and Dr. Frederick Ignotavich for his expertise in education administration. Special thanks go to my employer, Dr. Diane Smolen, at the Michigan Educational Assessment Program in Michigan Department of Education for her permission to use the Michigan High School Proficiency Test tryout data and for her consideration of adjusting my workload so that I had time to finish this project. I sincerely appreciate Dr. Leonard Bianchi, Dr. Lindson Feun, Dr. Richard Houang, Dr. Mike Linacre, Dr. Robert Sykes, Dr. Richard Smith, and Ms. Wen-Ling Yang for their professional judgment in educational measurement and statistics, and valuable suggestions and comments to improve the quality of the study. All of them helped me without any reservation during the process of this study. As for my colleagues, mentors, and dear friends Ms. Jan Hunt-Kost and Dr. Catherine Smith I can never say enough vi "Thank you." Armed with their professional knowledge, both of them showed great interest in this study and contribute their precious time to edit my dissertation meticulously. Their continuous support, encouragement, and advice motivated me in my study and work. Last but not least, I would like to thank my family, my relatives, and all my friends in China, the United States, and other parts of the world. Their unselfish love, deep faith, high expectation, and true understanding of my life pursuit inspired me to overcome countless obstacles in the past years to reach this milestone. vii TABLE OF CONTENTS Chapter Page LIST OF TABLES .......................................... x LIST OF FIGURES ......................................... Xi CHAPTER 1 INTRODUCTION ........................................... l The Problem ......................................... 1 Purpose of the Study ................................ 6 Significance of the Study ........................... 7 Research Hypotheses ................................. 11 Two Scoring Scales of IRT Rasch MOdels .............. 12 Structure of the Study .............................. 15 CHAPTER 2 LITERATURE REVIEW ....................................... 17 Concepts of Testlets ................................ 17 Characteristics of Testlets ......................... 22 Testlet Construction and Development ................ 24 Evaluation of Applications of Testlet Assessment ... 29 Local Item Dependence Effects ....................... 37 Summary ............................................. 50 CHAPTER 3 METHODOLOGY ............................................. 53 Overview ............................................ 53 Testing materials ................................... 54 Science Assessment Framework .................... 54 The Test ......................................... 56 Tryout Design .................................... 58 Data ............................................... 59 Sampling Procedures ................................. 60 Item Scoring .........................2 ............. 61 viii Original Testlets vs. Random Testlets and Reformed Testlets ........................... 62 Research Hypotheses ................................. 63 Calibration Mbdels .................................. 64 The Dichotomous Model ............................ 64 The Partial Credit Model ......................... 65 Estimation Measures ................................ 68 Phi Coefficient ................................. 68 Person Ability Measure .......................... 69 Testlet Measure ................................. 71 Local Item Dependence Measure .................... 72 Person Separation Ratio Indices ................. 74 Data Analysis ...................................... 76 BIGSTEPS Computer Software .......................... 79 Summary ............................................. 80 CHAPTER 4 RESULTS AND DISCUSSIONS ................................ 82 Phi Correlation Coefficient Results ................ 83 Testlet Measures Results ............................ 87 verification of Local Dependence Effects ........... 97 Mean Person Measures Results ........................ 101 Person Separation Indices Results ................... 104 Average Category Measures Results ................... 106 Summary ............................................ 110 CHAPTER 5 CONCLUSIONS AND RECOMMENDATIONS ........................ 113 Summary of the Study ............................... 113 Summary of the Results by Hypothesis ............... 115 Conclusions ........................................ 117 Limitations ........................................ 120 Generalizability ................................... 121 Recommendations for Future Research ................ 122 APPENDICES ............................................. 124 .A: Examples of Partial Credit Scoring ............. 124 B Sample Testlet in the MHSPT in Science ......... 125 C: Michigan School Stratum Classification .......... 126 D: Item Code Sheet for Tryout Form 22 ............. 127 E Tables and Figures ............................. 128 LIST OF REFERENCES ..................................... 176 ix LIST OF TABLES Table Hmoo \IO‘ U'l bu N H 12 13 14 15 16 17 18 Michigan Science Proficiency Test Form Configuration .............................. Number of Schools and Students Sampled in Science Tryout for Each Stratum ............... Data Configurations of Science Items ............ Match-up of the Analyses with Their Corresponding Hypotheses ........................ Mean Phi Coefficients for Items within Different Testlets by Form ................................. Summary of Mean Item Correlation for the Testlet .. Comparison of Original Testlet Steps and Context-Dependent Items on Error and Fit by Form Student Responses to Testlet 3 Form 23 ........... Student Responses to Testlet 4 Form 23 ........... Comparison of Random Testlet and Independent Items on Error and Fit by Form ................... Degrees of Freedom for the Context-Dependent Items ...................... Discrepancies for Testlets in the Tryout Forms CIs for One—Way ANOVA for Context-Dependent Items .......................... Summary of Measured (Non-Extreme) Person Fit by Form ............................... Person Separation Ratios fo Different Configurations by Form ........................... Reliabilities of Person Separation for Different Data Configurations ................ Comparisons of Average Measures for Original and Random Testlets by Form ............. Ranges for Average Measures for Original and Random Testlets by Form ............. Page 58 61 79 83 84 85 128 92 93 128 148 152 153 155 157 106 169 LIST OF FIGURES Figures Page 1 Classification of Testlets ...................... 172 2 An Example of 2-Level, 3-Item, 4-Outcome Hierarchical Testlet ............................. 173 3 An Example of 3—Level, 3—Item Linear Testlet ................................... 173 4 MHSPT Assessment Framework in Science ............ 174 5 C15 of ln(infit MNSQ) for Original Testlets ...... 175 6 Frequency Distribution of ln(infit MNSQ) for Original Testlets ............................ 175 xi (HEHPTER.1 INTRODUCTION The Problem Traditional educational measurement theories assume that multiple-choice (MC) test items are not correlated to each other when examinees' abilities are controlled,- each item is analyzed independently and dichotomously. Consequently, the unit of analysis is the item itself. However, in many testing situations, such as a short story in a reading comprehension test, a table in a mathematics test, or an investigation in a science test, a context is established and students are often asked a series of questions related to that context. Wainer and Kiely (1987) called a set of these context— dependent items a "testlet" and defined it as: “a group of items related to a single content area that is developed as a unit and contains a fixed number of predetermined paths that an examinee may follow (p.190).' For example, on. the Michigan High School Proficiency Test in Science (tryout version, 1995), one testlet on life science had six context-dependent items, four of which were multiple- choice items and the remaining two were constructed-response questions. In this example, a genetic disease was described and students were asked to identify the information about the gene presented in the pedigree and draw conclusions about it. 2 Then the students identified the scientist who contributed to the explanation of the disease and the probability of an unborn baby getting the disease given the parents' health condition. Finally, a hypothetical situation was given and the students had to answer questions based on the pedigree the students had drawn and provide scientific reasons for the answers. These items were scored independently, even though they were related to the same context. The immediate problem with conventional scoring methods under these circumstances is that the item response theory (IRT) assumption of local independence may be violated. In IRT, the assumption is that for a subpopulation of examinees at a given ability level, 3., on a latent trait scale, the items are statistically independent of each other (i.e., P(x1=1,x,=1|fl.)=P(x1=1|flu)P(x,=1|fl..), where x1 is item 1 score and x, is item 2 score). Thus, the probability of answering one item correctly (P(x,=1|fi.)) does not affect the probability of the examinee’s answering the other item correctly (P(x,=1|fi.)). When the items are statistically dependent, i.e., the probability of answering one item correctly depends on how one performs on the other, the equation does not hold (i.e., P(x1=1,x,=1|B.)¢P(x1=1lfiu) P(x,=1|fi..)). The rationale for the assumption of local independence is that the trait value should provide all the related information about the examinee's knowledge and that the contribution of each item to the test can be evaluated independently of all other items. 3 One of the measurement implications of local item dependence is that there would be an effect on the test information obtained, because the test information function (I(fi.)) has an inverse relationship with the standard error of measurement (SEM) of the ability estimates at level ,6. (I(,B..)=1/(/SEM(Y), Y is the examinee's total score). The estimate of information of a test is the sum of all the L individual item information estimates, I(fln)=2 11(fin), i=1, i=1 ., L, the number of items. The point is that this additive relationship is based on the assumption of local independence. When items are interdependent, the standard error of measurement of the test changes, depending on the direction of the correlation between items. Consequently, the test information calculated by I1(fin), assuming local independence, will be an over- or underestimate of the true information (Thissen, Steinberg, & Mooney, 1989, Yen, 1993). As to the direction of bias, Anastasi (1961) stated that: “Were the items in such a group to be placed in different halves of the test, the similarity of the half scores would be spuriously inflated, since any single error in understanding of the problem might affect items in both halves (p. 121).“ Guilford (1936) made a similar point: 'Interdependent items tend to reduce the reliability. Such items are passed or failed together and this has the equivalent result of reducing the length of the test (p. 147).“ Theoretically large correlations between residuals may imply a second trait in the ability estimation. Rosenbaum 4 (1988) compared item response distributions when local independence was conditional between, but not within, item "bundles" (testlets) with two sets of IRT assumptions. One set was traditional IRT and the other was less restrictive on local independence, allowing dependence among pairs of items that shared the same context. He proved a theorem that at every level of ability, the standard error of measurement under a positively correlated bundle was at least as large as that from a conventional IRT model having the same item characteristic curves (ICCs). He also found that positive dependence within bundles increased the SEM along the ability continuum. He suggested that, other things being equal, it is preferable not to use bundles of positively dependent items since it may cause a larger SEM. Thissen, Steinberg, and Mooney (1989) used a multivariate logistic latent trait model (Bock, 1972) to examine the violation of the local independence assumption with computerized adaptive test (CAT) data. They compared the results of a 4-testlet, 22-item test when the items were analyzed first as independent items and then as testlets. The results showed that, when testlet items were analyzed independently, the test information obtained was deceptively high. When those items were analyzed as testlets, the concurrent validity was slightly but significantly higher than that of the independently analyzed items . They concluded that the outcome of more information was ”fooled” by the excess correlation within the testlet among items and that 5 the testlet scores appeared to be at least as valid as the individual item scores. Yen (1993) used 3PL and 2PPCL models to study multiple- choice tests of the Comprehensive Test of Basic Skills, Fourth Edition (CTBS/4; CTB Macmillan/McGraw-Hill, 1989) and the performance assessment data of a state education assessment program. Item information and discrimination estimates obtained by testlet scale and by item scale on reading and math tests were compared. It was found that testlet analysis did result in a larger SEM, but it could be seen as a reflection of reality. However, in many cases, there was not much difference in parameter estimates when items were scaled as testlets or as independent items. It seems that, for context-dependent items, using item as the unit of analysis may cause different erroneous results because some items may be more strongly correlated within a context than between contexts. These high correlations, which are context-specific rather than test—specific, result in biased measurement of the common factor between contexts (Thissen et al., 1989). The information curve in IRT and the high reliability index in classical test theory were misled by the excess item correlations within a testlet because context-dependent items may be themselves statistically dependent. An alternative is to analyze these items together as a unit. 6 Purpose of the Study The purpose of the study was to explore the local item dependence effect when context-dependent items in the Michigan High School Proficiency Test in Science were analyzed as independent items and as testlets. In addition, originally independent items in the same test were randomly formed into testlets to conduct a concurrent validity analysis for the testlet effect. Both the traditional dichotomous rating scale and the partial credit scale in IRT Rasch models (Wright and Masters, 1982) were used. The computer software BIGSTEPS (Linacre and Wright, 1995,- version 2.6) used here was designed to conduct Rasch measurement from the responses of a set of persons to a set of items. If the results of the testlet—based analysis are not significantly different from the item-based analysis, it means that there is not enough evidence to reject the statement that context-dependent items within the testlets can be analyzed as individual items. The assumption of local independence will still hold. Consequently, it will not make a difference whether these context-dependent items are analyzed independently or as. testlets. In general, the item- based analysis is easier to conduct and less expensive because the dichotomous scoring is a conventional approach and the scoring process has been established in the industry. Higher costs would occur for the testlet—based analysis because the scoring process and the scoring model is more Complex and, therefore , more time , coding , computer A. 7 programming, technical support, and human resources would be involved. In addition, educating the education community and the public about the concepts of the testlet scoring would take a considerable amount of time and effort if one wants to use the testlet scale under this circumstance. In terms of the «consequences of person or item estimation, there is little discussion in the literature on the impact of using the testlet-based analysis when context-dependent items are statistically independent. Practically speaking, one should choose the scale that is simpler and easier to analyze and interpret when there is no significant difference in item/person estimation between the two models. However, if the results are significantly different, it indicates that local item dependence may exist, that context- dependent items are correlated either positively or negatively to each other within a testlet, and that actual measurement error is either overestimated or underestimated. As a result, these items should be analyzed as testlets with partial credit models. It is expected that the testlet analysis approach would provide an alternative in data analysis to control or alleviate the effect of the violation of the local independence assumption when local item dependence is indeed present. Significance of the Study Few studies have paid attention to the measurement Characteristics of testlets, even though they have existed as 8 an item format almost as long as tests themselves. In the last decade, there has been growing interest in treating a set of context-dependent items as the unit of analysis in educational measurement research. One main reason that test developers are using larger tasks as the fundamental units of tests and further shifting their focus to this field is that, besides the testlet characteristics to be described later, modern tests serve more purposes than before. A test result may now be used not only for achievement assessment, diagnosis, placement, or admission purposes, but also as an important reference to policy making and education budgeting practices. The same amount of testing time and information are used to achieve more goals than before. Furthermore, researchers have experimentally projected that testlets as units of analysis can solve some of the measurement problems that could not be overcome by item-based analysis (Ebel, 1951; Wainer & Kiely, 1987; Rosenbaum, 1988; Thissen et al, 1988, 1989; Haladyna, 1992; Yen, 1984a, 1993). Studies and discussions about testlets so far have been limited to applications of testlet concepts (Szeberényi & Tigyi, 1987; Wainer et a1, 1990, 1991, 1992), construction and development of testlets (Engelhart, 1942; Gerberich, 1956; Gronlund, 1965; Biggs & Collis, 1982; Mehrens & Lehmann, 1984; Collis et a1, 1986; Haladyna, 1991) and measurement precision (Cureton, 1965; Cattell & Burdsal, 1975; Wainer et a1, 1990; Sireci et a1, 1991; Ercikan, 1993). Studies on the effect of loss of local independence mostly 9 used IRT two—parameter (2PL) or three—parameter (3PL) polytomous :models (Rosenbaum, 1988; Thissen et a1, 1989, Donoghue, 1993, Yen, 1993). A hidden problem in using a 2PL or 3PL model is that these models are sample dependent and results can vary from sample to sample because they do not have sufficient statistics and thus their mathematical formulas cannot converge. Consequently, the models cannot separate person parameter from item parameters. (Wright, 1992). An outstanding property of the Rasch. model is that it has sufficient and necessary statistics that can separate person parameter from item parameter, and make it possible to construct the linear and objective measurement. More discussion about sufficient statistics for the IRT models will be presented later in Chapter 3. Wilson (1988) used the family of Rasch models (dichotomous, partial credit, and rating scale) to study the local item dependence effect with an example of “superitems' (testlets) in the Structure of the Learning Outcome program. The results showed that the rating scale model calibration provided. no evidence of the ‘violation. of the local item dependence assumption. Dependencies between items were adequately summarized by the dichotomous model item difficulties. On the other hand, the partial credit model calibration showed. that one of the five testlets studied demonstrated a local item dependence effect. However, the sample size was very small in Wilson's study (1988) . The data 10 were collected from only 30 students in the 9th and 10th grades, which is not comparable with a large scale assessment program. Masters's (1982) Rasch partial credit model was originally developed to analyze multiple—category items and it has remained this way for most studies of this model. For multiple-choice item analysis, it was used for foil analysis to gain more information. Other uses have included theoretical exploration such as the multi-dimensionality issue (De Ayala, 1991) and necessary and sufficient conditions to equate the estimates from dichotomous and partial credit models (Huynh, 1994) . However, most comparisons were on the item level, not on the testlet level. Wilson and Iventosch (1988) conducted a study at the testlet level, but the items were performance-based and the research was experimental with small samples. So far, studies have found that the partial credit model added more detailed information to the dichotomous model and provided the opportunity to observe the local dependence between items within a testlet when the situation occurred. A review of the literature on this topic indicates that there have been no studies examining the local dependence due to the testlet effect in any large—scale, high-stake state assessment programs using Masters’ partial credit model for MC items. This study attempts to do so. (Studies done with 2PL or 3PL partial credit models are not the focus of the discussion here, which does not mean that they are not A I 11 important. Rather, the intent is to concentrate on the main models of interest under study and to avoid complexity and issues inherent in 2PL and 3PL partial credit models.) In addition, the study will explore the curriculum impact on item analysis study. Sometimes it is possible that the context of constructing' a testlet. makes perfect sense in curriculum, but it does not affect the analysis of scoring scales psychometrically. The study results of the testlets in the newly developed Michigan High SChool Proficiency Tbst in Science will provide evidence of a real life example in applying' an alternative item analysis method to a large scale, high—stake assessment program. It will also explore other techniques that people can use in item analysis so that the methods and results of this study can contribute to the item analysis field. Research Hypotheses Based on the purpose and rationale of the study, the following research hypotheses are proposed to study the local item dependence effect. 1. For context-dependent items, (a) the average item correlations within an original testlet are larger than the average correlations with items from other testlet configurations; (b) when they are analyzed as a testlet by the Rasch partial credit model, they produce a better testlet fit statistic than when they are analyzed as individual items by the Rasch dichotomous model; (c) when they are analyzed as a testlet by the Rasch partial credit model, they produce better person fit 12 statistics than when they are analyzed as individual items by the Rasch dichotomous model; (d) when they are analyzed as a testlet, the measurement errors are smaller than when they are analyzed as individual items. In other words, the person separation reliability is higher for testlet-based analysis than for item—based analysis. 2. For independent items, (a) when they are analyzed as a testlet by the Rasch partial credit model, the testlet fit statistics are the same as the item fit statistics when they are analyzed as individual items by the Rasch dichotomous model; (b) person fit statistics stay the same regardless of whether the items are analyzed as random testlets or as individual items; (c) the reliability of the person separation ratio is the same for both testlet—based analysis and item-based analysis. 3. When context-dependent items in the original testlets of the same tryout form are decomposed and reformed into the same number of new testlets, each with an item from each original testlet, as if they were in different contexts, (a) the average correlations between items within a reformed testlet are smaller than the average correlations between items within an original testlet; (b) person fit estimated by the reformed testlets are not as good as those estimated by the original testlets. Two Scoring Scales of IRT Rasch Models The jpurpose of any test theory is to describe how inferences from examinees’ test scores or item responses can be made about unobservable characteristics that are measured by tests. These characteristics are referred to as traits or abilities. Since they are not directly measurable, they are called latent traits or abilities. With item response theory, 13 test developers usually assume that a single latent trait is considered to be responsible for item responses on a test if the test is designed to measure that trait. An item response model specifies a relationship between the observable examinee test performance and the unobservable trait or ability assumed to underlie performance on the test. The relationship is described by a mathematical formula which explains how examinees at different ability levels on the trait scale should respond to an item. Graphically, this relationship is reflected by the item characteristic curve (ICC), the key concept of IRT. Basically, an ICC plots the probability of responding correctly to an item as a function of the latent trait underlying performance on the test items. This knowledge allows one to compare the performance of examinees who have taken different tests. It also permits one to apply the results of an item analysis to groups with different ability levels. Different item response models are constructed through specified assumptions that one is willing to make about the test data set under study. For this study, two models in the family of Rasch models (i.e., one parameter models) were used: the dichotomous model (DM) and the partial credit model (PCM). The family of models was named after Georg Rasch, a Danish mathematician, who formulated this approach in the 19505 and 19603. It is a method for obtaining objective, fundamental measures from stochastic observations of ordered Category responses (Linacre and Wright, 1995) . The family of 14 Rasch models is suitable for testlet analysis as it has well— developed and interpretable polytomous extensions that embody the assumed item/category dependence and that make inter- model comparisons relatively easy by having identical sufficient statistics for the person ability parameters. The dichotomous model assumes that there are only two levels or categories of performance such as right/wrong, yes/no, or pass/fail for an item. It provides a way to place persons and items on a scale with a clear probabilistic interpretation of distance on the scale. Items scored in this way can be considered as "one-step" items. If an examinee completes the step, 1 point is awarded, otherwise, 0. That is, responding to an item correctly means completing a step. This scoring method is widely used in the multiple—choice item tests. The model was used here whenever items in the data were analyzed independently. The partial credit model (PCM) is an extension of the DM and handles data that scale more than one step in an item. For example, writing assessment frequently scores examinees with different writing levels. The PCM's basic observation is the number of steps that an examinee accomplishes in an item. If, for example, an item has 3 steps, an examinee can get a score of x = 0, 1, 2, or 3 points. More examples of partial credit scoring are provided in Appendix A. It can be seen that the basic measure in the PCM is the step difficulties within an item. The assumption for the PCM is that the step difficulties are not equally distanced among the performance 15 levels. For example, in Example 1 of Appendix A («9.0/0.3)-5=?), Step 2, (30-5=25), is much easier than step l, (9.0/0.3=30). In addition, the number of steps across items for a test does not have to be the same. Theoretically, steps in an item of the PCM should be ordered and are answered accordingly. One needs to complete step 1 before moving on to step 2. In this study, the “steps" were the number of items in a testlet. The mechanism of the partial credit to an item was borrowed here to award partial credit to a testlet in that the items in a testlet were analogous to the steps in an item and the testlet was analogous to a conventional MC item. The total number of the raw score for a testlet would be treated as the testlet score and was used for testlet analysis. Details are presented in Chapter 3. Structure of The Study In the first chapter, the problem of local item dependence, the measurement issues in testlet analysis, the purpose of the study, the significance of the study, the research hypotheses, and two scoring models in the family of Rasch models have been introduced. In Chapter 2 the author reviews the literature on the concepts of testlets, characteristics of testlets, construction and development of testlets, application of testlet concepts, and research on the local independence assumption in IRT. Chapter 3 is the methodology chapter in which the testing materials, the data, the sampling procedures, the research hypotheses, item 16 scoring, testlet categories, calibration models, estimation measures, the data analyses, and the computer program of this study are the foci. In Chapter 4 the results of different measures described in Chapter 3 are reported and discussed. In the final chapter a summary of the study and the results by hypothesis are furnished. Also presented are the conclusions, limitations, generalizability of the study, and recommendations for future research. CHAPTER 2 LITERATURE REVIEW There are six sections in this chapter. The first two sections cover concepts and characteristics of testlets. In the third section, testlet construction and development are discussed. The fourth and fifth sections are devoted to the application of testlet concepts and measurement precision, especially when the assumption of local independence is violated. The focus is on theoretical development, assumptions, and characteristics. Finally, the literature reviewed to the present study is summarized. Concepts of Testlets The problem of violating local independence with context-dependent items and consequential estimation bias invited a review of the structure of context-dependent items, which was discussed extensively a few decades ago (Ebel, 1951; Anastasi, 1961; Gronlund, 1965; Mehrens & Lehmann, 1984) . Ebel named the context-dependent items as the “interpretive test exercises" and predicted that this format would be highly promising. In his Writing the Test Items, Ebel (1951) defined the interpretive test exercise as follows: 17 18 ‘The interpretive test exercise consists of an introductory selection of material followed by a series of questions calling for ‘various interpretations. The; material to be interpreted may be a selection of almost any type of writing (news, fiction, science, poetry, etc.), a table, map, chart, diagram, or illustration; the description of an experiment or a legal problem; even a baseball box score or a portion of a music configuration. The questions on this material may be based on explicit statements in the material, on inferences, explanations, generalizations, conclusions, criticisms, and on many other interpretations (p. 241)." Gronlund (1965), following Ebel, used the same name but a less specific definition: “An interpretive exercise consists of a series of objective items based on a common set of data. The data may be in the form of written materials, tables, charts, graphs, maps, or pictures. The series of related test items may also take various forms but are most commonly of the multiple-choice or alternative-response variety (p. 161).“ Nevertheless, Gronlund demonstrated extensively the forms and uses of the interpretive exercise to measure complex achievement of an examinee, such as the ability to recognize assumptions , inferences , and relevance of information, to apply principles, and to interpret experimental findings. Mehrens and Lehmann's (1984) definition of the interpretive exercise was similar to Gronlund's but emphasized that the introductory material should be identical for all students: 'The interpretive exercise consists of either an introductory statement, pictorial material, or a combination of the two, followed by a series of questions that measure in part the student's ability to interpret the material. All test items are based on a set of materials that is identical for all students (p. 295)‘ What was different was that Mehrens and Lehmann presented interlinear exercise as a format in the context-dependent 19 literature. In their definition, an interlinear exercise was "somewhat of a cross between the essay question (the student is given some latitude of free expression in that he decides what is to be corrected and how it is to be corrected) and the objective item (the answer can be objectively scored) (p. 295)." For example, W1. ''Harry was ail-right all right at gramme:- grammar, but he didn't excel at ape-ling spelling.“ ‘The researchers are—of—t-he—opi—nien believe that—thirs- the test often produces biased results a—greeb—nmmber—o—f- t—mes—owing—to—the—feee—tha-e because subjects exh-r-brt—a tendency to misinterpret the questions. " It should be pointed out that all definitions above include the pictorial form as a medium to be used to present the material to examinees. It is considered that the pictorial form fit very well for younger children and for children with some reading deficiencies. It is a unique tool for directly measuring an examinee's ability to interpret graphs, maps, tables, and even cartoons. In some cases, pictorial material presents and explains far more precisely, simply, and effectively than does text material. Other terms that have been used for the content- dependent items included "superitems" (Cureton, 1965) , “application test“ (Szeberenyi and Tigyi, 1987), "item bundle" (Rosenbaum, 1988), and "item set" (Haladyna, 1992). Szeberényi and Tigyi defined an ”application test" as follows: 20 'The test consists of a description of an experiment, including data presented in tables or figures interspersed with built-in multiple-choice questions (p.73).' Rosenbaum's definition of "item bundle" was that: 'An item bundle is a small group of multiple-choice items that share a common reading passage or graph, or a small group of matching items that shares distractors (p.349).' Haladyna's definition for a testlet was the simplest one: 'A context-dependent item set consists of an introductory stimulus and a set of related test items (p.21).'I The term “testlet” was first introduced by Wainer and Kiely (1987) as: 'a group of items related to a single content area that is developed as a unit and contains a fixed number of predetermined paths that an examinee may follow (p.190).' This definition was different from the previous ones in that it clearly spelled out the nature of the information selection. as ”a single content area" and emphasized its development ”as a unit." This implied that the items generated from that content area should be analyzed together as a unit. Secondly, it identified the logical relationship between items. It may also be inferred that the testlet concept has covered several different fomms of context- dependent items. This more inclusive definition has been widely accepted and therefore will be used hereafter in this study. wainer and Kiely (1987) expected that using the testlet as the ‘unit of analysis could ease some of the observed and prospective difficulties associated with most of the current algorithmic methods of test construction, specifically, for computerized adaptive tests. 21 There are two ways of classifying testlets: by content form and by logical relationship (see Figure 1). The content form consists of four categories of testlets. The "pictorial form" bases its stimulus for questioning on pictures, maps, graphs, figures of data, photographs, art works, and the like. The "interlinear form" consists of a single passage with a number of denotations that provide an opportunity for questioning such as grammar error analysis in writing tests. The “interpretive exercise" uses a stimulus to set the stage for interpreting questions. The "problem—solving scenario" contains a problem and questions aimed at various steps in the solution of the problem (Haladyna, 1992) . The logical method classifies testlets into two categories, linear and hierarchical. By Wainer and Kiely's (1987) definition, each item is embedded in a pre-developed testlet, carrying its own context with it. If the paths through a testlet lead examinees to successive items of greater or less difficulty, depending on their previous responses, and culminate in a series of ordered score categories, it is called a hierarchical testlet (Figure 2). 22 In Figure 2, Item 2 is supposed to be an item of medium difficulty. If it is answered correctly, the student will be presented with a more difficult item (Item 3); otherwise, Item 1 follows. At level II, the final outcome for answering Item 3 correctly is outcome A; while an incorrect answer results in outcome B. The same process is true for item 1. If the examinee answers the item correctly, outcome C will be the result, otherwise, outcome D will be the measurement score. If a testlet contains a single path of several items that is administered to all examinees, it is called a linear testlet (see Figure 3). In this case, all examinees are exposed to the same items without discrimination. Depending on the purpose of the test, the two forms may be combined to construct mixed formats of testlets. Nevertheless, in most cases, testlets are constructed in the linear form. Hierarchical forms are more often used in adaptive tests. Characteristics of Testlets One major characteristic of a testlet is that it can be adapted to all types of tests, such as mathematical problem solving, scientific problem-solving, statistical reasoning, essay, performance-type activities , and higher—order 23 thinking. Because of this compatibility, testlets provide an effective setting that allows the test developer to present relatively complex topics and to ask meaning—construction questions. Usually in the one—item or independent question format, a test developer can ask only simple and straight forward questions and the essence of the item is in the stem. One has very limited room to provide necessary background information or "raw material" with which an examinee can show his or her abilities to interpret, synthesize, organize, and evaluate in solving a problem. Various item forms and modes of presentation make the testlet a popular format because it is not only effective and flexible in providing a whole picture of a problem, but also in assessing different aspects of an examinee's knowledge of a topic. Thus, this format provides a more coherent measure of a larger set of skills than is ordinarily possible with an item-base format. Frequently, it is found that test developers and test takers have different perceptions of a problem, which makes many examinees perform unsatisfactorily. Testlets reduce ambiguity by providing a common ground of information more detailed than that of independent items, and by controlling the amount of factual information given to the examinees. Further, it allows the test builder to provide guidance through a complex problem by suggesting, with the judicious use of subproblems, a path toward the solution of a larger question. These suggestions and subproblems can provide both instructional 'help and an explicit framework for awarding partial credits 24 through polytomous scoring procedures (Wainer, Kaplan, & Lewis, 1992). However, despite its wide application, the testlet has its own special problems. First, it is very difficult and time-consuming to develop testlets of high quality, especially those dealing with complex topics. It is not uncommon for original passages to be revised numerous times to satisfy the specifications of content, level of difficulty, and the outcomes of assessment required for use in real tests. Secondly, it takes considerably longer to administer testlets than to administer independent multiple- choice items because testlets require comprehensive interpretation ability. Since a testlet usually tests multiple abilities of an examinee, understanding the problem becomes essential. Thirdly, it may require that an examinee possess comprehensive reading ability. Often a testlet of moderate length is at least as long as a lengthy independent multiple-choice item. Lastly, because of the time factor, the number of items for a given testlet is restricted to a certain degree, which may cause a reduction in the reliability of the test (Mehrens 8: Lehmann, 1984) . Testlet Construction and Development Structures of testlets have changed considerably with the development of testing and measurement. Two frequently used forms in the early development of testlets are option- Sharing and alternative response items. The following 25 examples show their formulations. E J 2. E : . E . . 5 . Directions: The numbers preceding the paired items 5J1 the exercise below refer to the corresponding numbers on the answer sheet. Considering each pair from the standpoint of quantity, blacken space A, if the item at the left is greater than that at the right. 8, if the item at the right is greater than that at the left. C; if the two items are of essentially the same magnitude. F 6!) G 5 3 5 2 Plane I Plane II L 4 M N 57 0 Two spheres, X and Y, of equal masses and radii are placed on two inclined planes, as shown in the diagram. Neglect friction and air resistance, and assume that potential energy is measured from the level ofmmML,mN,mdQ 70. Potential energy of X at F - Potential energy of Y at H. 71. Potential energy of X at M - Potential energy of Y at N. 72. Potential energy of X at M - Potential energy of X at L. 73. Kinetic energy of X on rolling to L - Kinetic energy of X on falling to M. 74. Kinetic energy of X on rolling to L - Kinetic energy of Y on falling to O. 75. WOrk done on X in raising it from M to F — Work done on X in moving it from L to F. 76. Work done on X in raising it from M to F — Work done on Y in raising it from N to H.* * Other items of the series involved comparisons ‘with respect to acceleration, time, loss or gain in potential or kinetic energy, power, force, mechanical advantage, and mechanical efficiency. The exercise as a whole requires the application of numerous principles of mechanics. (Engelhart, 1942, p. 110) In. the next example, the item stem is followed. by several sentences the pupil is expected to classify according to their degree of causal relationship to the common stem. E J 3_ S J E I ! E . . 5. Directions: In the following examples, the first part is followed by several OTHER parts. Your job is to find out if the first part is a direct cause or an indirect cause or if it is not a cause of the other parts that follow it. 26 If the first part directly causes the second (numbered) part, draw a circle around the letter D. If the first part indirectly causes the second (numbered) part, draw a circle around the letter I. If the first part is in no way a cause of the second (numbered) part, draw a circle around the letter N. A girl chews a cracker. D I N 64. The cracker is broken into smaller pieces. D I N 65. The starch in the cracker changes into sugar. D I N 66. The girl gains energy from the cracker. D I N 67. The cracker is salty. (Gerich, 1956, Excerpt 106, p. 112) Example 4 is taken from the GRE Educational Test Sample Test (1989), The following people have been involved in educational innovations and/or research that have aided curriculum planning and learning. Select the person who is associated with the accomplishments in each of the questions below. (A) Jean Piagét (B) Robert J. Havighurst (C) B. F. Skinner (D) Jane Mercer (E) Ned Flanders 66. Established the basis for teaching machines and other programmed learning. 67. Emphasized the importance of concrete objects as instructional materials in the education of young children. 68. Developed a system for analyzing the interaction of students and teacher. Example 2 is in pictorial form. The graph and description of conditions to solve the problem are presented at the beginning of the problem. The examinee is supposed to match each of the following seven items to any one of the earlier mentioned conditions. Example 3 tests an examinee's ability to understand cause—effect relationships. The stem is very short, one simple sentence, but the directions are relatively long. The alternative responses in this testlet 27 were “direct," “indirect," or “no relationship.” Example 4 starts with the options and is followed by three questions sharing the same options. It can be seen that the alternative response form requires directions for each testlet, which is run: efficient in the test construction. While MC items, however, do not need directions to set up conditions, they do require more space and. more ‘higher—order thinking skills to solve the problems (see Example 5 on the next page). The main differences between constructing testlets and traditional MC item writing reside in the selection of appropriate introduction material and construction of items relating to that material. Strategically, the two parts should be developed simultaneously, since selecting the introduction material is similar to selecting the topics for individual items and the introductory material is crucial to the quality control of the testlet. 28 E J 5' i .1 . J J J . 1 . i E E 1. . J E !' 1M Year Republican Democratic Progressive 1904 336 140 1908 321 162 1912 8 435 88 1916 254 277 1920 404 127 1924 382 136 13 1928 444 87 1932 59 472 1936 8 523 1940 82 449 1944 99 432 1. Which party held the presidency during 1926? 1) Republican 2) Democratic 3) Progressive 4) The table does not tell 2. In what year was the Republican victory the most decisive? 1) 1904 2) 1924 3) 1928 4) 1936 3. Which of these statements about Democratic party strength is supported by the table? 1) The Democrats won easy victories in both 1912 and 1916. 2) The Democrats have been by far the strongest political party since 1904. 3) Democratic party strength Ihas been slowly‘ increasing since 1932. 4) Democratic party“ strength. has been slowly' decreasing since 1936. 4. Between which two consecutive elections was there the greatest increase in the number of Democratic electoral votes? 1) 1908 and 1912 2) 1912 and 1916 3) 1928 and 1932 4) 1932 and 1936 5. The percentage of the electoral votes received by the Democrats was the largest in what year? 1) 1944 2) 1936 3) 1928 4) 1912 (Ebel, 1951, p.243). 29 Evaluation of Applications of Testlet Assessment Discussion of the testlet was mostly limited to its form and construction in the early literature. Issues of its application have emerged in recent studies. Szeberényi and Tigyi (1987) described their employment of the testlet (they called it an "application test") as a problem-solving exercise tool for teaching and assessment of competence in a medical biology class. The typical structure of their testlet was somewhat similar to that of a scientific paper. The objectives of the experiments presented in the testlet were summarized in a short introduction with a brief description of methods. Experimental data were presented in the text, in a table or in pictorial form. A typical test contained 4-6 testlets, each with 10-15 MC items, and was concluded by a discussion of the results. An important feature of their testlet test was that it was an open-book examination. Students were allowed to. use any source of information (textbook, lecture notes, research papers, etc.) to eliminate assessing sheer factual knowledge from the test and to guarantee testing problem—solving skills to some extent. As a result, a test usually took three hours to finish. Szeberényi and Tigyi (1987) stated that their experience of 12 years in using testlets was very successful. They thought that testlets were valuable tools to assess higher levels of the cognitive domain at different levels of difficulty and could be used for teaching. Factual knowledge in a testlet was necessary but not sufficient to solve the problems. As for 30 students' feedback, the majority of students liked testlets as learning aids and accepted them as a form of examination. Wainer and Lewis (1990) investigated three different applications of testlet assessment and described psychometric models that they considered to be most suitable for each application. One application was drawn from Using Baysian Decision Theory to Design a Computerized Mastery Test (Lewis and Sheehan, 1988) , which employed the Test of Seismic Knowledge developed by ETS for architectural certification. Since it was a "pass-fail" test, the study focused on testlet difficulty in the region around the decision point. The item pool consisted of 110 items. Sixty percent of the items dealt with physical and technical aspects of seismic knowledge (Type 1 items), and 40% covered economic, legal, and perceptual concepts (Type 2 items). The goal of the study was to create testlets that could be interchanged randomly while retaining unbi asness and measurement accuracy (the degree to which the selected testlets varied with respect to the average likelihood of a particular number— right score). The item pool was divided into 10-item testlets, with each testlet balanced for content and equal in average difficulty and discrimination. The testlets were constructed by cross-classifying the item pool by item type and estimated item difficulty. After testlet selection, the experts in the subject field edited. the final version. The Validity of the testlet interchangeability assumption was 31 evaluated by determining the degree to which the six selected testlets varied with respect to the average likelihood of a particular number-right score. Likelihoods were evaluated at five different points on the latent proficiency scale which corresponded to five important decision points surrounding the anticipated cutscore. This validity check shows that, for examinees near the cutscore, the average number-right score has about the same probability regardless of which testlet was administered. After completion of a testlet presented to an examinee, a pass or fail decision was made by a statistical determination. It was expected that the number-right score approach carried all the information necessary to implement the Baysian decision process that was employed in the application. The tests allowed test developers to simultaneously maximize the probability of classifying individuals and minimize the amount of testing. The second application, conducted by Thissen, Steinberg, and. Mooney (1989), used traditional reading comprehension items as linear testlets and applied an adapted IRT model in a testlet-level analysis. Items were from IRT scored computerized adaptive tests and were used to study possible violation of the local independence assumption when several items shared the same stem. In the formulation, Thissen et a1. (1989) considered the examinees' responses to m questions relating to the same passage as a polytomous response and "then scored it either 0, 1, 2, . . ., or m, depending upon how 32 many of m questions an examinee answered correctly. They compared the results of a 22-item test where the items were first treated as independent items with the results from four testlets grouped by four passages by these items. The reading passages varied from one to six paragraphs and were followed by three to eight questions about the content. In addition, the authors evaluated the concurrent validity of these four testlets' scores with that of 54 other independently scored items in the same test. The Thissen et a1. study used a testlet response model proposed by Bock (1972) for responses of two or more nominal categories for each passage. The model required conditional independence Zbetween testlets only, not within them” The testlets were formed linearly and administered linearly. The traditional 3-PL IRT model was used to score the passage items as if they were independent. The results showed that the 3-PL scoring appeared to provide substantially more information over most values of the latent trait, especially at the positive side of its continuum. Hewever, the concurrent validity study with the statistical program LISREL (Joreskog and sorbom, v. 7, 1984) showed that the four testlets' scores were slightly but significantly superior to the 3-PL scores (xfln=8.8, p<.003) with an external criterion, the raw score on a simultaneously administered 54-item test.u—(§km)z)]"”, (14) where Pm is the estimated probability of a person with a score of r responding in step k to testlet i of the last iteration. The person fit statistic is t.=(v,‘,”—1)(3/q,,)+(q,,I3), (15) where Va is weighted mean square, q: is the standard deviation of the weighted mean square, and tn is the standardized weighted mean square for person 11. W Taking the first derivative of Eq.(11) above with respect to 6.7, one gets a}. N ... . —=—Sy+zzm' n=1IN; 3:1, 000' k, 0.0, mil (16) 3661' nk=i N: where 813:2}:50 is the number of persons completing step j in u=lj=l testlet i. 275.». is the probability of person n completing at i=1 least j steps in testlet i, and N nu 22m is the number of persons expected to complete at '72 least j steps in testlet i. In other words, it is the expected value of $11. Symbolically, the expected value for step difficulty (dn) in testlet i is N nu E(dij)=227tm'k. (l7) nk=j Setting Eq. (16) to 0, and solving for 6:], we will get the estimate of testlet step parameter, dij. The standard error of do is M—l nu nu SE (du) =[ )2 Nr( 2P... -(21.P,,~,)’)]'”2 (18) r i=1 =1 L where N: is the number of persons with score r, M=2m,. . i=1 The formula for testlet fit is :.-=(v,!’3-1)(3/q,.)+(q,.I3), (19) where V1 is the weighted mean square, q: is the standard deviation of the weighted mean square, and ti is the standardized weighted mean square for testlet i. Detailed derivation of Eq. (19) is done subsequently in the Local Dependent Item Measure section. For the simplicity of this study, the testlets do not take response patterns into consideration, and students' raw scores on the items within a testlet are summed up to a single number-right score. WW To assess the local dependence effect, dichotomously scored items are first calibrated with the Rasch dichotomous model as individual items and then by the partial credit model as testlets. The difficulties obtained from both 73 calibrations are compared for their estimated values, calibration errors, and item/testlet fits. The item fit statistics are calculated as follows (Wright & Masters, 1982): observed response: x“, expected value of x“: Eni = 2km: , (20) k=0 k where m=cxp Emu—fiqfl‘l'm, (21) p0 ’ nu k and ‘Pni=k20 exp fawn-5);), (22) = J: variance of x,,: W”. = §(k-Eni)27tm, (23) i=0 kurtosis of x“: Cm. = §(k—-Eni)‘7tu, (24) k=0 n score residual: y". = x”. — E"... (25) standardized residual: z,"- = ym. / W32 , (26) standardized residual squared: 2:,- , (27) score residual squared: y; = Wu-z; , (28) N unweighted mean square: ui=Zz:,/N, the outfit statistics, n=l where N is the number of persons in the sample, (29) N N N N weighted mean square: v, =2Wnizfi/2Wm. =23; IZWM. , (30) and finally, standardized weighted mean square: t. = (vV3 - 1)(3 I q.) + (q, / 3), the infit statistic, has a mean of 0 and variance 1. (31) q1 is the SD of the weighted mean square, v.. 1 In the formula, it is N 2 N m qi=[2(cui-Wns)/(2an)] - (32) Similarly, the person fit statistic can be obtained in this manner also. 74 The information—weighted fit statistic (v1) obtained from the computer program BIGSTEPS would have an expected value of 1. Values substantially less than 1 indicate dependence in the data; values substantially greater than 1 indicate noise. More about the fit statistics will be discussed in Chapter 4. E S ! . E !° 1113' In classical testing theory, an observed variance is composed of two components. That is: Observed variance (of)=True variance(of)+Error variance (of) , and the reliability is obtained by the following, Reliability (p) = ii: (33) One example of this kind of reliability is the coefficient a. A problem with classical reliability is that it depends on the population measured and on the measuring instrument. One has to specify the instrument and the population it applies to whenever he or she speaks of reliability because of population dependence. In IRT Rasch models, "true" variance is the "adjusted" variance (i.e., observed variance adjusted for measurement error). Error variance is a mean-square error (derived from the model) inflated by misfit to the model encountered in the data (Wright, 1996). Because the intention of most tests is to identify individual differences, indices of separation of persons on the ability continuum have been developed to see how well a particular test separates the persons in a 7S particular sample. One such index is the person separation index (Gp), which is the number of statistically different performance strata that the test can identify in the sample. The index is the ratio of the adjusted SD (SAp=(obs. SD’p - MSED)“’) to the root mean square error (RMSEP) . In the formula, SA G, =—”—. (34) RMSEP where SA]p is the sample SD adjusted for measurement error, and RMSEP is the root mean square measurement error, p is the person, which equals 1, ..., N. For example, a separation index of 3.5 means that if repeatedly tested, the ability estimates on the ability continuum can be consistently separated into roughly 3 strata by the test for samples like the one tested. In other words, G1p gives a sample standard deviation in standard error units. Person separation index provides an alternative way to examine the internal consistence of a test. Some consider it easier to interpret than the reliability coefficient. When Eq. (34) is squared, it becomes the ratio of sample variance adjusted for measurement error to the mean of sample measurement error variance. 2 02:35:. (35) P MSEP Eqs. (34) and (35) imply that the larger the person separation, the smaller the measurement error and the more precise an estimate is. The reliability of person separation 76 then is the ratio of the adjusted sample variance to the observed variance. Mathematically, it is 5A2 Anna R =—L=1— p, 36 P so; SD; ‘ ’ This reliability is analogous to KR—20, Cronbach's a, and the generalizability coefficient in the sense of classical testing theory. The relationship between the reliability of person separation and the classical reliability (p) is, 2 G reliability(p) = -—L2- , (3 7 ) 1+ 6,, or, (;==J-ll-. (38) P 1 p The indices are used here to examine the hypotheses 1(d) and 2(c). Data Analysis To test different hypotheses in this study, three things are done with the data. First, items within each original 1:estlet are scored twice, once as independent items and once as a testlet. For all the science tryout forms, the testlet :items are located in the same positions. They are: Original Testlet l: 11, 12, 13, 14; Original Testlet 2: 28, 29, 30, 31; Original Testlet 3: 45, 46. 47, 48; Original Testlet 4: so, 51, 52, 53. Second, additional sixteen independent items in the same test form are randomly selected from the 30 independent MC 77 items to randomly form another four hypothetical testlets. The rationale for these random testlets is to see if there is a local dependence effect on the truly independent items when they are analyzed as a testlet. It is equivalent to running a concurrent validity study. One set of context-dependent items are analyzed at their original configuration, the other set of independent items from the same tryout fonm are analyzed at a hypothetical configuration, and results of these two sets are compared in terms of testlet statistics and person estimates to see whether there is a dependence effect in the original testlets. If there is no significant difference between the two sets of estimates in person and/or in item or testlets parameters, then one may infer that the null hypothesis of no local item dependence effect among context— dependent items within a testlet holds. These random testlets are first scored as individual items and then scored as testlets. The items composing the random testlets are truly randomly selected from the context-independent items in the same form. Since there are no items in conunon in any two forms, the same number of items are chosen for the simplicity of the analysis. The random testlets for Forms 20—29 are: Random.Test1et 1: l, 8, 24, 38; Random Testlet 2: 2, 9, 25, 40; Random Testlet 3: 3, 18, 21, 41; Random Testlet 4: 4, 20, 37, 43. 78 Third, the original testlets are broken up and reformed into 4 new testlets (similar to Latin Square design). The jpurpose was similar to the random testlets. That is, examining local dependence effects with items from different original testlets. The items in these reformed testlets are scored twice as in the original testlets. The reformed testlets for Forms 20-29 were: Reformed Testlet 1: 11, 28, 45, 50; Reformed Testlet 2: 12, 29, 46, 51; Reformed Testlet 3: 13, 30, 47, 52; Reformed Testlet 4: 14, 31, 48, 53. According to the design, each kind of testlet configuration is analyzed twice. The first time the items are analyzed as individual items by the dichotomous model regardless of ‘whether' they are context-dependent or independent. The second time testlet scores are calculated for each testlet and then they are analyzed with the partial credit model. The configurations of different testlets and other forms of items are demonstrated below in Table 3. By the unidimensionality property of the IRT testing theory, testlets are expected to correlate to each other as little as possible at a given level on the ability continuum. Therefore, it is assumed that when a testlet is used as the unit of analysis, the correlations 'between testlets at a given ability level should be small. 79 Table 3. Data Configurations of Science Items Data Configuration Original Random Reformed Context- Independent Testlets Testlets Testlets dependent Items (Testlets (Independent (Testlets Items (Items used to consist of items from the consist of form the context- same tryout items from (Items used to random dependent form) different form the testlets) items as original . . 1 f designed) testlets) origina /re or med testlets) # of Testlets 4 4 4 t oflxeMBin a Testlet 4 4 4 Total # of Imam filthe data set 16 16 16 16 16 Dichotomous Model Analysis Yes Yes Partial Credit Model Analysis Yes Yes Yes BIGSTEPS Computer Software The computer program used in the parameter estimation and data analysis is BIGSTEPS (Linacre 8: Wright, 1995, version 2.6). The program is specifically designed to facilitate item analysis and scoring of psychological tests within the framework of IRT Rasch models. The program can analyze scores of both dichotomous and polytomous scales. Items may be grouped together or divided into subsets of one or more items that use the same scoring scale. According to the program's user's guide, person measure and item calibration are reported in logits. “A logit (log- odds unit) is a unit of interval measurement which is well— defined within the context of a single homogeneous test“ (Linacre 8: Wright , fl' 1995, p.89). the logit ’l=lo[ g l-zt is the probability unit. for A defined by the Mathematically, 80 modeled process, where n is mm = a: 1 gig/(351% . (39) This is the unit with which the Rasch measures can be compared as a uniformed standard unit. Summary Research data and the experimental methodology were described in this chapter. The first tryout data from the newly—developed Michigan High School Proficiency Test in Science was used. The test was designed to test students' abilities in using, reflecting, and constructing scientific knowledge. For each tryout form, only the context-dependent MC items within the testlets and an additional 16 randomly selected independent items were used in. this study since testlet effect is the focus of the study. The constructed- response questions were not included in the research because they were hand-scored by different scoring rubrics, which may introduce interrater errors and over time errors for the same rater, and make the study too complex to be handled. Cluster sampling in combination with stratified sampling was used in the tryout. Schools were used as the sampling unit. The sampling frame included all Michigan 11th graders in public schools. There were 10,074 students from 72 schools who actually took the tryout. . Ten tryout forms were spirally bundled into 6 groups, 3 Or 4 forms in each group. Each tryout school received only 81 one group of forms. No items overlapped between forms but each form was administered to two randomly equivalent groups of 11th grade students. All the MC items were scored dichotomously. Testlet were scored. as number-right items 'within. a testlet. The Rasch dichotomous model was used when items were analyzed independently and the Rasch partial credit model was used for the testlet analysis. All context-dependent items were first analyzed independently and then as testlets. Sixteen additional randomlyeselected.ZMC items were formed into 4 random testlets and were analyzed the same way as the original testlets. The original testlets were also reconfigured into 4 reformed testlets and were analyzed accordingly. Different statistics to measure item correlation, test reliability, person and item/testlet fit statistics, and measurement/calibration errors were described for these analyses. It is expected that the results of the estimates would provide information about whether the local item dependence has any impact on the parameter estimation. The computer program BIGSTEPS was used to run the analyses. The software was designed specifically for the data analysis of the Rasch models. The estimates are reported in logits. A logit is a unit of interval measurement that can make the comparison of measures on a uniformed standard unit. CHAPTER 4 RESULTS AND DISCUSSIONS As described in Chapter 3, to carry out the data analysis plan, data were organized and analyzed in five different ways. Four original testlets, 4 random testlets, 4 reformed testlets, 16 context-dependent items, and 16 independent MC items that formed the random testlets, were treated as though they were five different tests in each form. In essence, each form has 32 items (16 context- dependent items and 16 independent items) in total used in the analyses. According to the plan, different statistics were computed for the data. Phi correlation coefficients (¢),testlet measures, person separation indices, and person ability measures, were all computed. In addition, a one-way ANOVA and average category measures were also calculated to provide an overall data description for each step in a testlet. Table 4 below summarizes these analyses and relates them to their hypotheses respectively. Results of these analyses are presented in the following sections. Discussions are often mixed with the results reporting in order not to lose the continuity. The chapter concludes with a summary. 82 83 Table 4. Match-up of the Analyses with Their Corresponding Hypotheses. Research Hypothesis Analysis H1(a) H1(b) H1(c) H1(d) H2(a) H2(b) H2(c) H3(a) H3(b) ¢ Coefficient V V Testlet Measure V V Verification of fit statistics obtained from the One-way ANOVA partial credit model for local dependence. Person Ability Measure V V V Person Separation Indices V V Average Category Overall data description for each category (i.e., Measure step) in a testlet. Phi Correlation Coefficient Results The phi correlation coefficient ((11) is usually used to examine the linear relationship between two distinct dichotomously scored variables (e.g., male/female, smoking/ non-smoking) . The multiple-choice items in this study are dichotomous so that 4) coefficient is appropriate. By Hypotheses 1(a) and 3(a) if the context—dependent items are generated from the same context, the average within-context items correlations should be larger than the average across- context item correlations. To test these hypotheses, ¢ coefficients were calculated for all the original, the random, and the reformed testlets for all tryout forms. The mean coefficients for each testlet for overall forms are listed in Table 5. 84 Table 5. Mean ¢ Coefficients Testlets by Form for Items within Different Form. Type T1et.1 Tlet.2 Tlet.3 Tlet.4 Form Items Items Items Items mean 20 Original .1203 .1169 .1731 .1180 .1321 Random .1761 .0790 .1298 .0725 .1144 Reformed .1799 .0681 .1258 .0644 .1096 21 Original .1473 .0500 .3433 .1944 .1838 Random .0437 .0364 .0250 .0526 .0394 Reformed .1451 .1168 .1243 .0733 .1149 22 Original .1838 .1292 .0747 .0934 .1203 Random. .1287 .0300 .0492 .1484 .0891 Reformed .1378 .0673 .0500 .1890 .1110 23 Original .1440 .1778 .0656 .2758 .1658 Random .0900 .0814 .1192 .0433 .0835 Reformed .1074 .0975 .1513 .0991 .1138 24 Original .1126 .1376 .0620 .1714 .1209 Random .0866 .1175 .1170 .1474 .1171 Reformed .0431 .0835 .1247 .2429 .1236 25 Original .1579 .1049 .1155 .0423 .1052 Random .1269 .0772 .1027 .0674 .0936 Reformed .0356 .1038 .0793 .0622 .0702 26 Original .1421 .1243 .1824 .0980 .1367 Random .1455 .0972 .0941 .1209 .1144 Reformed .0758 .0422 .1516 .1809 .1126 27 Original .1314 .0760 .2954 .1043 .1518 Random. .0771 .0960 .1296 .0987 .1004 Reformed .0133 .0285 .1636 .2496 .1138 28 Original .2306 .0318 .2487 .0970 .1520 Random. .1510 .0585 .1215 .0730 .1010 Reformed .0366 .1230 .1044 .1275 .0979 29 Original .2059 .1589 .0914 .1859 .1605 Random. .1099 .0654 .0689 .1616 .1015 Reformed .1162 .1498 .1340 .0929 .1232 Mean Original .1576 .1107 .1652 .1381 .1429 By Random .1136 .0739 .0957 .0986 .0955 Testlet Reformed .0891 .0881 .1209 .1382 .1091 85 As is shown in the table out of the 40 original testlets, only one testlet (Testlet 3, Form 21) had an average 4: coefficient above .30, which is relatively high for item correlation. Five testlets had mean coefficients between .20 and .30, more than half of the testlets (23) obtained moderate mean coefficients between .10 and .20, and the remaining 11 testlets had mean coefficients less than .10. For random testlets, twenty-three of them had mean 4) coefficients less than .10, seventeen had mean coefficients between .10 and .20, but no testlets had mean coefficients greater than .20. For the reformed testlets, only Testlets 4 in Forms 24 and 27 had mean 4) coefficients above .20 (¢=.2429 and .2496 respectively). Half of them (20) were between .10 and .20, and the remaining eighteen were under .10. As these data in Table 5 indicate, thirty—one of the original testlets and 27 of the reformed testlets had mean ¢ coefficients larger than those of the random testlets. The summary is in Table 6 below. 2.9 ‘ 0 .1011: Q 0 v ‘ cl! ‘Jl . ,_ ‘ 0. ' I _ 9 ‘ . I. , ‘ ¢ Coef . Original Random Reformed Testlets Testlets .___..___I§§Ll.e_ts___ > .30 1 _ 0 0 .21 - .30 5 0 2 .11 - .20 23 17 20 .00 -— .10 11 2.1 1.8. Total 40 4O 40 86 Marginal mean coefficients for all forms by testlet (column mean) and for all testlets by form (row mean) were calculated also. For each marginal value, mean coefficients for the original testlets are higher than either random testlets or reformed testlets, except Form 24, where the reformed testlet mean is slightly, but not significantly, higher than the original testlet mean. Between the reformed and random testlet means , coefficient values vary irregularly. In some cases, random testlets have higher mean coefficients. Other times, vise versa. This outcome is not surprising, however, because the contents of the reformed testlets are not related to the same context any more, and "they are almost equivalent to the random testlets in the sense of testlet construction. Overall, the results strongly suggest that context-dependent items do have higher correlations within-context than across-context or independent items do, which implies that local dependence may exist in some original testlets. In summary, for the original testlets (ref. Hypothesis 1(a)) the results showed that, if the context-dependent items were generated from the same context, the average within- context item correlations were larger than the average across-context item correlation for a majority (29) of the original testlets. On the other hand, eleven reformed testlets (ref. Hypothesis 3(a)) had average within-context phi correlation larger than those of their corresponding original testlets. The remaining reformed testlets obtained 87 smaller average within-context correlations than their corresponding original testlets. Testlet Measures Results This section discusses the results for Hypotheses 1(b) and 2(a) . Hypothesis 1(b) states that when context-dependent items are analyzed as a testlet by the Rasch partial credit model, the testlet calibration produces a better fit statistic than when these items are analyzed individually by the Rasch dichotomous model. Hypothesis 2(a) states that if the items are independent, then testlet fit statistics should be the same as the item fit statistics. One rationale for using testlets as the unit of analysis is to determine whether the calibration errors are smaller when treating the context-dependent items in a testlet as a whole than when treating these items individually (i.e., ignoring the context effect), as well as determining whether such scaling produces better fits of testlet and/or person estimates. The User’s Guide to BIGSTEPS (Linacre & Wright, 1995) states that "INFIT is an information-weighted fit statistic, which is more sensitive to unexpected behavior affecting responses to items near the person's ability." And "MNSQ is the mean-square infit statistic with expectation 1. Values substantially below 1 indicate dependence in your data; values substantially above 1 indicate noise” (p. 82) . 88 In the same manual, it is explained that, when values of infit mean square (MNSQ) statistic are, say, less than .8 or the standardized MNSQ is less than -2 SDs, it means there are redundant items and the test developers need to investigate the items to see if the test has similar items, one item answers another, or an item correlates with other variables, that is, there are local dependence effects. When the infit MNSQ is larger than, say, 1.2, or its standardized MNSQ is greater than +2 SDs, it may mean different things, such as biased items, qualitatively different items, or curriculum interaction. In these cases, one needs to investigate areas related to the problems (Linacre & Wright, 1995, p. 95). By Eq. (30) the infit MNSQ is the sum of squares of the difference between the observed score and the expected score divided by the sum of variances on item i over N persons. In the formula, N 2 N N 2 N N 2 N vi=2WMzfiIXWni =2)!“ IZWM=2(xm.—Em.) IZWM. , Vi is the weighted mean square, 1, 2, ..., L, is the item, i n = 1, 2, ..., N, is the person, "I! W“: 2(k—E"‘)2”.l’ is the variance for observed score Xni, k=0 " k = 1, 2, ..., m, is the item step, and y,"- = x”. - Em. , is the residual , Eni= fikn'm-k is the expected value of x“, i=0 89 z,“- = ym. I Wfifz , is the standardized residual, and l mm = exp zwn — 6.7) / ‘I’m- , is the expected probability of j=0 person 11 answering item i, kth step. With the Rasch partial credit model, the smaller the discrepancy between the observed score and expected score, the larger the variance of x“. In the infit MNSQ formula, this means smaller residuals (yni=xm.—Em.). In other words, the formula will have a smaller numerator and a bigger denominator. As a result, vi will be less than 1 when the numerator is smaller than the denominator. Usually we expect an orderly pattern of responses. In other words, we want to see that the observed value is close to the expected value. However, when responses to an item are excessively orderly, that is, the observed scores are almost identical or identical to the expected scores, we may begin to suspect potential local dependence effects (Wright & Masters, 1982, p. 104). This would happen when problems like those mentioned earlier occur. An example of possible dependence is presented later in this section. Table 7 (see Appendix E) displays the results of testlet fit statistics for the original testlets and item fit statistics for the context-dependent items that configure these testlets. 90 In Table 7, seventeen out of 40 original testlets have 1 to 4 misfit items within a context when they are analyzed individually, but when they are analyzed as testlets, they produce a very good testlet fit. Considering Original Testlet 3 in Form 22 and Original Testlet 4 in Form 27 for example, when the items in those testlets are analyzed as individual items, all of the context-dependent items have misfit values beyond i2 SDs (all 4 items have the “*" sign in col. 6). However, the items produce a proper testlet fit when they are analyzed as testlets (infit=1.03 for Testlet 3 in Form 22 and infit=.95 for Testlet 4 in Form 27). In addition, the standard errors of the estimates for the original testlets are uniformly .04, while the standard errors for the context- dependent items are larger, between .07 and .09 logit. These results mean that, for those context-dependent items, the testlet-based analyses are more appropriate statistically than the item-based analyses to examine students' abilities in the areas of interest. For another 20 testlets, each also has 1 to 4 misfit context-dependent items when they were analyzed individually, but the testlet-based analysis still results in misfit calibrations (indicated by “*' sign in the table). Thirteen of these testlets have infit values substantially less than 1 (i.e., infit MNSQ < -2 SDs), implying that there may be local dependence effects in both the items of those testlets or the testlets themselves. This finding is a little surprising ' because these testlets are supposed to be independent to each 91 other by design or by model control. It seems that there are some other factors other than local dependence affecting the item and testlet calibration. Another 7 testlets have infit values substantially greater than 1 (i.e., infit MNSQ > +2 SDs). For instance, Original Testlet 3 in Form 23 has misfit values for all its context-dependent items and the resulting infit MNSQ (1.22) for the testlet shows noise in the data this time. This means students may have unexpected performance away from their expected scores. This outcome suggests that test developers need to look at the testlet construction, content or quality of the items. By the definition of fit statistics, Testlet 3 in Form 23 demonstrates one extreme (i.e., Vi greater than 1) . The testlet is an earth science problem which requires students to know the relationships between the ocean, coastal plateau, and mountain range. It is a relatively difficult testlet (difficulty measure=.98 logit). If a student were not clear about their relationships, the person would have a small probability of answering an item correctly. The items themselves are well written, with no signs of bias or trick, but for the two more difficult items (item #46's b=1.39 and item #48's b=1.19 logits), the percentages of students choosing a wrong option are larger than the percents of students choosing the right one (see Table 8 for detailed percentages). For item #46, the correct answer is option A. The percentage of students choosing A was 28% only, compared 'with 35% who chose the wrong option, D. The situation is 92 similar for item #48. The percentage of students choosing the right answer, C, was 31%, while the percent choosing the wrong answer, D, was 35%. In addition, the average correlation among all 4 items is very small (r=.0656) . .mo - 3 ,q-g _ {-__.0;_- o -_ - 0 Item # Option A Option B Option C Option D 45 9.6% 13.6% 55.0%V 18.6% 46 27.9%V 22.5% 11.3% 35.0% 47 10.4% 15.5% 34.2% 36.6%V 48 9.8% 20.8% 30.8%V 35.4% V means the correct answer. The results of large infit MNSQs (values substantially above 1.0) indicate large discrepancies between the observed scores and expected scores, implying students did not perform at their ability levels. These large discrepancies are considered “noise” in the item analysis. Usually one would suspect the item quality in this kind of situation. In this case, however, one may have to examine if there is an interaction of science dimensions within the testlet to seek possible reasons for poor performance. Nevertheless, “noise" in the item analysis does not have any relationship to local dependence. It is presented here to demonstrate another side of the infit statistic (i.e., values greater than 1.0). It also shows that large discrepancies between observed scores 93 and expected scores do happen even though items are from the same context. Testlet 4 in Form 23 provides an example of possible dependence. The testlet presented a diagram of the movement of carbon in the atmosphere and on the surface of Earth, and asked students to answer 4 questions based on the diagram. It was a relatively easy testlet (difficulty measure=-.82 logit) and most students chose the right answers of the items (see Table 9 for detail percentages). Looking at the item statistics, it seems that distractors for three of the four items were not very effective because they attracted few students. By examining the item contents closely, we can see that if a student can answer item #52 (a concept item) correctly, he or she can answer the items #50, #51 and #53 fairly easily. Consequently, the observed and expected score differences will be very small. Tahle_2i_students_Bssp9nses_to_Iestlet_Ai_E9rm_23. Item # Option A Option B Option C Option D 50 6.5% 79.5%V 7.0% 3.1% 51 11.1% 11.0% 28.0% 47.3%V 52 10.8% 66.0%V 8.6% 11.1% 53 7.3% 79.9%V 4.5% 4.7% V means the correct answer. As described in this section, small residuals imply possible local dependence. The average item correlation of 94 this testlet (r=.2750) helps support the suspicion. This correlation is very high in this test, compared with the grand average correlation (r=.1429). When a situation like this is true, the infit statistic, Vi, will be very small (because the residual, yni, will be very small). For this testlet in particular, the infit MNSQ is .76, which indicates that possible local dependence may exist among the items. In summary, the statistics in Table 7 show that, for 17 of the 40 original testlets, some of the context-dependent items were problematic when they were analyzed individually but produced good fit when they were analyzed as testlets. This provides strong evidence that the partial credit model is more appropriate for these items. However, for another 20 original testlets, each also had 1 to 4 misfit items when they were analyzed individually, but the final testlet fit statistics were still misfit. Thirteen of these 20 testlets indicate possible local dependence, which suggests further investigation of individual items in these testlets regarding their contents, item construction, or item quality. Across the forms, there are only 2 original testlets (Testlet 1 in Form 21 and Testlet 2 in Form 23) where the fit statistics are within the normal range regardless of which scoring model is used. Therefore, it would not matter if items in these testlets are analyzed independently or as testlets. The strangest case is Testlet 3 in Form 26. All its 4 ' items are perfectly fit when analyzed individually, but the 95 testlet fit is not acceptable (infit MNSQ=.88, less than —2 SDs) . The reason of this outcome is unknown to the author. The only inference that can be made is that these items many be truly independent and should be analyzed independently, even though they are from the same context. An analysis was also run for the random testlets and the independent items that form the random testlets (see Table 10 in Appendix E). The results are similar to those of the original testlets. Out of 40 random testlets, 15 of them had from 1 to 4 misfit items when these items were analyzed as individual items, but they obtained very proper fit when they were analyzed as testlets. Another 54 items that were distributed in 24 random testlets obtained misfit results no matter which model was used. Out of these 24 misfit testlets, 16 show local dependence and 7 indicate noise in their data. Again 2 random testlets (Testlet 4 in Form 21 and Testlet 3 in Form 29) obtained misfit when they were analyzed as testlets but had very good fit for each item when they were analyzed as independent items. In addition, there is no random testlet that shows proper fit for both scoring models, which ideally should be the case for these developer-designed independent items. The outcome of misfit items converting into proper fit testlets that are related to no specific contexts is interesting, at the same time a little bit disturbing too. Theoretically, the developer—designed independent items 96 should behave as statistically independent. However, the results of these 15 misfit-items—to-fit-testlets here show that they are actually better off when they are analyzed as testlets. One needs to see if there is local dependence effects in these items or the results are just from random errors. The results for the random testlet analyses indicate that these labeled “independent" items may not be really statistically independent, even though they were designed to be so. Some items may be related to each other or to a common factor statistically, and more study is needed on these items. One difference between the random testlets and the original testlets in fit statistic analyses is that the range of the independent item standard errors (.07-.14) is larger than those of the context-dependent items in the original testlets (.07-.09). This suggests student performance varies more for these independent items than for those context- dependent items, which further suggests that the context may have impact on student ability estimation as well as testlet calibration. Regarding the hypotheses tested in this section, it may be concluded that for the context-dependent items (ref. H1 (b)), mixed results have been obtained. More than 40% (17) V of the original testlets demonstrate a better fit when they 97 were analyzed as testlets. Half (20) of the original testlets have misfit by both models. Only 5% (2) of them obtain good fit as individual items and as testlets. For the independent items (ref. H2(a)), the testlet fit statistics are not the same as the items fit statistics. Sixty items in 15 random testlets have obtained a better fit when they were analyzed as (hypothetical) testlets. Another 34% of the items (54) show misfit with these items being analyzed as testlets and as items. The results are contradictary to the test development in that these items do not contain local independence with them. It is suspected that there may be an implicit factor affecting item calibration. For Hypothesis 3(b), the person fit statistics estimated by the reformed testlets are not significantly different from the person fit statistics estimated by the original testlets. Verification of Local Dependence Effects One way to verify whether the context-dependent items demonstrate dependence to each other when they are analyzed individually is to first check the variance homoscedasticity of the item fit statistics and then conduct a one-way ANOVA to compare the means of the fit statistics regressed on testlets. The fit statistic discussed in the last section is a weighted mean square with degrees of freedom by the number of students responding to an item minus 1. In this study, the 1 degrees of freedom are relatively large for all forms since 98 the test is large-scale. Consequently, the null hypothesis of local item independence within an original testlet would be easily rejected even though the dependence effect is very small. An alternative is to conduct a one—way ANOVA to verify whether the item fit statistics obtained by the Rasch partial credit model truly indicate local dependence between context- dependent items within a testlet. In this ANOVA, the natural log of the infit statistic is the outcome variable and the testlet is the classification variable. If the confidence interval (CI) of its estimate includes 0 (because the expected value of infit is 1, so ln(E(infit) should be 0), it can be inferred that there is not enough evidence to show that items within a testlet are dependent. Under normality and random sampling assumptions, the test statistic for a population variance equal to a pre- determined value is 2 137-13, (40) 0' where V, equal to n-l, is the degree of freedom of the chi- square distribution, 11 is the number of examinees responding ss . to the item, and s2 is some mean square, equal to 7, SS is sum of squares. (In this study, s2 is the weighted mean square of a context-dependent item.) Thus, E(s2)=0'2, and (41) 99 20‘ var(s2)= (42) Further, if we take the natural log of 52, we get E[ln(32 )]=ln(0'2), and (43) 2 2 var[ln(s )]=;. (44) Consequently, because the term 0'2 is “logged out,” if the degrees of freedom (df) for all the context-dependent items are the same, then the comparison between the infit statistics will not be biased. Otherwise, some adjustment may be needed. Table 11 in Appendix E lists df's for all context— dependent items . Values in Table 11 show that the majority of discrepancies between df’s from the highest to the lowest within a testlet are between 1 to 4 out of about 1,000 students. Two testlets (Testlet 4 in Forms 21 and 28) have somewhat larger differences in df’s, 9 for Form 21 and 14 for Form 28, respectively. Table 12 (see Appendix E) lists all the discrepancies in df's. We may assume that the small differences in df within a _testlet are negligible because the infit statistic is a 100 weighted mean square (i.e., variance is considered) and the sample size is large (1000 or so). A one-way ANOVA has been conducted then for each form. The results are shown in Table 13 (see Appendix E). The graph of confidence intervals (CI) is displayed in Figure 5 (see Appendix E). As stated earlier, the expected value for the infit statistic is 1 and its natural log is 0. It can be seen from Figure 5 that 35 out of 40 testlet statistics have included 0 in their CIs across the forms. Two testlets (Testlet 3 in Form 23 and Testlet 4 in Form 25) have values above 0 (indicating noise) and three testlets (Testlet 4 in Form 23, Testlet 3 in Form 27, and Testlet 1 in Form 28) have values below 0 point (indicating local dependence). The omnibus F statistics in Table 13 helps support the evidence. For all ten forms, 7 of them have nonsignificant F tests, indicating all testlets may include 0 and their infit statistics are within the normal range. Forms 21, 23, and 28 have significant F tests, implying that some of their testlets may have misfit statistics. The large SDs for some testlets in the table also show that these testlets would have a wide confident interval . Figure 5 explains the outcome graphical ly . Figure 6 shows the point estimates of ln(infit MNSQ) for "all testlets. The majority (31) of estimates fall between 101 -.05 and +.05, very close to 0, which provides the evidence to support that the testlet-based analysis produces appropriate fit statistics for the majority (30) of the original testlets in this test when a CI is built for each testlet. Mean Person Ability Measures Results It is acknowledged that the real purpose of any data analysis method in education is to try to measure person abilities as precisely as possible. Chapter 3 Hypothesis 1(c) stated that when the context—dependent items are analyzed as the original testlets, the person measure will have a better fit than when these items are analyzed individually. Hypothesis 2(b) proposed that since the independent items are not linked to a particular context, the person fit statistic will stay the same regardless of whether the items are analyzed individually or as testlets. For Hypothesis 3(b), because the reformed testlets are not context specific, it is hypothesized that the person fit statistics will not be as good as those of the original testlets. Table 14 in Appendix E presents results for mean person ability measures for different data configurations. In the table, the first column is the data configuration. The second column is the mean of the estimated person measures for the examinees in different data configurations in each tryout form. The estimates are in logits. For most forms, the original testlets have slightly lower mean person measures 102 than the context-dependent items do, except Form 26. In addition, their values vary between -.50 and .50 logit values, right around the middle point of 0 on the ability continuum. Only the independent-item data configuration for Forms 24 and 26 and the random testlets in Forms 24, 26 and 27 have mean measures greater than .50 logit value. Most of the time, these measures do not differ much for most forms no matter how the context-dependent items are analyzed: individually or as testlets. Column 3 is infit mean-square (MNSQ) for the mean person measure. It is the average of the infit mean-squares associated 'with responses of the sample and it has an expected value of 1.0. Values in Column 3 show that regardless of types of data. configuration, no infit IMNSQ statistic has a value substantially below 1.0. The lowest value is .92, and the highest is 1.0, which indicates that in average there is not enough evidence to prove unexpected behavior affecting responses to items or testlets near students ability levels. Outfit in Column 4 is an outlier-sensitive fit statistic. Its MNSQ is the mean-square outfit statistic with an expectation of 1.0. As with the infit statistic, values substantially less than 1.0 indicate dependency, while values substantially greater than 1.0 indicate the presence of unexpected outliers. In this sample, the outfit MNSQ statistics ranges from .94 to 1.10, which indicates that the data fit the model relatively well. 103 One thing that has to be explained here is the phrase “data fit the model." Usually in statistical analyses, researchers test whether a model fits data because the model is designed to imitate data, so it has to be faithful to the data as much as possible. Otherwise, another model is used. The Rasch model used here, however, is not designed to fit any data. Instead it is developed to define measurement. As Wright (1992) pointed out: “The Rasch model is a statement, a specification of the requirements of measurement —- the kind of statement that appears in Edward Thorndike’s work, in Thurstone’s work, in Guttman's work (p. 197).” Therefore, “. . . . The Rasch model is theory centered: data must fit, else get better data (p. 200)." As a result, the phrase “data fit the model" is used in this study. In summary, regarding the hypotheses discussed in this section, the conclusions will be the following. For the context-dependent items, there is no significant difference in person fit when the items were analyzed individually or as testlets (ref. H1(c)). For the independent items, the person fit statistics stay the same regardless of which model is used (ref. H2 (b)). For the reformed testlets, even though the testlets are not context—specific, they nevertheless still 104 produce proper person fit as do those of the original testlets (ref. (H3(b)). Person Separation Indices Results It is hypothesized (Hypotheses 1(d) and 2(c)) in this study that, when items are context-dependent, they will produce smaller measurement errors when they are analyzed as testlets than when they are analyzed as individual items. Otherwise, if items are independent, it does not matter which scoring' model is used” In. this section. person separation indices are examined to test the above hypotheses. In addition, the person separation ratio index will also provide an alternative for examining the reliabilities of different data configurations. In Table 15 RMSE is the root mean square standard error computed over the persons or over the items. The computer program BIGSTEPS computes two kinds of RMSE: model RMSE and real RMSE. Model RMSE is computed on the assumption that the data fit the model, and that all misfit in the data is merely a reflection of the stochastic nature of the model. Real RMSE (col. 3) is computed over the persons or items on the basis that misfit in the data is due to departures in the data from model specifications (Linacre & 'Wright, 1995). Columns 4 (adjusted standard deviation) and 5 (separation ratio) are described earlier in Chapter 3. By Eq. (34), Column 5 is equal to Column 4 divided by Column 3. 105 Values in Table 15 show that, regardless of item configurations, all but 3 person separation ratios range from 1.00 to 1.60 logits. Recall that testlets are much larger units than the items are and, more importantly, they have taken any local dependence effect into account. When a test consisting of larger units such as testlets here obtains similar separation ratios as a test consisting of smaller units such as single items, one can infer that the testlet— based analysis produces better fit statistics for person estimation than the item-based analysis does because the former has relatively smaller measurement errors. Table 16 lists the reliabilities of person separation for different data configurations. It can be seen that for all tryout forms, the mean reliabilities of person separation ratio for the original testlets was .62, while results of the other types were .66 for the random testlets, .68 for the reformed testlets, .63 for the context-dependent items, and .60 for the independent items. The reliabilities of person separation for the original testlets was very competitive to those of the context-dependent items, considering that the later ignores the within-testlet structure and their real reliabilities may be a proportion to the values appearing in the table here. The results imply that for the items in these forms, using the original testlet configuration would have at 106 least a good, if not better, reliability estimate as analyzing the context~dependent items individually. Therefore, for Hypothesis 1(d), it can be inferred that when items are context-dependent, the person separation ratios are not statistically different as to whether items are analyzed as testlets or as individual items. When the items are independent (ref. H2(c)), the relabiliuy of person separation. ratio is the same for both the testlet-based analysis and the item-based analysis. Overall, the testlet— based analysis indicates an implicitly higher test reliability than the item-based analysis does because the former takes local item dependence effects into account when they are present in the data. Table 16. Reliabilities of Person Separation for Different Data Configurations Original Random Reformed Context- Indep. Wm 20 .60 .70 .65 .62 .62 21 .62 .53 .72 .67 .43 22 .65 .67 .68 .64 .61 23 .61 .68 .69 .63 .59 24 .65 .72 .66 .63 .66 25 .53 .63 .60 .53 .57 26 63 .66 .69 .65 .63 27 .62 .69 .71 .63 .61 28 .59 .67 .67 .61 .62 29 .65 .66 .70 .67 .61 Meggii .62 .66 .68 -63 -50 Average Category Measure Results In partial credit models, when observations are ordinal, it is implicitly assumed that the higher the category level, 107 the greater the latent ability demonstrated. Consequently, the I'more able“ students would perform better in average and achieve higher scores than ”less able” students. Average category measures presented in this section do not aim at a particular hypothesis, but rather provide some descriptive statistics for the sample under study rather than inferential information. The average category measure estimates the average ability for all students who reach a particular category of a testlet. The purpose of this index is to investigate whether each category is properly scored as it is intended. It is expected that the average category measure increases along the variable in the correct rank order. The higher the category number, the more latent ability is evidenced. In this study, the total number of categories in a testlet is the maximum number of score points of a testlet, including 0. For example, a score of 3 points means a student is in category 3 of this testlet. Table 17 in Appendix E presents the results of average category measures (also called average measure for simplicity) for the original testlets, the random testlets, and their infit statistics for each category respectively. Values of average measures from Table 17 show that student average abilities of reaching different score categories for the original testlets are similar to those in the random testlets for all 10 tryout forms, most of them ranging between 12.0 logits. The next column of the same table contains its infit MNSQ, the ratio of the observed residual 108 sum of squares due to ratings of a specific score (e.g., Xni=x) over the expected residual sum of squares. When the data fit the model, the modeled variance approximates the residual sum of squares. Differences are diagnostic of misfit. This infit MNSQ summarizes the agreement of responses for each category. It has an expectation of 1.0 and can range from 0 to co. Values substantially greater than 1.0 indicate improbable category use (e.g., some students obtain scores that do not match their abilities). Values substantially less than 1.0 indicate overly predicable category use (e.g., students choose the same options for all items). In Table 17, some testlet categories have infit MNSQ substantially larger than 1.0, implying abnormal observations for some students' performance. For example, in Form 23, Category 4 of Original Testlet 3 has an infit MNSQ of 1.76, which means some students who score 4 points for the testlet perform unexpectedly well. On the other hand, Category 3 of Original Testlet 4 in the same form shows an overwhelmingly low infit value (.67). This suggests that some students may make obvious choices (e.g., choosing eye-catching options as correct answers) or select the same options for all items in the testlet rather than using their higher-order thinking skills. Another finding in this table is that there is no 109 pattern. within or between the original testlets and the random testlets regarding when over prediction or improbable observations would occur. For instance, in Form 22, Random Testlet 3 shows high infit values (e.g., 1.24 to 1.97) for 3 of its 5 categories, while in Random Testlet 4 of the same form, the category values are substantially low (.72 to .85). The same thing happens in the original testlets. In Form 23, Original Testlet 4, except for Category 0, where the infit measure is normal (.96), other categories manifest substantial low infit MNSQs (.67 to .76). Original Testlet 3 in Form 29 has the opposite situation, where the infit values range from 1.13 to 1.31 for its categories, suggesting some students who should have reached one category actually went to another category or vise versa. Results of the average measures are also presented in terms of the range of categories. In Table 18 (in Appendix E), ranges of the random teStlets are almost uniformly larger than those of the original testlets. The few exceptions are Testlet 2 in Form 21 and Testlet 3 in Form 22, where the ranges of the original testlets are slightly larger than that of the random. testlets. One jpossible explanation. for the narrower range of the original testlets may be that, although items within an original testlet are not closely correlated to each other, they are not as difficult when tested together as a whole unit as tnat of the random testlets, where items are tested in different places of the test. Summary Different analyses were conducted to examine the differences between the testlet-based scale and the item- based scale. It was found that the context-dependent items overall correlate more closely within an original testlet than. with items outside that testlet. There is obvious evidence that local item dependence may exist in some of the original testlets. A good proportion (40%) of the context-dependent items demonstrate better fit for testlet calibration when they are analyzed as testlets. This suggests that these items have misfit either in local dependence or noise if analyzed individually. The Rasch partial credit model is the better model to control these errors for these items. However, another 50% of the original testlets (20) cannot reach proper fit by either model, which leads to the suspicion that there may be some other implicit factors such as interactions of science dimensions between those testlets that affects testlet calibration. Analyses on the supposedly independent items found that a considerable number of items (60) have a better fit when they are analyzed as testlets, even though there is no specific context developed for the testlets. An additional 54 111 items (in 24 random testlets) would obtain misfit no matter which model is used. The results demand further study on these developer-designed independent items. Across the forms, there are only 8 context-dependent items in 2 original testlets (items Testlet 1 in Form 21 and Testlet 2 in Form 23) where the fit statistics are within the normal range regardless of which scoring model is used. A one-way ANOVA was conducted to verify the existence of local dependence effects within an original testlet and a CI was built for each testlet infit MNSQ. The results provide evidence to support that the testlet—based analysis produces appropriate fit statistics for 75% of the original testlets (30) in this study. Mean person measures for all five data configurations were compared. For the context-dependent items, there was no significant difference in person fit when the items were analyzed individually or as testlets. For the independent items, the person fit statistics stayed the same regardless of which model was used. For the reformed testlets, even though the testlets were not context-specific, they nevertheless still produced as proper person fit statistics as did those of the original testlets. Person separation index and the reliability of person separation were described and calculated for all the original testlets, the reformed testlets, and the context—dependent items to see how well a particular data configuration can differentiate the persons in a particular sample. It was 112 found that when items are context—dependent, the person separation ratios are not statistically different as to whether items are analyzed as testlets or as individual items. When the items are independent, not much difference is presented as to which model is better than the other either. Overall, the :nesults indicate that employing the testlet- based analysis could obtain a test reliability that more truly reflects its nature than the item-based analysis does because the former takes local item dependence effects into account when they are present in the data. Average category’ measures provided estimates of the average abilities of the examinees reaching a certain score level of a testlet. It was intended to check for any improbable category 'use or over prediction. The average category measures for each original and random testlet were computed and compared. It was found that the two kinds of testlets performed similarly for all tryout forms, and there was no pattern as to which type of testlets would more likely have improbable observations or over predictions. However, the ranges of the categories within an original testlet were not as wide as those of the random testlets. CHAPTER 5 SUMMARIES AND CONCLUSIONS There are six sections in this last chapter of the study. First, a very brief summary of the study is presented. Then a summary of the results by hypothesis follows. Third, conclusions are made based on the results of the study. Fourth, limitations of the study are discussed. Fifth, generalizability of the study is pursued. In the final section, a few recommendations for further research are proposed. Summary of the Study The issue of local item dependence has received increasing attention in the past decade due to progress in the area of IRT item analysis, and more importantly, the increasingly high-stake assessments administered at the different levels of education. Literature indicates that the testlet concepts have been widely applied in regular classroom testing, computerized adaptive testing, and non-traditional, non-IRT scoring. It has been found that although item—based parameter estimation for the context-dependent items appear to provide more information over most levels of the latent trait continuum, 113 114 this extra gain in information may be "fooled" by the excess within-context correlation among the items. This situation is especially true when the assumption of local independence is violated. It has been suggested that one should use testlets to manage the local item dependence problem. The purpose of this study was to explore the local item dependence effect when context—dependent items in the Michigan High School Proficiency Test in Science were analyzed as independent items and as testlets. The family of the Rasch models (partial credit and dichotomous models) were applied to testlets in a large-scale assessment program, and to estimate the person ability measures and the test reliabilities, testlet/item calibrations, and testlet/item fit statistics when the potential violation of the assumption of local independence is controlled by the testlet. The first tryout data from the newly-developed Michigan High School Proficiency Test in Science (1995) were used. The test was designed to examine students' abilities in using, reflecting, and constructing scientific knowledge. Using science was further divided into using life, using physical and using earth. Reflecting and constructing were embedded across all three content areas. There were ten forms in total for the tryout. FNery form had four testlets, each testlet consisted of four multiple-choice items and one or two constructed-response questions . Only multiple-choice questions were used in the study to avoid the inter-rater reliability problem and other related issues in the hand- 115 scoring of constructedrresponse questions. In addition, only context-dependent items and an additional 16 independent multiple-choice items in the same form were used in the analysis. Cluster sampling in combination with stratified sampling was used in the tryout to ensure that the sample was representative of the population. The sampling frame included all Michigan 11th grade students, including alternative education and special education students. There were 10,074 students from 72 schools who took the science tryout test. All ten forms in the tryout were used in this study. Data were analyzed in five different configurations: as the individual context-dependent items, the original testlets, the reformed testlets, the individual independent items, and the random testlets. Statistical methods of phi coefficient, testlet measure, one-way ANOVA, person ability measure, person separation indices, and average category measure, were used in the analysis. Summary of the Results by Hypothesis Mixed results have been generated from the data analyses in this study. They are presented in the order of the research hypotheses. For context-dependent items: 1a. If the context-dependent items were generated from the same context, the average within-context item correlations were larger than the average across-context item correlation 116 for a majority (29) of the original testlets. lb. More than 40% (17) of the original testlets demonstrated a better fit when they were analyzed as testlets. Half (20) of the original testlets had misfit by both models. Only 5% (2) of them obtained good fit as individual items and as testlets. 1c. No matter how the data were organized, whether they were analyzed as individual items or as testlets, the person fit statistics generated from the Rasch dichotomous model were as good as those from the Rasch partial credit model. 1d. The person separation ratios were not statistically different whether items were analyzed as testlets or as individual items. However, the nonsignificantly different person separation ratios between the testlet-based analysis and the item-based analysis indicate that the former had smaller measurement errors than the latter because the former has a larger unit of analysis and it took the local item dependence into account. For independent MC items: 2a. When the items were analyzed as a testlet by the Rasch partial credit model, the testlet fit statistics were not the same as the items fit statistics when the items were analyzed individually by the Rasch dichotomous model. Sixty items in 15 random testlets obtained a better fit when they were analyzed as (hypothetical) testlets. Another 34% of the items (54) showed misfit both when these items being analyzed as testlets and as items. The results are contradictary to the 117 intention of the test development in that these items should be context independent. It is suspected that there may be an implicit factor affecting item calibration. 2b. The person fit statistics for the independent items configuration and the random testlets configuration were not significantly different. 2c. The reliability of person separation ratio was the same for both the testlet-based analysis and the item-based analysis. For the reformed testlets: 3a. When context—dependent items in the original testlets were reconfigured into the same number of new testlets, each with.(an item. from. each original testlet (i.e., reformed testlets), their mean correlations were not all smaller than those of the original testlets. Eleven. of them had :mean within-context phi correlations larger than those of their corresponding original testlets. The remaining reformed testlets obtained smaller average within-context correlations than their corresponding original testlets. 3b. The person fit statistics estimated by the reformed testlets were not significantly different from the person fit statistics estimated by the original testlets. Conclusions Based on the results of this study, the following eight conclusions are made . 118 1. Context-dependent items correlated more closely within- context than across-context for most original testlets in this study, which provides some evidence that local item dependence does exist within a context. 2. Where there is a local item dependence effect in the context—dependent items, the IRT assumption of local independence may be violated for some context-dependent items. Under this circumstance, it would be thereotically preferrable to use the Rasch partial credit model. Evidence in this study showed that such a local dependence effect can be controlled and a better fit for item calibration can be obtained by employing the model for some, but not all original testlets. 3. Caution must be exercised in any revision of the misfit testlets. Often only one or two misfit items causes misfit of the whole testlet. When the problematic item(s) are not highly correlated to other items in the context, the test developers only have to eliminate or revise the bad item(s) instead of discarding the whole testlet. This conclusion may be more meaningful to test developers than to curriculum specialists or teachers. Very often during the testlet development an item is found to be problematic in measurement or for other concerns such as ethnic or gender bias. As a result, the whole testlet is discarded because of the underlying assumption that a testlet is considered as a complete piece and all of its parts are clustered together closely and should not be separated. If 119 one part goes wrong, the whole work is terminated. The results from this study imply that when context—dependent items are not highly correlated with each other, deleting the problematic item may not affect the remaining part of the testlet significantly; Therefore, one can still keep the technically' sound items, and. revise or eliminate the 'bad item, or, replace it with a new item. It is not necessary to discard the whole testlet or make any changes in other testlets either. 4. It seems that an implicit factor other than the local item dependence affects the misfit original testlets. Even when the Rasch partial credit model was applied unacceptable fit statistics were obtained. 5. Local item dependence effects may even exist in some developer-designed independent items in this study. However, they may be caused by random errors. 6. Truly statistically independent items should be analyzed independently, whether they belong to a context or not. 7. There is no significant different between the Rasch partial credit model and. the Rasch dichotomous model in average person ability measures. Competitive estimates were obtained by both models. 8. The Rasch partial credit model, which was usually used to analyze partial credit items, performed efficiently in analyzing the testlet data of this large-scale assessment. The computer program BIGSTEPS provided most of the necessary information for this research in a user-friendly manner. 120 Limitations Every study has its limitations. The major limitation of this study may be the quality of the data. Since the data were from a tryout administration, there were no previous item statistics available. Therefore, there was no reference of item quality, testlet formation or other related information. Another limitation is the nature of the testlet formation. Because the original testlets here were designed to assess students' multiple traits, their items were not linked to a common factor. Therefore, it is unlikely that student abilities would be affected by a single context. If these testlets had been developed as unidimensional instead of multi-dimensional, the results may have been quite different. In addition, because it was a tryout and not an operational administration, the results did. not have any impact on student records, and therefore, it did not matter if they performed seriously or not. Consequently, student attitudes may confound the results of the study. Furthermore, for the simplicity of the study, neither the response patterns of the testlets nor the constructed- response questions were considered in the research design. Whether this would affect the results is not known. 121 Generalizability One of the outstanding features of this study is that the data were collected from a very large and representative sample of approximately 100,000 students per testing instrument. Because every 11th grade student in Michigan public schools is required by the Legislature to take the Michigan High School Proficiency Tests, it was possible to sample from the entire public school student population of the 11th grade, which can help generalize the results to similar situations. However, such a large-scale and high- stake assessment may not be available in every field. So the methods described in the study may not be applicable to every testing situation. Other researchers who want to do similar studies or generalize :the results from this study need to be very cautious on this matter. Another important and practical factor is the cost of data analysis . even though some evidence of local dependence has been shown here, it is almost impossible to score those items as testlets with the Rasch partial credit model and other items with the dichotomous model for such a large-scale statewide assessment because the cost will be increased dramatically. What is more, this approach may also cause a lot of confusion and tension in the education community and to the public, especially parents and school boards. 122 Recommendations for Further Research This study demonstrated a technique for analyzing potential local item dependence with context-dependent testlets. Although the models function consistently, the lack-of-quality data leave some uncertainties on the inconsistent final results. To this author's knowledge, all the original testlets in the science tryout have been revised and one context—dependent multiple-choice item has been eliminated from each testlet in the operational forms. There is a need to use full operational data to conduct the study again to verify the outcomes. Testlets in this study were multi-dimensional. It is necessary to use the models in this study to investigate the local item dependence with unidimensional testlets. It is anticipated that dimensionality of a testlet has an impact on the validity of the results. As mentioned above, only the multiple-choice items within the testlets were used in the analysis. To fully investigate the local item dependence effects, full testlets, that is, multiple-choice items and constructed-response items, should be used in future studies. Local dependence shown in the independent items and the original testlets when they were analyzed as testlets need to be studied further. An alternative to examine the local item dependence of a test is to study the item relationship only when two or more items are found highly correlated to each other, tenporarily 123 ignoring whether they are from the same testlet or not (see Yen, 1984a). In this study, only the fit statistic generated from the BIGSTEPS was used. Other statistics such as Oz and Q3 were mentioned but not considered in the analyses. In addition, R. Smith (April, 1996, personal contact) proposed a “between— fit' statistic contrary to Linacre and wright's (1995) infit and outfit statistics. It will be helpful to the item/testlet analysis field to compare the efficiency of these and other currently available fit statistics. APPENDICES APPENDIX A EXAMPLES OF PARTIAL CREDIT SCORING Example 1. Mathematics item (f9.0/0.3-5=? No steps taken ............................... 0 9.0/0.3 = 30 ................................. 1 30 - 5 = 25 .................................. 2 Example 2. Screening test item Draw a Circle 0 1 2 3 No response Scribble, no Lack of Closure, no resemblance to closure, much more than 2/3' circle overlap, more overlap, 2/3 than 1/3 of figure round figure distorted Example 3. Geography item The capital city of Australia is a. Wellington ................................. 1 b. Canberra ................................... 3 c. Montreal ................................... 0 d. Sydney ..................................... 2 * From Bating_figale_5nalysis (p. 41) by B. D. Wright and G. N. Masters, 1982, Chicago,IL: MESA Press. Copyright 1982 by the authors. Reprinted with permission. 124 APPENDIX B SAMPLE TESTLET IN THE MHSPT IN SCIENCE Below is a data table which shows the melting and boiling points of common substances. Study the table. Then do Number 1 through 5. U001? IP- coma» U U001? N Until}? H U1 Substance Melting Point (*2) BoilingfiPoint (°C) Water 0 100 Alcohol -117 78 Nitrogen -210 —196 Oxygen -218 -183 Which substance should be a liquid at -90 degrees? water alcohol nitrogen oxygen . As each substance in the table is cooled down, the atoms and molecules undergo a physical changes as they move faster physical changes as they move slower chemical changes as they move faster chemical changes as they move slower Because alcohol freezes and boils at lower temperatures than water, mixing alcohol and water could be a useful application for a better radiator coolant in cars during the summertime better windshield-washer fluid in cars during the wintertime clean and inexpensive alternative to gasoline clean and inexpensive alternative to engine lubricants In order to change water from a solid to a liquid, energy must be removed added created destroyed . As water boil, the arrangement and behavior of the water molecules undergo changes. Describe at least two of these changes on the lines provided below. 125 APPENDIX C MICHIGAN SCHOOL STRATUM CLASSIFICATION The Michigan schools are classified into seven strata relative to populations where the schools reside. 1. Large City Central city of a Metropolitan Statistical Area (MSA) with a population greater than or equal to 400,000 or a population density greater than or equal to 6,000 people per square mile. Mid-size City Central City of an MSA.with a population less than 400,000 and a population density less than 6,000 people per square mile. Urban Fringe of Large City Place within an MSA of a Large Central City and defined as urban by the Census Bureau. Urban Fringe of Mid-size City Place within an MSA of a Mid—size Central City and defined as urban by the Census Bureau. Large Town Town not within an MSA and with a population greater than or equal to 25,000 people. Small Town Town not within an MSA and with a population less than 25,000 and greater than or equal to 2,500 people. Rural A.place with fewer than 2,500 people and coded rural by the Census Bureau. 126 APPENDIX D ITEM CODE SHEET FOR TRYOUT FORM 22 Item Item Dimension Item Item Item Dimension Item Num . Code Content Type Num . Code Content Type 1 L04 CELLS-COMP/RESP DC 28 P13 SPEED/DIR CHANGE DC 2 L06 CLASSFY ORGANISM DC 29 R2 REFLECTING DC 3 L14 ECO RELATIONSHIPS DC 30 P10 ATOMIC CHADEES DC 4 L08 FOOD SI‘ORAGE/ USE DC 31 C1 CONSTRUCTING DC 5 R4 REFLECTIDB DC 32 P13 SPEED/DIR CHADBE GE 6 L12 NATURAL SELECTION DC 33 R1 TEXT CRITICISM CE 7 Cl CONS'I'RUCTING DC 34 R1 TEXT CRITICISM OE 8 L02 EXPLAIN GRCMTH DC 35 E02 USE MAPS DC 9 L16 POPULATION SIZE DC 36 E06 SOIL/ SURFACE DC 10 C1 CONSTRUCTING DC 37 E09 WATER BELOW SURF DC 11 L05 CELLS-FOOD/RESP DC 38 E13 AIR/WEATHER DC 12 L05 CELLS—FOOD/ RESP DC 39 E16 HUMANS/ POPULATION DC 13 C1 CONSTRUCTING DC 40 E19 OBSERVE NITE SKY DC 14 R2 REFLECTIDI; DC 41 E25 SPACE SCI/TECH DC 15 L05 CELLS-FOOD/ RESP CE 42 R3 WIDE DC 16 Cl INVESTIGATION CE 43 C1 CONSTRUCTING DC 17 C1 INVESTIGATION (E 44 C1 CONST’RUCTING DC 18 P01 CLASSF'Y SUBS'I'NCS DC 45 C1 CONSTRUCTING DC 19 P02 MASS/VOLUME/ DC 46 E23 EVOLUTION OF DC DENS UNIVERSE 20 P04 ANALYZE RISK/ BEN DC 47 E23 EVOLUTION OF DC UNIVERSE 21 P18 SOUNDS/WAVES DC 48 R1 REFIECTIDG DC 22 P21 TYPES OF WAVES DC 49 E23 EVOLUTION OF w UNIVERSE 23 R3 WIDE DC 50 C1 CONSTRUCTING DC 24 P11 ENERGY CHADBES DC 51 P12 MEANS DC SPEED/DIRECTION 25 P15 OBJECTS/FORCE DC 52 E24 SOLAR SYST.FORM DC 26 C1 CONSTRUCTING DC 53 R1 REFLECTIDE DC 27 C1 CONSTRUCTIDE DC 54 P12 MEANS SPEED/ w DIRECTION MC - Multiple-choice OE - Open-ended 127 APPENDIX E TABLES AND FIGURES Table 7. Comparison of Original Testlets and Context— Denendent Items on Error and Fit by Form Form: 20 1 2 3 4 5 6 Testlet Context Item Orig. SE of Infit Depends SE of Infit Testlet Testlet. 211980 Item Item MNSQ Testlet 1 .04 1.03 11 .07 .93* 12 .07 1.02 13 .07 1.05* 14 .07 1.05 Testlet 2 .04 .95 28 .07 .97 29 .07 1.14* 30 .07 .98 31 .07 .98 Testlet 3 .04 .97 45 .07 .91* 46 .07 .94* 47 .08 .91* 48 .07 1.10* Testlet 4 .04 .97 50 .07 .91* 51 .07 1.0 52 .07 .96 53 .08 1.09* * indicates where the standardized infit statistics are greater than 12.0 SDs. 128 Table 7. (cont'd) 129 Porn) 21 1 2 3 4 5 6 Testlet Context Item Orig. SE of Infit Depend. SE of Infit Testlet Testlet MNSQ 1m 1m MNSQ Testlet 1 .04 .94 11 .08 .93 12 .07 1.01 13 .07 1.10 14 .07 1.03 Testlet 2 .04 1.17* 28 .07 1.03 29 .09 1.24* 30 .07 1.07* 31 .08 1.18* Testlet 3 .03 .90* 45 .07 .82* 46 .07 .80* 47 .07 1.00 48 .07 .94* Testlet 4 .04 .94 50 .07 1.02 51 .07 .93* 52 .07 .91* 53 .07 1.08* 130 Table 7. (cont'd) Porn: 22 1 2 3 4 5 6 Testlet Context Item Orig. SE of Infit Depend. SE of Infit Ififitlfil I§§Ll§§ MNSQ Item Item MNSQ Testlet 1 .04 1.03 11 .07 1.03 12 .07 .89* 13 .08 .91* 14 .07 1.05 Testlet 2 .04 .90* 28 .07 .97 29 .07 1.02 30 .07 1.00 31 .07 .91* Testlet 3 .04 1.03 45 .08 .83* 46 .08 1.25* 47 .07 1.16* 48 .07 .90* Testlet 4 .04 .91* 50 .07 .89* 51 .07 .96 52 .07 1.22* 53 .07 .96 Table 7. (cont’d) 131 Porn: 23 1 2 3 4 5 6 Testlet Context Item Orig. SE of Infit Depend. SE of Infit Testlet Testlet MNSQ Item Item MNSQ Testlet 1 .04 .96 11 .07 .97 12 .07 1.11* 13 .09 .92 14 .07 .99 Testlet 2 .04 .98 28 .07 1.05 29 .07 .96 30 .07 .96 31 .08 .97 Testlet 3 .04 1.22* 45 .07 1.13* 46 .08 1.16* 47 .07 1.09* 48 .07 1.10* Testlet 4 .04 .76* 50 .09 .88* 51 .07 .88* 52 .07 .86* 53 .09 .86* Table 7. (cont'd) 132 For!) 24 1 2 3 4 5 6 Testlet Context Item Orig. SE of Infit Depend. SE of Infit I§§Ll§§ IEELlEL MNSQ Item Item MNSQ Testlet 1 .04 .88* 11 .08 1.18* 12 .07 .93* 13 .07 .91* 14 .08 .87* Testlet 2 .04 1.18* 28 .08 1.12* 29 .07 1.07* 30 .07 1.12* 31 .09 .92 Testlet 3 .04 .89* 45 .07 1.08* 46 .07 1.12* 47 .07 .90* 48 .08 .83* Testlet 4 .04 .87* 50 .07 .98 51 .07 .99 52 .07 .98 53 .07 .185* 133 Table 7. (cont'd) For-l 25 1 2 3 4 5 6 Testlet Context Item Orig. SE of Infit Depend. SE of Infit Testlet Testlet 24180 m lien MNSQ Testlet 1 .04 .94 11 .07 .97 12 .07 .92* 13 .07 1.09* 14 .11 .86 Testlet 2 .04 .93 28 .07 1.07 29 .07 .98 30 .07 .91* 31 .07 .95 Testlet 3 .04 .92 45 .07 .95* 46 .07 .99 47 .07 .95* 48 .08 1.02 Testlet 4 .04 1.12* 50 .07 1.08* 51 .07 1.03 52 .07 1.03 53 .08 1.10* 134 Table 7. (cont'd) Form; 26 1 2 3 4 5 6 Testlet Context Item Orig. SE of Infit Depend. SE of Infit thistle; leatLeL 14.1159 Item Item MNSQ Testlet l .04 .94 11 .08 1.02 12 .08 .94 13 .08 .91* 14 .08 1.03 Testlet 2 .04 1.08 28 .08 1.16* 29 .08 1.09* 30 .10 .89 31 .09 .89* Testlet 3 .04 .88* 45 .08 .95 46 .08 1.01 47 .08 .99 48 .08 .97 Testlet 4 .04 1.01 50 .07 .91* 51 .09 .93 52 .11 1.27* 53 .08 1.00 Table 7. (cont'd) 135 Fatal 27 1 2 3 4 5 6 Testlet Context Item Orig. SE of Infit Depend. SE of Infit leaflet Ieeelee MNSQ leg}: Item MNSQ Testlet 1 .04 1.00 11 .08 .96 12 .08 1.16* 13 .07 1.04 14 .07 .93* Testlet 2 .04 1.12* V 28 .07 1.02 29 .08 1.31* 30 .09 1.04 31 .07 .92* Testlet 3 .04 .79* 45 .09 .83* 46 .07 .90* 47 .08 .92* 48 .08 .86* Testlet 4 .04 .95 50 .09 1.31* 51 .07 .90* 52 .08 .88* 53 .08 .85* 136 Table 7. (cont'dL Fora: 28 1 2 3 4 5 6 Testlet Context Item Orig. SE of Infit Depend. SE of Infit IEELLEL. .Iefiilefi MNSQ Item Item. MNSQ Testlet 1 .04 .84* 11 .07 .90* 12 .07 .91* 13 .07 .94* 14 .07 .95 Testlet 2 .04 1.20* 28 .08 1.31* 29 .08 1.10* 30 .08 1.11* 31 .07 .95 Testlet 3 .04 .88* 45 .07 .98 46 .07 .90* 47 .07 .85* 48 .07 .96 Testlet 4 .04 1.05 50 .08 1.09* 51 .07 .97 52 .08 1.05 53 .07 1.07* 137 Table 7. (cont'd) Fora; 29 1 2 3 4 5 6 Testlet Context Item Orig. SE of Infit Depend. SE of Infit leatlee Testlet MNSQ Item Item MNSQ Testlet 1 .04 .90* 11 .08 1.08* 12 .07 .95 13 .08 .92 14 .09 .87* Testlet 2 .04 .86* 28 .08 1.06 29 .08 .87* 30 .10 .94 31 .07 .92* Testlet 3 .04 1.24* 45 .08 .98 46 .07 1.09* 47 .07 1.12* 48 .08 1.26* Testlet 4 .04 .90* 50 .09 .81* 51 .07 1.05 52 .07 95 53 .07 1:04_ Table 10. Items on Error and Fit bv Form Comparison of Random Testlets and Independent Form: 20 1 2 3 4 5 6 Testlet Item Random SE of Infit Indep. SE of Infit Testlet Testlet @159 Item Item meg Testlet 1 .05 .78* 1 .14 .95 8 .09 .91 24 .07 1.10* 38 .08 .84* Testlet 2 .04 1.11* 2 .07 1.03 9 .07 1.19* 25 .07 1.01 40 .08 .92* Testlet 3 .04 .98 3 .07 1.04 18 .07 .94* 27 .07 .88* 41 .07 1.00 Testlet 4 .04 .88* 4 .07 1.00 20 .08 1.18* 37 .08 .90* 43 .07 .97 * indicates where the standardized infit statistics are greater than 12.0 SDs. 138 139 Table 10. (cont’d) Four: 21 1 2 3 4 5 6 Testlet Item Random SE of Infit Indep. SE of Infit Testlet Testlet MNSQ Itgn Item MNSQ Testlet 1 .04 .85* 1 .08 1.11* 8 .07 .90* 24 .07 .93* 38 .07 1.01 Testlet 2 .04 1.05 2 .07 .93* 9 .08 .92* 25 .07 1.00 40 .09 1.14* Testlet 3 .04 1.11* 3 .07 1.02 18 .08 .97 27 .07 1.06* 41 .07 1.04 Testlet 4 .04 .87* 4 .07 1.00 20 .07 1.00 37 .09 1.00 43 .07 .95 140 Tabletlo. (cont'd) Foam: 22 1 2 3 4 5 6 Testlet Item Random SE of Infit Indep. SE of Infit I‘eetlet Ifitlet MNSQ Item Item 14159 Testlet 1 .04 .76* 1 .07 .88* 8 .07 .94* 24 .07 .96 38 .07 1.07* Testlet 2 .04 1.10* 2 .07 1.09* 9 .07 .93* 25 .08 1.01 40 .07 1.19* Testlet 3 .04 1.20* 3 .07 .93* 18 .11 1.26* 27 .07 .97 41 .07 1.03 Testlet 4 .04 .77* 4 .07 .87* 20 .07 1.05* 37 .07 99 43 .07 -88* 141 Table 10. (cont'd) 10:11 23 1 2 3 4 5 6 Testlet Item Random. SE of Infit Indep. SE of Infit Testlet Testlet MNSQ Item Item MNSQ Testlet 1 .04 .84* 1 .07 1.10* 8 .09 .83* 24 .07 1.06* 38 .07 .95* Testlet 2 .04 1.11* 2 .07 .93* 9 .07 .97 25 .07 .98 40 .08 1.17* Testlet 3 .04 .97 3 .07 1.03 18 .07 1.03 27 .07 .93* 41 .08 .89* Testlet 4 .04 .90* 4 .07 .97 20 .07 .96 37 .09 .91 ##43 .08 1.251 142 Table 10. (cont'd) [0:11.24 1 2 3 4 5 6 Testlet Item Randdm SE of Infit Indep. SE of Infit Ieetlet JEEEJSE. MNSQ Item Item MNSQ Testlet 1 .04 .95 1 .08 .97 8 .07 1.11* 24 .08 .95 38 .07 1.03 Testlet 2 .04 .98 2 .08 .94 9 .07 1.15* 25 .08 .82* 40 .07 1.10* Testlet 3 .04 .98 3 .08 .89* 18 .09 1.10 27 .09 .93 41 .07 1.03 Testlet 4 .04 .81* 4 .09 1.01 20 .07 1.04 37 .10 .89 43 .07 .921 Table 10. (cont'd) 143 rorul 25 1 2 3 4 5 6 Testlet Item Random SE of Infit Indep. SE of Infit Testlet Testlet MNSQ Item Item meg Testlet 1 .04 .83* 1 .07 .90* 8 .10 .97 24 .07 1.00 38 .07 .94* Testlet 2 .04 .98 2 .08 .93 9 .07 .96 25 .07 .93* 40 .08 1.14* Testlet 3 .04 1.00 3 .07 .91* 18 .07 1.00 27 .07 .94* 41 .07 1.09* Testlet 4 .04 1.03 4 .07 1.10* 20 .07 1.08* 37 .07 1.10* 43 .07 .95 144 Table 10. (cont’d) rornt 26 1 2 3 4 5 6 Testlet Item Random. SE of Infit Indep. SE of Infit Testlet lestlet MNSQ Item Item MNSQ Testlet 1 .05 .85* 1 .08 .93 8 .13 .92 24 .08 1.04 38 .08 .90* Testlet 2 .04 .94 2 .09 .90* 9 .08 1.01 25 .08 1.08* 40 .08 .90* Testlet 3 .04 1.20* 3 .07 1.09* 18 .08 .99 27 .07 1.04 41 .09 1.10* Testlet 4 .04 .85* 4 .07 1.15* 20 .09 .98 37 .08 .93* .437 .08 .95 145 Table 10. (cont'd) rorn.:r7 1 2 3 4 5 6 Testlet Item Random SE of Infit Indep. SE of Infit Testlet Testlet MNSQ Item Item MNSQ Testlet 1 .04 .85* 1 .07 1.07* 8 .08 .93* 24 .09 1.17* 38 .08 .89* Testlet 2 .05 1.08 2 .09 .93 9 .08 .92* 25 .07 1.17* 40 .08 .99 Testlet 3 .04 .99 3 .07 .93* 18 .07 1.04 27 .07 .95 41 .07 .98 Testlet 4 .04 .87* 4 .07 1.08* 20 .07 1.01 37 .07 .95* p43? .09 .99 TabletlQ. (cont'd) 146 torn: 28 1 2 3 4 5 6 Testlet Item Random SE of Infit Indep. SE of Infit Testlet Testlet MNSQ Item Item MNSQ Testlet 1 .04 .77* 1 .08 .83* 8 .07 .99 24 .07 .99 38 .07 .91* Testlet 2 .04 1.14* 2 .07 1.06* 9 .08 1.10* 25 .08 1.15* 40 .07 .88* Testlet 3 .04 1.04 3 .07 .91* 18 .07 1.08* 27 .08 .93 41 .07 1.02 Testlet 4 .04 .92 4 .07 1.08* 20 .08 1.09* 37 .07 .95 __43 .07 .99 147 Table 10. (cont’d) Porn: 29 1 2 3 4 5 6 Testlet Item Random. SE of Infit Indep. SE of Infit Testlet Testmlet @159 Im Item 11215.0 Testlet 1 .04 .80* 1 .07 1.05 8 .08 .87* 24 .07 .89* 38 .07 1.09* Testlet 2 .04 1.04 2 .08 .99 9 .08 .92* 25 .08 1.27* 40 .07 .92* Testlet 3 .05 1.10* 3 .08 1.01 18 .08 1.02 27 .11 .94 41 .09 1.10 Testlet 4 .04 .82* 4 .08 .94 20 .08 1.10* 37 .07 .91* 43 .07 .91* -'P ‘ _ 0 “‘10 I , 0- Q“ 0! ‘JC "!“9“!!“! FORM 20 ORIGIN ITEM ITEM INFIT OUTFIT TESTLET NAME CALIBR DP MNSQ LN(INFIT) MNSQ LN(OUTFIT) 1 TELll .42 1029.00 .93 -.07 .88 -.13 1 TEL12 .41 1028.00 1.02 .02 1.03 .03 1 TEL13 -.04 1025.00 1.05 .05 1.07 .07 1 TEL14 -1.02 1025.00 1.05 .05 1.08 .08 2 TEL28 -.13 1024.00 .97 -.03 .97 -.03 2 TEL29 .17 1024.00 1.14 .13 1.21 .19 2 TEL30 -.79 1022.00 .98 -.02 .98 -.02 2 TEL31 -.46 1025.00 .98 -.02 -99 -.01 3 TEL45 -.08 1021.00 .91 -.09 .89 -.12 3 TEL46 .24 1017.00 .94 -.06 .94 -.06 3 TEL47 .77 1018.00 .91 -.09 .91 -.09 3 TEL48 450 1018.00 1.10 .10 1.16 .15 4 TEL50 -.26 1014.00 .91 -.09 .89 -.12 4 TEL51 .31 1018.00 1.00 .00 1.00 .00 4 TEL52 -1.11 1020.00 .96 -.04 .96 -.04 4 TEL53 1.05 1019.00 1.09 .09 1.23 .21 FORM 21 ORIGIN ITEM ITEM INFIT OUTFIT TESTLET NAME CALIBR DP MNSQ LN(INFIT) MNSQ LN(OUTFIT) 1 TELll -1.12 1044.00 .93 -.07 .87 -.14 1 TEL12 .34 1045.00 1.01 .01 1.04 .04 1 TEL13 -.73 1045.00 1.03 .03 1.07 .07 1 TEL14 -.60 1044.00 1.03 .03 1-04 1404 2 TEL28 -.55 1038.00 1.03 .03 1.01 .01 2 TEL29 1.88 1036.00 1.24 .22 1.63 .49 2 TEL30 -.39 1038.00 1.07 .07 1.15 .14 2 TEL31 1-47 1038.00 1.18 .17 1.39 -33 3 TEL45 -.49 1036.00 .82 -.20 .73 -.31 3 TEL46 -.06 1033.00 .80 -.22 .75 -.29 3 TEL47 -.03 1036.00 1.00 .00 1.00 .00 3 TEL48 -.23 1035.00 .94 -.06 .93 -.O7 4 TELSO -.37 1024.00 1.02 .02 1.02 .02 4 TEL51 -.29 1033.00 .93 -.07 .87 -.14 4 TEL52 .43 1032.00 .91 -.09 .86 -.15 4 TEL53 .75 1019.00 1.08 08 1.12 .11 FORM 22 ORIGIN ITEM ITEM INFIT OUTFIT TESTLET NAME CALIBR DP MNSQ LN(INFIT) MNSQ LN(OUTFIT) 1 TELII -.05 1041.00 1.03 .03 1.04 .04 1 TEL12 .45 1039.00 .89 -.12 .85 -.16 1 TEL13 -1.27 1038.00 .91 -.09 .81 -.21 41 TEL14 .66, 1038.00 1.05 .05 1-10 .10 2 TEL28 -.82 1038.00 .97 -.03 .96 -.04 2 TEL29 -.37 1040.00 1.02 .02 1.03 .03 2 TEL30 .84 1040.00 1.00 .00 1.06 .06 2 TEL31 --42 1038.00 .91 -.09 -88 -113 3 TEL45 -1.53 1035.00 .83 -.19 .67 -.40 3 TEL46 1.10 1035.00 1.25 .22 1.51 .41 3 TEL47 .48 1035.00 1.16 .15 1.22 .20 3 TEL48 -.72 1033-00 190 -.11 .88 -.13 4 TELSO -.33 1028.00 .89 -.12 .85 -.16 4 TEL51 .79 1034.00 .96 -.04 1.00 .00 4 TEL52 .83 1033.00 1.22 .20 1.33 .29 4 TEL53 .35 1033.00 .96 -.04 .98 -.02 ‘JL 148 149 Table 11. (cont'd) roan 23 ORIGIN ITEM ITEM INFIT OUTFIT TESTLET NAME CALIBR DF MNSQ LN(INFIT) MNSQ LN(OUTFIT) 1 TEL11 -.27 1050.00 .97 -.03 .97 -.03 1 TEL12 -.22 1040.00 1.11 .10 1.19 .17 1 TEL13 -l.64 1049.00 .92 -.08 .81 -.21 1 TEL14 .36 1045.00 .99 -.01 1.07 .07 2 TEL28 .99 1040.00 1.05 .05 1.08 .08 2 TEL29 -.17 1040.00 .96 -.04 .94 -.06 2 TEL30 -.46 1038.00 .96 -.04 .88 -.13 2 TEL31 1.61 1039.00 .97 -.03 1.10 .10 3 TEL45 -.08 1032.00 1.13 .12 1.20 .18 3 TEL46 1.39 1030.00 1.16 .15 1.27 .24 3 TEL47 .90 1032.00 1.09 .09 1.13 .12 3 TEL48 1.19 1031.00 1.10 .10 1.17 444416 4 72150 -1.63 1024.00 .88 -.13 .81 -.21 4 TEL51 .31 1028.00 .88 -.13 .85 -.16 4 TEL52 -.66 1029.00 .86 -.15 .75 -.29 4 TEL53 -1.62 1026.00 .86 -.15 .70 -.36 FORM 24 ORIGIN ITEM ITEM INFIT OUTFIT TESTLET NAME CALIBR DF MNSQ LN(INFIT) MNSQ LN(OUTFIT) 1 TEL11 1.58 1016.00 1.18 .17 1.58 .46 1 TEL12 -.60 1023.00 .93 -.07 .94 -.O6 1 TEL13 -.65 1020.00 .91 -.09 .87 -.14 1 TEL14 -1.04 1023.00 .87 -.14 .78 -.25 2 TEL28 1.76 1018.00 1.12 .11 1.47 .39 2 TEL29 -.30 1020.00 1.07 .07 1.05 .05 2 TEL30 .01 1018.00 1.12 .11 1.27 .24 2 TEL31 -2.00 1021.00, .92 -.08 .82 -.20 3 TEL45 .60 1015.00 1.08 .08 1.15 .14 3 TEL46 .98 1014.00 1.12 .11 1.30 .26 3 TEL47 -.67 1012.00 .90 -.11 .87 -.14 3 TEL48 -1.34 1015.00 .83 -.19 .72 --33 4 TELSO -.20 1007.00 .98 -.02 1.01 .01 4 TEL51 .44 1010.00 .99 -.01 1.02 .02 4 TELSZ 1.09 1008.00 .98 -.02 .97 -.03 4 TEL53 -.46 1009.00 .85 -.16 .79 -.24 FORM 25 ORIGIN ITEM ITEM INFIT OUTFIT TESTLET NAME CALIBR DF MNSQ LN(INFIT) MNSQ LN(OUTFIT) 1 TEL11 -.67 1013.00 .97 -.03 .95 -.05 1 TEL12 -.96 1013.00 .92 -.08 .86 -.15 1 TEL13 .26 1010.00 1.09 .09 1.13 .12 1 TEL14 -2.49 1014.00 .86 -.15 .65 -.43 2 TEL28 1.12 1005.00 1.07 .07 1.17 .16 2 TEL29 -.90 1008.00 .98 -.02 1.06 .06 2 TEL30 -.66 1005.00 .91 -.09 .86 -.15 2 TEL31 .85 1005-00 -95 -.05 1.10 .10 3 TEL45 -.26 999.00 .95 -.05 .92 -.0e 3 TEL46 .22 996.00 .99 -.01 1.02 .02 3 TEL47 .36 993.00 .95 -.05 .92 -.0e 3 TEL48 1.39 995.00 1.02 .02 1.14 .13 4 TELSO .35 989.00 1.08 .08 1.13 .12 4 TEL51 .18 993.00 1.03 .03 1.08 .08 4 TEL52 -.03 992.00 1.03 .03 1.03 .03 4 TEL53 1.23 987.00 1.10 .10 1.23 .21 150 Table 11. (cont’d) FORM 26 ORIGIN ITEM ITEM INFIT OUTFIT TESTLET NAME CALIBR DF MNSQ LN(INFIT) MNSQ LN(OUTFIT) 1 TEL11 -.67 894.00 1.02 .02 1.02 .02 1 76112 .99 692.00 .94 -.06 .99 -.01 1 76113 -.93 895.00 .91 -.09 .80 -.22 1 TEL14 -.27 894.00 1.03 .03 1.05 .05 2 TEL28 .87 894.00 1.16 .15 1.26 .23 2 TEL29 .93 893.00 1.09 .09 1.13 .12 2 76130 -2.01 693.00 .89 -.12 .74 -.30 2 TEL31 -1.61 891.00 .89 -.12 .76 -.27 3 TEL45 .26 888.00 .95 -.05 .95 -.05 3 76146 .41 669.00 1.01 .01 1.05 .05 3 TEL47 .71 885.00 .99 -.01 1.00 .00 3 TEL48 -.33 888.00 .97 -.03 .99 -.01 4 76150 .04 669.00 .91 -.09 .67 -.14 4 76151 -1.44 663.00 .93 -.07 .90 -.11 4 76152 2.62 889.00 1.27 .24 2.24 .81 4 76153 .43 690.00 1.00 .00 .98 -.02 FORM 27 ORIGIN ITEM ITEM INFIT OUTFIT TESTLET NAME CALIBR DF MNSQ LN(INFIT) MNSQ LN(OUTFIT) 1 TEL11 -1.10 944.00 .96 -.04 .88 -.13 1 76112 .64 936.00 1.16 .15 1.31 .27 1 76113 .42 943.00 1.04 .04 1.05 .05 1 76114 -.03 943.00 .93 -.07 .90 -.11 2 76126 .17 931.00 1.02 .02 1.03 .03 2 76129 1.05 931.00 1.31 .27 1.61 .46 2 76130 1.66 932.00 1.04 .04 1.31 .27 2 TEL31 -27. 931.00 .92 -.08 .92 -.08 3 TEL45 -1.52 923.00 .83 -.19 .72 -.33 3 76146 -.32 923.00 .90 -.11 .67 -.14 3 76147 -.96 922.00 .92 -.06 .88 -.13 3 TEL48 -1-24 923.00 .86 —.15 .73 -.31 4 76150 2.06 913.00 1.31 .27 2.95 1.08 4 76151 .20 920.00 .90 -.11 .67 -.14 4 76152 -.60 920.00 .88 -.13 .83 -.19 4 76153 -.72 919.00 .65 -.16 .77 -.26 FORM 26 ORIGIN ITEM ITEM INFIT OUTFIT TESTLET NAME CALIBR DF MNSQ LN(INFIT) MNSQ LN(OUTFIT) 1 TEL11 -.15 942.00 .90 -.11 .87 -.14 1 TEL12 -.64 941.00 .91 -.09 .87 -.14 1 TEL13 -.53 942.00 .94 -.06 .93 -.07 1 TEL14 -.57 940.00 -95 -.05 .91 -.09 2 76126 .67 933.00 1.31 .27 1.40 .34 2 76129 1.14 934.00 1.10 .10 1.21 .19 2 TEL30 .60 935.00 1.11 .10 1.12 .11 2 TEL31 -.39 934.00 ..95 -.05 .94 -.06 3 76145 .20 926.00 .96 «.02 .97 -.03 3 76146 -.16 924.00 .90 -.11 .67 -.14 3 76147 -.56 925.00 .65 -.16 .79 -.24 .113 TEL48 -.42 926.00 _196 2104 .93 --07 4 TELSO .50 907.00 1.09 .09 1.13 .12 4 TEL51 -.27 921.00 .97 -.03 .96 -.O4 4 TEL52 .53 919.00 1.05 .05 1.07 .07 4 76153 .08 920.00 1.07 .07 1.11 .10 1151 Table 11. (cont’d) FORM 29 ORIGIN ITEM ITEM INFIT OUTFIT TESTLET NAME CALIBR DF MNSQ LN(INFIT) MNSQ LN(OUTFIT) 1 TEL11 -.20 943.00 1.08 .08 1.07 .07 1 76112 .27 944.00 .95 -.05 .94 -.06 1 TEL13 -.99 943.00 .92 -.06 .78 —.25 1 76114 -1.45 944.00 .87 -.14 .67 --40 2 76126 1.88 939 00 1.06 .06 1.46 .38 2 76129 -.28 939.00 .67 - 14 .76 -.25 2 76130 -1.62 939.00 .94 - 06 .79 -.24 2 TEL31 .36 942.00 .92 -.08 .88 -.13 3 76145 -.81 941.00 .98 -.02 .97 -.03 3 76146 .34 940.00 1.09 .09 1.14 .13 3 TEL47 .54 940.00 1.12 .11 1.17 .16 13 76146, 1.81 941.00 1.26 .23 1.50 .141 4 76150 -1.29 938.00 .81 -.21 .65 -.43 4 76151 .41 941.00 1.05 .05 1.09 .09 4 76152 .73 937.00 .95 - 05 .93 -.07 4 76153 49 940.00 1.04 .04 1.07 .07 Table 12. Discrepancies For Testlets in the Tryout Forms Form. Testlet 1 Testlet 2 Testlet 3 Testlet 4 20 4 3 4 5 21 1 2 3 9 22 3 2 2 6 23 5 2 2 4 24 7 3 3 3 25 4 3 6 6 26 3 3 4 7 27 8 1 1 7 28 2 2 2 14 29 1 3 1 4 152 49.‘ . n. .0 01‘-1o ‘10 ‘ 0 0 f. -9‘9e90‘_ . :1. Form 20 4 of Standard Standard Testlet Items Mean Deviation Error 95 Pct Conf Int for Mean Testlet 1 4 .0112 .0575 .0287 -.0803 TO .1027 Testlet 2 4 .0150 .0775 .0387 -.1082 TO .1383 Testlet 3 4 -.0388 .0907 .0454 -.1831 TO .1055 Testlet 4 4 —.0122 .0761 .0381 -.1334 To .1089 F ratio = .4236 probability = .7396 Form 21 4 of Standard Standard Testlet Items Mean Deviation Error 95 Pct Conf Int for Mean Testlet 1 4 -.0009 .0487 .0243 -.0783 TO .0766 Testlet 2 4 .1195 .0857 .0429 -.0169 To .2558 Testlet 3 4 -.1209 .1073 .0537 -.2917 TO .0499 Testlet 4 4 -.0175 .0801 .0400 -.1450 TO .1099 F ratio = 5.6103 probability = .0122 Form 22 4 of Standard Standard Testlet Items Mean Deviation Error 95 Pct Conf Int for Mean Testlet 1 4 -.0331 .0843 .0422 -.1673 TO .1011 Testlet 2 4 -.0262 .0499 .0249 -.1056 TO .0531 Testlet 3 4 .0200 .1967 .0983 -.2930 TO .3329 Testlet 4 4 .0002 .1372 .0686 -.2181 TO .2184 F ratio = .1431 probability = .9322 Form 23 4 of Standard Standard Testlet Items Mean Deviation Error 95 Pct Conf Int for Mean Testlet 1 4 -.0049 .0791 .0396 -.1308 TO .1210 Testlet 2 4 ~.0158 .0434 .0217 ~.0848 To .0532 Testlet 3 4 .1130 .0281 .0141 .0683 TO .1578 Testlet 4 4 -.1393 .0133 .0066 -.1604 TO -.1182 F ratio = 18.6909 probability = .0001 Form 24 4 of Standard Standard Testlet Items Mean Deviation Error 95 Pct Conf Int for Mean Testlet 1 4 -.0352 .1366 .0683 -.2526 TO .1823 Testlet 2 4 .0527 .0933 .0466 -.0957 TO .2011 Testlet 3 4 -.0254 .1438 .0719 -.2541 To .2034 Testlet 4 4 -.0532 .0730 .0365 -.1694 TO .0629 F ratio = .6559 probability = .5946 Form 25 4 of Standard Standard Testlet Items Mean Deviation Error 95 Pct Conf Int for Mean Testlet 1 4 -.0446 .1002 .0501 -.2040 TO .1147 Testlet 2 4 -.0245 .0686 .0343 -.1336 TO .0846 Testlet 3 4 -.0232 .0346 .0173 -.0783 TO .0319 Testlet 4 4 .0578 .0335 .0168 .0045 TO .1112 F ratio = 1.9327 probability = .1782 153 154 Table 13. (Cont'd) Form 26 4 of Standard Standard Testlet Items Mean Deviation Error 95 Pct Conf Int for Mean Testlet 1 4 -.0267 .0609 .0305 -.1237 TO .0702 Testlet 2 4 .0004 .1374 .0687 -.2182 TO .2190 Testlet 3 4 -.0205 .0264 .0132 -.0624 TO .0215 Testlet 4 4 .0180 .1527 .0764 -.2250 TO .2611 F ratio = .1431 probability = .9321 Form 27 I of Standard Standard Testlet Items Mean Deviation Error 95 Pct Conf Int for Mean Testlet 1 4 .0186 .0985 .0493 -.1382 TO .1753 Testlet 2 4 .0614 .1491 .0746 -.1759 TO .2987 Testlet 3 4 -.1315 .0461 .0231 -.2048 TO -.0581 Testlet 4 4 -.0314 .2023 .1012 -.3534 TO .2905 F ratio = 1.4697 probability = .2722 Form 28 4 of Standard Standard Testlet Items Mean Deviation Error 95 Pct Conf Int for Mean Testlet 1 4 -.0782 .0257 .0129 -.1192 TO -.0373 Testlet 2 4 .1046 .1313 .0657 -.1044 TO .3136 Testlet 3 4 -.0822 .0647 .0323 -.1851 TO .0207 Testlet 4 4 .0430 .0513 .0257 -.0386 TO .1247 F ratio = 5.5278 probability = .0128 Form 29 I of Standard Standard Testlet Items Mean Deviation Error 95 Pct Conf Int for Mean Testlet 1 4 -.0492 .0917 .0458 -.1951 TO .0966 Testlet 2 4 -.0566 .0832 .0416 -.1890 TO .0758 Testlet 3 4 .1026 .1032 .0516 -.0617 To .2669 Testlet 4 4 -.0435 .1203 .0601 -.2349 TO .1478 F ratio = 2.3075 probability = .1284 Table 14. Summary of Measured (Non—Extreme) Persons Fit by Form 1 2 3 4 Item/testlet Mban Infit Outfit Composition Measure MNSQ MNSQ Porn. 20 (n-1030) 16 context-dependent items -.27 1.00 1.01 4 original testlets -.28 .97 .97 4 reformed testlets —.37 .96 .96 16 MC independent items .18 1.00 1.02 4 random testlets .12 .94 .93 For: 21 (II-1046) 16 context-dependent items .06 .99 1.03 4 original testlets .04 .95 .97 4 reformed testlets -.02 .93 .94 16 MC independent items -.29 1.00 1.02 4 random testlets -.38 .97 .97 torn. 22 (n-1044) 16 context-dependent items —.03 1.00 1.01 4 original testlets —.04 .96 .97 4 reformed testlets «.11 .95 .96 16 MC independent items —.26 .99 1.05 4 random testlets -.30 .92 .96 Form. 23 (n-1051) 16 context-dependent items .25 1.00 .99 4 original testlets .20 .95 .96 4 reformed testlets .22 .94 .95 16 MC independent items .36 1.00 1.00 4 random testlets .37 .95 .94 form. 24 (n-1024) 16 context-dependent items .10 .99 1.04 4 original testlets .08 .94 .95 4 reformed testlets .00 .93 .96 16 MC independent items .57 .99 1.02 4 random testlets .71 .92 .92 155 156 Table 14 (cont’d) 1 2 Item/testlet Mean Composition Measure torn. 25 (n-1016) 16 context-dependent items .07 4 original testlets -.03 4 reformed testlets .06 16 MC independent items .21 4 random testlets .10 For. 26 (ll-896) 16 context-dependent items .15 4 original testlets .16 4 reformed testlets .05 16 MC independent items .63 4 random testlets .78 form 27 (rs-945) 16 context-dependent items .14 4 original testlets .03 4 reformed testlets .05 16 MC independent items .47 4 random testlets .61 torn. 28 (n-944) 16 context-dependent items -.22 4 original testlets -.36 4 reformed testlets -.33 16 MC independent items .01 4 random testlets -.09 Porn 29 (xi-947) 16 context-dependent items .47 4 original testlets .47 4 reformed testlets .53 16 MC independent items .09 4 random testlets .14 3 4 Infit Outfit MNSQ 1.00 .96 .96 .96 .99 .96 .93 .95 .98 .93 .92 .99 .94 .95 .96 .96 .97 .94 .99 .92 MNSQ .98 .98 .96 .97 .96 .96 .95 .93 .99 .96 .97 .96 .97 .99 .97 .94 .96 Table 15. Person Separation Ratios for Different Configurations by Form 1 2 3 4 5 Item/testlet Mean Real Adj . Separa. Composition Measure RMSE SD Ratio l'orn 2O (n-1030) 16 context-dependent items —.27 .61 .78 1.28 4 original testlets -.28 .70 .85 1.21 4 reformed testlets -.37 .71 .97 1.36 16 MC independent items .18 .66 .85 1.82 4 random testlets .12 .79 1.20 1.52 tom 21 (xi-1046) 16 context-dependent items .06 .63 .89 1.42 4 original testlets .04 .70 .89 1.27 4 reformed testlets —.02 .76 1.20 1.59 16 MC independent items -.29 .63 .55 .87 4 random testlets -.38 .76 .81 1.06 Form. 22 (n-1044) 16 context-dependent items -.03 .63 .84 1.33 4 original testlets -.04 .72 .98 1.36 4 reformed testlets -.11 .74 1.06 1.4 16 MC independent items -.26 .62 .77 1.25 4 random.test1ets -.30 .74 1105 1.42 torn. 23 (n-1051) 16 context-dependent items .25 .66 .87 1.32 4 original testlets .20 .72 .91 1.26 4 reformed testlets .22 .75 1.14 1.51 16 MC independent items .36 .63 .75 1.19 4 random testlets .37 .77 1.11 1.45 Form 24 (nu-1024) 16 context-dependent items .10 .65 .84 1.30 4 original testlets .08 .74 1.01 1.36 4 reformed testlets .00 .7 1.03 1.38 16 MC independent items .57 .68 .95 1.40 4 random testlets .71 .79 1.25 1.59 157 158 Table 15 (Cont'd) 1 2 Item/testlet Mean Composition Measure form. 25 (n-1016) 16 context-dependent items .07 4 original testlets -.03 4 reformed testlets .06 16 MC independent items .21 4 random testlets .10 form. 26 (n-896) 16 context-dependent items .15 4 original testlets .16 4 reformed testlets .05 16 MC independent items .63 4 random testlets .78 torn. 27 (n-945) 16 context—dependent items .14 4 original testlets .03 4 reformed testlets .05 16 MC independent items .47 4 random testlets .61 form. 28 (n-944) 16 context-dependent items —.22 4 original testlets -.36 4 reformed testlets -.33 16 MC independent items .01 4 random testlets -.09 torn. 29 (n-947) 16 context-dependent items .47 4 original testlets .47 4 reformed testlets .53 16 MC independent items .09 q 4 random testlets .14 Real RMSE .64 .70 .73 .62 .74 .66 .73 .79 .66 .77 .66 .73 .77 .64 .76 .61 .68 .72 .61 .73 .67 .74 .77 .64 .76 4 5 Adj.Separa- tion H SD .71 .75 .89 .72 .97 .90 .96 .19 .86 .08 .87 .93 .22 .80 .13 .77 .81 .04 .78 .04 .94 .01 .17 .81 .06 idididldid P'F‘F‘P‘F‘ F'h‘k’h‘h‘ H'F‘F‘F‘k‘ P‘F‘P‘F‘F’ .12 .06 .22 .16 .31 .36 .32 .50 .30 .41 .32 .27 .58 .25 .49 .26 .19 .43 .27 .43 .41 .36 .53 .26 .39 Table 17. Comparisons of Average Measures for the Original and Random Testlets W0 (n=1030) Category Average Infit LabeL m JINSQ Original Testlet 1 0 -1.38 1.11 1 -.78 1.08 2 -.21 1.02 3 .43 1.04 4 1.32 .92 Random Testlet 1 0 -2.57 .91 1 -1.79 .78 2 -.88 .88 3 .00 .75 4 1.57 .74 Original Testlet 2 0 -1.61 .93 1 -.98 .85 2 -.28 .98 3 .45 .89 4 .93 1.15 Random Testlet 2 o -1.31 1.07 1 -.55 1.08 2 .36 1.04 3 1.38 1.15 4 2-39 1.44 Original Testlet 3 o -1.14 1.10 1 -.59 .98 2 -,07 1.02 3 .74 .90 4 1.46 -81 Random Testlet 3 o —1.40 1.19 1 -.80 .93 2 .01 .98 3 1.01 -90 4 2.30 -97 Original Testlet 4 o —1.45 ~93 1 —.78 .96 2 -.13 1.01 3 .56 .91 4 1.24 1.03 Random Testlet 4 0 —1.52 .90 1 -.52 .81 2 .33 .91 3 1.64 -88 4 2.76 .1108 159 160 Table 17. (cont'd) EQIIILZI (n=1046) Category Average Infit MEL M m Original Testlet 1 0 —1.51 .99 1 -.92 .91 2 -.29 .91 3 .34 1.03 4 1.19 .92 Random Testlet 1 0 —2.02 .84 1 -1.15 .84 2 -.35 .87 3 .49 .77 4 .99 1.03 Original Testlet 2 0 —1.15 1.02 1 -.33 1.20 2 .30 1.11 3 .89 1.25 4 1.61 1.55 Random.Test1et 2 0 -1.61 1.12 1 -.86 1.00 2 -.12 1.02 3 .58 1.02 14 .89 1.27 Original Testlet 3 0 -1.15 1.00 1 —.65 .88 2 -.27 .90 3 .47 .84 4 1.18 .88 Random.Test1et 3 0 -1.77 1.06 1 -.89 1.17 2 -.24 1.05 3 .35 1.12 4 .1-08 1.20 Original Testlet 4 0 —1.11 1.04 1 -.60 79 2 .01 .98 3 .75 .88 4 1.48 97 Random Testlet 4 0 -1.70 .92 1 -.84 .83 2 -.04 .88 3 .65 .91 .21 1.54 -84 Table 17. (cont'd) EgzthZ (n=1044) 161 Category Average Infit LabeL m msg Original Testlet 1 0 -1.56 .97 1 -.64 1.16 2 -.07 1.07 3 .45 1.25 4 1.61 .75 Random Testlet 1 0 -2.05 .83 1 -1.33 .68 2 -.54 .80 3 .37 .69 .AL- 1.22 .83 Original Testlet 2 0 -1.64 .99 1 -.93 .86 2 -.17 .95 3 .62 .84 4 1.47 .92 Random Testlet 2 0 -1.77 1.19 1 -l.06 1.08 2 -.26 1.10 3 .70 .97 4 1.38 1.38 Original Testlet 3 0 -1.73 .98 l -.95 .92 2 -.09 1.06 3 .80 .85 4 1.02 1.52 Random Testlet 3 0 -1.22 1.24 1 -.49 1.04 2 .28 1.19 3 .1.29 1.24 14, 1.35 1.97 Original Testlet 4 0 -1.32 1.01 1 -.56 .89 2 .23 .90 3 1.03 .87 4 1.87 .91 Random.Test1et 4 0 -l.99 .85 l -1.23 .77 2 -.47 .83 3 .40 .72 4 1.41 __-72 162 Table 17. (cont'd) 2911123 (n=1051) Category Average Infit Label Jew M Original Testlet 1 0 -1.19 1.16 1 —.90 .88 2 -.17 .94 3 .50 .98 4 1.37 .94 Random Testlet 1 0 -1.91 .82 1 -1.01 .90 2 -.17 .84 3 .71 .86 14, 1.87 .81 Original Testlet 2 0 -.93 1.05 1 —.34 .99 2 .23 1.01 3 .95 .96 4 2.00 .91 Random.Test1et 2 0 -1.23 1.06 1 — 46 1.03 2 49 1.15 3 1 38 1.10 4 l 93 1.41 Original Testlet 3 0 -.77 1.19 1 -.12 1.21 2 .46 1.04 3 1.31 1.22 4 1.56 1.76 Random Testlet 3 0 -l.75 .94 1 -.93 .88 2 .05 .98 3 .84 .95 14 1.76 1.12 Original Testlet 4 0 -1.52 .96 1 -1.05 .70 2 -.46 .75 3 .27 .67 4 1.16 .76 Random Testlet 4 0 -1.80 .84 1 -.76 .92 2 .10 .82 3 1 -5 .85 _4 1.21 163 Table 17. (cont'd) W4 (n=1024) Category Average Infit Lane-L m J39 Original Testlet 1 0 -1.76 .81 1 -.98 .84 2 -.13 .91 3 .77 .83 4 1.53 1.11 Random.Test1et 1 0 -1.97 .99 1 -.88 .88 2 .21 .93 3 1.29 1.01 4 2.32 .98 Original Testlet 2 0 -1.70 .99 1 -.59 1.27 2 .06 1.11 3 .80 1.29 4 1.85 1.15 Random Testlet 2 0 -1.64 .91 1 -.49 .94 2 .50 1.01 3 1.53 .95 4 2.41 .1.14 Original Testlet 3 0 -1.69 .93 1 -.93 .86 2 -.02 .77 3 .79 .99 4 1.79 .93 Random Testlet 3 0 -1.82 .92 1 —.73 .90 2 40 1.01 3 1.49 .94 .4 2.35 .1.22 Original Testlet 4 0 -1.34 .88 1 -.69 .83 2 .16 .84 3 .98 .83 4 1.71 .99 Random Testlet 4 0 —2.03 .95 1 ~1.22 .64 2 .07 .75 3 1.15 .82 4 2-13 .95 164 Table 17. (cont'd) W5 (n=1016) Category Average Infit Latel— Jeasan M Original Testlet 1 0 -1.74 .86 1 -1.06 .91 2 -.49 .95 3 .15 93 4 .75 1.01 Random Testlet 1 0 -1.75 .87 1 -1.09 .82 2 -.31 .84 3 .55 .80 4 1.61 .87 Original Testlet 2 0 -1.26 .88 1 —.60 .99 2 -.07 .98 3 .61 .89 4 1.41 .96 Random Testlet 2 0 —1.37 1.07 1 -.88 1.00 2 -.10 .94 3 .79 .88 4 1.67 1.10 Original Testlet 3 0 -.98 1.02 1 -.51 .89 2 .12 .83 3 .70 .97 4 1.74 .82 Random Testlet 3 0 -1.28 1.13 1 -.76 .94 2 -.09 1.05 3 .80 .87 4, 1.52 1.09 Original Testlet 4 0 —.94 1.15 1 —.41 1.05 2 .10 1.05 3 .69 1.19 4 1.55 1.18 Random Testlet 4 0 —.95 1.11 1 —.39 1.04 2 .33 1.08 3 1.26 .93 -4 2.19 1.01 165 Table 17. (cont’d) W6 (n=896) Category Average Infit LabeL m MNSQ Original Testlet 1 0 -1.30 1.08 1 -.78 .92 2 —.09 .94 3 .60 .98 4 1.60 .86 Random Testlet 1 0 -1.60 .77 1 -.76 .88 2 .05 .86 3 .98 .81 14 2.14 .89 Original Testlet 2 0 —1.74 .89 1 -1.09 .76 2 .04 1.25 3 .45 1.22 4 1.46 1.18 Random.Test1et 2 0 -1.54 .73 1 -.64 1.03 2 .05 .85 3 .99 .98 4 1.91 1.00 Original Testlet 3 0 -1.13 1.00 1 -.57 .81 2 .13 .93 3 1.01 .73 4 1.68 .94 Random Testlet 3 0 —.56 1.34 1 -.11 1.05 2 .78 1.13 3 1.68 1.22 4 2.33 1.35 Original Testlet 4 0 —1.18 1.09 1 -.54 .94 2 .21 1.01 3 1.11 .90 4 1.86 1.50 Random Testlet 4 O -l.16 .92 1 -.42 .93 2 .30 .82 3 1.25 .88 4 2144 .74 166 Table 17. (cont'd) m2? (n=945) Category Average Infit Lane-L gleam mag Original Testlet 1 0 —1.40 1.01 1 -.74 .94 2 - . 04 . 92 3 .68 1.10 4 1.48 1.04 Random Testlet 1 0 -1.36 .92 1 -.63 .79 2 .36 .86 3 1.26 .96 4 2.38 .80 Original Testlet 2 0 -1.01 1.06 1 -.29 1.12 2 .31 1.22 3 .99 1.27 4 2.19 88 Random Testlet 2 0 -1.50 1.13 1 -1.11 .86 2 -.20 1.06 3 .88 1.01 .4 1.58 1 29 Original Testlet 3 0 -1.77 .81 1 -1.23 .77 2 -.69 .79 3 .06 .80 4 .91 .77 Random Testlet 3 0 —1.21 1.05 1 -.53 1.00 2 .27 1.04 3 1.19 .94 4 2.13 .96 Original Testlet 4 0 -1.35 .98 J -.81 .86 2 .01 .91 3 .84 .84 4 1 35 1.54 Random Testlet 4 0 -1.30 .88 1 -.39 .95 2 .44 .78 3 1.54 .90 4 .2.49 .88 Table 17 . (cont ' d) W8 (n=944) 167 Category Average Inf it Label m mag Original Testlet 1 0 —1.55 .85 1 -1.05 .89 2 —.61 .87 3 .08 .78 4 .86 .83 Random Testlet 1 0 -2.14 .67 1 -1.21 .71 2 -.28 .80 3 .38 .84 4 1.40 .85 Original Testlet 2 0 -1.24 1.05 1 -.56 1.28 2 -.02 1.11 3 .47 1.41 4 1.81 1.12 Random Testlet 2 0 -1.32 1.15 1 —.45 1.12 2 .14 1.22 3 1.10 1.07 4 1.80 1.28 Original Testlet 3 0 -1.35 1.03 1 -.95 .84 2 -.56 .96 3 .23 .74 4 .94 .83 Randan Testlet 3 0 -1.77 1.10 1 -.98 .99 2 -.24 1.06 3 .45 1.13 A 1.39 .98 Original Testlet 4 0 -1.22 1.05 1 -.81 1.01 2 --.15 .96 3 .42 1.14 4 1.42 1.08 Random Testlet 4 0 -1.64 .94 1 -.71 .96 2 .02 .94 3 .89 .88 A 1.90 34 1158 Table 17. (cont'd) EQIHLZB (n;947) Category Average Infit Labe1__ 1Meaanre_t. 4MNEQ Original Testlet 1 0 -1 48 97 1 -.97 .75 2 -.10 .96 3 .57 .93 4 1.54 94 Random.Test1et 1 0 -2.20 .75 1 -1.32 .80 2 -.44 .86 3 .45 .78 4 1.45 80 Original Testlet 2 0 -1.31 .90 1 -.64 .86 2 .23 .86 3 1.03 .87 4 2.12 .83 Random Testlet 2 0 -1.69 1.04 1 -.99 .89 2 .11 .97 3 .91 1.05 4 1.39 1 45 Original Testlet 3 0 -1.01 1 13 1 -.17 1.31 2 .41 1.23 3 1.19 1 27 4 1.92 1 25 Random Testlet 3 0 -1.49 .99 1 -.40 1.08 2 .42 1.15 3 1.26 1.24 41, 2.27 1.00 Original Testlet 4 0 —l.22 .88 1 -.45 .94 2 .18 1.00 3 1.03 .83 4 1.94 .91 Random.Test1et 4 0 -1.43 .91 1 -.66 .83 2 .17 .86 3 1.09 .80 _JL 1.92 .76 Table 18. Ranges for Average Measures for Original and Random Testlets Tryout Testlet Avg. Measure Form Composition Range Form 20 Original Testlet 1 2.70 Random Testlet 1 4.14 Original Testlet 2 2.54 Random Testlet 2 3.70 Original Testlet 3 2.60 Random Testlet 3 3.70 Original Testlet 4 2.69 Random Testlet 4 4.19 Form 21 Original Testlet 1 2.70 Random Testlet 1 3.01 Original Testlet 2 2.76 Random Testlet 2 2.50 Original Testlet 3 2.33 Random Testlet 3 2.85 Original Testlet 4 2.59 Random Testlet 4 3.24 Form 22 Original Testlet l 3.17 Random Testlet l 3.27 Original Testlet 2 3.11 Random Testlet 2 3.15 Original Testlet 3 2.75 Random Testlet 3 2.57 Original Testlet 4 3.19 Random Testlet 4 3.40 Form 23 Original Testlet 1 2.56 Random Testlet 1 3.78 Original Testlet 2 2.93 Random Testlet 2 3.16 Original Testlet 3 2.33 Random Testlet 3 3.51 Original Testlet 4 2.68 Jammie-ener— 4 3.76 169 170 Table 18. (cont'd) Tryout Testlet Avg. Measure Form Composition Range Form 24 Original Testlet 1 3.29 Random Testlet 1 4.29 Original Testlet 2 3.55 Random Testlet 2 4.05 Original Testlet 3 3.48 Random Testlet 3 4.17 Original Testlet 4 3.05 Random Testlet 4 4.16 Form 25 Original Testlet 1 2.49 Random Testlet 1 3.35 Original Testlet 2 2.67 Random Testlet 2 2.93 Original Testlet 3 2.72 Random Testlet 3 3.10 Original Testlet 4 2.49 Random Testlet 4 3.80 Form 26 Original Testlet 1 2.90 Random Testlet 1 3.74 Original Testlet 2 3.20 Random Testlet 2 3.45 Original Testlet 3 2.81 Random Testlet 3 2.89 Original Testlet 4 3. Wtel 3.60 171 Table 18. (cont'd) Tryout Testlet Avg. Measure Form Composition Range Form 27 Original Testlet 1 2.88 Random Testlet 1 3.74 Original Testlet 3.20 Random Testlet 2 3.08 Original Testlet 2.68 Random Testlet 3 3.34 Original Testlet 2.70 Random Testlet 4 3.79 Form 28 Original Testlet 2.41 Random Testlet 1 3.54 Original Testlet 3.05 Random Testlet 2 3.12 Original Testlet 2.29 Random Testlet 3 3.16 Original Testlet 2.64 Random Testlet 4 3.54 Form 29 Original Testlet 3.02 Random Testlet 1 3.65 Original Testlet 3.43 Random Testlet 2 3.08 Original Testlet 2.93 Random Testlet 3 3.76 Original Testlet 3.16 Random Testlet 4 3.35 172 lForms of Testletsl I Content Forms I lLogic Relationshipship ] 1‘ Pictorial Form 1. Linear relationship 2. Interlinear Form 2. Hierarchical relationship w Interpretative Exercise 4. Problem-Solving Scenario Figure 1. Classification of Testlets 173 Outcome B) Item 2 Outcome C Outcome D > Level I Level II Outcome Figure 2. An example of two-level, 3-item, 4-outcome hierarchical testlet @ «a @ Level I Level II Level III Outcome Figure 3. An exampleof three-level, 3-item linear testlet 174 A >mmmmmambn meBOSonxm moH «Um SMOUowS mwnU mOUOOH MHOMWQMmboK Home wb mnwmboWHV eUHmo Smbnww wnnwdwnwmm mmmcnm Bommcum mePanmowwnnmeo\ . mommmnconwbn wmmwmnnwbn on 08 o no wb. P m mnwmoowmwo xbotwmaom mowmbnwmwo xSOSHmQQm mmHnU\wUPmHooH Awow 0m ammev Anew om mmwev .mom 0m-mmmev _£UMOU amwbm _£UMOU Smmam _£Uon Bmwbm imaomnmnWmebn won mooHKwLmj \MbmHKNMSO osm.m SOHHQ