PREDICTING DIFFERENTIAL ITEM FUNCTIONING IN CROSS-LINGUAL TESTING: THE CASE OF A HIGH STAKES TEST IN THE KYRGYZ REPUBLIC By Todd W. Drummond A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirement for the degree of DOCTOR OF PHILOSOPHY Educational Policy 2011 ABSTRACT PREDICTING DIFFERENTIAL ITEM FUNCTIONING IN CROSS-LINGUAL TESTING: THE CASE OF A HIGH STAKES TEST IN THE KYRGYZ REPUBLIC By Todd W. Drummond Cross-lingual tests are assessment instruments created in one language and adapted for use with another language group. Practitioners and researchers use cross-lingual tests for various descriptive, analytical and selection purposes both in comparative studies across nations and within countries marked by linguistic diversity (Hambleton, 2005). Due to cultural, contextual, psychological and linguistic differences between diverse populations, adapting test items for use across groups is a challenging endeavor. The validity of inferences based on cross-lingual tests can only be assured if the content, meaning, and difficulty of test items are similar in the different language versions of the test items (Ercikan, 2002). Of paramount importance in the test adaptation process is the proven ability of test developers to adapt test items across groups in meaningful ways. One way investigators seek to understand the level of item equivalence on a cross-lingual assessment is to analyze items for differential item functioning, or DIF. DIF is present when examinees from different language groups do not have the same probability of responding correctly to a given item, after controlling for examinee ability (Camilli & Shephard, 1994). In order to detect and minimize DIF, test developers employ both statistical methods and substantive (judgmental) reviews of cross-lingual items. In the Kyrgyz Republic, item developers rely on substantive review of items by bi-lingual professionals. In situations where statistical DIF detection methods are not typically utilized, the accuracy of such professionals in discerning differences in content, meaning and difficulty between items is especially important. In this study, the accuracy of bi-linguals’ predictions about whether differences between Kyrgyz and Russian language test items would lead to DIF was evaluated. The items came from a cross-lingual university scholarship test in the Kyrgyz Republic. Evaluators’ predictions were compared to a statistical test of “no difference” in response patterns by group using the logistic regression (LR) DIF detection method (Swaminathan & Rogers, 1990). A small number of test items were estimated to have “practical statistical DIF.” There was a modest, positive correlation between evaluators’ predictions and statistical DIF levels. However, with the exception of one item type, sentence completion, evaluators were unable to predict which language group was favored by differences on a consistent basis. Plausible explanations for this finding as well as ways to improve the accuracy of substantive review are offered. Data was also collected to determine the primary sources of DIF in order to inform the test development and adaptation process in the republic. Most of the causes of DIF were attributed to highly contextual (within item) sources of difference related to overt adaptation problems. However, inherent language differences were also noted: Syntax issues with the sentence completion items made the adaptation of this item type from Russian into Kyrgyz problematic. Statistical and substantive data indicated that the reading comprehension items were less problematic to adapt than analogy and sentence completion items. I analyze these findings and interpret their implications to key stakeholders, provide recommendations for how to improve the process of adapting items from Russian into Kyrgyz and highlight cautions to interpreting the data collected in this study. Copyright by Todd W. Drummond 2011 ACKNOWLEDGEMENTS I feel fortunate to have had the opportunity to pursue doctoral work in the College of Education at Michigan State University. I would like to express my sincere gratitude to my dissertation director, Dr. Mark Reckase, for his patient guidance as he mentored me throughout my doctoral studies. Special thanks also to my academic advisor Dr. Jack Schwille for his thoughtful probing and encouragement. Committee members Dr. Jim Fairweather and Dr. Ed Roeber were always accessible and provided constructive feedback. Dr. Michael Sedlak has been a consistent source of moral and financial support for all the students in the educational policy program at MSU. Though not directly involved with this dissertation, I learned a tremendous amount about leadership from Dr. John Hudzik in the Office for Global Engagement and about educational politics from the wisdom of Dr. Phillip Cusick. Thanks to Seung-Hwan Ham and Wang Jun Kim for their friendship and interest in reading my work. Colleagues and friends at the American Councils for International Education in both Washington, D.C., and the Kyrgyzstan field office influenced my thinking about this dissertation. I would like to acknowledge Dr. Dan Davidson, Dr. David Patton, Michael Curtis and Kimberly Verkuilen as well as past and current members of the American Councils team in Bishkek for their friendship and the role they have played in my professional development for more than a decade. This dissertation would not have been possible without the support of dozens of colleagues, students and friends in Kyrgyzstan. I thank Nina Dolzhenko, my former students, the entire collective at school number one in Kant, colleagues at the Ministry of Education, and the Shirinovi and Chokubeavi families for “introducing me” to Kyrgyzstan almost two decades ago. I also thank former Minister of Education, Camilla Sharshekeeva, for her friendship, inspirational courage, and tenacious optimism. v Past and current staff of the Center for Educational Assessment and Teaching Methods (CEATM) in Bishkek led by Dr. Inna Valkova have not only my deepest gratitude for their enthusiastic collaboration, but my sincere admiration for the outstanding work they do in very difficult conditions: Constantine Titov, Natalia Naumova, Merim Kadyrova and Asel Bazarbaeva, study participants, and the rest of the CEATM team supported me in every possible way in the summer of 2010. This research was made possible by support from the U.S. State Department’s (Title VIII) Research Scholars program administered by the American Councils for International Education. Finally, I want to express a very heartfelt thanks to my family. Vitaly and Lubov Stolyarovi as well as their extended families in Kyrgyzstan have been constant supporters. I thank my parents, R. Wayne and Gayle Drummond, for their unwavering love and a lifetime of opportunities and encouragement. I dedicate this dissertation to the most important person in my life, my wife and best friend, Natalia, who has been with me every step of the way. vi TABLE OF CONTENTS LIST OF TABLES ................................................................................................................... X CHAPTER 1: PREDICTING DIFFERENTIAL ITEM FUNCTIONING IN CROSSLINGUAL TESTING ............................................................................................................... 1 OVERVIEW ................................................................................................................................... 1 THE CHALLENGE OF CROSS-LINGUAL ASSESSMENT .................................................................. 2 RESEARCH QUESTIONS ................................................................................................................ 6 UTILITY OF THIS STUDY............................................................................................................... 7 SITUATING THE STUDY AND KEY TERMS ................................................................................... 10 STUDY LIMITATIONS .................................................................................................................. 13 ORGANIZATION OF THE STUDY .................................................................................................. 15 CHAPTER 2: EDUCATION & LANGUAGE(S) OF INSTRUCTION IN THE KR ..... 16 OVERVIEW ................................................................................................................................. 16 CONTEMPORARY SCHOOLING AND LANGUAGE ISSUES .............................................................. 26 THE STATUS OF RUSSIAN AS A MEDIUM OF INSTRUCTION ........................................................ 32 QUALITY OF EDUCATION BY LANGUAGE OF INSTRUCTION........................................................ 38 TERTIARY EDUCATION AND THE NST ....................................................................................... 43 STUDENT SELECTION IN THE SOVIET PERIOD ........................................................................... 47 THE NATIONAL SCHOLARSHIP TEST AND LANGUAGE POLITICS ............................................... 55 CHAPTER 3: LITERATURE REVIEW ............................................................................ 58 SUBSTANTIVE REVIEW AND DIF PREDICTION .......................................................................... 58 LEVELS OF DIF IN CROSS-LINGUAL TESTING .......................................................................... 67 CAUSES OF DIF IN CROSS-LINGUAL TESTING .......................................................................... 70 DIF AS STATISTICAL ARTIFACT ................................................................................................ 77 CHAPTER 4: METHODS .................................................................................................... 81 CONTENT AND DEVELOPMENT OF THE 2010 NST .................................................................... 81 THE ITEM ADAPTATION PROCESS ............................................................................................. 84 STATISTICAL DIF DETECTION METHOD ................................................................................... 85 PREPARING FOR THE STATISTICAL ANALYSIS ........................................................................... 93 SAMPLE SELECTION .................................................................................................................. 93 THE INDIVIDUAL ITEM ANALYSIS RUBRICS............................................................................... 94 SELECTING THE EVALUATORS ................................................................................................... 96 ADMINISTERING THE RUBRICS ................................................................................................ 100 GROUP ITEM ANALYSIS ........................................................................................................... 102 vii SUMMARY RUBRIC ................................................................................................................... 103 ESTIMATING INTER-RATER RELIABILITY ................................................................................ 105 ESTIMATING EVALUATORS’ ACCURACY IN DIF PREDICTION ................................................. 106 CHAPTER 5: RESULTS .................................................................................................... 108 DIF DETECTION RESULTS ...................................................................................................... 108 INTER-RATER RELIABILITY AND RANK ORDER ESTIMATIONS ................................................ 111 DIRECTION OF DIF ................................................................................................................. 115 READING COMPREHENSION ITEMS ......................................................................................... 118 SENTENCE COMPLETION ITEMS .............................................................................................. 120 ANALOGY ITEMS ...................................................................................................................... 122 SOURCES OF DIFFERENCE ...................................................................................................... 125 TRANSLATION AND ADAPTATION ISSUES ................................................................................. 128 SOCIO-CULTURAL ISSUES ........................................................................................................ 145 FORMAT................................................................................................................................... 146 GRAMMAR ............................................................................................................................... 148 OTHER ISSUES ......................................................................................................................... 149 CHAPTER 6: DISCUSSION & CONCLUSIONS ........................................................... 151 UNDERSTANDING EVALUATORS’ DIF PREDICTIONS............................................................... 151 ACCURACY IN SUBSTANTIVE ITEM REVIEW ............................................................................. 152 RECOMMENDATIONS FOR RESEARCHERS AND CEATM.......................................................... 156 UNDERSTANDING THE CAUSES OF DIF ................................................................................... 158 RECOMMENDATIONS FOR RESEARCHERS AND CEATM.......................................................... 163 STATISTICAL DIF AND THE NST VERBAL ITEMS .................................................................... 167 CAUTIONS TO STATISTICAL DIF INTERPRETATION ................................................................. 169 RECOMMENDATIONS FOR IMPROVING STUDIES OF SUBSTANTIVE METHODS ......................... 175 CHALLENGES TO COLLECTING AND INTERPRETING DATA FROM THE SUBSTANTIVE REVIEW 178 CONCLUSION ........................................................................................................................... 180 APPENDICES ....................................................................................................................... 184 APPENDIX A: SCHOOLS BY LANGUAGE(S) OF INSTRUCTION IN THE KR ...... 185 APPENDIX B: STUDENTS (%) IN MAIN LANGUAGE TRACKS BY OBLAST ........ 186 APPENDIX C: NST PARTICIPATION RATES IN THE KR ......................................... 187 APPENDIX D: DEMOGRAPHICS AND TEST SCORES (2010) ................................... 188 APPENDIX E: DEMOGRAPHICS OF SCHOLARSHIP WINNERS ............................ 189 APPENDIX F: SELECTIVITY OF HEIS IN THE KR ..................................................... 190 APPENDIX G: COMPLETING THE ITEM ANALYSIS RUBRICS............................. 191 APPENDIX H: GLOSSARY OF KEY RUBRIC TERMS................................................ 192 APPENDIX I: ITEM RUBRICS 1.A & 1.B ......................................................................... 196 APPENDIX J: ITEM RUBRIC 2 ........................................................................................ 198 APPENDIX K: UNIFORM DIF STATISTICS ................................................................. 205 viii APPENDIX L: ITEMS WITH MODERATE OR LARGE DIF....................................... 207 APPENDIX M: ITEMS WITH NO DIF ............................................................................. 208 APPENDIX N: NON-UNIFORM DIF STATISTICS ....................................................... 209 APPENDIX O: ITEM LOCATION ACROSS EFFECT SIZE VALUES ....................... 211 APPENDIX P: EVALUATOR SCORING MATRIX ....................................................... 212 APPENDIX Q: INTER-RATER RELIABILITY ............................................................. 214 APPENDIX R: RAW DATA FOR RANK ORDER ESTIMATION ............................... 216 APPENDIX S: RANK ORDER CORRELATION ............................................................ 217 APPENDIX T: EVALUATOR MARKS AND DIF STATISTICS ................................... 218 APPENDIX U: NUMBER, NATURE OF DIFFERENCES BY ITEM............................ 219 APPENDIX V: KYRGYZ ONLY DIF ANALYSIS........................................................... 222 APPENDIX W: SUMMARY ITEM ANALYSIS RUBRICS ........................................... 224 REFERENCES...................................................................................................................... 289 ix LIST OF TABLES TABLE 2-1: THREE LARGEST NATIONALITIES IN THE KYRGYZ REPUBLIC ............. 28 TABLE 2-2: PERCENTAGE OF STUDENTS IN MAIN LANGUAGE TRACKS .................. 29 TABLE 2-3: NST 2010 SCORES BY LANGUAGE OF INSTRUCTION ................................ 37 TABLE 2-4: PISA 2006 MATHEMATICS SCORES BY LANGUAGE OF INSTRUCTION 39 TABLE 2-5: NAEQ 2007 READING SCORES BY LANGUAGE OF INSTRUCTION ......... 39 TABLE 2-6: SOVIET AND CONTEMPORARY SELECTION PROCEDURES .................... 50 TABLE 2-7: SCHOLARSHIP WINNERS BY QUOTA CATEGORY (2010).......................... 57 TABLE 4-1: DESCRIPTIVE DATA FROM THE NST 2010 .................................................... 82 TABLE 4-2: EXAMPLE ANALOGY AND SENTENCE COMPLETION ITEMS .................. 84 TABLE 4-3: FLOW CHART FOR TEST ITEM ADAPTATION ............................................. 85 TABLE 4-4: TYPOLOGY OF ETHNIC KYRGYZ RUSSIAN LANGUAGE KNOWLEDGE 98 TABLE 4-5: BACKGROUND CHARACTERISTICS OF SELECTED EVALUATORS ........ 99 TABLE 5-1: ITEMS (%) BY EFFECT SIZE LEVELS AND ITEM TYPE ............................ 110 TABLE 5-2: EVALUATOR MARKS AND STATISTICS FOR PREDICTED DIF ITEMS . 114 TABLE 5-3: PREDICTION OF DIF DIRECTION FOR ITEMS PREDICTED AS DIF ........ 115 TABLE 5-4: PREDICTION OF DIF DIRECTION FOR ALL ITEMS .................................... 116 TABLE 5-5: STATISTICALLY SIGNIFICANT READING COMPREHENSION ITEMS .. 119 TABLE 5-6: STATISTICALLY SIGNIFICANT SENTENCE COMPLETION ITEMS.......... 121 TABLE 5-7: STATISTICALLY SIGNIFICANT ANALOGY ITEMS ..................................... 123 TABLE 5-8: SUMMARY OF EVALUATORS’ MARKS BY ITEM TYPE ........................... 124 TABLE 6-1: ITEMS ABOVE MEDIAN EFFECT SIZE WITH THREE OR MORE DIF MARKS .............................................................................................................................. 175 x TABLE A-1: SCHOOLS BY LANGUAGE(S) OF INSTRUCTION IN THE KR .................. 185 TABLE A-2: STUDENTS (%) IN MAIN LANGUAGE TRACKS BY OBLAST ................... 186 TABLE A-3: NST PARTICIPATION RATES BY OBLAST & LANGUAGE ........................ 187 TABLE A-4: DEMOGRAPHICS AND TEST SCORES ......................................................... 188 TABLE A-5: NST WINNERS BY LANGUAGE, OBLAST (2010) ......................................... 189 TABLE A-6: AVERAGE NST SCORES OF SCHOLARSHIP WINNERS ............................ 190 TABLE A-7: UNIFORM DIF STATISTICS FOR 38 VERBAL ITEMS ................................ 205 TABLE A-8: VERBAL ITEMS WITH MODERATE OR LARGE DIF ................................. 207 TABLE A-9: NON-SIGNIFICANT VERBAL ITEMS ............................................................ 208 TABLE A-10: NON-UNIFORM VERBAL DIF STATISTICS ............................................... 209 TABLE A-11: CONTINUUM OF EFFECT SIZE VALUES BY ITEM TYPE ........................ 211 TABLE A-12: EVALUATOR ITEM SCORING MATRIX ..................................................... 212 TABLE A-13: RELIABILITY STATISTICS ........................................................................... 214 TABLE A-14: CHI-SQUARE VALUES & EVALUATORS’ SCORES ................................. 216 TABLE A-15: RANK ORDER CORRELATION RESULTS ................................................... 217 TABLE A-16: EVALUATOR MARKS AND DIF STATISTICS ........................................... 218 TABLE A-17: NUMBER AND NATURE OF DIFFERENCES BY INDIVIDUAL ITEM ... 219 TABLE A-18: DIF STATISTICS FOR KYRGYZ RURAL AND URBAN STUDENTS ...... 222 xi Chapter 1: Predicting Differential Item Functioning in Cross-Lingual Testing Overview Cross-lingual tests are assessment instruments created in one language and adapted for use with another language group. Practitioners and researchers use cross-lingual assessments for various descriptive, analytical and selection purposes both in comparative studies across nations and within countries marked by linguistic diversity (Ercikan, 2002; Hambleton, 2005). In 2002, educational policy makers in the Kyrgyz Republic (KR) changed the selection criteria for awarding state scholarships to higher education by replacing oral admissions examinations with a 1 standardized, cross-lingual test (Clark, 2005). The new test, known as the National Scholarship Test (NST), is conducted in May of each year in the Kyrgyz, Russian, and Uzbek languages (Valkova, 2004). 2 The introduction of standardized testing in Kyrgyzstan merits scholarly attention for many reasons. In general, high stakes selection testing is a political endeavor with distributive consequences. For some students, success on the NST represents a once in a lifetime chance to access higher education (Drummond & De Young, 2004). As NST results are the sole criterion for university scholarship distribution, the public is counting on the NST to be fair to all examinees, regardless of ethnic or language background. Research is needed to determine the extent to which the NST has met its stated goal of reducing corruption in access to university scholarships. Another inquiry worthy of exploration is the extent to which the new selection 1 Kazakhstan, Georgia, Russia, Ukraine, Azerbaijan and Uzbekistan have also replaced oral examinations with cross-lingual, standardized admissions tests since the collapse of the USSR. The primary rationale for change has been to overcome corrupt practices that have plagued university admissions in the post-Soviet era (Drummond & De Young, 2004; Clark, 2005; Osipian, 2007; Heyneman, Anderson, & Nuralieva, 2008). 2 The “Kyrgyz Republic” is the official name of the country but “Kyrgyzstan” is also commonly used. 1 criterion has impacted schooling. Selection testing for tertiary admissions can impact secondary school classrooms as administrators, teachers, and pupils adjust to the incentives created by what is assessed on high stakes tests (Yeh, 2005). While the above issues are important, this study addresses key questions in cross-lingual assessment at the test item level. Though not often the focus of policy makers’ attention, item level analyses are essential because valid selection inferences in cross-lingual testing must be based upon the foundation of equivalent test items (Hambleton, 2005). The Challenge of Cross-Lingual Assessment The validity of inferences based on the results of any assessment must be carefully substantiated (Messick, 1988). However, cross-lingual testing introduces additional complexity into measurement and interpretive processes. Inferences derived from cross-lingual test results are based on the assumption that the items are measuring the same constructs at the same level of difficulty across language groups. In fact, cross-lingual item adaptation is a highly complex task due to the myriad of linguistic, cultural and psychological differences between groups: Item equivalence, and thus comparability across groups, can not simply be assumed (Hambleton, 2005). Successful cross-lingual item adaptation requires not only an understanding of test specifications, item aims and content knowledge, but also cultural and nuanced linguistic expertise in order to ensure that all examinees experience “the same” test items (ibid, 2005). The evidence from empirical studies of test adaptation across languages is that accurately adapting items is not always a straightforward task (Reckase & Kunce, 2002; Ercikan, 2002; Van de Vijver & Poortinga, 2005; Grisay & Monseur, 2007). Unintentional differences between item versions can manifest themselves in many ways: Variation in content, presentation, translation or 2 adaptation, format, or mistakes in one version (e.g. grammar mistakes) can all result in differential performance across groups. Even when two language versions of an item appear to be linguistically equivalent and convey similar content, meaning and difficulty, there may be less visible but critically important cultural, contextual, and psychological background differences between diverse groups that impact a group’s performance on an item. For example, variation in curricular or content exposure, opportunity to learn, instructional differences or other background phenomena may impact item performance by group differentially (Gierl & Khaliq, 2001; Van de Vijver & Poortinga, 2005; Hambleton, 2005). One way investigators seek to understand the level of item equivalence on cross-lingual assessments is to analyze items for differential item functioning, or DIF. DIF is present when examinees from two or more distinct groups do not have the same probability of responding correctly to a given test item, after controlling for examinee ability (Camilli & Shephard, 1994). Like factor analytic studies, the utility of DIF studies is that they provide an understanding of the measurement invariance of a test between studied groups. A large number of un-interpretable or un-rectifiable DIF items can result in invalid selection, categorization, or policy decisions and consequently have important political and social implications (Ercikan & Koh, 2005; Grisay & Monseur, 2007). 3 Researchers conduct DIF studies on gender, racial, language and other group differences. When professional capacity and large sample sizes are readily available, such studies typically employ statistical methods to detect DIF. Sometimes, they include a substantive item review to predict or interpret DIF, post-hoc (Ercikan, 2002). Substantive review relies on experts’ “best 3 In theory, “rectifiable DIF” (typically the result of overt issues such as translation mistakes) can be directly addressed after DIF analysis and therefore does not represent as serious a threat, assuming that analyses are conducted and steps taken before test scoring. 3 estimates” to identify and/or interpret differences and estimate how groups will be impacted by those differences. Previous research has shown that substantive reviews are not consistently effective at accurately predicting or interpreting statistical DIF. In some studies, there has been a low correlation between reviewers’ predictions and statistical DIF outcomes (Plake, 1980; Engelhard, Hansche & Rutledge, 1990). However, in some recent cross-lingual DIF studies, substantive review has proven to be relatively successful in interpreting DIF causes post-hoc (Allalouf, Hambleton, & Sireci, 1999; Gierl & Khaliq, 2001). The choice of substantive review methods, timing of review (before or after statistical analyses), knowledge of whether or not items have been flagged as statistical DIF, expertise of evaluators, and other contextual factors appear to impact the results of such studies (Ercikan, 2002). Ideally, in order to both accurately detect and interpret causes of DIF, both statistical and substantive analyses of test items are needed (Sireci & Allalouf, 2003). However, in many cross-lingual testing contexts, there is little or no capacity to employ statistical DIF detection methods. In countries of the former Soviet Union, those charged with developing assessments rely almost exclusively on substantive methods in item review and analysis (Drummond & De Young, 2004). Historically, standardized testing was considered “ideologically incorrect” and no investment was made in the field of educational measurement. As there was no standardized testing, there was no need for statistical DIF detection methods. The application of quantitative methods to educational outcomes in general was rare and the 4 validity of inferences was not typically empirically tested. In the development of cross-lingual educational materials in general, substantive review relying on bi-lingual educators and 4 This is primarily due to the fact that in the Soviet period educational assessment at both the secondary and tertiary levels relied heavily on oral examinations (Drummond & De Young, 2004). 4 translators (not necessarily panel review, sometimes a single translator) was considered to be a satisfactory verification of adaptation quality. The concept of an educator whose expertise was in “measurement” or “psychometrics” did not exist in the KR until the introduction of standardized testing in 2002 (Drummond & De Young, 2004). Today there are still no courses offered in higher education in educational assessment and measurement in the KR and only a handful of specialists have received any training in the basic concepts of psychometric theory. To my knowledge, and the knowledge of the test center staff who conduct the NST, no educators associated with the development of assessment instruments in the republic have ever participated in a DIF analysis. In contexts such as the KR where test developers rely on substantive item review, it is essential that bi-lingual personnel be able to identify both overt item differences between language versions as well as predict how differences in examinee backgrounds will impact group performance. If bi-linguals can not detect differences or predict performance patterns with at least a modicum of accuracy, this calls into question the feasibility of accurate test adaptation across groups and hence the feasibility of cross-lingual assessments: Thus, the need to “problematize” the ability of the bi-lingual evaluator to accurately predict DIF in the republic at this time. There are also good reasons to probe for the quantity and causes (sources) of differential item functioning (DIF) on the NST. Recent research has shown that the more disparate the language families involved in a cross-lingual assessment, the more challenging it can be to ensure the equivalence of test forms or unambiguously interpret assessment results (Sireci, Pastula, & Hambleton, 2005; Ercikan & Koh, 2005; Grisay, de Jong, Gebhardt, Berezner, & Halleux, 2006; Grisay & Monseur, 2007). The Russian and Kyrgyz languages come from very different language families, Slavic and Altaic (Oruzbaeva, 1997). 5 In other words, while there may be some “common challenges” to cross-lingual test adaptation in general (regardless of specific languages involved), it is increasingly clear that the feasibility of employing equivalent cross-lingual tests is also a function of the particular languages in question. Research Questions In this study I explored two research questions related to the two cross-lingual item adaptation issues highlighted above. First, to what extent were bi-lingual item evaluators able to predict differential item functioning (DIF) on cross-lingual, verbal skills test items from the 2010 National Scholarship Test (NST)? This question was answered by determining how accurately evaluators predicted statistical DIF and by how accurately they estimated which language group was favored by DIF. Second, what were the causes or sources of DIF on the Kyrgyz and Russian test items? Were DIF causes related to overt item adaptation issues like poor adaptation or due to background characteristics of examinees such as cultural or inherent linguistic differences in the way a particular language expresses or represents certain meanings or constructs? (Reckase & Kunce, 2002). To answer these research questions I designed and conducted a substantive review of thirty-eight verbal skills test items from the 2010 NST. Ten bi-lingual evaluators were selected to complete the item review process. This work took place in Bishkek, Kyrgyzstan, in June of 2010, between the time that the 2010 tests were administered and the time examinee score reports were released. The items evaluated consisted of eighteen analogy items, ten sentence completion items, and ten reading comprehension items. The item analysis rubrics developed for this study required the evaluators to: (1) estimate the level of difference(s) (if any) in content, meaning and difficulty between the two versions of each item; (2) characterize the nature of difference(s); (3) describe the difference(s); (4) estimate which group was favored; (5) suggest 6 improvements to make the items more equivalent; and (6) participate in a group discussion about each item pair. Then, I analyzed the items for statistical DIF using the logistic regression (LR) method to provide empirical data about the actual item response patterns by language group (Swaminathan & Rogers, 1990). An effect size measure proposed by Jodoin and Gierl (2001) was applied to each item analysis to limit type one error in statistical estimation. With this data I was able to compare the predictions of the evaluators with actual statistical outcomes and analyze the relationship between these two estimation approaches. Data for understanding the causes of DIF came from the item evaluators’ descriptions of the items on the evaluation rubrics and the group discussion of each item pair. Utility of this Study In general, there are relatively few studies that seek to identify the causes of DIF on cross-lingual assessments (Ercikan, 2002; Hambleton, 2005). To my knowledge, no DIF studies comparing items from the Altaic and Slavic language families have been carried out at the time of this study. An important goal of this study is to contribute to an understanding of the unique challenges to test adaptation between these two language groups: Characterizing the sources of DIF will inform the planning and design of future cross-lingual assessments in the KR (Gierl & Khaliq, 2001; Jodoin & Gierl, 2001). At present, there are large performance gaps between the Kyrgyz and Russian language groups on the NST. Thus, the study touches on not only technical but also sensitive political issues and the results can either provide support for inferences based on the NST or reveal critical areas where further work needs to be done to improve item equivalence. 7 Performance gaps of course do not automatically mean high DIF levels. There are urban and rural cleavages in educational outcomes in the KR that parallel the language gaps on the NST (chapter two). Despite evidence that demographics, socio-economic conditions, and selectivity bias plausibly best explain the performance gaps by language, poor test adaptation could nonetheless be a contributing factor to these gaps and any improvements in test quality would improve the validity of inferences based on test results. As noted above, the exclusive reliance on substantive review needs to be problematized until there is empirical evidence that bi-linguals can effectively adapt and analyze cross-lingual test items without the help of statistical DIF detection techniques. The Ministry of Education in the KR has an interest in this study as policy makers seek to enact a selection policy for university scholarships that is fair to all ethnic and language groups (Presidential Decree No. 91, 2002). In the event that item adaptation in this context appears fraught with irreconcilable problems, policy makers have choices: They could consider different policy options like administering separate - not cross-lingual - assessment instruments. They could consider modifications to the NST if the results of this study indicate this might be necessary. Or, they could consider returning to oral examinations and abandoning cross-lingual, standardized testing entirely. The Center for Educational Assessments and Teaching Methods (CEATM), the organization that conducts the NST, also has a stake in the results of this study. While they have procedures in place for test adaptation and item review, results could shed light on weaknesses in these processes and indicate areas for improvement. Results could demonstrate that different approaches to adaptation are necessary for some item types or that more stringent curriculum surveys or other analyses need to be carried out. They could alter they way they invest in 8 adaptation procedures or take new steps that would improve DIF predictability and lower DIF levels. Methods could be explored such as special equating or scaling methods that take systemic DIF into account by adding points to groups who have been discriminated against. Finally, outside of the KR, countries continue to join international assessment regimes like the Trends in Mathematics and Science Survey (TIMSS), the Programme for International Student Assessment (PISA), the Progress in International Reading Literacy Study (PIRLS), and the Teacher Education and Development Study in Mathematics (TEDS-M). 5 Cross-lingual testing is likely to remain a highly visible endeavor that includes more and more countries in the coming years (Hambleton, 2005). Many of the newcomers to these regimes are not from countries with high capacity in the field of psychometrics and measurement. Despite the fact that the item development protocols for the above regimes are designed in countries with a longer history in testing and measurement, all countries must still conduct much of the adaptation from core languages (English, French, in PISA for example), into other languages. In the newly independent countries of the former Soviet Union, substantive item review is still used as the primary means of reviewing and analyzing adapted tests (Drummond, 2011). These Eurasian countries also employ cross-lingual, standardized tests in the Slavic and Altaic languages. Depending on the level and nature of DIF discovered and the efficacy of bi-lingual reviewers in this study, the results could assist policy makers in these countries develop their own assessment capacity. In the short term, the results could help them decide whether or not crosslingual assessments should serve as the single selection criterion for high stakes university admissions tests (Clark, 2005). In the rest of this introductory chapter I situate the study, define key terms and then set the stage for what follows. 5 For PIRLS and TIMSS information see: http://timss.bc.edu/; For TEDS-M see: http://www.iea.nl/teds-m.html; for PISA see: http://nces.ed.gov/surveys/pisa/ 9 Situating the Study and Key Terms Differential item functioning (DIF) is occasionally confused as synonymous with bias (Hambleton, 2005). However, this is not a bias study. In fact, there are important distinctions between DIF and bias. The term bias tends to imply inherent unfairness and the term is often used broadly in a social rather than statistical sense. Items identified as DIF however, may or may not be fair, depending on the sources of DIF. In cross-lingual studies, item pairs marked as DIF indicate only that the two versions of the item are performing differently in the two groups, not the reasons for that differential performance. Only by collecting and analyzing more information (usually through post-hoc substantive review) can it be determined if bias exists. In essence, DIF is an essential prerequisite for bias but is not the same thing as bias (Camilli & Shephard, 1994). Despite the popular notion that large achievement gaps between groups are usually due to bias in testing, this is not necessarily the case. Zumbo (2003) points out that when two groups demonstrate different probabilities of answering correctly due to true differences in the underlying ability being measured, this indicates item impact, not item bias. Van De Vijver and Poortinga (2005) help clarify the distinction between DIF and bias by noting that bias occurs when one group of examinees is likely to perform less well because of some characteristic not relevant to the assessment purpose. Or, “A measure is considered to be biased if scores of different language versions of the same instrument are differentially affected by an unwanted and undesirable source of variance” (Van De Vijver & Poortinga, 2005, p. 41). Thus, bias is usually associated with the presence of some “nuisance factors” which hinders our ability to attain a closer approximation of a true score. 10 Bias on an assessment can be the result of some background factor that differentiates tested populations (differences population experience with testing format) or due to overt, item related issues like translation problems or unclear items due to format mistakes. An example of overt bias on a cross-lingual test is the adaptation of a word from the source language that results in the use of a word with a different or multiple meanings in the target language. The different meaning may result in an item that confuses the examinees in the target language and result in a failure to assess knowledge of the intended, original word. The resulting variance in performance between the two groups can not be said to be due to knowledge of the original word or construct but due to artificially introduced differences in difficulty of the item due to confused word meaning. In this case, the nuisance factor is the poor quality of test translation or adaptation (Hambleton, 2005). Van de Vijer and Tanzer (1999) identify three types of bias that can impact cross-lingual tests – construct bias, item bias, and method bias. Construct bias occurs when there is an incomplete overlap of psychological or linguistic constructs between the cultural groups in question. Entire ways of conceptualizing problems can differ and existence of certain concepts and ideas might not be found to the same degree between disparate groups (Hambleton, 2005). Method bias can occur due to variation in conditions under which an instrument is administered or due to differences in exposure to certain techniques, like “filling in bubbles,” on multiplechoice testing (Van De Vijver & Poortinga, 2005). Method bias was less of a concern for this study as all regions of the Kyrgyz Republic experienced the introduction of standardized testing at the same time. All test NST administrators are trained by the same central authority using detailed test administration manuals and NST testing is conducted under tightly controlled, standardized conditions (Valkova, 2004). 11 The term DIF is utilized in reference to statistically identified differences in item response patterns, not the claims of item reviewers as to whether or not an item has the same meaning in different groups. That is, evaluators do not identify DIF per se, but instead estimate the likelihood of difference in their professional opinion, or - as in the case of post-DIF analysis reviews, provide interpretations as to what might be the cause of DIF identified by statistical methods (Camilli & Shephard, 1994). While some researchers use the terms “DIF review” and “substantive review” interchangeably, in this study I use the term “DIF review” to imply statistical analysis, not the estimation of differences from a substantive review. At the same time, statistical tests alone reveal nothing about the nature of differences between groups – only that respondents in the groups have different odds of answering an item correctly. In order to interpret DIF it is essential to conduct substantive reviews with bi-lingual expert panels or committees (Ercikan, 2002). As noted above, the reasons for differences in outcomes may be due to true differences or bias. It is possible that through the statistical DIF detection and substantive review methods employed in this study, bias on the NST items will be detected. However, in this study I sought to understand how sensitive bi-lingual evaluators were to overt item and background differences of examinees (i.e. differences from either item impact or bias) as well as how accurately they could predict which group these differences favored - not necessarily distinguish between bias and DIF per se. Finally, strict measurement equivalence on cross-lingual tests is rare in practice. Van De Vijver and Poortinga (2005) maintain that the constituent elements of constructs like behaviors, attitudes, or norms are never identical across all cultural or linguistic groups. Representatives of different groups are likely to always be somewhat differentially impacted by certain situational types of questions, curricular coverage, and background knowledge. Thus, the finding of some 12 DIF on the NST does not automatically invalidate the comparative inferences based on the NST: DIF results must be put into perspective and context. Study Limitations Despite the importance of DIF studies, this type of analysis gathers only a portion of the validity evidence necessary to support the appropriateness of an assessment for the purpose for which it is employed (Messick, 1988). For example, the absence of DIF on a cross-lingual selection test such as the NST in Kyrgyz Republic reveals nothing about whether or not the domains covered by the test are the most useful for university selection purposes. The various language versions of the test might be equivalent but lack predictive validity. Determining the validity of inferences from any assessment is “an overall evaluative judgment, founded on empirical evidence and theoretical rationales, of the adequacy and appropriateness of inferences and actions based on test scores” (Messick, 1988, p. 35). In short, the appropriateness of the NST as a selection instrument (and overall fairness) can not be determined by a DIF study alone. Further, the practical utility of any validity or DIF study in policy making is not always related to the meaningfulness of the study’s results. As Margaret Archer (1979) reminds us, educational policy is not a natural response to evolving “societal needs,” but rather the expression of the will of actors with the power and ability to influence policy and institutionalize their version of reality. Policy decisions are political and can be arbitrary or confused, or made by policy makers whose intentions are not benign; data and evidence can be utilized, or not (Archer, 1979). In countries like Kyrgyzstan where institutional corruption and test score abuse has a long history, validity issues can be peripheral or even completely irrelevant to policy outcomes (Clark, 2005; Drummond & De Young, 2004). Archer’s perspective helps us maintain 13 realistic expectations as to the power (or lack of power) of validity and DIF studies to impact policy decisions. Nonetheless, this study provides important foundational validity evidence because the results provide information about the challenges to item adaptation from Russian into Kyrgyz as well as the utility of employing substantive item review. As Messick (1988) emphasizes, there is no way to judge responsibly the usefulness of score inferences in the absence of evidence as to what the scores mean. Of course the overall selection inferences made from cross-lingual assessments can be valid or invalid, even when DIF levels are low. However, if DIF levels are high between various groups tested, and test developers are unable to understand why, there can be few valid selection inferences based on the test, regardless of how transparently test results are utilized by higher education institutions (HEIs). A final general limitation of the study is that the nature of the work itself is highly interpretive. Test adaptation is a human process and evaluators bring different skills and dispositions to their work (Engelhard, Hansche, & Rutledge, 1990). Evaluators can hypothesize and provide plausible predictions and interpretations, but never be 100% certain of those claims. As will be argued in later chapters, even if the statistical DIF estimates are reasonably accurate, the determination of the exact DIF rate by statistical means is also influenced by contextual factors such as differences in the ability distributions of the two groups under study (Narayanan & Swaminathan, 1996). A detailed explication of the statistical limitations in DIF studies is presented in Chapters 3 and 6. This does not mean assessment practitioners and policy makers should not try however. The purpose of a DIF study is to generate empirical data in order to address the root of as many challenging issues as possible and adapt policies and methods accordingly. 14 Organization of the Study As Kyrgyzstan is not well known by western scholars, brief historical context about the educational system, language politics in the Soviet period, and the politics of contemporary language issues are provided in Chapter 2. I place particular emphasis on the proportion of pupils being schooled in the various language media as well as educational outcomes by language group. I also present trends over time in enrollment by language medium since the collapse of the Soviet Union. In the last half of Chapter 2 I detail how NST results are utilized as the selection criterion for state scholarships to higher education. In Chapter 3 I review the relevant DIF literature. In Chapter 4 I present the design of the study and the methods utilized to collect and analyze data, in Chapter 5 the results, and in Chapter 6 I analyze and discuss the results as well as offer recommendations for future cross-lingual DIF studies. 15 Chapter 2: Education & Language(s) of Instruction in the KR Overview Every language is a unique system of communication that conveys meaning, ideas, and culture. However, languages evolve and develop in the context of social and political systems. The trajectories of their evolution are thus framed by social conditions and power relations (Korth, 2005). Language use also demarcates class, privilege, and social boundaries, and thus has meaning beyond conveyance of literal meaning. A language can be heavily influenced by a more “powerful” or prestigious language, the interaction with which differs through time and place. Therefore, language as a “variable” in research in multi-lingual societies needs to be understood in relation to other competing languages that co-inhabit the same linguistic and cultural space. This is especially true in societies where one language has enjoyed a hegemonic position over all others for a considerable amount of time as Russian did in the Soviet period (Grenoble, 2003). Understanding the development and political place of the Kyrgyz and Russian languages in the Kyrgyz Republic helps set the context of this DIF study. This chapter highlights the salient historical and demographic issues related to language and schooling in the republic. After a brief overview of language politics in the Soviet era, contemporary data on school enrollment rates and quality of education by languages of instruction in the republic are presented. The chapter concludes with a discussion of the resilience of the Russian language as a language of instruction in the republic and a brief overview of the higher education system and the contextual conditions that gave rise to the new selection test, the NST. 16 Historical Context of Languages of Instruction Education in the Soviet era was characterized by centralized administration and tight ideological control. This resulted in the standardization of most educational norms and practices, and in theory, an egalitarian, mass approach (Glenn, 1995). In Central Asia, as in other parts of the USSR, a success of the Soviet state was the development of a mass education system and the attainment of high literacy rates. According to Dienes (1987), the literacy rate for Uzbek males in 1926 was under 25%: By 1979, over 90% of Uzbeks had access to some form of education. Not more than 3% of the population living on the territory of what is today Kyrgyzstan was literate before the Soviet period. By independence in 1991, literacy rates were near 100% (Fierman, 1991). Despite standardization in approach to educational policy throughout the USSR, the achievement of literacy was initially made through the use of multiple languages of instruction. In the early Soviet years many citizens in Eurasia had their first exposure to formal education through the medium of their native language (Grenoble, 2003). Some accounts of the Soviet’s assumption of power in Central Asia emphasize the widespread poverty and absence of mass schooling at the time (Glenn, 1995). Indeed, the Bolsheviks faced many challenges consolidating power at the end of the Russian Revolution. In addition to establishing law and order and creating new administrative structures, schools had to be built and “new literary languages” had to be developed (Korth, 2005). The written Kyrgyz language as it exists today is a Soviet era creation. The narrative that emphasizes the “educational successes” in the early Soviet period however, has been strongly contested in recent years (Hu & Imart, 1989; Oruzbaeva, 1997; Megoran, 2002). Whatever the claims of the early Bolsheviks, a limited number of Kyrgyz (and other Central Asian) elites did have access to the written word at that the time of the Soviet conquest 17 (Hu & Imart, 1989). Further, many people of the Eurasian steppe did not identify themselves as belonging to the distinct ethnic or linguistic groups that the Soviets were busy constructing (Grenoble, 2003; Korth, 2005). The common literary language used by Kazakhs, Kyrgyz and th other literate Turkic peoples at the turn of the 19 century was the one learned “at the Tatar speaking medressehs of Ufa, Kazan or, to a lesser extent, in Orenburg” (Hu & Imart, 1989, p.70). While there were differences between the “oral reality and written word,” the “Turkic” produced by early writers was mutually intelligible to many on the steppes and had the potential to serve as a unifying lingua franca: A language utilizing the Arabic script which could serve to unite, not divide the peoples of the steppes. Hu and Imart (1989) conclude: “Such a lofty aim maybe was surrealistic and in any case hard to attain: it demanded time and above all wide autonomy in cultural and educational matters. The impending historical events were to show that this was precisely what the Kazakh-Kirghiz intellectuals lacked, or, more exactly, what they were denied” (Hu & Imart, p. 73). The development of a written Pan-Turkic language was not to be. Between 1926 and 1931, under the direction of the Soviet authorities, a distinct written Kyrgyz language was developed with the Latin alphabet which would be utilized until the end of the 1930s. The fact that the Soviets initially selected the Latin script for the newly codified Kyrgyz written language indicates that they perhaps felt threatened by the development of a pan-Turkic language that could serve to unify millions of Muslim subjects. Some scholars also contend that Latin (instead of Cyrillic) was selected in order to avoid being seen as “Russifying” the Kyrgyz language while at the same time it avoided the use of the Arabic script (Grenoble, 2003). Lenin himself spoke of the need to provide educational opportunities in the native 6 languages of the newly “liberated” peoples of Central Asia (ibid, 2003). 6 Scholars debate According to Grenoble (2003), Lenin clearly believed that there should be no “state language” of the USSR. Indeed, Russian never actually became the “state language” of the USSR - at least 18 whether he believed that the native medium was essential over the long term but there is no question that the Bolsheviks had political aims in mind when calling for education through native language media. Because the Tsar had outlawed native language schools, the Bolsheviks were sensitive to the language question and did not want to alienate Central Asians: Hence, the guarantee of the right to education through native language (Glenn, 1995). This policy of native language education fit well within the overall strategy of “korenizatsiia,” or “indigenization,” the Soviets’ initial approach to institutionalizing the communist state through the appropriation and utilization of local elites in visible social, economic and political positions (Grenoble, 2003). “Korenizatsiia” officially began in June 1923, and was seen as necessary to develop a strong communist movement and institutions. In December of 1923, a decree mandated that official documents be produced in the local languages in all the Central Asian Republics and Autonomous Regions. Initially, even ethnic Russian functionaries were encouraged to learn local languages (Fierman, 1991). Grenoble (2003) proposes that it was highly unlikely however, that the committees with policy making power (which were comprised of highly educated, urban, Russian intelligentsia) were actually willing to give up power to the “uneducated” indigenous peoples in matters of culture and education. For example, he notes that Soviet philologists, while claiming to support indigenous languages, “did much to influence them (new languages) to acquire a vast number of Russian lexical items, collocational and grammatical patterns, as well as to directly impose Russian orthography and spelling” (ibid, 2003, p. 37). Thus, native language development was encouraged to the extent it was politically expedient for the new regime. not until 1990 when the move was more of a desperate reaction to the outbreak of national assertiveness exemplified by the 1989 language laws in many of the republics (Fierman, 1991). 19 Nonetheless, the Soviets saw basic education and literacy as essential for the economic, social and ideological development of the region. And, for better or worse, they were successful both in establishing new educational institutions and creating written languages for the titular majorities in the new Soviet Republics, including Kyrgyz. 7 Soviet census data show dramatic increases in literacy rates in all the new republics within the first 20 years of Soviet rule. In Kyrgyzstan, as early as 1923 there were reportedly 251 new Kyrgyz language schools out of the 8 357 total schools in the republic. In 1913, no books had been published in the Kyrgyz language but by 1924 the first school textbooks were already being produced (Grenoble, 2003). The early emphasis on education through native languages was to be short lived however. Despite the fact that Article 121 on the Soviet Constitution of 1936 guaranteed the right to native language education for the titular majorities in the new Soviet Republics, by the mid 1930s a change towards overt Russification was already evident. In June of 1934, the Communist Party in Kyrgyzstan promoted the maximum use of “Sovietisms” and internationalist terms in the Kyrgyz language (Huskey, 1995; Korth, 2005). In 1938 a law was passed that required all Soviet citizens to study the Russian language. According to Grenoble (2003), “the rationale for the law was the need for a common inter-ethnic lingua franca for communication, economic and cultural development, the need for Russian to promote science and advanced training, and defense” (p. 54). This policy had the effect of stimulating growth in the number of citizens studying in the Russian medium of instruction. 7 Huskey (1995) and others have emphasized the artificiality of the creation of the “new” distinct Central Asian Turkic languages where no such delineations had existed prior to the Soviet era. For a discussion on how the Soviets “constructed ethnic identities” see Grenoble (2003), pp.3040. For more on “constructed languages,” see Hu and Imart (1989). 8 There was considerable Russian settlement in the region before the Bolshevik Revolution of 1917 and these settlers had been sending their children to basic primary schools for some time. Grenoble (2003) notes that while 54% of eligible European children attended school, only 4% of native children did so at the time of the revolution (p. 142). 20 Further, in 1938 the Latin alphabet was eliminated for written Kyrgyz and replaced by the Cyrillic script. Perhaps not coincidentally, the late 1930s also mark the period of consolidation of Soviet state power and the brutal persecution of all forms of dissent, political and intellectual. Huskey (1995) believes that the Kyrgyz were in no position to resist the overt linguistic Russification of this period. Not only did ethnic Kyrgyz make up just 40% of the total population of the republic - living mostly in the peripheral regions - but the destructive Stalinist purges decimated the ranks of Kyrgyz intellectuals. A historical account of the 1937-38 mass killings of Kyrgyz elites at Chong-Tash notes that most of the executed were accused of have connections to “Pan-Turanic” (Turkic) parties, actively working against the USSR (Helimskaia, 1994). Huskey (1995) argues that the obvious “sycophantism” in the immediate post-repression years and the eager embrace of the Russian language on the part of local party leadership further expedited Russification. By the end the post-war period, Russian language education was clearly the means to professional advancement in industry, agriculture, science, medicine and culture. A 1954 decree 9 in Kyrgyzstan eliminated the requirement that Europeans study Kyrgyz as a second language (Huskey, 1995). Higher education at this time was conducted almost exclusively in Russian and movement through the communist party hierarchy at any level was difficult, if not impossible, without Russian language skills (Glenn, 1995). Another blow to the Kyrgyz language came in the form of Clause 19 of the 1958 education reforms. According to Clause 19, study of the native language (for non-Russians) was no longer compulsory. In Kyrgyzstan, the result of this 9 The term “European” in the Eurasian context is typically used to denote non-indigenous, usually non-Muslim, inhabitants of the region who are of “European” origin. While most of these groups are highly “Russified” today, the use of this term allows the inclusion of Germans, Jews (considered a “nationality” by the Soviets), Poles, Ukrainians, Belarusians, Moldovans, the Baltic nationalities, etc. into a category of peoples sharing common traits (Russian medium schooled, typically not functional in local languages, etc.). 21 policy was that Kyrgyz became a marginalized language, for use at the primary levels of schooling, not for secondary schooling and certainly not for those with ambitions to higher education (Grenoble, 2003, p. 57). The Brezhnev era (1962-84) has been characterized as a time of further Russian language expansion. Propaganda, policy, and funding promoted the Russian language as the great unifier that would bind all peoples of the USSR, enable socio-economic development, allow for demographic mobility of elite cadres and ultimately serve as the language of the new “Soviet man” (Fierman, 1991; Grenoble, 2003). By the end of the 1970s, 82% of the entire Soviet population had at least basic knowledge of Russian. In 1989 a total of 35.2% of all ethnic Kyrgyz in the republic reported fluency in Russian (ibid, 2003). At independence in 1991, only 4% of all books in the national library and only 9% of all films produced by the state cinematography industry were in the Kyrgyz language (Huskey, 1995). Translation of educational materials and books in the sciences and industry now became almost exclusively a one way process – Russian to local languages (Grenoble, 2003). In 1989, only 7% of all schools in the capital (Frunze at that time) were Kyrgyz language 10 medium, while 54% were Russian language medium. The rest were mixed medium schools. Approximately 42% of ethnic Kyrgyz who were not in a Kyrgyz medium track were also not studying Kyrgyz as a second language (Fierman, 1991). Between 1989 and 1992, the number of 10 In a “mixed medium” school, two or more separate cohorts are taught in two or more different languages all in the same school building. Pupils may attend school at the same time of day or be organized so that one language cohort attends in the morning, one after lunch, all depending on the logistical arrangements, size of the study body and facilities available. Class cohorts in mixed schools are known as “Kyrgyz A” or “Russian A.” School buildings characterized as mixed medium do not provide bi-lingual education. The combined attendance of various linguistic groups in the same school building is an administrative and logistical, not pedagogical, arrangement (Korth, 2004). 22 Russian language medium schools in the republic declined from 234 to 142. The number of Kyrgyz language schools increased from 1,018 to 1,122. However, without the total enrollment numbers for each group it is difficult to determine the actual proportion of change in enrollment in the Kyrgyz and Russian language tracks. That is because many Russian schools did not in fact close but instead became mixed medium schools by opening parallel Kyrgyz language groups. The number of mixed schools in the republic increased during this same period from 332 to 409 (Huskey, 1995, p. 562). With Perestroika in mid 1980s there were calls for change in this status quo in cultural and educational affairs. The Central Committee of the Kyrgyz Communist Party issued a proclamation on “National (Kyrgyz) and Russian Bi-lingualism” in August, 1988. Supporters of the proclamation bemoaned the lowly status of the Kyrgyz language. However, the proclamation did not challenge Russian language hegemony directly but instead called for the redoubling of efforts to improve the knowledge, use and quantity of instruction in languages besides Russian, including of course, Kyrgyz. The idea was not to “bring the status of Russian down” but rather to bring other languages “up” to equal status (Huskey, 1995). The 1989 language law however was less equivocal. Its main features included the renaming of Russian place names, the requirement that official government documentation be in Kyrgyz, and the introduction new norms about language use in the workplace. The law called for the mandatory provision of all business and social services in the Kyrgyz language by 1999 (ibid, 1995). According to Huskey (1995), perhaps most controversial was Article 8 which required that ethnic Russian managers be able to speak and carry on normal work activities in the Kyrgyz language. For some however, the law did not go far enough. There was a compromise provisions that noted “either language could be used” in certain official activities and that where 23 Kyrgyz was being used, translation into Russian was to be provided. Further, while Kyrgyz became “the state language,” Russian remained “a language of interethnic communication.” Critics of the law pointed out that the provision of these “Russian options” reduced the incentive to learn Kyrgyz (ibid, 1995). Events over the next few years would conspire to make implementation of the 1989 language law incomplete. President Askar Akaev, despite occasional opposition, was perhaps more concerned about keeping Europeans in Kyrgyzstan than the “Kyrgyzification” of the political, economic and educational system. In the summer of 1990, deadly clashes between ethnic Kyrgyz and Uzbeks in the south of the country raised the stakes of nationalist politics. President Akaev was a moderate, committed to the development of the Kyrgyz language over the long run, but eager to avoid fanning the flames of nationalism. In 1993, President Akaev extended the time for full implementation of the 1989 language law from 1997 to 2000 (Wright, 1999). In characterizing language politics in the republic in the 1990s, Huskey (1995) identified two political camps in addition to the centrists. One group consisted of highly Russified elites, the “internationalists,” who had both personal and professional stakes in the Soviet system and the Russian language. The nationalists or “indigenizers” were primarily embedded in the various Kyrgyz language committees and enforcement agencies. They were primarily Kyrgyz speakers who sought faster movement on language reform. According to Huskey (1995), both of these groups used “alarmist tactics” to push their agendas, threatening dire consequences for failure to act decisively. President Akaev however, maintained his moderate position through the promotion of the motto “Kyrgyzstan: Our Common Home,” intentionally designed to ally European fears of a nationalist resurgence that would jeopardize their livelihoods and security. 24 By the mid 1990s the rate of European emigration abated as much of the early nationalist fervor faded (Korth, 2005). In fact, new Slavic educational institutions such as the KyrgyzRussian Slavonic University were opened and other efforts were made to maintain close cultural ties to Russia. In 1996, the lower house of parliament even approved the return of Russian as an “official” language. The proposal however, was rejected by the upper house in 1998 (Wright, 1999). Nonetheless, at the end of the Soviet period and well into the 1990s, Russian remained the dominant language of higher education and continued to carry significant political and 11 cultural capital in Kyrgyzstan. There were of course practical realities that impeded significant change to the status quo in regard to the status and use of the Kyrgyz language. The transition to the Kyrgyz language in official and educational life would have been a daunting financial burden under “normal” economic and political conditions; in the wake of the collapse of the USSR, exceedingly difficult (Fierman, 1995). In such conditions, a small, economically dependent state like Kyrgyzstan could hardly have been successful in overseeing the costly transition from Russian to Kyrgyz overnight, or in other grandiose projects such as reintroducing a new alphabet (Latin script) as some had proposed. Korth (2005) provides considerable evidence that there were not only financial challenges, but also attitudes, dispositions, and pedagogical challenges which coalesced to make the transition unsuccessful. Whatever the dispositions of policy makers, today the Russian language and its role in education have remained relatively unchanged since the Soviet era. Indeed, not only was the 11 The maintenance of Russian in Kyrgyzstan as the lingua franca of business, government and education well into the 1990s contrasts sharply with other countries of the former Soviet Union; the Baltic countries, the Caucasus and even other neighboring countries in Central Asia with the exception of Kazakhstan. According to Bruner and Tillet (2007), 67% of all those enrolled in higher education in the KR today are receiving it through the Russian language medium. 25 1989 language law not vigorously applied in practice, in 2000 the Russian language was given new life by becoming an “official” language of the Kyrgyz Republic. The resilience of the Russian language is connected to history, demographics, culture as well as contemporary political, economic, and pedagogical issues, including the relative strength of Russian medium education, a topic addressed in the next section of this chapter (Korth, 2005). Contemporary Schooling and Language Issues Kyrgyzstan inherited and maintained both the Soviet tradition of centralized authority and a multi-lingual system of education. The Ministry of Education in the capital, Bishkek, makes all major education policy decisions (Johnson, 2004; Bruner & Tillet, 2007). Education departments in the 7 oblasts (provinces), 2 cities of “republican status” (Bishkek and Osh), 40 rayons (regions), and 23 gorono (city) administrations implement policy at the local level (Census, 2010). Overall, representatives of over 100 natsional’nosti (nationalities) reside in the republic. 12 It is possible to receive an education from kindergarten to the completion of an advanced degree in three languages of instruction - Kyrgyz, the state language, Russian, an official language since 2000, Uzbek, the language of the most numerically predominant minority 13 in the republic. There are also four Tajik language schools in the Batken Oblast. Approximately 80% of schools in the republic are designated by the ministry as rural schools. The average size of rural schools is 477 pupils, the average class size is 23.7, and the 12 Soviet and many post-Soviet Eurasians use the term natsional’nost (nationality) as American scholars would use the term “ethnicity,” not citizenship. For example, an ethnic Russian born and raised in Kyrgyzstan (and a citizen of Kyrgyzstan) would nonetheless be considered to be of “Russian nationality.” I use the term ethnicity and nationality interchangeably depending on context. While the total number of different nationalities in the republic is over 100, only 15 distinct nationalities have more than 10,000 people or more in the republic today. The total population of the republic in 2009 was just over 5.3 million (Census, 2010). 13 Opportunities for higher education in the Uzbek medium are limited. There are no higher education institutions offering study through the Tajik medium. 26 pupil to teacher ratio is 14.9. The average size of urban schools is 774 pupils, the average class size is 26.9, and the pupil to teacher ratio is 17.9 (Herczynski, 2003). 14 Pupils study in class cohorts which stay together as a group and move from teacher to teacher. There can be from one to seven or eight cohorts in a given grade level. Graduating class sizes can thus consist of as few as 10 students all the way up to 120-140 students, though this larger number is usually only found in the largest schools in Bishkek. According to the data presented in Appendix A, there were 1,911 total schools in the republic in 2003. More recent figures provided by Steiner-Khamsi, Teleshaliyev, SheripkanovaMacCleod, & Moldokmatova (2011) put the total number of schools at 2,168. Schools in all instructional languages usually contain all grades in one building (complete secondary schools). However, a small number of buildings house grades one to four, nachal’nee shkoli (primary schools), and ne pol’nee sredniye (basic secondary) grades one to nine. For the 1999-2000 school year primary schools comprised 6% and basic secondary schools 11% of the total number of schools in the republic (Herczynski, 2003). 15 However, it is also quite common that a large number of pupils attending schools with all 11 grades stop their schooling after the ninth grade. During the Soviet period, approximately half of those completing their last year of compulsory education (ne pol’nee sredniye) received one to two additional years of vocational education in a professional uchilishe or teknikum instead of completing the full 11 years of secondary school which was considered necessary for university study. 14 16 The proportion of This is 2003 data. More recent anecdotal reports indicate that class sizes have increased in recent years, especially in Russian language tracks. 15 While the Russian word “sredniye” literally means “middle,” western observers of Soviet and post-Soviet schooling tend to use the term secondary to denote schooling across all grades 1-11. 16 In the Soviet era mandatory education (basic) took eight years, while “complete” secondary education took ten years. 27 those finishing complete secondary education has risen in recent years while enrollment in professional technical schools and other specialized vocational schools has dropped dramatically (Herczynski, 2003). According to Brunner and Tillet (2007) only 10% of the 1993 graduating cohort went on to receive higher education while today approximately half the graduating cohort enrolls in some form of higher education. This is primarily due to the availability of inexpensive educational options with the growth of private institutions and the opening of more places for paying students in HEIs in general. Table 2-1 below presents the proportion of the three largest natsional’nosti in the republic by proportion of the total population since 1959. 1959 was selected intentionally because it represents the peak of European habitation in the republic since data was collected in 1926 (Huskey, 1995). 1989 was the last census year before the collapse of the Soviet Union. The decline in proportion of the Russian and the “other” group between 1989 and 2009 is primarily due to European emigration and the higher rate of population growth of non-Europeans (Census, 2010). Table 2-1: Three Largest Nationalities in the Kyrgyz Republic Nationality Kyrgyz Russian Uzbek 18 Others 195917 40.5% 30.2% 10.6% 18.7% 1989 52.4% 21.5% 12.9% 13.2% 1999 64.9% 12.5% 13.8% 8.8% 2009 70.9% 7.8% 14.3% 7.0% Upon entering the first grade pupils select a language medium according to the options available in their communities. Russian or Kyrgyz, if not the first choice of instructional 17 Data from 1959 and 1989 come from Huskey (1995). The 1999 and 2009 figures are from census data. 18 The large “other” category is composed of many nationalities but includes several highly Russified European groups which were at one point quite substantial in the republic. For example, in the 1959 data the Ukrainian population of the Republic was 6.6% while the German population was almost 2% (Huskey, 1995). 28 medium, is then studied as a second language (Korth, 2005). Some communities provide no language options while others have two to three options. Data on the total number and proportions of pupils studying in the different language tracks in 1989 and 1999 is presented in Table 2-2. As can be seen, the overall proportion of Kyrgyz and Uzbek language enrollees grew while the proportion of Russian language enrollees declined during this period. Table 2-2: Percentage of Students in Main Language Tracks Total/ thousands 1988/89* 1998/1999** Kyrgyz 474 52.4% 63.3% Russian 323 35.7% 22.7% Uzbek 106 11.7% 13.4% Tajik 2 .2% .3% *Narodnoye obrazovanie i kultura v SSSR (1989) ** Herczynski (2003) A breakdown of the number of schools by language of instruction is presented in Appendix A. At the time this data was collected in 2003, almost 20% of all pupils in the 19 republic studied in mixed schools that offered both Kyrgyz and Russian language tracks. Despite the overall diversity of the republic, there is some geographical clustering of the various language cohorts by oblast. Russian medium schools, while found throughout the republic, tend to be concentrated in urban areas, oblast capitals and in the north of the country. All Uzbek schools except for one small school in the northern city of Tokmok are located in the south of the country. Mixed Russian and Kyrgyz schools can be found in all oblasts while some southern cities and towns offer various combinations of Kyrgyz, Russian, and Uzbek tracks. Appendix B 19 The reader should avoid extrapolating numbers of actual speakers or proportions of a natsional’nost in the population at large from the total proportion of schools by language track. For example, data from 1955-56 indicated that there were 324 Russian medium schools compared to 1,376 Kyrgyz medium schools in the republic at that time (Grenoble, 2003). However, 49% percent of all children studied in the Russian medium while 51% of children studied in the Kyrgyz medium. At that time only about 33% of the population was ethnic Russian (Grenoble, 2003). It is also important to note that in 1955 - just as today - language of instruction is not a marker for ethnicity. 29 presents the dispersion of students enrolled in the main language tracks by oblast in the year 2000. Osh, Djalal-Abad, and Batken are southern oblasts, separated from the north by the TienShan Mountain range. The capital city of Bishkek is listed separately in Appendix B and in other presentations of data. Language track availability reflects the cultural and linguistic demographics of the communities in which a school is located. In both the north and south of the country, some communities are highly homogeneous while others are quite heterogeneous in ethnic constitution. Further, while sometimes a community is ethnically homogenous today, some towns were more significantly influenced by European settlers than others in the recent past. For example, some mining and industrial towns are now mono-ethnic Kyrgyz settlements; in the past however, they had high concentrations of Russian speakers and today have both a considerable number of bi-linguals and retain some Russian language track options (Korth, 2005; De Young, 20 Reeves & Valyaeva, 2006). The overwhelming majority of ethnic Russians study in Russian language tracks.21 However, the Russian language tracks are diverse in ethnic composition. Koreans, Ukrainians, Germans, Dungans, Tatars, Kurds, Turks, Kazakhs, Azerbaijanis, Chechens and other natsional’nosti also study in Russian language tracks in large proportions (Korth, 2005). Some these natsional’nosti speak primarily Russian at home as well as at school. However, with the exception of ethnic Kyrgyz in these schools, functional command of the Kyrgyz language on the 20 Uzbek communities like Uzgen and Aravan, have both Uzbek and Russian options as Kyrgyz communities in the south have Kyrgyz and Russian options. Larger cities like Osh and DjalalAbad have all three. Russian language options can still be found in all three southern oblasts in communities such as Mali-Suu (formerly Mali-Sai), Tash-Kumyr, Kizil-Kiya, Kadamjai, etc. 21 In the nine years since the NST was introduced, CEATM staff can recall only two or three ethnic Russians sitting for the NST in the Kyrgyz language. According to census data, 322 individual Russians claimed Kyrgyz as their native language in 2009. 30 part of non-Kyrgyz is quite low. In general, only 7% of all Russians over the age of 15 in the republic claim knowledge of a second language. Further, of those Russians claiming knowledge of a second language, more actually claim knowledge of English (42%) than Kyrgyz (36%) (Census, 2010). In contrast, 42% of all Kyrgyz over the age of 15 in the republic report fluency in a second language; in 94% of such cases, that fluency is in Russian (Census, 2010). Some ethnic Kyrgyz in urban areas grew up in families where the Russian language was the primary home language. 22 However, there are also many bi-lingual Kyrgyz who are schooled in Russian for whom Russian is indeed a second language, learned primarily once schooling began, rather than from the home environment. Kyrgyz language schools are usually attended by pupils whose strongest language is Kyrgyz though they are also attended by bi-linguals with varying levels of Russian competency (Korth, 2005). The city of Osh has the highest rate of bi-lingualism in the republic with 62% of the population reporting knowledge of a second language. 23 The Naryn region has the lowest bi- lingualism rate with 27% (Census, 2010). While Russian is studied as a second language in most Kyrgyz schools where teachers are available, it is difficult to generalize across the entire Kyrgyz school population in regard to how well these students know Russian: Exposure to language is 22 While it is perhaps common for an ethnic Kyrgyz person (in the post-Soviet period) to identify their “native language” as Kyrgyz, this does not necessarily indicate that they have functional ability in the Kyrgyz language, especially in urban areas where many non-Russians know and speak primarily (sometimes almost exclusively) in Russian. In general, both Soviet and postSoviet census data relies heavily on respondent self-reporting. For a discussion on interpreting language data in the Soviet and post-Soviet census see Grenoble (2003), pp. 28-32. 23 This figure is from the total population (not just above 15 years of age). Of course the second languages known differ by region. In Osh city for example, where 49% the population is ethnic Uzbek, 33% of bi-linguals report their second language as Kyrgyz while twice that amount report their second language as Russian. Tri-lingualism is also common in places like the city of Osh. 31 not the same as functional capacity in a language. In areas of the country where school study plus the opportunity for daily interaction in Russian is available (primarily in the north as well as in the regional capitals) it is probably safe to assume that the percentages of those who have functional command of Russian are relatively higher than in more isolated, homogenous Kyrgyz or Uzbek communities. By any measure however, ethnic Kyrgyz in Kyrgyz schools are more likely to have command of at least basic Russian than non-Kyrgyz in Russian schools are to have command of Kyrgyz (Korth, 2005). Regardless of the ethnic constitution of the community, Kyrgyz language schools are typically composed almost entirely of ethnic Kyrgyz pupils with the exception of a relatively small number of non-European natsional’nosti (Kazaks, Tatars, Uighurs, Turks, those claiming mixed heritage, or in some cases Uzbeks) who have assimilated into predominately Kyrgyz speaking communities through marriage or long time residence. 24 Many southern Uzbeks and a small number of Tajiks in the Batken Oblast are schooled through their native language though many members of these groups also attend Russian medium schools (Herczynski, 2003). The Status of Russian as a Medium of Instruction Despite the decrease in the proportion of Russian language enrollment from 1989 to 1999, data from NST participation rates indicates that the Russian language has retained its status as an important language of instruction in the republic since independence. Recall that in 1999, 22.7% of the total cohort was enrolled in Russian medium education. At that time, approximately 20% of the entire population was ethnic Russian or “others” which consisted of 24 These ethnicities are the only ones (besides Kyrgyz) with more than 1% of their group claiming Kyrgyz as their native language: Tajiks 2%, Tatars 3%, Uighurs 4%, Turks 16%, and Kazaks 26% (Census, 2010). The native languages of these groups are all in the Turkic language family (except for Tajiki). Overall however, 98% of all Kyrgyzstanis report their native language to be that of their ethnicity. 32 many Russian speaking, European groups. By 2009, the total combination of Russian and “others” was only 14.7% of the total population. Yet, 36% of those sitting for the NST in 2010 did so in the Russian language (CEATMa, 2010). CEATM estimates that perhaps half of the 10,994 Russian language examinees were ethnic Russians. Thus, Russian as a language of instruction appears to have retained popularity among non-Russians, at least for those seeking higher education. 25 The under-representation of Uzbek language pupils in the NST is plausibly due to demographic and background factors. In 2009, in the southern Osh Oblast (excluding Osh city), 84% sat for the NST in Kyrgyz, 7% in Russian, and 9% in Uzbek. However, according to the 1999 data presented in Appendix B, a full 28% of pupils in the Osh Oblast studied in Uzbek language tracks. Demographic data indicates that the overall proportion of the Uzbek population has not declined in this period but increased (Census, 2010). Because of the lag in data points between 1999 and 2009, it is of course possible that there was a mass exodus of pupils from Uzbek to other language tracks. However, the more likely explanation for this under- representation (proportionally) of Uzbeks in the NST is simply the lower overall tertiary matriculation rates of Uzbeks in higher education. Few opportunities to receive higher education in their native language - with the exception of one or two institutions - as well as the high 25 As students are free to select any language for the NST, NST language is not necessarily indicative of the language of schooling for a given individual. Some bi-lingual ethnic Kyrgyz for example, have received education through different languages of instruction at different times in their schooling. The researcher is aware of children of mobile parents who were sent to regions far from the capital when they were completing their secondary education. In their previous location they studied in Russian for eight or nine years but they completed their secondary education in the Kyrgyz language. Then, they took the NST in Russian. However, according to CEATM, the overwhelming majority of students sit for the NST in the language in which they completed their schooling. 33 concentration of Uzbeks in high poverty, rural regions in the Fergana Valley are plausible explanations for this state of affairs. It is also possible that some Uzbeks seek higher education in Uzbekistan instead of Kyrgyzstan. Yet, the Uzbek regime has not been especially welcoming to Uzbeks from Kyrgyzstan in recent years and there are a myriad of visa and other hindrances making the border not as permeable as it once was (Megoran, 2002). In the wake of the summer 2010 violence between Uzbeks and Kyrgyz in the south of the country - and the destruction of several Uzbek higher education institutions - the higher education matriculation rate for Uzbeks is not likely to increase in the near future. Appendix C presents the number and percentage of NST examinees by language in 2009 and 2010 for all oblasts of the republic. In 2010, a total of 30,264 examinees sat for the NST, just under half of all secondary school graduates. 26 In Bishkek and the northern Chui Valley the majority sat for the NST in the Russian language. In two homogenous (Kyrgyz) provinces, Talas and Naryn, the proportion of NST examinees sitting in the Russian language was 20% or more. Interestingly, the longer term trend in NST participation reveals that despite the continued emigration of Europeans out of the republic, as well as the higher growth rates of the Kyrgyz and Uzbek populations, the proportion of Russian language examinees for the NST has actually increased gradually since the NST was introduced in 2002 (CEATM, 2009; Census, 2010). 26 27 A caution to interpreting educational data in the republic is that the proportion of various groups in the youth population (for those under 18) is not necessarily same as in the overall population rates in general. The proportions of Kyrgyz and Uzbek populations are higher in the under 18 group than in the overall population (Census, 2010). Therefore, neither NST data, nor other data on language of instruction in general will reflect the overall proportions of an ethnicity in the republic. 27 Between 1989 and 1993 alone, 50% of the 100,000 Germans left the republic (Wright, 1999). Between 1989 and 1999, the number of Russians dropped from 916,558 to 603,198 (Korth, 2005, p. 119). 34 Approximately 60% of all examinees in 2010 sat in the Kyrgyz language, down from 63% in 2009. These figures are down from highs of 71% in 2003 and 69% in 2004 (American a Councils for International Education , 2004). In those same two years, 23% and 26% sat in Russian respectively. By 2010, the proportion of Russian language examinees reached 36% and the proportion of Uzbek examinees had dropped to 4%. From 2009 to 2010, the proportion of Russian medium examinees increased by two to three percentage points in five oblasts and the two major cities of Bishkek and Osh, decreased in one oblast (Naryn) and was the same in two a others, Talas and Batken (CEATM , 2010). One explanation for these percentages is simply the difference in tertiary matriculation rates for the various language groups as a whole. For example, the participation data from 2003 indicate that 93% of all eligible school leavers from Bishkek (n= 5,089) sat for the NST. Only 34% of all those eligible from the southern Djalal-Abad Oblast (n= 4,852) did so in that same year. The Bishkek cohort overwhelmingly sat for the NST in Russian while the Djalal-Abad cohort overwhelmingly sat for the NST in Kyrgyz. These two NST participation rates were the a highest and lowest in that year (American Councils for International Education , 2004). Another explanation is that Russian medium enrollment has held steady or perhaps grown since 1999, despite the continued emigration of Europeans. 28 Several scholars have suggested there was an initial “surge” in Kyrgyz medium enrollment in the wake of independence in the early 1990s and following the passage of the 1989 language law (Fierman, 1995; Korth, 2005). In the early 1990s, parents might have believed that their children would have had better life chances with a Kyrgyz language education; by the mid 1990s, this was no longer so apparent (Huskey, 1995). The 2009 and 2010 graduating cohorts began schooling in the late 1990s, after 28 Unfortunately, I was unable to locate enrollment data beyond the 1999-2000 year. 35 the initial wave of nationalistic enthusiasm had subsided. The trend in NST participation data towards Russian language corroborates the assertion made by De Young (2007) in his ethnographic study of teachers in rural Kyrgyzstan. According to De Young: “… In each of our Chui and Naryn samples, we heard stories about initial enthusiasm for learning in Kyrgyz among Kyrgyz parents and teachers, but a swing back to Russian as at least equally important by the late 1990s. Ethnic Kyrgyz teachers teaching in Russian language schools in particular made this claim. Kyrgyz nationalism and pride led earlyon to demands for Kyrgyz as an instructional language in many schools, but the lack of Kyrgyz language texts and other printed materials was one immediate problem, as was the realization among many parents that at the universities they wanted their children to attend, classes were usually taught in Russian.” (p.5) There are of course other plausible explanations for the over-representation of Russian medium examinees on the NST. First, as higher education itself has traditionally been primarily a Russian medium endeavor, it is understandable that more Russian speakers matriculate (Korth, 2005; Bruner & Tillet, 2007). Urbanites are still more likely to receive higher education in general than their rural peers (Census, 2010). More rural Kyrgyz and Uzbeks have perhaps simply become relatively more marginalized in recent years due to increasing poverty – and thus, less able to afford higher education. 29 It is also theoretically possible that Russian language schools are actually gaining slightly in overall enrollment proportion as they attract more Kyrgyz and Uzbek youth due to the increasing disparity (or perception of disparity) in the quality of education between the different language tracks (Herczynski, 2003). In a series of interviews with key stakeholders, Toursunov 29 Data on language enrollment represent the overall number of pupils enrolled throughout all grades, 1-11. A grade level breakdown of data might be revealing as those eligible for higher education typically have “complete” secondary education (11 years of schooling). A large number of pupils finish schooling after grade nine. It is quite possible that the proportion of Kyrgyz language track pupils who exit schools after grade 9 is higher than for the Russian tracks. That is, Russian track enrollment in grades ten and eleven might be 27- 30% of the total, not 22%. I was unable to find data to confirm this hypothesis. 36 (2010) provides anecdotal evidence from a credible source at the Ministry of Education that the demand for Russian-language education has grown in recent years. According to one of his respondents, the head of the secondary education department at the Ministry of Education: “There is a definite trend where the number of children at Russian schools is increasing … there are many overcrowded schools. Every year, 5,000 to 6,000 children start attending Russian schools … even many children who attended Kyrgyz primary schools switch to Russian secondary schools when the time comes…” (Toursunov, 2010). Indeed, since NST inception in 2002, highly publicized test results have indicated that the Russian medium examinees have consistently and significantly outperformed their Kyrgyz and Uzbek peers. In 2003, expressed in z-scores, the average difference between Russian and Kyrgyz mathematics test scores was .94, almost a full standard deviation in difference. The RussianUzbek difference was just over one standard deviation at 1.04 (American Councils for a International Education , 2004). The urban-rural divide parallels the language divide. The data from 2010 (below) reveal similar gaps. Table 2-3: NST 2010 Scores by Language of Instruction Test Language No. Participants Kyrgyz 18,270 Russian 10,994 Uzbek 1000 Republic 30,264 a (CEATM , 2010) Mean Score 103.4 131.4 100.8 113.5 Std. Dev. 24.6 38.4 23.3 33.2 Cronbach’s Alpha .89 .95 .87 .93 30 30 Data is not aggregated by gender in this study. However, females have consistently both outperformed males on the NST and received more scholarship places than their male counterparts. In 2003, the first year data was collected, females captured over 60% of the scholarship places. The gap is most in favor of females from rural areas. In 2003, almost 66% of all winners from rural regions were young women (American Councils for International Educationa, 2004). 37 Quality of Education by Language of Instruction By many accounts there is a crisis in education in the republic today. There is a shortage of funding for education, a shortage of quality teachers, textbooks, teaching materials and a crumbling infrastructure (OSI, 2002; De Young & Santos, 2004; Korth, 2005; Silova, 2009; Shamatov & Niyozov, 2010). However, the “crisis within the crisis” is the state of education in 31 the Kyrgyz medium schools (Toursunov, 2010; Shamatov, 2011). While the dramatic differences in schooling outcomes seem clear, the reasons for these gaps in some ways are straightforward, and in other ways complex and multi-faceted. NST results do not provide data about the quality of education in the republic because the NST was not designed to assess educational quality. Further, NST examinees are not fully representative of the student cohort as the NST is an optional, tertiary admissions test. However, in recent years representative studies of educational quality have been conducted in the republic. In 2006 and 2009, with support from the World Bank, Kyrgyzstan participated in the Programme for International Student Assessment (PISA). In 2007 and 2009, a nationally representative b evaluation of educational quality was also conducted (CEATM , 2010). 32 Both studies utilized sophisticated sampling designs which adequately covered all three languages of instruction and demographic regions of the country. While the purposes of these assessments differed, the 33 results of both studies indicated wide performance gaps by language tracks. 31 And, one could of course contend, the Uzbek medium. However, for obvious political reasons, the public and government focus has been on the state of Kyrgyz language education. 32 During Soviet times, standardized assessments for determining education quality were not conducted (Bereday, 1960). 33 See www.testing.kg for the technical reports of both NAEQ and PISA. For general data on PISA, see http://www.oecd.org/dataoecd/15/13/39725224.pdf. 38 PISA assesses the “life skills” of fifteen year olds in reading, mathematics, and science. One hundred and one schools and 3,412 pupils participated in PISA Kyrgyzstan 2006, including 54 Kyrgyz schools, 34 Russian schools, and 13 Uzbek schools. Kyrgyzstan showed the poorest results in all three subjects from all participating countries. The average mathematics score for the KR was 311. However, when aggregated by language of instruction, the average Russian track score was 331.5 while the average Kyrgyz track score was 286.7. Table 2-4 below presents a breakdown of the Russian and Kyrgyz cohorts by percentages in the various mathematics score ranges. Table 2-4: PISA 2006 Mathematics Scores by Language of Instruction Kyrgyz Russian www.testing.kg 100-180 6% 3% Percentage of Examinees in Score Range 180-240 240-300 300-360 360-420 12% 33% 42% 7% 4% 17% 42% 27% 420-500 0% 7% The results from the National Assessment of Educational Quality (NAEQ) by language of th instruction were similar. NAEQ assessed knowledge and skills at both the 4 and 8 th grade levels. Unlike the NST and PISA, the NAEQ was explicitly intended to assess how well students were mastering national standards in mathematics, reading comprehension and science. Over 3,000 pupils in schools across the country participated in both the 2007 and 2009 test administrations. Cleavages in grade eight results by language of instruction were large (see Table 2-5). Table 2-5: NAEQ 2007 Reading Scores by Language of Instruction Levels Achieved High Above Base Base Lower than Base b (CEATM , 2010) Kyrgyz .8% 4.5% 12.3% 82.4% Russian 5.8% 15.6% 23.4% 55.1% 39 Uzbek 0% 3.0% 7.8% 89.1% The first qualification to interpreting the data about educational quality is the aggregation of the data itself: Recall that the language gap closely parallels the urban-rural divide. Most Kyrgyz and Uzbek schools are concentrated in rural, resource-poor areas while Russian schools are more typically found in urban areas. Rural pupils miss more school hours at harvest time, have teachers who are less educated, and in general face greater poverty levels (Herczynski, 2003). Urbanites are two times more likely to have higher education than their rural counterparts (Census, 2010). This makes disentangling the various explanations for the gaps challenging though it is probably safe to assume that socio-economic conditions, rather than language of instruction itself, is of primary importance. Appendix D presents data on poverty levels, levels of higher education, NST score averages, and percent sitting for the NST in the Russian language for each of the oblasts and Bishkek. The poorer southern regions (except for the city of Osh) have the highest poverty levels, the lowest levels of higher education per capita, and the lowest NST scores from the entire KR. While demographics plausibly explain most of the disparities in educational outcomes, there is evidence that Kyrgyz language schools face unique challenges, regardless of location (Korth, 2005). According to Toursunov (2010), though the Russian Federation provides 40,000 to 70,000 school textbooks a year to the republic, only 60% of all Russian schools have enough. The state of textbook provision to Kyrgyz schools is worse with only 39% of schools having adequate textbooks. This is especially challenging for teachers who by tradition are accustomed to teaching with textbooks (De Young et al., 2006). Further, according the Asian Bank’s 1997 School Mapping Project, teachers in Russian medium, urban schools, have significantly more contact hours (seven more per week) with students than their urban or rural Kyrgyz school counterparts (Herczynski, 2003). 40 Finally, in 2010, many policy elites raised in the Soviet era continue to send their children to Russian schools (Korth, 2005). Those who are responsible for improving the situation in education are not necessarily personally affected by the low quality of Kyrgyz schools. In essence, there may be a class element to choice of language of instruction and some “selectivity bias” at work. De Young (2007) presents data from ethnographic research in the Naryn Oblast in which participants associate Russian medium schooling with modernity, sophistication and cosmopolitanism. The loss of opportunities in the Russian language in this rural region was perceived by some to be a serious problem, despite the fact that 98% of the provinces’ population is ethnic Kyrgyz (Census, 2010). Several of De Young’s (2007) respondents also noted differences in classroom cultures between Russian and Kyrgyz schools. The acquisition of Russian was linked by some to active and independent learning. According to a teacher from At-Bashy: “… If you tell something to a kid, he will obey without delay; and kids will not express their opinions or defend their opinions, just do what you told them, and that’s it. (But) in Russian language schools, kids defend their points of view; they can even add something better, or even change the direction of an assignment… in sum, they are more or less how to say - maybe more democratic? … I would say (this) is partly to do with the community where they live. You know, when we teach Russian, and start teaching Russian literature and Russian lyrics of freedom, it has its impact in child development. (Meanwhile), Kyrgyz (stories) also has the same freedom lyrics and also the same democrats and fighters (akyns and writers), but (it is not the same)….” (p. 9). De Young (2007) summarizes his interviews with teachers at a Russian school in a Kyrgyz community: “All the staff and all the teachers we interviewed at Kazybek claimed that their school was the best in the raion (region), and that Russian as the instructional language was a primary reason for their success. Importantly, almost every school in our study, including Kazybek, gauged school success in terms of how many graduates went on to the universities in Bishkek (and secondarily to Naryn among schools in that oblast), as a result of the education they received” (p.5). 41 Toursunov (2010) concluded that the core issue is simply the failure of a corrupt, authoritarian regime to care for its citizens by providing quality education in their native language. In a series of interviews with parents and educational administrators, he found that parents send their children to Russian language schools simply because they believe the quality of education there is better. One respondent, a 30-year-old teacher and ethnic Uzbek from Osh whose daughter attends a Russian school, argued: “…migration is not the key issue. The main reason why parents take their children to Russian schools is that they offer a better education than Uzbek and Kyrgyz schools” (Toursunov, 2010). And, from a sociologist in southern Kyrgyzstan: “Since Kyrgyzstan obtained independence after the collapse of the Soviet Union in 1991, the quality of services provided by secondary schools has been declining… the situation is most alarming at Uzbek and Kyrgyz language schools. Services are better at Russian schools, which attract more and more parents seeking better education for their children” (Toursunov, 2010). Other explanations for the persistence of the Russian language are that TV and media available in Russian is seen as superior to local equivalents (Huskey, 1995; Korth, 2005). Increased labor emigration on the part of ethnic Kyrgyz to Russia (and hence the need for 34 Russian language) has also been noted by some. Finally, many post-Soviet Kyrgyz elites might simply still strongly identify with Russian culture and language as an inherent part of their own identity. Identity formation is a complex phenomenon and it is not necessarily the case that all Kyrgyz feel the need to be educated through the Kyrgyz medium in order to “feel Kyrgyz.” There is evidence that many ethnic Kyrgyz identify strongly with Russian culture and language (Faranda & Nolle, 2010). 34 According to some estimates, over 500,000 Kyrgyzstani citizens currently work in Russia (Podolskaya, 2011). 42 Whether it is ineffective governance, sensitivity to economic ties with Russia or the domestic European population, utilitarianism, the need to be seen as modern, cultural affinity, or simply the lack of motivation and interest on the part of highly “Russified elites” to address the issues, the evidence is clear that there are serious problems with the provision of quality education through the Kyrgyz medium. Education through the Russian language medium is still perceived as higher quality, both at the secondary and tertiary levels. In the next section of this chapter I turn to a discussion of tertiary education and the NST. Tertiary Education and the NST There were only two higher education institutions (HEIs) in the Kyrgyz Republic in 1932. By the early 1980s there were 10 with 57,109 students enrolled (Soktoev & Usubaliev, 1982). The total number of first year students enrolled in 1988 was 12,106 (National Statistical Committee of the USSR, 1989). Eight of the 10 HEIs in the republic were located in the capital at that time, Frunze. Soktoev & Usubaliev (1982) record 87 degree options in the 1980s and lists economics, engineering, pedagogy, medicine, and agronomy as popular specializations (majors). According to official statistics there were five general HEIs with humanities, pedagogy, and natural sciences programs - including one “university” - one medical academy, one institute of physical education, one arts academy, one agricultural institute, and one building and construction institute in the republic in 1988 (National Statistical Committee of the USSR, 1989). The provision of tertiary education in the USSR was funded entirely by the state. Not only were the operating budgets and fixed capital provided by the state, but all students also received full state funding for the duration of their studies (Bereday, 1960). As in other sectors of the economy, centralized planning characterized all aspects of higher education provision: The 43 allocation of resources for the support of operations and facilities, academic programs and materials, the number of professorships and student places available, and even curricula were all determined according to state planning needs (Reeves, 2005). Upon completion of a course of study, graduates were typically assigned a job based on the needs of the relevant scientific, economic or social sector at the time of graduation. As a key provider of human resources for the state planned economy, HEIs did not have the institutional authority to enlarge their faculties or student bodies, significantly alter program offerings, create their own curricula, or make other major institutional decisions without direction from the central Ministry of Education (Bereday, 1960; Reeves, 2005). Another characteristic of the Soviet higher education system was that HEIs were not all subordinate to the Ministry of Education. For example, the medical institute was under the Ministry of Health, the agricultural institute was under the Ministry of Agriculture, and the military and police academies were under the Ministry of Defense and Internal Affairs, respectively (Bereday, 1960). Higher education in the republic was (and still is) also distinguished from some western systems by the institutional separation of research and teaching. Scientific research is the responsibility of the Academy of Sciences, not HEIs. Another distinction is that academic programs are characterized by the high number of contact hours compared to their western peers. In some courses of study, students are in the lecture halls as much as 35-40 hours per week (Reeves, 2005). Once enrolled, students do not select classes and elective options but follow a prescribed course of study. As in secondary education, they move through their courses in cohorts (groups) which attend all classes together throughout their years of study. 44 With independence, most HEIs opened Kyrgyz language tracks for degree courses which had previously been taught only in Russian (Korth, 2005). While there are now both Russian and Kyrgyz groups for most fields of study, it is widely considered that for fields like medicine and the sciences the Russian groups still have better access to quality materials and teachers. Today, 67% of students overall continue to receive their tertiary educations through the Russian language medium (Bruner & Tillet, 2007). Higher education in the republic has been dramatically affected by the collapse of the USSR. Educators have struggled to define the mission of higher education which had been so tightly coupled with state planning in the past. Today many remain proud of the Soviet era accomplishments in science and research and there is little agreement on whether reorienting the purpose of higher education away from “the needs of the state” to the needs of the market or the individual is either needed or appropriate (Reeves, 2005). 35 The biggest change however, has been the decline in state financial support and its impact on the higher education system. According to data provided by Bruner & Tillet (2007), by 1994 only 61.2% of all funding for 36 higher education was from the national budget while by 2005 it had declined further to 30.4%. These are tremendous decreases considering that just a few years ago 100% of all HEI funding had come from the state. 35 This can often be seen in the contradictions between stated intentions and actual policy implementation. Rhetoric about the market aside, in regard to the way the budget places in HEIs are assigned, individuals do not select how to use their scholarships but instead choose from places available according to the needs of the state: Over half of all scholarship places are for teaching positions; i.e., the budget places are used to provide incentives to fill positions the state has prioritized (Silova, 2009). 36 Anecdotally, depending on the institution, some university rectors would argue that today the actual state funding levels are around 10% but it depends on how “budget funding” is defined, i.e. figures vary depending on whether or not the value of buildings and other fixed capital inherited from the USSR is included in estimation. The main point is the dramatic decrease from the Soviet era. 45 Despite this decline in support, or perhaps because of it, the number of HEIs in post Soviet Kyrgyzstan has grown. Private institutions as well as institutions partly sponsored by foreign governments are now prevalent in the republic. 37 The founding of regional state universities in Naryn, Talas, Djalal-Abad, Kara-Kol, and Batken is also a post-Soviet development. In the Soviet Era, the only non-capital institutions of note were Osh State University and the pedagogical institute in Kara-Kol (Soktoev & Usubaliev, 1982). With the 1992 Law on Education, HEIs now have the discretion to collect tuition and fees from students and engage in other revenue generating activities. In some cases, entrepreneurial rectors have created “for profit” departments or institutes within state institutions as a way to generate resources though the quality of many of the new programs has been questioned (Bruner & Tillet, 38 2007). Bruner and Tillet (2007) report that in 2005 the number of HEIs in the republic was 49. This figure includes both the new state HEIs as well as many smaller, private institutions. Many of these institutions receive no state funding to support budget (scholarship) students however, and therefore are not obligated to accept NST results for admissions. Each HEI negotiates with the Ministry of Education the exact number of scholarship places it makes available every year. In 2010, 21 state-funded institutions enrolled budget students according to NST results 37 Kyrgyz Russian Slavonic University (KRSU), The American University in Central Asian (AUCA), and Kyrgyz-Turkish Manas University are three of the most popular HEIs in the republic today. All three are partially funded by partnering countries. 38 I have argued elsewhere that power relations between HEIs and the Ministry of Education have also changed (Drummond, 2011). While formally the ministry has retained many of their Soviet era oversight prerogatives, in reality, the funds generated by HEIs themselves have empowered them relative to other state institutions. 46 a (CEATM , 2010). 39 Of course, along with the increase in institutions there has been an increase in overall student enrollment, a subject to which I turn in the next section after a brief review of Soviet HEI selection policy. Student Selection in the Soviet Period Though overall HEI admissions policy was made at the ministerial level in the Soviet period, each institution selected abiturients with internally created and administered 40 examinations. Examination scores served as the single criterion for selection for the majority of abiturients: The exception being special conditions for “gold medal” winners (perfect marks throughout the school career), winners of academic “Olympiads” and quotas for disabled, orphaned, or other special categories of students granted special admissions privileges. HEIs administered oral examinations in subjects deemed necessary for a particular course of study plus a written essay in the Russian language for humanities majors. Mathematics also required a written exam in addition to an oral exam (Clark, 2005). School transcripts, interviews, portfolios, abiturients’ community or social activities and other criteria were not utilized as selection criteria. 39 Private institutions are not required to take scholarship students. Foreign sponsored organizations are also not required but some have complicated admissions arrangements. For example, KRSU has over 500 scholarship “budget places” provided by the Russian Federation and only around 120–150 provided through the Kyrgyz republican budget. This means that admissions requirements are different depending on which budget supports the particular student. For the Russian Federation funded budget places, NST results are not considered in admissions decisions. The American University in Central Asia also has its own admissions requirements though it has traditionally provided considerable scholarship support for high scorers on the NST. 40 The term abiturient most likely entered Russian from German (in the German system, an abitur has completed the type of secondary education that allows a pupil to apply to a university). In Russian, the term is commonly used to denote an HEI applicant, or entrant (лицо, поступающее в учебное заведение). Of course, there is also the assumption that applicants have completed secondary education necessary to apply to an HEI, i.e. an abiturient is no longer a pupil, but not yet a student. 47 Abiturients prepared for admissions exams by studying many “topics” or questions about a particular theme or subject area relevant to their desired major. At the appointed time in July of each year, abiturients then went to each HEI to which they were applying to sit for examinations. Abiturients came before an examination committee and randomly selected one or more of these topics, turned face down, on small cards or strips of paper (Drummond & De Young, 2004). After abiturients demonstrated their knowledge of the selected topic, admissions committees composed of specialists in the subject area being assessed asked the abiturients questions relevant to the topic. The examiners assigned marks on a scale of two to five, five being the highest mark. After the completion of examinations, the admissions committees forwarded their lists of recommended abiturients up the institution’s chain of command for 41 official approval and eventual enrollment (Drummond & De Young, 2004). After the breakup of the Soviet Union some HEIs continued to select abiturients through these procedures. However, with the loosening of bureaucratic controls, many institutions throughout Eurasia began to introduce written, multiple-choice tests. Representatives of some HEIs believed that multiple choice testing was more efficient in handling the increasing numbers of abiturients to higher education. According to one recent analysis, the total HEI enrollment in the 19-24 year old cohort went from 14% to a 36% in Kyrgyzstan from 1989 to 2001 (Bruner & Tillet 2007). Using multiple-choice tests, HEIs could screen larger numbers of abiturients than with the more time consuming individual based, oral exams (Drummond & De Young, 2004). Some perhaps saw such testing as representing a more “modern” approach to student selection (Valyaeva, 2006). Others however, saw standardized testing as a threat to their own educational heritage and as unwanted “Americanization” of their education system (Reeves, 2005). 41 Ministerial approval of lists of recommended abiturients was a formality but necessary in order to enable state funding for the abiturients selected. 48 The new HEI-administered tests in the Kyrgyz Republic were not standardized in the sense that one test was utilized to assess all abiturients throughout the country; each HEI required abiturients to sit for their own tests. Critics pointed out that the multiple-choice tests administered by the universities were of low quality and could be easily manipulated by test administrators for abiturients willing to pay for the service (Clark, 2005). In the era of instability that followed the Soviet collapse, throughout Eurasia evidence that HEIs began to abuse their power in the admissions process through both oral examinations and the new multiple-choice tests mounted (International Crisis Group, 2003; Osipian, 2007; Heyneman et al., 2008). By 2000, ministerial bureaucrats and even some university officials in Kyrgyzstan and other Eurasian countries began to propose major changes to their admissions systems (Valkova, 2001; Valyaeva, 2006; Osipian, 2007). While the timetable for reform was different in each country, the focus on corruption in selection was a common rationale for policy change across Eurasia. According to a report by the Russian Ministry of Education and Science and the Moscow School of Economics, abiturients in Russia were allegedly paying the equivalent of several years of tuition in illicit payments to enter higher education (Clark, 2005). At a September 19, 2006, address to participants at a conference on university admissions and examinations, the Minister of Education of the Georgian Republic, Alexander Lomaia, claimed as many as 80% of students admitted to Georgian HEIs in late 1990s, were enrolled for nonacademic reasons (Conference Program, 2006). Fighting corruption emerged as the primary rationale for the move to standardized testing in Kyrgyzstan as well (Drummond & De Young, 2004). President Askar Akaev issued his first decree in support of admissions reform on April 18, 2002. The decree called for the introduction 49 of the National Scholarship Test (NST) in June of that same year. 42 The decree explicitly eliminated all HEI discretion in selecting budget students: All scholarship (budget) places were to be allocated strictly according to test results (Presidential Decree No. 91, 2002). The Presidential Decree also called for public observation of the enrollment process, similar to the kind of monitoring that accompanies major political elections. The NST is a high stakes selection test used for the distribution of over five thousand full university scholarships. The purpose of the National Scholarship Test (NST) is to determine which examinees have the scholastic aptitude and academic skills for study at the tertiary level (Valkova, 2004). The NST has mathematical and verbal reasoning domains. 43 Table 2-6 below highlights the differences between the Soviet examination system and the new NST in the Kyrgyz Republic. Table 2-6: Soviet and Contemporary Selection Procedures Country Administered Oversight Purpose Format USSR HEIadministered Ministry of Education Selection Oral Exam, Subject Based Achievement NonBoard of Selection governmental Trustees organization Presidential Decree (2002), Drummond & De Young (2004) Kyrgyzstan (2002) Multiple Choice Test, Scholastic Aptitude Acrimonious struggles over which institutions should have the discretion to conduct the NST and select students have accompanied the NST reform since inception. HEIs initially 42 In the Russian language, the NST is known as “Общереспубликанское тестирование” which translates literally as “General Republican Testing.” However, I use the name “National Scholarship Testing” as it captures the idea that results are used for HEI scholarship allocation. 43 Additional subject tests (scored separately) are required for examinees seeking certain academic majors such as medicine and foreign languages (Valkova, 2004). 50 strongly resisted the introduction of the NST in 2002. In 2003 and 2004, opposition to the NST came from the Ministry of Education itself which sought to usurp the right to test from the nongovernmental Center for Educational Assessment and Teaching Methods (CEATM) (Drummond, 2011). In 2002, the Minister of Education and Culture, Camilla Sharshekeeva, had insisted that the new testing center be a non-governmental organization, overseen by a board of trustees, not the ministry (Drummond & De Young, 2004). Minister of Education Ishengul Boljurova, who replaced Sharshekeeva in June of 2002, initially supported the new test center (Mambetaliev, 2003; Boljurova, 2003; Drummond & De Young, 2004). However, at a White House presentation for university rectors on November 24th, 2003, the minister articulated a new vision for admissions testing starting in 2004. That new plan entailed the Ministry of Education owning the rights to the student testing databases, ministerial oversight of test scoring and the ministerial production of test score certificates (Drummond, 2011). The politics of NST implementation have been addressed elsewhere and will not be analyzed in detail here. However, it is important to highlight the fact that while HEIs are essential stakeholders in the NST, their representatives played (and continue to play) no formal role in deciding admissions policy for budget students: Neither in terms of determining the selection criteria nor in terms of how the enrollment process is organized. 44 Due to the publics’ loss of trust in HEIs in the 1990s, they were cut out of the policy making circle on university 44 Note that the selection reform was implemented with a Presidential, not ministerial decree. That is, while the Ministry of Education has formal power to make selections policy, major decisions need to be backed by the President of the Republic. As I have argued elsewhere, HEIs are in fact quite powerful in relation to the Ministry of Education (Drummond, 2011). 51 admissions both by Sharshekeeva and the ministers who followed her since 2002 (Drummond, 2011). 45 The introduction of the standardized NST and elimination of HEI administered oral examinations represents the reassertion of administrative control by the Ministry of Education over HEIs which were perceived to be out of control in terms of selection corruption. However, this renewed ministerial oversight is not indicative of total control. As noted above, the majority of students now enrolled in higher education are those who pay for their educations, so-called “contract students.” The majority of contract students are still selected primarily according to 46 HEI examinations. Thus, the impact of the NST on the students and the HEIs it is designed to assist has been different for different student cohorts. On the one hand, the NST is a high stakes test in that there are no other criteria utilized for scholarship distribution for full state scholarships. On the other hand, scholarships to study academic majors designated by the state are not necessarily that popular and there are a myriad of low cost, easily accessible opportunities for some kind of higher education in the event that NST results for a given abiturient are low (Bruner & Tillet, 2007). In general, even with the decline in funding, the number of students in the post-Soviet era has sky-rocketed since independence. In 1992, there were only 53,670 total students enrolled in higher education in the republic, almost 100% of which were budget students on a full 45 However, there is evidence that the NST is meeting HEI needs in terms of student selection. A predictive validity study of the NST demonstrated reasonable correlations between NST scores and academic achievement of students at the completion of one year of course work (Davidson, 2003). 46 However, in 2010, the MOE required HEIs to accept 50% of their contract students based on NST results. 52 47 government scholarship. By 1998, 120,986 students were enrolled, but only 27.5% received full state support (Bruner & Tillet, 2007). While higher education was free in the Soviet era, access to higher education was competitive throughout the USSR. At the end of the Soviet period, there were approximately three applicants for every available place in the Kyrgyz Republic (National Statistical Committee of the USSR, 1989). Considering that up to fifty percent of the entire age cohort left school after the eighth form, this three people for one place ratio is competitive. Today, while many students must pay, there are more access points to higher education available than ever before. 48 Since 2002, approximately 5,200 to 5,700 full state scholarships have been allocated annually based on NST results. During this same period, the average size of the cohort graduating from secondary school has been between 72,000-82,000 pupils per year. Approximately 40-48% of the graduating secondary cohort (30,000-36,000) sits for the NST a each year (CEATM , 2010). The enrollment data indicate that most participants go on to some form of higher education whether or not they win a scholarship place (Bruner & Tillet, 2007). It is estimated that another 10,000-15,000 enroll in some form of correspondence education, putting annual matriculation at approximately 40,000-50,000 and total number of students at over 200,000 in the entire system any given time (Bruner & Tillet, 2007). It can be deduced from these figures that only around 10% of all entering students currently receive full scholarships. 47 This number includes both “day students” and “zaochnoye” (correspondence students) which are about 20% of the total student population. 48 Anecdotally, many of the new institutions are popularly perceived to be little more than “business ventures” (providing low quality education) though it should also be noted that three of the most popular HEIs in the republic were also all founded in the 1990s, all with outside support (KRSU, AUCA, and Turkish Manas). 53 While there are more opportunities for higher education today, there are of course few HEIs with “elite status.” An indicator utilized by the public to assess the prestige of HEIs is the annual NST report which contains the average NST scores of the entering scholarship classes. A full list of those institutions enrolling scholarship students can be found as Appendix F along with the NST average scores and number of budget entrants for each of these state HEIs. Note the wide dispersion in average scores across HEIs in Appendix F. The average 2010 scores for those entering with scholarship support at the prestigious Kyrgyz Russian Slavonic University and the Kyrgyz-Turkish Manas University were 182.2 and 182.1 (about two standard deviations above the mean) respectively. The Medical Academy average was also high at 177.7. At the same time, regional HEIs such as Talas State and Naryn State awarded scholarships to abiturients whose NST scores averaged just 116.4 and 115.4, respectively, barely above the NST average score of 113 for the nation as a whole. One can conclude the competition for budget places at elite HEIs is fierce (average score of entering cohort more than two standard deviations above the national average) while for “middle of the road institutions,” hardly competitive at all. One reason for the low competitiveness of places at the middle and lower tier HEIs is related to the purpose of the scholarships. An issue that is not visible in the data presented in Appendix F is what kind of budget opportunities are offered (by department). Recall that the scholarship does not “follow the student” but rather the student “follows the scholarship.” The ministry has traditionally allocated approximately half of all budget places for “pedagogical faculties” and other specializations needed by the state, which are less popular than subjects like economics, international relations, and computer science (Drummond & De Young, 2004; Silova, 2009). The result is that many high NST scorers do not take part in the scholarship competition and prefer to pay for their educations and study the major of their choice. In a study 54 of 2007 NST results by Silova (2009), she found that the dispersion of NST average scores by faculty is as great as the geographical divide noted above. That is, those enrolling in areas of study like international relations had considerably higher NST scores than those enrolling as pedagogy majors. The National Scholarship Test and Language Politics Despite the vast performance gaps by language of instruction on the NST, the introduction of the NST has been relatively non-controversial, even popular among rural and non-Russian speaking cohorts. 49 There are several reasons for this. First, despite some initial resistance from elite HEIs, all examinees are allowed to sit for the NST in the language of their choice, regardless of the language on instruction in the HEI department to which they are applying (Drummond & De Young, 2004). This policy was introduced in 2002 in order to ensure that the brightest rural students were not denied educational opportunities at elite institutions due to a lack of language knowledge. More specifically, so that graduates of Kyrgyz language schools in rural regions could not be denied access to elite universities like the Kyrgyz-Russian Slavonic University because they didn’t speak Russian. Minister Sharshekeeva argued that if examinees could score high enough on the NST in their native language, they were capable of learning Russian in a year of pre-enrollment language preparation (Drummond & De Young, 2004). 49 In the winter of 2003 and spring of 2004, over 900 school directors were surveyed on their attitudes towards the new selection system (American Councils for International Educationb, 2004). The overwhelming majority of school directors favored independent testing, with only 5.6 % noting that universities should conduct selection testing. According to survey results, school directors believed that the motivation to learn had increased among pupils due to the introduction of the NST. 55 Perhaps the primary explanation for the continued support of the NST might have more to do with the fact that rural, Kyrgyz speaking students are winning scholarships in equal proportion to their Russian language counterparts, despite their overall lower average NST scores. This is because scholarships are awarded according to a quota system that places examinees in competition only with those from within the same quota category. Bishkek abiturients compete only against abiturients from Bishkek, not rural regions, and rural abiturients compete only against other rural abiturients for scholarship places. Each abiturient is assigned one of four possible demographic categories depending on the location of the school from which they graduated. Each village, town, or city has its own official designation. The purpose of the quota system was (and is) to “level the playing field” between rural and urban a examinees (American Councils , 2004). The result of this quota system is close proportional representation from each of the demographic and language categories in the overall proportion of scholarships awarded. This proportional representation persists, despite the fact that urban and Russian track examinees score almost a full standard deviation above the other groups on the NST. As can be seen from Appendix E for example, 66% of 2010 total scholarship winners were from Kyrgyz language tracks while these tracks represented only 60% of the total test takers. Note that the average score of this group was 125.6 while the average for the Russian language track examinees was 153.9. In fact, the two most rural and impoverished quota categories - “village” and “high mountain” - are actually over-represented in the proportion of scholarships received (Table 2-7). While the quality of higher education varies between regions of the republic, it appears that not only are “village” and “high mountain” winners well represented in scholarship winnings overall, 56 50 they are also well represented in urban institutions. The trend between participation and winnings is fairly consistent throughout each oblast and it is likely that without the quota system, this proportional representation in winnings would not be occurring. Table 2-7: Scholarship Winners by Quota Category (2010) % Participation % Scholarships Avg. Score Avg. Scholarship Score a (CEATM , 2010) Republic 100.0% 14.8% 113.5 134.9 Bishkek 21.0% 14.6% 135.4 158.1 50 Towns 14.7% 13.5% 122.7 146.4 Village 49.8% 52.4% 104.4 127.9 Mountain 14.5% 19.5% 102.2 125.6 See CEATM’s 2010 Annual NST report for the demographic breakdown of scholarship winners at each urban university. www.testing.kg. 57 Chapter 3: Literature Review Since the 1960s, a considerable amount of both applied and theoretical research on DIF, bias, and item equivalence has been conducted (Holland & Wainer, 1993; Camilli & Shephard, 1994). Studies have analyzed for racial, gender, and language differences on a variety of assessments and tests. DIF studies vary in purpose: Some are designed to assist practitioners identify and interpret causes of DIF while others compare the efficacy of DIF detection methods. In regard to statistical DIF, there has been considerable comparative research on various item response theory models, Mantel-Hanszel chi-squared and logistic regression methods (Clauser & Mazor, 1998). Much of the statistical DIF detection research has utilized simulated data sets in order to create experimental conditions for testing hypotheses (Hambleton, Clauser, Mazor, & Jones, 1993). Most DIF studies conducted in the USA have focused on racial or gender differences (Holland & Wainer, 1993). However, there is a growing literature on DIF in cross-lingual assessments (Hambelton, 2005). This literature review focuses on the DIF research most relevant to this study: Studies of item reviewers’ ability to predict DIF through substantive review and studies of causes of DIF on cross-lingual, verbal assessment items. The last section is an analysis of the literature that addresses how the particular statistical methods employed to detect DIF can impact DIF detection results. Substantive Review and DIF Prediction Studies of the relationship between substantive review and statistical DIF detection methods have been conducted in various contexts for some time (Mazor, 1993). Some early analyses in the USA focused on racial, gender, or group other differences (Plake, 1980; Engelhard, Hansche & Rutledge, 1990). More recently, Gierl & Khaliq (2001) have conducted 58 research on Canadian achievement tests in which the relationship between substantive and statistical methods was assessed. However, the overall number of studies is quite small and there have been very few studies, if any, on congruence of these methods on cross-lingual assessments 51 in developing country contexts. Cross-lingual substantive analyses often employ post-hoc review in which linguists, translators and content specialists analyze items flagged as DIF by statistical procedures (Joldersma, 2008). However, as in this study, it is also possible to work in the opposite direction and collect substantive data first, then statistically analyze the items in order to understand how well item reviewers predict DIF. Various protocols, coding guides and rubrics, questionnaires and focus groups have all been employed to collect and analyze such data (Engelhard, Hansche, & Rutledge, 1990; Allalouf, Hambleton & Sireci, 1999; Gierl & Khaliq, 2001; Ercikan, 2002). Early studies of substantive reviews reflect the socio-political issues important in those times (Holland & Wainer, 1993). In the 1970s and 1980s, analyses typically focused on whether minority groups and women were represented in a positive light on educational and professional tests, whether they were represented at all, and whether the content presented in tests and items would be equally familiar to all examinees across groups (Tittle, 1982). In the early literature, the term “bias” was often used in a broader way than is currently accepted in the psychometric literature. Bias was sometimes used in reference to any test with “poor representation” of minority groups or for test items that appeared to place women or minorities in stereotypical 51 Cross-lingual testing is increasingly entering the domain of US policy makers. In particular, members of the Obama administration often reference the US’s “poor performance” on such cross-lingual assessments as the Programme for International Student Assessment (PISA) and the Trends in Mathematics and Science (TIMSS) assessment programs. See Duncan, A. (June 14, 2009). “States Will Lead the Way Towards Reform,” Address by the Secretary of Education at the 2009 Governors Education Symposium. www.ed.gov. 59 roles. Much early writing about substantive review also had a highly prescriptive character with “how to” type recommendations for test developers and reviewers (Holland & Wainer, 1993). In 1982, Carol Kehr Tittle presented a comprehensive plan for how substantive reviews could be employed at all stages in the test development process – planning, specifications development, item try outs, post-test review, etc. She provided recommendations and detailed rubrics for scoring and collating problematic items. Other work from this period also has a highly prescriptive character for how to ensure item fairness. According to Scheuneman (1982), Coffman (1961), Donlan (1971) and Dwyer (1979) conducted studies of the relationship between substantive reviews and statistical outcomes, but all three focused on gender differences on the Scholastic Aptitude Test (SAT). Medley and Quirk (1974) studied black-white differences on the National Teacher Examination. Such studies were viewed as important not only for political purposes but because the computational costs of statistical analyses were high at that time (Plake, 1980; Holland & Wainer, 1993). Further, as Plake (1980) argued, in many testing situations the number of examinees was sometimes too low to conduct statistical analyses, even when technology was available. Thus, it was considered important to find ways to improve the quality of the substantive evaluation process as test developers did not always have the luxury of statistical DIF detection methods. In cross-lingual testing today, high profile cross-lingual assessments like Trends in Mathematics and Science (TIMSS) and the Programme for International Student Assessment (PISA) rely on sophisticated, quantitative methods for item analysis. However, not all multi-lingual countries have the financial and personnel resources to conduct such sophisticated DIF detection techniques. Thus, in some ways, many developing countries still face the same challenges to DIF detection that many western analysts and researchers faced in the 1960s and 1970s. 60 The results of many of the early studies on reviewers’ ability to predict DIF were mixed at best. Tittle (1982) found that overall, the outcomes were inconsistent, with results highly dependent on methods employed, type of prediction study, and expertise and background of item reviewers. A decade later, Mazor (1993) stated more conclusively that the cumulative result of the early research was that accurate DIF prediction by substantive review was the exception rather than the rule. In order to understand the challenges involved in substantive DIF review it is necessary to present several of these studies in greater detail. In the next section I present findings from some of the more well known studies of the efficacy of substantive review. Using data from the Iowa Test of Basic Skills, Plake (1980) analyzed whether raters could th identify DIF for students from the 4 th th through 8 grades who all had 5 grade skill levels in mathematical concepts. In order to control for ability level as a confounding factor, she paired th examinees with like ability from 5 grade with similar ability from the other grades and created separate test groups. Three specialists in elementary math education then predicted which items th would be easier or harder for non-5 graders. When two out of three specialists selected an item, it was deemed a DIF item. Plake utilized ANOVA to analyze the test results and compared these to her panel results. The result was that raters predicted twice the amount of DIF than the statistical procedures yielded. In terms of direction of DIF (which group was favored) one third of the items favored the opposite direction that was predicted by the specialist raters. The raters also differed greatly in the number of DIF items they identified at 41, 38, and 16 cases. Engelhard, Hansche & Rutledge (1990) analyzed the ability of item raters to predict DIF between blacks and whites on a series of three different teacher education examinations. Fortytwo judges examined 40 test items from teacher certification test batteries. Twenty four evaluators were black and 16 were white. The judges were divided into three separate review 61 committees - one for early childhood, one for administration and supervision, and one for middle childhood examinations. All participants in the study were experienced members of previous bias review committees. They received 45 minutes of training and written guidelines for identifying potential problems. They categorized items as “favors blacks, no difference, or favors whites.” From the results of the review, the researchers created a categorical index called the “Judged Category Index” with categories coded as -1 = favor blacks, 0 = no difference, 1 = favor whites. They then compared the results from this index with results from a statistical DIF detection method, the Mantel-Haenszel (MH) chi-square method, a method commonly used by the ETS (Camilli & Shephard, 1994). The MH procedure tests whether the odds of success on a given item are proportional for both groups across levels of the matching criteria (ability). The null hypothesis is that the proportion of examinees answering correctly in the reference group is the same as the proportion for the focal group. The MH method employs a 2 X 2 contingency table for each item where item response data (correct/incorrect) is entered along with group membership for those examinees with the same ability (Mazor, 1993). 52 Engelhard et al. (1990) then computed the correlation between the substantive estimates and empirical estimates that were calculated using MH. The result was little agreement between the two estimation methods. The correlations ranged from .00 to .11 for the three different tests. However, they did find significant individual differences between reviewers with one having a 52 Swaminathan & Rogers (1990) argue that the MH procedure is best conceived as a special case of the logistic regression method (LR). The main difference is that MH treats ability as a discrete variable while in LR ability is treated as continuous. They note that having the variable treated as continuous enables analysis of an interaction effect between ability and group. Mazor et. al (1992) argue that this is important because in the MH model, if an item favors one group at one end of the ability distribution but the other group at the other end, key information gets canceled out and no DIF is reported (Mazor, Clauser & Hambleton, 1992). 62 .52 correlation to the statistical results. At the same time, another one of the reviewers in the administration and supervision group had a negative correlation of -.36. The two main results of this study were that (1) there was significant variation in the ability of the item raters to accurately detect differences and (2) as a group- raters were not able to predict DIF very well. From each of the three analyzed groups of items, only one or two evaluators (from 42) demonstrated better than chance agreement with statistical data. Engelhard et al. (1990) concluded that item reviewers could not predict which test items would perform differently for black and white examinees when they had no empirical data. They argued that a primary reason for low agreement between the two indices was the infrequent use of the category “favors blacks.” They proposed that because many reviewers were asked to represent the interest of their social category (race) in a high stakes situation, this might have influenced their estimations. Another conclusion they drew from this study was the need to conduct experimental research (using simulated data) which would allow them to compare how well reviewers could identify flaws with test items. The authors argued that the practical utility of an experimental study would be useful in selecting quality reviewers for review committees. Engelhard, Davis, and Hansche (1999) conducted such an experimental study with thirty-nine reviewers on a statewide student assessment program in the state of Georgia. The reviewers were practicing elementary teachers and administrators, were diverse in age and all had experience either writing test items or participating in bias review committees. Before beginning, the evaluators received a sixty minute training session which included the overall purpose of the assessment system and guidelines for identifying flaws. The key to this study was that some of the over seventy test items had known flaws which served as the 63 criteria (a baseline) with which to assess the accuracy of the judges’ ratings. Test items came from a variety of content from grades three through eight. Twenty-eight of the items had no known flaws while 47 items had flaws; nineteen had one flaw, 22 had two flaws, 5 had three flaws and 1 item had four flaws. The flaws were broken down into cultural flaws and technical flaws. After reviewing the items, the reviewers responded to 16 questions. The questions on cultural flaws had to do with gender, race, handicaps, socio-economic status, demographics (rural vs. urban) etc. For the technical flaw category, reviewers answered questions about the comparability of the difficulty levels in format, language, prior knowledge, grammar, typographical errors, item content, appropriateness of topic, etc. Each reviewer then spent two to three hours evaluating the 75 items. Reviewers marked “yes” if they believed the items exhibited any flaws. They left the questions blank if they found no flaws. The accuracy of their estimations was determined by the agreement of their marks and the predetermined a priori classification of the items. They utilized a logistic transformation of ratios to determine the probability of accuracy vs. inaccuracy of evaluators’ predictions. False positives and false negatives were scored as inaccurate. The most accurate reviewer was 94% accurate while the least accurate was 83% accurate. Overall, accuracy rates were higher on the cultural flaws than technical flaws. The study demonstrated that substantive committees could be quite accurate in detecting various item flaws. However, as the authors noted, identifying flaws is not the same thing as predicting DIF. In recent cross-lingual studies of DIF prediction, bi-lingual reviewers have been more successful than in Engelhard et al.’s 1990 study. Gierl and Khaliq (2001) conducted a study with eleven reviewers analyzing French and English social studies and mathematics tests at the 6 th th and 9 grade levels. Their method included having the reviewers generate a priori hypotheses 64 about types of DIF and which groups might be favored. Cognizant of the potential for multidimensionality (addressed below) evaluators attempted to discern not only primary traits assessed by the items, but also what secondary traits these items might be assessing and how these traits might impact the two groups differentially. It was a matter of judgment as to whether the secondary dimension was benign or adverse. Items with similar characteristics were organized into “bundles” for analysis. They utilized the Simultaneous Test for Bias (SIBTEST) to test for statistical DIF. Across both grade levels, the evaluators predicted the direction of DIF correctly in 7 of 8 times for the mathematics items and 8 of 13 for the social studies items. Intuitively, the results of Gierl and Khaliq’s (2001) are plausible as differences between languages may at times be more explicit and somewhat easier to detect than predicting how racial groups will respond to items that are in the same language. Overt mistakes like poor translation or typographical errors might be easier (on average) to detect than say how females or males might react to different kinds of items in the same language. However, as Ercikan (2002) points out, the raters in Gierl and Khaliq’s (2001) study also knew in advance which items had been flagged as DIF. Thus, they were not so much “predicting DIF” and DIF direction as “interpreting DIF direction” with known DIF items. They employed “a consensus-building model wherein the reviewers worked as a group and focused on standardizing interpretations and ratings across reviewers, which may have contributed to high success rates of explaining DIF” (Ercikan, 2002, p. 201). Thus, it would appear that method of DIF evaluation, whether DIF is known or not, and whether individual or group analyses are employed might also contribute to success rates. In addition, Ercikan (2002) argues that it makes a difference whether a cross-lingual DIF study is on test items utilized within a single country study or across countries. This is because 65 the potential number of DIF sources is higher in cross-country studies. In cross-country analyses there is greater potential for variation in opportunities to learn or curricular coverage to cloud reviewers’ estimations. Further, with within-country studies, there is a larger pool of potential reviewers with intimate linguistic knowledge and cultural understanding that may not always be available for the cross-country study. Languages, conditions and cultures in different language groups can be relatively well understood by within country bi-lingual reviewers who not only have life long experience with both languages, but in many senses may consider themselves to bi-cultural as well. Item evaluators in the Kyrgyz Republic might also be expected to do relatively well in prediction in comparison to some other types of DIF studies. It is possible to find item reviewers who are themselves the products of both Russian language and Kyrgyz language educations and many have intimate knowledge of the cultural differences between the two groups. As the NST is an aptitude test, curricular differences are not expected as they might be with achievement tests (Ercikan, 2002). School teachers and other educators from both language groups are trained in the same institutions, sometimes with the same materials. School textbooks for both languages are often the same (translated from Russian to Kyrgyz) or at least have historically been so for the generation of evaluators participating in this study (De Young, Reeves, & Valyaeva, 2006). On the other hand, accuracy in DIF detection and prediction might be more of a function of evaluators’ expertise and experience. Gierl and Khaliq’s (2001) study involved highly trained and experienced reviewers. Lack and training and experience in sophisticated evaluation techniques may present challenges to accurate DIF identification in some contexts. In the Kyrgyz Republic there are few (if any) specialists with the experience in undertaking such 66 analyses. Further, at this point, no comparative research has been conducted on test items produced in Russian and the Turkic languages. Ercikan (2002) also contends that the results of substantive review depend on whether or not the reviewers know the DIF statistics before or during their analyses. When evaluators have knowledge that DIF has been indentified by statistics, some such “DIF cause” may always be found – whether accurate or not – which can lead to an inflated success rate. She also argues that it makes a difference as to whether items are evaluated individually or as item pairs (reviewed simultaneously). When both items are presented at the same time, evaluators tend to focus on the comparability of details like format, content, and language use. Reviewers of a single item focus more on context and content that might make the item biased for a particular group. Ercikan (2002) proposes that the single item review approach leads to a more nuanced analysis of item content and context and the consideration of different cognitive processes among comparison groups. In the section that follows I first review various studies that have determined the amount of DIF on cross-lingual assessments. I then address studies that sought to determine the causes of bias or DIF on cross-lingual assessments. This review will set the context for the second research question in this study in regard to DIF sources. Levels of DIF in Cross-Lingual Testing Several cross-lingual DIF studies have reported large percentages of items as DIF. Gierl, Rogers and Klinger (1999) found that 52% of English–French item pairs on a Canadian elementary social studies test exhibited DIF. Ercikan and McCreith (2002) discovered DIF rates of 41% on TIMSS science items. Robin, Sireci and Hambleton (2003) reported 21% of items on a credentialing exam exhibited DIF when the two languages studied were both European 67 languages: When looking at a European and Altaic language on the same exam, DIF rates were 46%. They go on to say, “By any reasonable criterion for interpreting the delta-DIF statistics, the DIF results reveal major problems with the translation/adaptations with the Altaic versions of the exam” (Robin et al. 2005, p. 15). On the verbal section of a university admissions exam in Israel, Russian-Hebrew DIF rates on the test were 34% (Allalouf, Hambleton & Sireci, 1999). On Programme for International Student Assessment (PISA) reading items, Grisay and Monseur (2007) found DIF rates of 25%-30% on European to European language comparisons but the rates increased to 45% when the items were from highly dissimilar language groups. Interpretation of any given DIF result in light of other DIF studies however, is not necessarily straight forward. Different studies use different criteria to define DIF levels. Therefore, determining how much a given DIF level threatens test comparability is not simply determined by percentages of DIF or DIF item counts as these figures mean different things in each study (Grisay & Monseur, 2007). For example, how one employs an effect size measure to distinguish between statistical DIF and practical DIF (or does not) impacts how one classifies items as DIF. Grisay and Monseur (2007) evaluated PISA data from the 2000 reading assessment to determine item performance across various groups. Utilizing data from 47 countries, they analyzed 32 reading passages with a total 132 test items. They found that adapting a test from a source version always had at least a basic cost in terms of loss of equivalence. They found that using tests in the same language (but developed differently or in another location as in several Spanish language tests developed in each of the Spanish speaking PISA countries) is not as 68 valuable as using identical (twin) tests. 53 This is because any translated version is just one in an infinite number of potential “sister versions.” Reckase and Kunce (2002) also found that different translators produce highly variable results in terms of accuracy and quality of translation. In Grisay and Monseur’s (2007) study DIF levels increased when comparisons were made to “cousin versions,” or different language versions within same Indo-European family (German to English for example), with, on average, 25%-30% of items displaying DIF. However, the most fascinating finding was the comparison across language families. When examining the Indo-European and Asian language groups, the level of average DIF was around 45%. In other words, it was difficult to interpret whether or not about 45% of the total items were actually measuring the same way in say, the English and Japanese versions of the PISA reading section. Grisay and Monseur (2007) also found: “A highly positive correlation between communality and test reliability (.72), as well as the negative correlation between reliability and Asian country (-.70). This suggests that some non-random factor affecting the geographic or cultural distribution of DIF items was deteriorating, to some extent, the reliability of the scale in a number of countries” (p. 76). Their study was not the first to show the lack of construct invariance across European and non European languages in DIF studies. A study by Grisay, de Jong, Gebhardt, Berezner, and Halleux (2006) with TIMSS data also found a high level of DIF between Indo-European languages and non-Indo-European languages. 53 For example, the English version used in five countries was more or less the same test, slightly adjusted for local differences. Each version of the Spanish version however, was adapted from English and French in complete isolation from all other Spanish versions and these different Spanish versions were not compared with each other prior to test administration. The result was higher levels of DIF in the Spanish versions. 69 PISA’s 2003 Technical Report also suggests that despite expertise and highly developed protocols for item adaptation, some versions have higher percentages of what they claim to be “weak items” than others; for example, 18% of the items for the Japanese test and up to 32% for the Arabic language, Tunisian version (OECD 2003, pp. 77-79). They note that one explanation may be the overall instability of the scale as these language groups tend to be located on either the upper or lower extremes of the scale. However, they also offer this potential explanation for the larger portion of weak items for the non-European languages: … a second possible explanation might be of some concern in terms of linguistic and cultural equivalence, i.e. the fact that the group of outliers included all but two of the ten PISA versions that were developed in non-Indo-European languages (Arabic, Turkish, Basque, Japanese, Korean, Thai, Chinese and Bahasa Indonesian)…and, finally, a third explanation may well be that competent translators and verifiers from English and French are simply harder to find in certain countries or for certain languages than for others (OECD 2003, p. 79). Thus, an important finding of recent cross-lingual DIF studies is that DIF levels appear to vary depending on the relationship between the two language groups in question. That is, while there may be common challenges to all cross-lingual adaptation, not all languages “compare” across these commonalities in the same way. In particular, assessments involving languages from within the same “language family” tend to exhibit lower DIF levels than when assessments involve languages from more disparate language families (Grisay & Monseur, 2007). These are significant findings for this study as Russian and Kyrgyz come from very different language families, Slavic and Altaic (Turkic) (Oruzbaeva, 1997). Causes of DIF in Cross-Lingual Testing Several studies have focused on the causes or origins of DIF on cross-lingual assessments. Studying an intelligence test with German and English language examinees, Ellis (1995) concluded that most of the DIF was due to translation error. Van de Vijver and Poortinga 70 (2005) argue that the most significant sources of item bias in cross-lingual testing is poor test adaptation resulting from poor translation, careless work, lack of subject knowledge, or lack of understanding of the principles of test development. Hambleton (2005) lists five general sources of item bias – the test itself, selection and training of translators, the process of translation, poor protocols for adapting tests, and poor data collection designs and data analysis for establishing equivalence. In Gierl and Khaliq’s (2001) study with data from several content areas, his 11 member review committee found four sources of adaptation/translation DIF: (1) omissions or admissions of words that effect meaning, (2) differences in the words, expressions, or sentence structure that are inherent to the language and or culture, (3) differences in the words, expressions, or sentence structure of items that are not inherent to the language or culture, and (4) differences in item format. Several other studies have concluded that the issue of word difficulty (inability or failure to use words of equal difficulty) is a common cause of DIF. Schmidt and Belistein (1987), Bejar, Chaffin and Embertson (1991), and Roccase and Moshinsky (1997) all found word difficulty to be problematic. However, not all DIF on cross-lingual assessments is caused by translator-related adaptation error. In Gierl, Rogers, and Klinger’s (1999) study of French and English examinees, only 2 of 7 math items detected as DIF were found to contain translation errors after substantive review. Only 6 of 26 DIF items on that same test (social studies items) were found to have translation errors. Similarly, Ercikan and McCrieth (2002) found large levels of DIF on the TIMSS Science section but poor adaptation was the cause of only 22% of the mathematics items and 40% of the science items flagged as DIF. 71 Other hypotheses for explaining DIF causes have been proposed. Less visible psychological factors like different student response strategies used by examinees might also cause DIF (Gierl et al., 1999). By looking at the distribution of DIF items by curricular topic area, Ercikan, Gierl, McCreith, Phan, & Koh (2004) discovered that different opportunities to learn could lead to DIF. Even word count might be an important DIF issue. In the exam that Gierl et al. (1999) examined, there were 24% more words on the French exam than the English exam. They concluded that the longer test length might make the exam more difficult for the French examinees. In regard to causes of DIF specifically on cross-lingual verbal assessments, Agnoff and Cook (1988) discovered that in some cases, additional text size is sometimes a good thing. They hypothesized that longer texts are sometimes necessary in one of the languages in order to provide enough context and sufficient explication of meaning. This is related to the idea that sometimes inherent linguistic differences can make some item types more conducive to item adaptation than others. They found greater DIF in antonym and analogy (shorter) items and less in sentence completion and reading comprehension. Beller (1995) and Gafni and CanaanYehishafat (1993) also found that DIF was greater in analogy items than sentence completion and reading passages. Using data from the Israeli Psychometric Entrance Test (PET), Allalouf, Hambleton and Sireci (1999) examined the causes of DIF between Russian and Hebrew examinees on verbal test items. They concluded that analogies were problematic with 65% of items demonstrating DIF. Reading comprehension items showed a small amount of DIF. Through post-hoc substantive review, they found the primary causes of DIF to be: (1) Changes in difficulty of words or sentences – i.e., the translation was accurate and meaning relatively intact, but words or 72 sentences became easier or more difficult after adaptation for one of the languages. This could also be due to how literal or symbolic the meaning of questions is presented; (2) Changes in content – i.e., an item lost its meaning for one of the languages after adaptation; (3) Changes in format – i.e., an item became much longer, shorter or more awkward for one of the languages after adaptation; (4) Differences in cultural relevance – i.e., items contained meaning, symbols, norms, content, or expressions that had no equivalent connotation in the other language group. The findings from these four studies indicate that there are characteristics of item types like length of items that seem to a have similar impact on DIF levels across a range of language groups. There are a myriad of explanations as to why it is difficult for evaluators to predict DIF: Lack of training and experience, poorly designed procedures and protocols, lack of time and resources to do the evaluations, personal dispositions or pressures for certain outcomes, and of course simply the difficult task of trying to anticipate how the background and psychological make-up of any given group will impact how they respond to any given set of items. Whatever the care and the methods employed, identifying the sources of DIF through substantive studies is a challenge. Mazor (1993) argues that the failure of substantive studies (on both real and simulated data) to consistently identify DIF also challenges some of the fundamental assumptions that researchers make in DIF studies. In the next section I turn to some of the most important of those assumptions. Statistical DIF detection methods like the Mantel-Hanzsel (MH) and logistic regression (LR) often condition on total test scores as a proxy for examinee ability. The items that compose these scores are typically hypothesized to be tapping into a single trait or skill. However, for some time it has been known that test items are often not uni-dimensional. Items may be multi- 73 dimensional which means that they are measuring more than one latent trait (Reckase, 1985). As Ackerman (1992) noted in his definition of DIF, we should keep in mind that the single test score, typically used as the proxy for ability, is an alleged conditional ability. This is important to keep in mind for any DIF study, especially for those involving “real world items” like the NST in the Kyrgyz Republic (Kok, 1988). Several early DIF studies demonstrated that the uni-dimensionality assumption of DIF detection methods was untenable (Birbaum & Tatsuoka, 1982; Subkoviak, Mack, Ironson, & Craig, 1984). A commonly cited example of an item with a high probability for multidimensionality is the mathematics word problem that demands considerable reading or verbal ability (a secondary trait) in addition to mathematics skills (primary trait) in order to solve the item correctly. The ramification for DIF studies is that if there are underlying differences between groups on secondary traits, DIF could actually be the caused by this multidimensionality (Kok, 1988). Interpreting DIF results can become ambiguous if these secondary traits are not identified and parceled out (Shealy & Stout, 1993). Ackerman (1992) calls the primary ability the target ability and the secondary ability nuisance ability. Ackerman contends that all test items tap into at least some level of nuisance ability. He believes that small amounts of DIF are likely in conditions where a secondary trait is tapped and the distribution on that secondary trait across groups differs. At the same time, an item may be multidimensional but not DIF if the groups involved have equal distributions on all the traits assessed. Kok (1988) cites background knowledge, language skills and “test wiseness” as examples of secondary skills and knowledge upon which populations may differ. He notes that sometimes multidimensionality is unavoidable when tests employ complex items with the 74 intent to approximate situations in which many skills need to be applied simultaneously by an examinee. Douglas, Roussos, and Stout (1996) make a useful distinction between types of secondary abilities: Auxiliary abilities are those that can be legitimately a part of the construct measured, while nuisance abilities are those not related to the construct of interest in any way. They thus conclude that DIF arising from auxiliary abilities is benign DIF, while DIF arising from nuisance ability is adverse DIF (Douglas et al., 1996). They note that in practice, in order to determine which kind of DIF is prevalent, substantive reviews a priori are needed to hypothesize which item bundles might exhibit multidimensionality. The low correlations between statistical DIF and substantive review methods in some studies could be related to this unaccounted for multi-dimensionality because it is hard to identify. Assessing for multidimensionality on cross-lingual test items requires evaluators to know about more than just test adaptation processes and linguistic issues; knowledge of examinee exposure to a broad variety of content and their cognition as they engage with content is also important. It may be difficult in many instances to find reviewers who are both bi-lingual and equally knowledgeable about the nuances of item response. One way to try and minimize the effect of multi-dimensionality in DIF estimation has been to condition on sub-scores rather than total scores during statistical DIF detection. Theoretically, this provides analysts with a cleaner estimate of the particular ability under study (Clauser, Mazor, & Hambleton, 1991; Ackerman, 1992; Mazor, 1993; Clauser, Nungester, Mazor, & Ripkey, 1996). For example, using the MH method on 91 items, Clauser, Mazor, and Hambleton (1991) examined a sample of 1,000 examinees from two subgroups (AngloAmericans and Native Americans), with an average test score difference of about one standard 75 deviation. The test had four sub domains - mathematics, reading, prior reading, and charts. They discovered that choice of conditioning variable made a difference in the level of DIF identified. Twenty two items were identified as DIF when conditioned on total score. When they conditioned on a sub-scores alone, the amount of DIF identified fell by one third and reduced the overall type 1 error. At the same time, however, when then they conditioned only on sub-score, some items emerged as DIF that had not been previously indentified. Mazor, Kanjee, and Clauser (1995) conducted a study on two achievement tests. They compared males and females but also took into consideration knowledge of English (those who reported it as their best language vs. others who reported some other language as their best). They used both logistic regression (LR) and the MH procedures and first conditioned on total score. Then, using LR they added the sub-scores SAT-verbal and SAT- math to their model. They found that with the LR procedure the number of items identified as DIF was reduced when conditioning on sub-scores. Clauser, Nugester and Swaminathan (1996) employed a logistic regression model and conditioned on both a total score and educational experience (area of specialization of medical students) as a secondary variable. As men and women tend to generally select different areas of specialization (on average, more men in surgery and more women in pediatrics), the authors hypothesized that males and females may differ in ability distributions across background. They believed that the conditioning variable would reduce the number of flagged items as it would partially account for those differences in group performances on this secondary ability. When conditioning only on total score, 30% of items were identified as DIF. When the background variable was added, the number of DIF items was reduced to 19%. Although the main result was a reduction in the total number of items identified, some new DIF items were identified when 76 using the background variable that were not identified using the total test score alone. Nonetheless, these studies that address the multi-dimensionality issue have informed the design of this study and the methods that I present in the next chapter. DIF as Statistical Artifact A final and very important consideration in disentangling the sources of DIF is the extent to which the methods employed are themselves producing reliable DIF estimates. That is, it is possible that one reason why substantive evaluators sometimes can not identify DIF is because there may in fact be no DIF: Items that have been identified as DIF may simply be the result of statistical artifacts (Mazor, 1993; Gierl et al. 1999; Ercikan et al., 2004). For example, inflated type one error due to a poor choice of conditioning variable may muddle statistical outcomes. It can not be assumed that DIF levels indicated by a particular statistical method are infallible. In research conducted by Jodoin and Gierl (2001) power rates for a host of real world DIF conditions using LR methods were only 70-80%; thus the interpretation of DIF statistics needs to be made with caution. Fortunately, there is a way to gain a general understanding of the effectiveness of various detection methods utilized in varying conditions. Much of what we know about the efficacy of DIF methods comes from simulation studies because simulations allow a comparison of the efficacy of different approaches under controlled conditions, i.e. we know “how much DIF actually exists a priori” (Hambleton et al., 1993). Through simulation studies, researchers can create DIF levels or other necessary experimental conditions by adjusting the difficulty and discrimination parameters of artificially generated item data. Thus, they can compare the effects of large and small sample size, variation in the ability distributions of examinees, item types, DIF 77 levels per test, test length and dimensionality among other factors (Hambleton et al., 1993; Narayanan & Swaminathan, 1996; Jodoin & Gierl, 2001). In general, it should be noted that comparative research done on various methods consistently shows that IRT, MH, and LR methods are equally effective in the identification of uniform DIF (Swaminathan & Rogers, 1990; Rogers & Swaminathan, 1993; Roussos & Stout, 1993; Narayanan & Swaminathan 1994). Nonetheless, there are differences between methods. While both the MH and LR consistently show similar results in their capacities to detect uniform DIF, the MH method has not been able to identify non-uniform DIF (Swaminathan & Rogers, 1990; Narayanan & Swaminathan, 1994; Hambleton et al., 1993). While non-uniform DIF is less common than uniform DIF, it does occur in practice. Thus, one possible advantage of logistic regression over other DIF methods is that it can assess interaction of group membership and examinee ability (Swaminathan & Rogers, 1990; Gierl et al., 1999). On the other hand, they also found that while type 1 error rates (identifying items as DIF when they were not) were within expected limits for the MH, they were a bit higher for the LR procedure. The size of the examinee sample is also important. The converging evidence from DIF detection studies is that a larger sample sizes allow for more accurate DIF detection. In Rogers and Swaminathan’s (1993) comparison of MH and LR methods, they discovered that the detection rates increased by 15% when the sample size was increased from 250 to 500 for both methods. In Mazor et al.’s (1995) study, various sample sizes were created from 100 to 2,000 per group. The study demonstrated that a small sample size (100) was not adequate but sizes of 200 to 1,000 were satisfactory. Hambleton et al.’s (1993) review of simulation studies also indicates that these findings about sample size are true across combinations of item types, ability distributions and other experimental conditions. 78 On the other hand, while it would appear that large sample sizes are necessary, there is evidence that type 1 error (over-identification of DIF) increases with larger sample sizes. Thus, one concern with overly large samples is that even the most trivial differences between groups can be identified as statistically significant even though they are of little practical significance. Hambleton (1989) argues that while small sample sizes fail to capture much DIF, with sample sizes around 5,000 it is conceivable that much of the DIF detected will be of no practical significance. Jodoin and Gierl (2001) have proposed that DIF detection methods using chisquared tests must have a reliable measure of effect size. Another factor that can impact DIF results is related to the ability distributions of the two groups under study. In cross-lingual DIF studies ability distributions are often not the same: In fact, gaps may be quite large which is why several simulation studies have created experimental conditions in which the groups tested differ by as much as one standard deviation. Narayanan & Swaminathan (1996) found that DIF detection rates were higher when examinees were sampled from the equal ability groups for the MH, SIBTEST, and LR methods. The differences in detection rates dropped when two differing ability distributions were analyzed but not equally across all methods. The biggest drop was 14% for the LR method. For all three procedures, the type 1 error rates were higher for the unequal ability distribution than those for the equal ability distributions. At the .05 level, they were 4.1% for MH and 6.1% for LR. At unequal, they were 5.5% for the MH and 9% for the LR method (Narayanan & Swaminathan, 1996). Simulation studies also tend to agree that item characteristics can impact DIF results. In Rogers and Swaminathan’s (1993) study, the items with DIF that were most easily detected by both the LR and MH procedures were items of moderate difficulty and high discrimination. For these items, detection rates were as much as 15% greater than for other types of items (Rogers & 79 Swaminathan 1993). Hambleton et al.’s (1993) study also indicated that items with lower discrimination were associated with items that were likely to be missed with MH, regardless of differences in difficulty. They also found that very difficult items were more likely to be missed in DIF detection methods, regardless of ability level. The statistical issues raised through the literature review above are directly relevant to this DIF study in the Kyrgyz Republic. Sample size is not likely to be a problem. However, the difference in ability distributions between the two groups under study is large and the item characteristics for the Russian and the Kyrgyz items do differ. In Chapter 6 of the study, I will return to these issues and discuss them in relation to the findings of this study. I now turn to a presentation of the study’s methods, Chapter 4. 80 Chapter 4: Methods Multiple research methods were employed in this study. Before reviewing each of them, I first introduce examples of the item types analyzed along with descriptive statistics from the 2010 NST. I then highlight the statistical DIF estimation method, logistic regression, utilized in the study. Next, I discuss the purpose and design of the individual item analysis rubrics employed in the substantive review, the process for selecting item evaluators, the steps in administering the rubrics, and the use of group discussion for each item. In the last two sections I present the methods used for determining the inter-rater reliability of the evaluators’ marks and the rank order correlation estimation procedure for determining the relationship between the statistical DIF and evaluators’ predictions. Content and Development of the 2010 NST The NST is administered at the end of May in all regions of the republic over a two week period. Examinees receive their NST score reports at the end of June. The NST lasts 3 hours a and 35 minutes and in 2010 had 150 test items (CEATM , 2010). The items in this study were taken from the NST verbal reasoning (словесно-логический) domain. This domain consists of four sections: Reading comprehension (24 items, 3 texts), analogies (20 items), sentence completion (10 items), and grammar use (20 items) (Valkova, 2004). All items are multiplechoice with three distractors and one answer key. The verbal reasoning format of the NST contrasts with what was historically assessed for university entry, native language and literature which focused on knowledge of grammar and literary works (Drummond & De Young, 2004). Descriptive data from the test variant analyzed is presented below. Reliability estimates are presented for the full complement of items from all variants, all verbal items, and for the test items analyzed in this study. 81 Table 4-1: Descriptive Data from the NST 2010 Descriptive Statistics for Test Variant Analyzed N Min Max Mean Std. Error Russian (All items) 2,850 47 241 137.20 .743 Kyrgyz (All items) 1,557 24 204 102.7 .658 Russian (Verbal) Kyrgyz (Verbal) 2,850 1,557 10 10 119 96 Reliability Estimates 69.35 49.18 .388 298 Std. Dev. 39.652 25.973 20.716 11.745 Cronbach’s Alpha N Items All Variants/Math and Verbal Items Russian .956 150 All Variants/Math and Verbal Items Kyrgyz .896 150 All Variants/Verbal Items only Russian .907 60 All Variants/Verbal Items only Kyrgyz .702 60 Analyzed Variant/Studied Items only Russian .871 40 Analyzed Variant/Studied Items only Kyrgyz .660 40 The last two reliability estimates given above are based on 40 verbal items. However, I analyzed only 38 of these items from the analogies, sentence completion and reading comprehension sections because two of the item pairs in fact contained different items: 18 analogy items, 10 sentence completion items, and 10 reading comprehension items were analyzed in total. According to the test developers, the purpose of the analogies and sentence completion sections were to check verbal reasoning skills at the word, sentence and text level. More specifically: “Analogies check (a) lexical richness, (b) ability to analyze logical relations between concepts, (c) ability to find relations (dependencies) between words in pairs (d) ability to determine similarities or differences by one or several indicators, (e) ability to analyze, synthesize, compare, generalize, and classify” (CEATM, 2007, pp.14-16). In regard to sentence completion items: “Sentence completion checks (a) the ability to understand logical connections between different parts of verbal expression, (b) vocabulary richness” (CEATM, 2007, pp.14-16). 82 In regard to the reading comprehension items: “The questions from this section evaluate the ability to carefully read different texts of 400 to 850 words, understand and analyze what has been read. Fragments of texts can be taken from different domains of knowledge: humanities, social science, and physical science. Popular literature is also utilized. This section has two independent texts and two related text fragments for comparison with each other. Each text or pair of texts is accompanied by questions that check: (a) understanding of the content of the text, its basic concept; (b) ability to interpret portions, connections between such portions in the text; (c) connections between the text and the real world; (d) ability to understand hidden meaning; (e) ability to determine the style of the author and his/her disposition, as articulated in the text, and; (f) understanding of the structure of the text and its connection to content. This 60 minute section has 30 items” (CEATM, 2007, pp. 14-16). Below are two English language versions of the type of items analyzed in this study. These are example items from a previous year as items from the 2010 test remain secret. Due to the length of the reading comprehension texts I did not translate items from that section here. However, the reading comprehension section is similar to the reading comprehension section found on tests such as the American SAT or Graduate Record Examination. For more examples of NST items in the Russian or Kyrgyz languages, including reading comprehension sample items, see Valkova (2004) or CEATM (2007). 83 Table 4-2: Example Analogy and Sentence Completion Items Analogies Instructions: Every task has five pairs of words. The highlighted pair of words presents a relationship between two words. Determine the relationship between those two words and then select another pair below with the same relationship. The order of the words should be the same as in the example. 7. music: composer (А) poem : poet (B) aerodrome : pilot (C) fuel : engineer (D) doctor : patient Sentence Completion Instructions: Each sentence below contains two to four blanks. There are four groups of possible answers to complete the sentence. Select the best answer to make the sentence logical. 3. ______ to believe this theory, ______ nobody has ______ yet. (А) It is easy / because / formulated it (B) It is not possible / for / refuted it (C) It is easy / although / proven it (D) It is common / although / cancelled it (Valkova, 2004) The Item Adaptation Process In 2010, the source language of all sections of the NST except ‘grammar use’ was Russian. 54 As highlighted in chapter two, a large percentage of non-Russians in Kyrgyzstan, speak, read, and write in the Russian language with native proficiency. Therefore, finding personnel to adapt items is not difficult. CEATM test developers rely primarily on peer review 54 According to Dr. Valkova, the director of the testing center, CEATM has experimented with developing different test sections in various languages in the past. In other words, test items have not always been developed in Russian first and then adapted into Kyrgyz. 84 and substantive methods of item evaluation to determine adaptation quality and equivalence of the adapted test forms. However, they also calculate p-values (difficulty) and discrimination coefficients of items in order to get a more complete understanding of how items are performing. CEATM reports that in addition to the use of back translation (see Table 4-3 below), close cooperation between item development groups and translators is maintained to ensure adherence to test specification(s) in all language versions as well as consistency in item construction and adaptation. Table 4-3 presents the item adaptation process for the 2010 NST. As can be seen, the Russian items are evaluated both substantively and statistically while the Kyrgyz and Uzbek items receive primarily substantive review. Table 4-3: Flow Chart for Test Item Adaptation 1. Russian items created ↓ 2. Russian items pre-tested ↓ 3. Russian items are reviewed and revised based on pre-test results ↓ 4. Russian items adapted into Kyrgyz and Uzbek ↓ 5. K/U translator compares own version to Russian version ↓ 6. K/U translator verifies the translation with the author of the R items by looking at the K/U items but reading them in Russian as the item author checks original meaning (oral, back translation) ↓ 7. Specialists in Kyrgyz and Uzbek grammar review the target language translations Interview with CEATM’s head of test item development Statistical DIF Detection Method Investigators employ a wide variety of statistical DIF detection methods depending on study aims, skill level of the researcher, resource constraints, and nature of the specific tests and items examined. The most commonly utilized methods are Item Response Theory (IRT) 85 methods, the Mantel Hanzsel (non-parametric) chi-squared method and logistic regression (LR). Because it can not be assumed that statistical DIF indices are always correct (i.e. serve as a 100% reliable baseline from which to compare substantive evaluations) it is necessary to carefully select the statistical approach to be used in any DIF study and qualify any findings based on the analyses (Jodoin & Gierl, 2001). As noted in the literature review, one challenge to DIF estimation is that the ability distributions between compared groups are typically not the same, especially in cross-lingual DIF studies. In selecting the appropriate statistical method for this study an important consideration was the large difference in ability distributions between the Russian and Kyrgyz groups. Russian examinees on average have performed consistently better on the NST since inception in 2002 (Valkova, 2004). Narayanan & Swaminathan (1996) found that DIF detection rates were more accurate when examinees were sampled from the equal ability groups than when unequal distributions were examined. However, if large enough sample sizes are used - and access to large sample sizes was not a problem in this study - this challenge can be addressed to some extent (Hambleton et. al., 1993). After a careful review of methods, I elected to utilize the logistic regression (LR) method for DIF detection as articulated by Swaminathan & Rogers (1990). The LR model is easy to implement for the novice researcher, flexible, can detect both uniform and non-uniform DIF, and has power comparable to other DIF detection methods (Swaminathan & Rogers, 1990; Zumbo, 1999; Gierl et. al., 1999; Jodoin & Gierl, 2001). probabilistic approach to DIF detection. The LR method is a non-parametric In the LR method examinees must represent the 86 complete population of interest because non-representative samples will impact the results (Hambleton et. al, 1993). 55 Unlike IRT models, non-parametric models utilize observed scores to test for the likelihood of difference in group performance on an individual item after conditioning on ability. The LR approach to DIF analysis relies on a chi-squared test of statistical significance and has an established measure of effect size. 56 In most non-parametric DIF studies, the total test score or sub-score on the instrument examined serves as a practical proxy for ability (Sireci, Patsula & Hambleton, 2005). 57 Considering the issues highlighted in the literature review about dimensionality, I elected to condition on verbal scores rather than the total NST score for this study. The logistic regression model for predicting the probability of a correct response to an item is based on (Swaminathan & Rogers, 1990): P (u = 1| θ) = e( β0 + β1θ ) [1 + e ( β0 + β1θ ) ] ’ Where: u = the response to the item θ = the observed ability of an individual β0 = the intercept parameter, and β1 = the slope parameter 55 One, two and three parameter IRT models have been used to estimate for DIF. Each allows for estimation of item characteristic curves (ICC) which specify the relationship between the probability of success on the item and the underlying ability or trait. A key assumption in IRT models is that the estimates are invariant and do not depend on the sample. This differs from non-parametric models which utilize observed scores and thus to some extent depend on the samples utilized. For this reason, some have argued that IRT methods are superior because they allow for conditioning on true ability, not observed scores which are at best proxies for true ability (Camilli & Shephard, 1994). 56 This was not the case with LR originally until Zumbo (1999) and Jodoin & Gierl (2001), introduced pseudo R-squared measures of effect size. 57 It is important to note that most non-parametric DIF studies measure on internal criteria. In essence, DIF detection assumes at least a modicum of overall validity because if all items were biased (systematically) no DIF would be evident (Hambleton et al., 1993). 87 According to Swaminathan and Rogers (1990), by specifying separate equations, the probabilistic model presented above can be adapted for two separate groups of interest as: (β +β θ ) 0 j 1j 1 j e P (u = 1| θ ij ) = ij Where u ij (β [1+ e 0 j +β1 jθ1 j ) ] ’ i = 1, ..., nj, j = 1,2. = the response of person i in group j to the item β0j = the intercept parameter β1j = the slope parameter for group j, and θij = the ability of individual i in group j. This model can also be formulated as (Swaminathan & Rogers, 1990): P (u = 1) = where: ez [1+ ez ] ’ z = β0 + β1θ + β2G + β3 (θG) In simple terms, the use of this equation allowed the determination of whether or not to reject the null hypothesis of “no difference” in item response for two groups (Kyrgyz and Russian) on the particular item under study. In this study, a chi-square test of significance was applied to assess this null hypothesis at the .05 level. At 1 degree of freedom at the .05 level, the test statistic was 3.841. It is important to remember that the DIF analysis with LR proceeds at the item level; the data was entered into the equation for each test item individually. In this study that meant thirty-eight separate analyses for determining uniform DIF and thirty-eight separate analyses for determining non-uniform DIF. 88 For each item analysis, the dependent variable was a dichotomous variable - either a “1” for a correct item response, or a “0” for an incorrect response. On the right hand side, θ was a measure of examinee ability - observed sub score (verbal scores in this case). Language group membership was a categorical variable “G” and was coded “1” for Kyrgyz or “0” for Russian, sometimes called the reference and focal group, respectively. The term θG represented an interaction between these two independent variables. In DIF studies using LR methods, a significant interaction means there is evidence of “non-uniform DIF.” Non-uniform DIF occurs when differences between two groups are not the same across all ability levels (Swaminathan & Rogers, 1990). For example, Russian examinees might perform better at the upper ability levels, but worse at the lower ability levels on the same item, or vice versa. In sum, the parameters β0, β1, β2, and β3 represented the intercept followed by the weights for ability, language group, and ability by language group interaction term respectively (Jodoin & Gierl, 2001). Jodoin and Gierl (2001) propose assessing separately for uniform and non-uniform DIF in order to capitalize on the use of a 1 degree of freedom model. Using the steps they recommend, I assessed each item in a two step process. In order to assess for uniform DIF, two models were identified. The first “compact model” - where z = β0 + β1θ - was entered first. The presence of uniform DIF was then tested by examining the improvement in chi-square model fit when the group membership (G) term was added, the “full model” (z = β0 + β1θ + β2G). The chi-square value of the “compact model” was then subtracted from the chi-square value of the “full model” and this difference was compared to the test statistic for statistical significance. 89 Then, the presence of non-uniform DIF was tested in similar fashion by examining the improvement in chi-square model fit associated with the “full model” (above) and the addition of the interaction term (θG) (z = β0 + β1θ + β2G + β3 (θG)). In other words, the chi-square value from the “full model” (z = β0 + β1θ + β2G) was subtracted from the chi-square value from the third model with the interaction term and compared to the test statistic for significance (Jodoin & Gierl, 2001). In practical terms, for both the uniform and non-uniform tests, chisquare values lower than 3.841 indicated a very close correspondence in item response patterns between the two groups: I.e., I did not reject the null hypothesis of “no difference”. Further, such “identical” items had a β2 (group) value at “0,” or close to it. The Exp (B) or odds ratio for “no DIF” items was at or close to “1.” In the LR model, understanding which group is favored is determined by the sign of the β2 value. When β2>0, the uniform DIF favored the reference group (Kyrgyz language). When β2<0, the uniform DIF favored the focal group (Russian language). In general, non-uniform DIF is present when β3≠0, regardless of the value of β2. When β3>0, the item favored high ability Kyrgyz and low ability Russian. Items with negative values for β3 favored high ability Russian and low ability Kyrgyz (Jodoin & Gierl, 2001). An early criticism of the LR approach was that it did not have a measure of effect size (Kirk, 1996). This was considered a weakness as the power of the statistical test is somewhat dependent on sample size and large samples have a tendency to generate high type 1 error (over identification of significance). In recent years this problem has been addressed by Zumbo (1999) 2 and Jodoin and Gierl (2001) with the introduction of an R ∆ (r-squared delta), a weighted least 90 2 squares effect size measure. In this study I utilized this R ∆ effect size measure proposed by Jodoin and Gierl (2001). Below I outline the steps I took to test for “practical DIF significance.” After testing for the statistical significance of each item, it was essential to interpret the results in terms of practical significance through the effect size measure. If the null hypothesis of “no DIF” was not rejected, there was no need to employ the effect size measure. However, if 2 the chi-square test was significant, the R ∆ needed to be assessed. For example, for determining the magnitude of significance for an item identified as statistically significant uniform DIF, the 2 2 R for the test score term (θ, compact model) was subtracted from the R for the group membership term (G, full model). For determining the magnitude of significance for an item 2 identified as non-uniform DIF, the R for the group membership term (G, model 2) was 2 subtracted from the R for the interaction term (θG) model. 2 2 The resulting R ∆ levels were then interpreted in light of Jodoin and Gierl’s (2001) R ∆ effect size measures. In simulation studies, Jodoin and Gierl (2001) demonstrated that this approach results in more powerful detection and lower type one error. These effect size measures were effective in trials with both simulated data and real data (Zheng, Gierl, & Cui, 2 2005). In a study by Gierl, Rogers and Klinger (1999), the R ∆ effect size measure utilized in the LR method correlated at .91 with the MH effect size measure for an analysis of math items and .93 with the MH effect size measure for a social studies test. The values utilized to classify the practical significance of DIF were the following: • • • 2 Negligible DIF: R ∆ <.035 2 Moderate DIF: .035 ≤ R ∆ < .070, and the null hypothesis is rejected Large DIF: R2∆ ≥ .07, and the null hypothesis is rejected 91 In order to demonstrate how I utilized the logistic regression method and effect size measure proposed by Jodoin and Gierl (2001) in this study, I present two example item analyses here for items 7 and 32 (uniform DIF). For item 7, the chi-square value for the compact model (z = β0 + β1θ) was 159.771. The chi-square value for the full model (language group added, z = β0 + β1θ + β2G) was 161.089. The difference in these two chi-square values was 1.318, lower than the test statistic of 3.841 at the .05 significance level. The β2 (group variable) was low, estimated at .122. Recall that a β2 value at zero or very close to zero indicates no difference. The odds ratio for this item, Exp (β), was 1.13 and odds ratios at 1 or close to 1 indicate the same odds of response for both groups. For item seven, the null hypothesis of no difference was not rejected and the response patterns to the Russian and Kyrgyz versions have a close one to one correspondence after controlling for ability; i.e., there was “no DIF” for this item pair, neither statistical nor practical. For item 32, the difference in chi-square values between the compact model and the full model was 96.334, statistically significant and much higher than the test statistic, 3.841. The r2 R ∆ (effect size) difference was .057. The β2 was also far from zero at 1.101. Further, the odds ratio was not near 1, but 3.007. This meant that the Kyrgyz group was just over three times more likely to answer the question correctly than the Russian group (recall that a positive β2 means the item favors the Kyrgyz group). All of the 38 items were analyzed and interpreted in turn per the above steps for both uniform and non-uniform DIF. The results of these analyses are presented in the next chapter and can be found in full in Appendix K (uniform DIF) and Appendix N (nonuniform DIF). 92 Preparing for the Statistical Analysis Before analyzing each item for DIF with the LR method I took several preliminary steps. First, I physically examined the test booklets from both languages to ensure that the items were indeed the same for both language versions. The result of this investigation revealed that from the 40 items initially selected for analysis, two item pairs (items 1 & 6) actually contained different test items. Based on their own preliminary analyses, the test center believed that the original items were not satisfactory and resolved to utilize two completely different items. I thus removed these two items from the analysis. After confirming that the rest of the items were in fact the same, I requested the item response data from the test center for the test version under study. Data was provided in Excel format and included an indicator for the language version of the test, an item response matrix which included a dichotomous “1” or “0” (correct or incorrect) for each item, and a verbal score (scaled) for each student from the analogy, sentence completion, and reading comprehension sections. Each student was denoted by an eight digit identification number which was tied to the students’ test registration center. Sample Selection There is converging evidence from DIF detection studies that larger sample sizes enable more accurate DIF detection (power). In Rogers & Swaminathan’s (1993) comparison of MH and LR methods, they discovered that the detection rates increased by 15% when the sample size was increased from 250 to 500 for both methods. In Mazor et al.’s (1992) study, various sample sizes were created from 100 to 2,000 per group. They found that when the smaller sample sizes were used perhaps only 45-65% of DIF items were being correctly identified while in the larger samples, 65-85% of DIF items were being correctly identified. Hambleton et al.’s (1993) review 93 of simulation studies confirmed that these findings about sample size are true different across combinations of item types, ability distributions and other experimental conditions. However, there is also evidence that type 1 error (over-identification of DIF when none is actually present) can increase with larger sample sizes. Thus, one concern is that even the most trivial differences between groups can be identified as statistically significant even though they are of little practical significance. While small sample sizes fail to capture much DIF, with sample sizes around 5,000 it is conceivable that much detected DIF will be of no practical significance, i.e. have an unacceptable Type 1 error rate (Hambleton, 1989). Thus, the need for the use of large (but not too large) sample sizes of 200-1,000. In 2010, 30,264 examinees sat for the NST; approximately 18,720 in Kyrgyz, 10,994 in a Russian and 1,000 in the Uzbek languages (CEATM , 2010). However, there were several versions of the NST, each with about 4,000-6,000 examinees. The test version provided by CEATM had a total 4,407 examinees and it was administered to both rural and urban participants throughout the country, including examinees from the capital and surrounding areas. This selection included a total of 1,550 Kyrgyz language and 2,850 Russian language examinees. From this test version, using SPSS software, I randomly selected a sample of 1,000 examinees per language group to be analyzed. 58 The Individual Item Analysis Rubrics In order to answer the research questions, item analysis rubrics had to be developed that would capture not only the evaluators’ estimations of content, meaning and difficulty differences between item pairs, but also elicit hypotheses about the cause or source of those differences. They needed to be short enough to allow efficient administration but thorough enough to ensure 58 The investigator did not have access to the schools or names of the individual examinees who sat for the 2010 NST. 94 that essential data was captured to facilitate interpretation. I designed the rubrics based on insight gleaned from similar studies (Allalouf et al., 1999; Reckase & Kunce, 2002; Ercikan, 2002; Ercikan et al., 2004). An overall path model for the process of collecting data through the individual rubrics is provided as Appendix G. After consultation with the director of the Center for Educational Assessment and Teaching Methods (CEATM), the items selected for analysis came from the NST 2010. The items to be analyzed were collated in test booklets. As the construction of the test item booklets required access to the test items, this booklet was put together only after I arrived in Bishkek. The test booklets (rubric 1.a) consisted of each of the 38 item pairs, one item pair per page. There was also space to write notes and a place to mark whether or not items were identical or exhibited differences. Rubric 1.b was a graphic organizer which required evaluators to provide an initial categorization of the type of differences (if any). English versions of rubrics 1.a and 1.b are presented together as Appendix I. For rubric 1.a the evaluators first attempted to correctly answer all the items in both the Kyrgyz and Russian versions in the test booklet. This was a “blind review” in the sense that the evaluators did not know which items had been identified as DIF by the statistical methods (Ercikan, 2002). Evaluators took notes only on the most important problems that arose. After going through all items, item pairs coded as “identical” on rubric 1.b were set outside as they were not needed for the completion of rubric 2. I developed and translated rubric 2 before I arrived in country. Rubric 2 had the following sections: (2.1) estimation of the level of difference(s) in content, meaning, or difficulty (if any) between the two items in the pair; (2.2) the specific nature of the difference(s); (2.3) description of the difference(s) in detail; (2.4) estimation of which group might be advantaged (favored) by differences; (2.5) suggestions for 95 improving equivalency of the item pairs. Rubric 2 was printed in three colors for three categories of difference: Content (violet form), format (green form), or cultural/linguistic (pink form). This color scheme allowed the researcher to easily collate the forms by nature of the issue during later analysis. English and Russian versions of rubric 2 can be found in Appendix J. Section 2.1, level of difference(s), required evaluators to classify each pair of items as “somewhat similar,” “somewhat different,” or “different” in meaning, content or difficulty. A coding scheme, adapted from both Ercikan’s (2002) and Reckase and Kunce’s (2002) work, defined these terms as follows: 59 0- Identical: no difference in meaning, content, or difficulty between two versions; 1- Somewhat similar: small differences in meaning, content, or difficulty between two versions, will not likely lead to differences in performance; 2- Somewhat different: clear differences in meaning, content, or difficulty between the two versions, may or may not lead to differences in performance between two groups; 3- Different: differences in meaning, content, or difficulty between the two versions that are expected to lead to differences in performance between the two groups. Before presenting the process for the administration of the item analysis rubrics, I first review how the participating bi-lingual item evaluators were selected. Selecting the Evaluators Recall from Chapter 1 that there are no professional pyschometricians in the Kyrgyz Republic. There are many educators with experience adapting textbooks and other educational materials from Russian into Kyrgyz, but few with experience in cross-lingual standardized test development. 60 As there has been no standardized testing until recent years, there is not a large pool of human resources from which to draw upon that has experience with DIF studies or item 59 The actual choices on rubric 2 were only somewhat similar, somewhat different, and different because for the items they selected as “identical” on rubric 1b, they did not fill in rubric 2. 60 With the exception of a few CEATM employees and ministerial assessment specialists who have been receiving training since 2002. 96 review. Since 2002, the test center has relied upon bi-lingual educators and translators in test development (item writing) and adaptation. Through experience, the test center has gradually identified those who have shown ability in this area and they maintain a small pool of personnel with whom they work on a short term basis as needs arise. It was important that the pool of selected evaluators be as skilled as possible, bi-lingual, preferably with some experience in testing, but at the same time not have direct experience with the particular 2010 items. In other words, the challenge was to select a pool of evaluators who were a proxy for “as qualified as any other feasible sample” of the potential evaluators, but not have a conflict of interest (inability to evaluate objectively) due to experience working with the 2010 items. It was decided that eligible candidates could be those with experience writing or adapting NST test items in previous testing years, item writers who worked on other sections of the NST, translators with good reputations and content specialists who were known to be bilingual and relatively knowledgeable about assessment issues. Ultimately, four of the ten evaluators selected had never written nor adapted test items at any point in their professional careers, two had been item writers for previous iterations of the NST, three had been item writers for NST 2010 items not evaluated in this study, and one evaluator was selected who had participated in the adaptation of the NST 2010 items under study. Selection of competent bi-linguals was essential. Perhaps the biggest challenge in selecting the evaluators was ensuring that all participants were as close to being as purely bilingual as possible. While finding bi-linguals was not difficult in Kyrgyzstan, pure bi-lingualism is rare as bi-linguals are usually stronger in one language than in the other (Korth, 2005). I included not only linguists and translators in the evaluation process but also teachers. This is because the item review process requires not only the identification of linguistic differences in 97 the two language versions, but also a judgment as to whether these differences might lead to performance differences (Mazor, 1993; Ercikan et al., 2004). As highlighted in Chapter 2, it is primarily ethnic Kyrgyz who are bi-lingual in Russian and Kyrgyz as Russian speakers of other nationalities tend not to know Kyrgyz (Korth, 2005). However, there is a wide spectrum of skills and knowledge amongst those who claim to be Kyrgyz-Russian bi-lingual. Table 4-4 below presents an approximate typology of Kyrgyz and Russian knowledge levels found in the ethnic Kyrgyz population. Table 4-4: Typology of Ethnic Kyrgyz Russian Language Knowledge Kyrgyz Primary (Basic Russian) Kyrgyz Only Bi-lingual Kyrgyz Primary (Good Russian) Russian Primary (Basic Kyrgyz) Russian Primary (Good Kyrgyz) Russian Only Potential evaluators were identified with the assistance of test center employees. Each prospective candidate was contacted and provided with information about the study. If they agreed to participate, they first completed a brief questionnaire which elicited detailed information about their language knowledge and skills as well as educational backgrounds. In order to encourage only true bi-linguals to participate, participants were informed in an interview that they would be required to use both Russian and Kyrgyz equally, not only on the individual written analysis but in discussion with their peers – many of whom would be translators, linguists and other knowledgeable specialists. As part of this investigation, evaluators would be required to state and perhaps defend their views on the test items under study using both languages. Several of the candidates who initially applied declined to participate in the study after they learned about this requirement. 98 Through the survey, each candidate provided information about his or her professional background and language ability. I then selected the ten evaluators that provided a balance in terms of competency levels in both languages. All the evaluators had completed higher education and nine of the ten were women. The majority were women because women are overrepresented in teaching and in areas related to translation and linguistics in the republic (De Young et al., 2006). The majority of participants selected more than one profession. This is because in Kyrgyzstan bi-lingual educators often serve in many capacities: As translators, teachers, test item writers, consultants, or work in other capacities on a short term basis in addition to their primary place of work (ibid, 2006). This broad spectrum of professional experience was beneficial as bi-linguals who know the school program or educators who have experience creating test items can approach the evaluation task from a multitude of perspectives and with practical experience in a relevant discipline. None of the selected evaluators had ever participated in a formal DIF study before. The table below presents the characteristics of those selected to serve as evaluators. Table 4-5: Background Characteristics of Selected Evaluators Profession(s): Teacher (secondary and tertiary) (5), Test item writer (3), Philologist/language specialist (6), Methodologist (1), Translator (5), Linguist/editor (2), Lawyer (1) Language Medium Medium of secondary education? Medium of higher education? Main medium at work? Main medium at home? Medium in which you think? Slightly more literate in? Kyrgyz 5 2 1 4 2 3 Russian 5 5 2 0 4 4 Both/Equal 0 3 7 6 4 3 In terms of schooling, half the evaluators completed their secondary education in the Russian language medium and half in the Kyrgyz language medium. Three evaluators received 99 higher education in both languages while only two completed their higher educations in the Kyrgyz language medium. Seven evaluators reported using both languages at work and six of them reported using both languages in the home. None of the evaluators reported that Russian was their primary home language. Interestingly however, four evaluators reported that they “think” primarily in the Russian language. Four marked that they were slightly more literate in Russian than Kyrgyz, three marked that they were slightly more literate in Kyrgyz than Russian, and four marked that they were equally literate in both languages. All participants signed consent forms and were compensated for their work. Administering the Rubrics The administration of the item analysis rubrics and group discussion required three halfdays of work to complete. Prior to convening, each evaluator received a glossary of technical terms which defined all key concepts (English version, Appendix H). Evaluators familiarized th themselves with this material prior to coming to the analysis on June 19 . On June 18 th I conducted a pre-test of the rubrics with one evaluator in order to determine if adjustments were needed to the rubric or glossary. The pre-test yielded important results: In addition to the discovery of some minor formatting and typographical mistakes, in a debriefing the pre-test evaluator reported that the most challenging aspect of the rubric was interpreting the coding categories in section 2.2. Although definitions of “adaptation, translation, format and cultural issues” were provided in the glossary, the pre-test participant claimed that these categories were easily confused and open to various interpretations. She noted, for example, that she spent an inordinate amount of time attempting to classify whether a problem with an item was a “cultural” or “linguistic” problem. She questioned the utility of coding the nature of the problem and was in favor of more focus on description of the problem (section 2.3). 100 As the main purpose of the rubrics was to get an estimation of differences and gather good descriptive data about each item, I instructed the other nine evaluators to focus on sections 2.1, 2.3, 2.4 and 2.5. Emphasis was placed on section 2.3, description of the issues that they discovered with each item. It is indeed the task of the researcher to characterize and interpret what kinds of problems were being discovered, after collecting the data from all ten evaluators. However, as the full rubrics had already been printed, section 2.2 was left intact. In the section that follows I present the steps of the data collection by each day’s activities and tasks. The evaluator panel was convened at 98 Tynustanova Street at the Center for Educational Assessment and Teaching Methods (CEATM) at 9:00 am on June 19th, 2010. All ten evaluators came on time and participated in a forty-five minute overview of the item evaluation process. Evaluators were then split into two groups and each group started with different item numbers. One group started with item 2 while the other started with item 20. This ensured that all items received at least a minimum amount of coverage. Then, evaluators were seated in individual work stations and began their individual analyses. The first task was for the evaluators to answer and analyze the thirty eight test items on rubric 1a (test booklet). Then, they provided an initial mark as to the nature of difference (if any) on rubric 1b. Each evaluator completed the analysis individually. Evaluators wrote their comments in the rubrics in Kyrgyz and Russian. This process took approximately three and one half hours. All rubrics were collected at approximately 13:00 and stored in a secure location until the continuation of work the next day. th On Sunday, June 20 , all ten evaluators arrived again at 9:00 am and worked until lunch time. Their task was to complete rubric 2 for each item they had marked with any rating other than “0 = identical” the day before. This step required the evaluators to take their notes from day 101 one and code their comments on the four sections (2.1, 2.3, 2.4 and 2.5) presented above. This stage of the process took approximately four hours to complete. A fifteen minute coffee break was organized after the second hour. At the end of this session, the booklets and rubrics were collected and analyzed in the evening for key patterns and issues. th I reviewed the rubrics on June 20 because the time allocated on day three for discussion was three hours: it was essential to make sure that items were prioritized for discussion. The initial review focused on their estimated “the level of differences” (section 2.1) and “description” (section 2.3) for each of the items. If certain items elicited high marks, much commentary or varying views, it was essential that the group discuss these issues on day three. As it turned out, the time for the group analysis on day three was adequate to cover all the items. Group Item Analysis st A three hour group discussion was held on Monday, June 21 . Including this discussion time, the total time spent with evaluators was approximately ten and one half hours. I facilitated the discussion in the Russian language and a note taker from the test center recorded the conversations. As facilitator, I allowed the conversation to flow but on occasion needed to intervene to keep the discussion on track. Areas of agreement and disagreement were noted and recorded. Evaluators shared their thoughts and feedback freely about each item. Data from these discussions were later utilized to examine the relationship between evaluators’ marks and the DIF statistics as well as disentangle the many potential sources of DIF on the test items. The English version of group discussion for each item is presented in the summary rubrics in Appendix W. While evaluators marked each item individually, it was important to come to agreement about how to interpret their total marks as a group for each item. In order to establish an 102 operational definition of group “DIF prediction” each evaluator stated their opinion on how to best interpret their marking scheme. In simple terms, it was necessary to determine how many marks by evaluators would serve as “a vote for DIF” from the group. Several opinions were stated but ultimately they agreed that four total marks in any combination from the two “upper categories” of “somewhat different” or “different” would be considered as a vote for DIF. Recall that these are the marks that received 2 or 3 points for DIF. While the term “group discussion” has been used up to this point, the term “group analysis” will be used going forward. The term group “analysis” underscores the point that throughout the discussion process, evaluators continued to analyze, study, and process the items. In the discussion of some items, evaluators changed their minds, saw the items in a different light, debated, argued, or discovered nuances of the item pairs that they had not noticed during their individual analyses. Thus, group analysis better characterizes what actually happened during the discussion of each item. Summary Rubric Descriptive data from the individual analyses and discussion notes provided data about evaluators’ predictions of DIF levels and information about causes of DIF. As over 150 individual rubrics were filled in, it was important to have a way to collate this data in summary form. All data from each evaluator were thus recoded onto one summary rubric. For each of the individual 38 items, the full range of commentary from all ten item evaluators is coded in one place (Appendix W). For example, under section 2.3 for each of the items on the summary rubric, each bullet point and comment represents a statement from a different evaluator. All comments from the individual rubrics were translated verbatim without editing or synthesis on the summary rubric. This presentation of the full data allows the reader to see the entire scope of 103 comments for each item. Further, it allows the reader to see the “strength of agreement” in the commentary. For example, if six or seven individuals all seem to be saying the same thing, this is visible. Or, the opposite, if only one or two people are noting certain issues or tendencies, this is also on display. The summary rubrics presented as Appendix W differ from the individual rubrics completed by each evaluator in a few important ways. On the summary rubric section 2.2, the “nature of difference” data was not recoded from each of the individual summary rubrics. Recall that after the pre-test, evaluators were instructed to focus on item description in section 2.3 and not to worry about the accuracy of their coding in section 2.2. The a priori coding categories under section 2.2 were used to guide evaluators’ thinking in how best to characterize the differences between the item versions. The “level of difference” on section 2.1 of the summary rubric was coded under the color-coded categories (content, cultural/language, for format) as submitted by each evaluator. I used these categories as a way to collate the data but did not focus on the consistency of the evaluators in marking these categories. Notice in the summary rubrics that evaluators’ comments about the same issue often fell under the different headings. A difference that was defined by one as “cultural” for example, might have been characterized by another as a “content” issue. The important data for analysis was the totality of the description, not how the issues were coded according to each individual evaluator. Otherwise, the summary rubric in Appendix W reflects the same organizing principles and data as collected from each of the individual rubrics. 104 Estimating Inter-Rater Reliability After collecting data from the evaluator rubrics, group discussion and statistical analyses, there was significant data about both the perceived DIF and as well as actual DIF levels based on the statistics. An important question on the use of evaluator/raters in any study is the extent to which their estimations can be considered reliable. As bi-lingual evaluators represent a sample of a larger population of possible evaluators, it was necessary to see how much measurement error existed. Thus, the first step of analysis was to determine the inter-rater reliability of the evaluators’ marks and how much variation there was in their estimations. In order to do this, an inter-class correlation coefficient was estimated with SPSS software. Inter-class correlations are ratios of rating variance to total variance and can be used as reliability coefficients for assessments of raters that are deemed to be in the same category or class (McGraw & Wong, 1996). In order to estimate this coefficient I had to first develop a scoring system that would allow the coding of the evaluators’ marks for each item. Recall that on section 2.1 of rubric 2 each evaluator estimated the level of difference between the item versions under study. The coding scheme was “0” (identical), “1” (somewhat similar), “2” (somewhat different), and “3” (different). In order to estimate reliability, I produced a matrix of their scores for each item in an Excel file. Each column represented an evaluator and the thirty-eight rows represented each of the items analyzed. In the matrices I placed their marks of 0, 1, 2 or 3 in each cell based on their perception of the level of differences in item pairs as defined above. Before conducting this analysis I reviewed the data from their individual evaluation rubrics and decided to drop two evaluators from the analysis. The one evaluator who had worked as a translator on the NST 2010 filled out only six total rubrics and his rubrics contained a considerable amount of missing 105 values. 61 A second evaluator filled out the rubrics incorrectly using the same single rubric to record marks for many different items. This led to confusion and I could not determine which marks were meant for which items. Approximately one third of her rubrics were filled in this way. Using these rubrics would have demanded considerable guess work in trying to interpret the intent of this evaluator. Nonetheless, after dropping these two evaluators, a group of eight evaluators remained to provide an ample number marks for each of the items. After these two evaluators’ data were removed, the marks from the eight remaining evaluators were examined for missing data. There were 13 missing entries from a total of 304 possible entries (38 items x 8 evaluations). I imputed data for these missing scores by entering the average scores from the other seven evaluators into each cell where data was missing. I then calculated Pearson’s reliability in SPSS. Two-way random effects models are used where people effects and measures effects are random. I selected “two-way ANOVA, random” and selected “consistency.” I then selected “absolute agreement” to see if there would be differences in these estimates. I report the results of these analyses in the results chapter. Estimating Evaluators’ Accuracy in DIF Prediction The key question of interest in this study was the extent to which evaluators could accurately predict statistical DIF. Therefore, the relationship between evaluators’ DIF predictions and the statistical outcomes needed to be established. Evaluator DIF predictions consisted of two separate steps. First, they had to estimate the extent of differences between the 61 While it might be expected that one of the specialists working on the NST 2010 adaptation was not likely to offer critical commentary, it was nevertheless important to include one of them in the study. And, this individual, having more experience with the items, made an especially valuable contribution to the group discussion. Note that on the rubrics in appendix W the total number of marks comes from the eight retained evaluators: That is, all estimations utilized only the marks from the eight evaluators. However, the commentary under section 2.3 and group discussion comes from all ten evaluators. 106 items in the pair. Second, they had to predict which group, if any, would be favored by these differences. In order to assess the relationship I conducted a rank order correlation analysis between the marks of the evaluators and the chi-square difference values for all 38 items. Recall that as the chi-square values ascend, they move towards DIF, away from item equivalence that is indicated by values below the test statistic, 3.81. That is, the DIF items have the highest chisquare difference values while the non-significant items have very low chi-square difference values. The correlation was estimated in the following manner. Recall that in order to quantify the meaning of the distinct “levels of difference,” I assigned points for different levels of categorization (0, 1, 2, and 3). These scores were totaled across all eight evaluators to produce a combined total score for each item: Higher scores thus represented a stronger belief in DIF while lower scores represented a weaker belief in DIF. After calculating the scores for each item, I conducted the rank order correlation analysis using Spearman's rho in SPSS. I employed Spearman's rho because it is thought to be less sensitive to outliers in the data than Pearson’s coefficient (SPSS User’s Guide, Version 16). The results of this analysis are presented in the next chapter. 107 Chapter 5: Results DIF Detection Results As explicated in the methods chapter, logistic regression was utilized to analyze each item. The uniform DIF statistics with all chi-square values, effect sizes, significance levels, β2 values and odds ratios for each item are presented in Appendix K. Recall that the sign of the β2 value indicates which group is favored (positive Kyrgyz, negative Russian). In all, a total of six items had no statistical DIF (items 9, 2, 24, 7, 17, and 29). This indicates a very close correspondence in response patterns between the two versions of the items in these pairs. These six non-significant values are italicized in Appendix K. Twenty-eight items had negligible DIF, three items had moderate DIF (13, 19, and 32), and one item had large DIF (item 3). Throughout the rest of the study, I refer to these four DIF items (three moderate, one large) together as “practical DIF” items when distinguishing them from the statistically significant but “negligible DIF” items. The data in Appendix K are presented in order of ascending chi-square difference values. Recall that this chi-square difference value is the difference between the chi-square value of the compact model from the chi-square value of the full model (with the group variable added for uniform DIF). This chi-square value is checked against the test statistic of 3.814 with 1 degree of freedom to assess for significance. Thus, in the table, the lowest values (and non-significant items) come first while the four practical DIF items have the highest values and are the four last items in order of ascension. Of the four items identified as practical DIF, three favored the Russian group while one favored the Kyrgyz group. Of the 32 items classified as negligible, moderate or high DIF, 18 items favored the Kyrgyz group and 14 items favored the Russian group. The statistical results for the four items classified as practical DIF are presented 108 separately in Appendix L. The six items with no statistically significant DIF are presented in Appendix M. Recall that there was no need to report an effect size for non significant items since all the chi-square difference values were below the test statistic, 3.841. All items were also tested for non-uniform DIF in the same two step process. This time the compact model included the group (language) variable and the full model included an interaction term, β3, (θG). Twenty-one of the items had no statistically significant non-uniform DIF. Seventeen items had statistically significant chi-square values but all were classified as “negligible DIF.” The largest effect size (r-squared delta) was .018. Thus, there were no practically significant non-uniform DIF items. The full results of this analysis are presented in Appendix N. These items are also arranged by chi-square difference values in ascending order. In Table 5-1 below I present the percentage of items in each uniform DIF category by item type. As is evident, the majority of items under study fell into the negligible DIF category. In order to be classified as negligible DIF (statistically, but not practically significant) the effect size for the item had to be 0.0 ≤ .035 (Gierl & Jodoin, 2001). The effect sizes of the 28 negligible uniform DIF items ranged from .003 to .031: The higher the effect size, the closer to the cut off for moderate DIF at .035. For enhancement of interpretation, I split the group of negligible DIF items into halves. This is because there appeared to be “clustering” by item type along the effect size distribution. The median effect size value was .009. When put in rank th order, an item with a .009 effect size is the 14 item in the range of 28 negligible items. Note in table 5-1 below that the analogy items were spread throughout all classification categories relatively evenly. The sentence completion and reading comprehension items were concentrated more heavily in particular categories. 109 Table 5-1: Items (%) by Effect Size Levels and Item Type Item Type Non-DIF Neg. Low Analogy Sentence Completion Reading Comprehension 22% 20% 0% 28% 20% 70% Neg. High Practical DIF 33% 60% 20% 17% 0% 10% Total 100% 100% 100% Of the reading comprehension items, 90% were categorized as negligible DIF. Of the sentence completion items, 80% were categorized as negligible DIF. However, the reading comprehension items tended to cluster with lower effect size values (below the .009 median) while the sentence completion items tended to cluster with the higher effect size values (above the .009 median). Fifty percent of all the items below the median in the negligible DIF category were reading comprehension while only 14% of them were sentence completion items. At the same time, only 14% of all the negligible DIF items above the median were reading comprehension items while 43% were sentence completion items. In other words, there were proportionally more sentence completion items closer to moderate DIF levels than reading comprehension items. Appendix O presents each item by item type and effect size level in order to demonstrate this distribution across effect size levels. Overall, from the perspective of those who adapted the items, having only 4 of 38 items classified as “practical DIF” is a positive result. This is a considerably lower percentage of DIF items than have been found in many cross-lingual DIF studies as noted in the literature review (Chapter 3). As I will argue below, however, these estimations might be a bit conservative in terms of the actual number of items that merit further review by CEATM. There were other items that received both criticism from evaluators and had relatively high effect size values near 110 the .035 cutoff. I return to the issue of effect size categorizations and reasons why some nonpractical DIF items might be problematic in Chapter 6. Next I present the results from the interrater reliability estimation, rank order analysis, and evaluators’ predictions about DIF direction. Inter-Rater Reliability and Rank Order Estimations The average number of rubrics filled out per evaluator was 17. The most active of the 62 evaluators filled in 31 rubrics while the least active filled in 6 rubrics. Such wide variation in the number of rubrics completed was also reported by Plake (1980). The least active evaluator was on the team of translators who worked on the NST in 2010. As highlighted in the methods chapter, I conducted an analysis of inter-rater reliability using marks from eight evaluators. The evaluators and measures were both considered random. The inter-rater reliability coefficient when I selected “consistency” was .66 with a 95% confidence interval of .473 to .804. The interrater reliability coefficient when I selected “absolute agreement” was .66 with a 95% confidence interval of .462 to .796. These modest, positive correlations are indicative of a fair amount of agreement between evaluators. The full matrix with the evaluators’ marks used in the statistical analysis can be found in Appendix P. The SPSS output from these analyses for “consistency” can be found in Appendix Q. Recall that the rank order correlation estimation assessed the relationship between the evaluators’ total score for each item and the chi-square difference value for that item. After summing the individual marks for each item, the total item scores ranged from 0 to 16 total points per item; the higher the number – the stronger belief in difference by the evaluators. The mean score for the 38 items was 6.62; the mode was 5; the median was 5.5; the standard 62 Appendix U charts the number of distinct item issues with each individual test item noted by the evaluators. It is important to keep in mind that in many cases, these should be understood as “alleged” item issues, not necessarily proven issues as some issues were clearly disputed during group analysis. 111 deviation was 4.48. Using Spearman’s rho in SPSS, the result of the rank order correlation was a significant, positive relationship of .45, .004 significance at the .01 level. The two columns with item scores and their corresponding chi-square difference values used in the analysis are presented in Appendix R. The SPSS output from this analysis is presented in Appendix S. This modest correlation indicates that as evaluators’ total scores for the items increase, so do the chisquare difference values. These results provide support for a modest correlation between evaluators’ DIF predictions (of difference estimations, not DIF direction) and statistical DIF outcomes. This relationship between the evaluators’ marks and the chi-square difference values is also visible through graphical representation. Appendix T presents the evaluators’ marks in one column next to the chi-square difference values arranged in ascending order. Instead of using item sum scores, to enhance visual representation I simply entered an “X” for each mark of “2” or “3” that the item received from evaluators. Any marks of “somewhat similar” (1 point) for example, were not included as an X in this table. Recall from Chapter 4 that evaluators created an operational definition of what total quantity of evaluator marks indicated “belief in statistical DIF.” It was decided during the group discussion that a total of four marks in any combination of “somewhat different” (2 points) and/or “different” (3 points) would be considered a vote for “probable DIF.” Thus, in Appendix T, each item with four Xs represents a vote for DIF from the evaluators for that item. In essence, four total marks serves as a “cut score” for DIF from the perspective of the evaluators; less than four total marks for any item pair means evaluators (as a group) believed statistical DIF unlikely. Note that as the chi-square difference values ascend in Appendix T, the items with a larger number of evaluator marks tend to cluster in the bottom half of the table. While there are 112 a low number of evaluator marks for some items with high chi-square difference values (e.g. items 20, 28 and 13), for the first sixteen items with low chi-square values, only one of those items has four or more marks (item 7). Eleven of these initial items have a total of 0 or 1 mark and four items have a total of two marks. Looking at the very bottom of the table however, it is also apparent that three of the four practical DIF items did not in fact receive four or more marks from the evaluators. Only item 3 exhibited a high statistical DIF level and received many marks (six) from evaluators as probable DIF. Thus, the positive rank order correlation can not be attributed to the close correspondence between the four practical DIF items and evaluators’ predictions for these particular items but rather to the general tendency for clustering near the bottom. Five of the six items with five or more DIF marks from evaluators are located in the lower half of the table. In total, eight items received four or more marks from the evaluators. Seven of the eight items predicted by evaluators were statistically significant and most were located in the lower part the table in Appendix T with relatively high squared values. The only item predicted to be DIF by evaluators that turned out to be not statistically significant was item 7 (four marks). From the eight items they predicted as DIF, two items received four marks, four items received five marks, and two items received six marks. The eight items predicted as DIF and their effect size values are presented in Table 5-2 below. 113 Table 5-2: Evaluator Marks and Statistics for Predicted DIF Items Evaluators’ Marks xxxx xxxxx xxxx xxxxx xxxxxx xxxxx xxxxx xxxxxx Item 7 15 18 25 21 33 11 3 χ2 Difference χ2 Rank Order Effect Size 1.318 14.890 15.464 23.006 42.413 43.427 49.326 111.086 4 17 18 26 30 32 33 37 .008 .008 .016 .024 .027 .028 .050 Several of the items that received high marks from evaluators were negligible DIF items that had relatively high effect sizes. Five of their eight predictions had effect size values above the effect size median of .009. For example, item 21 had a .024 effect size and received six marks. 63 Item 11 received five marks and had a .028 effect size. Item 33 received five marks and had a .027 effect size. In other words, several negligible DIF items that were very close to the “cut-off” of moderate DIF (.035) were also marked as probable DIF by evaluators. It seems that evaluators’ moderately accurate estimations in the middle to higher part of the effect size order best explain the positive rank order correlation of .45. There were of course outliers in terms of correspondence between the two indicators which plausibly kept the overall correlation from being high. For example, item 15 received five marks from evaluators but had a fairly low effect size measure of .008. Items 7 and 18 also demonstrated little correspondence between evaluators’ marks and the DIF statistics (many evaluator marks but non-significance or a low effect size value). 63 Note that the chi-square difference values and r-squared values (effect size) are very closely (though not perfectly) correlated. 114 Direction of DIF Despite the reasonable inter-rater reliability and modest correlation between the evaluators’ predicted differences and statistical differences for the item pairs, the inference that evaluators had a reasonably good understanding of DIF would be tenuous: That is because the data collected from section 2.4 of the item rubrics indicate that evaluators did not correctly predict the “direction of DIF” (which group was favored by differences) on a consistent basis. For the eight items they predicted as DIF, they correctly predicted the direction of DIF only 29% of the time (2 of 7 statistically significant items). The data in Table 5-3, arranged in order of chisquare difference values highlights this fact. Note the difference between their predictions of direction and actual DIF direction in columns five and six. Five of the seven items favored the Kyrgyz group. The evaluators were only correct in their predictions with the one practical DIF item (item 3) and with item 21. Table 5-3: Prediction of DIF Direction for Items Predicted as DIF 1 Item 7 15 18 25 21 33 11 3 2 Evaluators’ Marks xxxx xxxxx xxxx xxxxx xxxxxx xxxxx xxxxx xxxxxx 3 4 Effect Size 5 Evaluators Predict* 6 Statistics Favor .008 .008 .016 .024 .027 .028 .050 Russian (5) Russian (4) Russian (1) Russian (5)** Russian (3) Russian (3) Russian (3) Kyrgyz Kyrgyz Kyrgyz Russian Kyrgyz Kyrgyz Russian 2 χ Difference 1.318 14.890 15.464 23.006 42.413 43.427 49.326 111.086 * Numbers in parentheses are number of votes for DIF direction ** Item 21 also received 1 vote for favoring Kyrgyz. Note that from the eight items they predicted as DIF, with the exception of one lone vote, they predicted DIF to favor the Russian group for each item. Thus, their only two correct 115 predictions were when the Russian group was actually favored. In fact, from the total pool of 38 items assessed, evaluators marked a total of 26 items as “favoring Russian” and only two as “favoring Kyrgyz.” One item received a mark of “no advantage” and four items received no marks at all. Of the items that received mixed marks however, the most marks any item received as “favoring Kyrgyz” was one (items 16 and 21). Table 5-4 below presents a breakdown of the evaluators’ marks for all 38 items in response to section 2.4 of the rubric - “which group is advantaged (favored)?” Table 5-4: Prediction of DIF Direction for All Items Number of Marks 2 3 2,4,9,12,13, 7,10,11, 17,24,25,28, 23,26,33 31,32,36 Favors Russian 26 1 22, 27,29 Favors Kyrgyz 2 5,30 Mixed Vote No Advantage 4 1 8 (1R,1 N/A), 16 (1R,1K), 21(5R,1K), 35 (1R,1 N/A) 37 No Estimation 5 14, 20, 34, 39, 40 Total 4 3,18,19,38 5 15 From Table 5-3 it is clear that sometimes the number of marks in section 2.4 (which side is favored) was sometimes less than the total number of marks for DIF, section 2.1. It would appear that in some cases evaluators were a bit more confident that there were differences in items than they were in which group might be advantaged by those differences. However, there were also cases when items received very few “different marks” in section 2.1 of the rubric but several for “favoring the Russian group” in section 2.4. For example, only three evaluators marked item 19 as “somewhat different” or “different” yet four total evaluators marked it as favoring the Russian group. Items 9 and 28 received no marks for “somewhat different” or “different” but still received two marks per item as “favoring Russian.” Of course there is no 116 reason why evaluators could not have selected the category “somewhat similar” and also selected a group to be favored. As the majority of items marked “favoring Russian” received only 1, 2, or 3 total marks as such, I re-examined each individual rubric to check if this result might be a function of the dispositions of certain evaluators. I discovered this not to be the cases as there was a roughly equal distribution of “favoring Russian” marks across all evaluators. In terms of the four practical DIF items, three of these items advantaged the Russian group and the evaluators got all three of these predictions correct. Items 3 and 19 received four marks in favor of the Russian group, while item 13 received two marks in favor of Russian. Item 32, which advantaged the Kyrgyz group, was not predicted to be a DIF item but still received two marks as “favoring Russian.” This apparent lack of accuracy (overall) in predicting DIF direction was similar to results from Plake (1980) as well as Engelhard et al. (1990) with black and white group differences. Plake (1980) found that the raters scored twice the amount of DIF than the statistical procedures yielded. In this study, evaluators also scored two times more DIF than the DIF statistics indicated. In Plake’s study, one third of items favored the opposite direction that was predicted by the raters while in this study the evaluators (while accuracy rates differed by item type) overall were only 52% accurate when including all their predictions in the analysis (including negligible DIF items) and only 29% accurate for those items they predicted as DIF. At the same time, these results contrast with Gierl and Khaliq’s (2001) study of crosslingual DIF that found Canadian evaluators to have better than random prediction rates for DIF direction for French and English versions of mathematics and science items. Methodological approaches are perhaps important in understanding the accuracy of their substantive review however (Ercikan, 2002). In the Gierl and Khaliq (2001) study, the evaluators had knowledge of 117 the statistical data and they set out to classify DIF direction on item pairs they knew had been flagged as DIF. Perhaps it was therefore a bit easier for them to estimate DIF direction than in the Kyrgyz situation where evaluators had no knowledge about statistical DIF beforehand. The larger point is that if evaluators can not accurately predict who is advantaged by differences in the two versions of an item it is difficult to determine how well they actually understood alleged item differences, regardless of the inter-rater reliability and rank order outcomes. It also underscores the difficulty of the task that substantive committees face in item analysis in general. I now turn to a presentation of the data by item type. Reading Comprehension Items Analysis of the reading comprehension items entailed the analysis of a reading text (195 lines in Kyrgyz, 165 lines in Russian) in addition to 10 individual item pairs, item numbers 3140. Nine of the reading comprehension items were classified as negligible DIF and one, item 32, was moderate DIF. As noted at the beginning of the chapter, 70% of all the reading comprehension items had effect size values lower than .009 - the median effect size value of all the significant DIF items. Six reading items had chi-square difference values less than 10.30 and effect size values at .006 or less. In the rank order of negligible DIF items from lowest to st nd rd th th th th th th highest, these items occupied the 1 , 2 , 3 , 7 , 8 , 10 , 14 , 17 , and 26 places respectively (see Appendix O). Only one negligible DIF reading item, 33, had a relatively high effect size value at .027. It also received five marks as “DIF” from the evaluators. Overall, as can be seen from the rubrics in Appendix W, these items generated the least discussion in comparison to the sentence completion and analogy items. The reading comprehension items had the lowest average number of distinct issues per item at 1.5, and received the lowest average number of evaluator marks for DIF per item at 1.6. Eight items 118 received 0, 1 or 2 total marks as DIF from evaluators. The highest number of marks for any reading item was five (item 33) and three (item 38). The average effect size value for the statistically significant items was .0129. The most commonly noted issue for the reading comprehension items was format mistakes in the Kyrgyz language (5 times). In Table 5-5 below, the reading comprehension items are presented in order according to ascending chi-squared values. “Marks” indicate DIF votes from evaluators, while “predicted” indicates which group evaluators believed the item favored. The numbers in parenthesis indicates the number of evaluators who voted for the predicted DIF direction. The last column“statistics”- indicates the statistical direction of DIF. Table 5-5: Statistically Significant Reading Comprehension Items Item Marks R ∆ 2 DIF Category Predicted Statistics 39 0 .003 negligible None Kyrgyz 35 X .003 negligible No Adv. (1), R(1) Russian 36 XX .003 negligible Russian (2) Kyrgyz 31 XX .004 negligible Russian (3) Russian 34 X .006 negligible No Est. Russian 40 0 .006 negligible None Russian 37 0 .009 negligible No Adv. (1) Kyrgyz 38 XXX .011 negligible Russian (4) Russian 33 XXXXX .027 negligible Russian (3) Kyrgyz 32 XX .057 moderate Russian (2) Kyrgyz 119 Five reading comprehension items favored the Russian group and five items favored the Kyrgyz groups. In the six cases in which evaluators predicted advantage, 64 only three were correct (50% accuracy). The conversation around the reading comprehension items was perhaps tempered by the nature of the task. Evaluators had to not only to read and analyze the items, but compare the texts as well. For a full list of comments about the reading comprehension text, see the last page of Appendix W. The analysis of reading comprehension items generated commentary about issues of adaptability in general but few strongly supported and highly agreed upon hypotheses about problems with specific items. This was not the case for the sentence completion items to which I now turn. Sentence Completion Items No sentence completion items were classified as moderate or high DIF. Two items, 24 and 29, were non-DIF items. As noted above however, six of the eight statistically significant sentence completion items had effect size values higher than the median effect size value of .009. One item had a value of .016, two items had values of .019, one of .024, and one of .029, all somewhat close to the negligible-moderate DIF border at .035. In the rank order of twenty eight th th st nd rd negligible DIF items by effect size value, these items occupy the 4 , 20 , 21 , 22 , 23 , 24 th th and the 28 highest positions respectively. This can be seen from the rank order of values in Appendix O. The item data for the sentence completion items is presented below in Table 5-6. 64 “Predicted advantage” includes only items where there was at least one prediction but no cases of “split decisions” (one vote for one group, another vote for the other group). 120 Table 5-6: Statistically Significant Sentence Completion Items Item Marks R ∆ 2 DIF Category Predicted Statistics 27 X .003 negligible No Est. Kyrgyz 30 X .004 negligible Kyrgyz (1) Kyrgyz 22 XX .013 negligible Russian (1) Russian 25 XXXXX .016 negligible Russian (2) Kyrgyz 26 XX .019 negligible Russian (3) Russian 23 XXX .019 negligible Russian (3) Russian 21 XXXXXX .024 negligible Russian (5)* Russian 28 0 .029 negligible Russian (2) Kyrgyz * There was also one vote for favoring Kyrgyz for this item One item received six marks from the evaluators, one item five received marks and one item received three marks. The average number of evaluator marks for DIF per item for the sentence completion items was 2.5. The average number of issues per item was 3.1, twice the amount of the reading comprehension items. The average effect size value was .0159, higher than the reading comprehension value of .0129. Interestingly, compared to the other item types, evaluators correctly predicted the direction of DIF at a greater than random rate for the sentence completion items (5 of 7 times, or 71% correct). The most commonly cited problem for these items were the lack of syntaxical equivalence between the Russian and Kyrgyz items which made these items difficult for the Kyrgyz group (more below). As will be seen in the individual item analysis section, no items generated more discussion than the sentence completion items, especially items 21, 23, 25, 26 and 28. 121 Analogy Items There were almost twice as many analogy items (18) examined than reading comprehension (10) and sentence completion items (10). Fourteen of the 18 analogy items were statistically significant; 11 were negligible DIF; 3 items were practical DIF; 4 items were nonDIF. Unlike the sentence completion items which tended to cluster at the higher end of effect size values, the effect size values for these items were evenly spread across the whole range of effect size values. For example, in the lower range there was one item at .004, one at .006, two items at .008, one each at .009, .010, and .011. There were also several middle level values as well as two negligible items with very high effect size values, item 11 (.029) and item 16 (.031). The three practical DIF analogy items of course had effect size measures over .035. The average effect size value was .0225 making it the highest effect size average of the three item types. The average number of evaluator marks for DIF was 2.6. The average number of distinct issues per item was 2.0 placing it between the other two item types. The dispersion of evaluators’ marks was wide with four items receiving only zero or one mark for DIF while three items received marks of two. Item 3 received the most marks at six. Two other items received five marks, one item received four marks, and three items received three marks. Table 5-7 presents the data from the analysis of the analogy items. 122 Table 5-7: Statistically Significant Analogy Items Item 14 12 Marks 0 2 R ∆ 0.004 DIF Category negligible Predicted No est. Statistics Russian 0.006 negligible Russian (2) Kyrgyz 18 XX XXXX 0.008 negligible Russian (4) Kyrgyz 15 XXXXX 0.008 negligible Russian (5) Kyrgyz 10 XXX 0.009 negligible Russian (3) Russian 8 X 0.010 negligible Rus (1), Kyr (1) Kyrgyz 4 XX 0.011 negligible Russian (1) Kyrgyz 20 0 0.015 negligible No est. Kyrgyz 5 XXX 0.015 negligible Kyrgyz (1) Kyrgyz 11 XXXXX 0.028 negligible Russian (3) Kyrgyz 16 XX 0.031 negligible Rus (1), Kyr (1) Kyrgyz 19 XXX 0.048 moderate Russian (4) Russian 3 XXXXXX 0.050 moderate Russian (3) Russian 13 X 0.072 high Russian (2) Russian While the direction of DIF for the three practical DIF analogies was predicted correctly, overall, the evaluators correctly predicted the direction of analogy DIF 40% of the time (4 of 10 predictions correct for which there were no split estimates). The negligible DIF analogy items tended to favor Kyrgyz (9 of 11 items) while the practical DIF items all favored the Russian group. Overall, the evaluators overwhelmingly selected the Russian group as favored for the analogies items. There was no consistently marked “typical problem” for the analogies items: A wide variety of translation and adaption, cultural and format issues were noted as problematic. As will be highlighted below, many predictions about DIF for analogy items did not come to fruition. Sometimes issues like loan word use or social-cultural issues were projected to cause DIF on analogy items but did not. However, plausible causes were identified for the three items 123 flagged for DIF. For these three items, mistakes in “key location of meaning” such as answer keys (3, 13) and item stems (19) all plausibly led to DIF (Ercikan, 2002). Overall, the reading comprehension items had the lowest effect size measures, the lowest number of distinct issues identified by evaluators, and the lowest average number of DIF marks from evaluators. They also generated the least amount of discussion. Only one item was practical DIF (32), though one other item was very close to moderate DIF (33). Evaluators were not however, able to offer an explanation for DIF for item 32. The analogy items demonstrated the most variation, both in terms of evaluator marks and the DIF statistics across the statistical distributions (effect sizes). They also had the highest average effect size values. There were no practical DIF sentence completion items but many of these items were concentrated near the moderate DIF cut off of .035. These items also received the second most marks for DIF on average (2.5, while analogies received 2.6), the highest number of distinct issues per item, 3.1, and generated the most discussion. The table below presents summary information about the evaluators’ marks by item type. These results by item type are consistent with other cross-lingual DIF studies of verbal items. For example, Agnoff and Cook (1988) argued that longer texts (reading comprehension) allow for the flexibility necessary in item adaptation to more accurately convey meaning. Indeed, if there are inherent differences between languages, they are perhaps less constraining when longer texts are involved. Table 5-8: Summary of Evaluators’ Marks by Item Type Data by Item Type Evaluators: Avg. Number Issues Per Item Avg. Number of Marks for DIF % Correct DIF Direction Ana (n = 18) 2.0 2.6 40% 124 Sent Com (n = 10) 3.1 2.5 71% Read (n = 10) 1.5 1.6 50% Data source: all items sig. items sig. items As in this study, other researchers also found greater DIF in analogy items and less in reading comprehension items. In this study, three of the four practical DIF items were analogy items. Beller (1995) and Gafni and Canaan-Yehishafat (1993) found greater DIF in analogy items than in reading passages. Using data from the Israeli Psychometric Entrance Test (PET), Allalouf, Hambleton and Sireci (1999) also concluded that analogies seemed most problematic with 65% of items demonstrating DIF. Reading comprehension items showed the smallest amount of DIF in their research. In this study reading comprehension items were also concentrated in the lowest range of effect size values. In the final section of this chapter, I turn to the individual item analyses with a focus on specific sources of DIF in Russian and Kyrgyz language items. Sources of Difference The second goal of this study was to determine the source (cause) of DIF and the specific challenges to item adaptation from Russian into Kyrgyz. As presented in the methods chapter, data from each item pair was collected on the item analysis rubrics completed by each evaluator and from the group discussion. Though only eight total items were predicted as DIF, there was at least one distinct issue or problem noted with all but 2 of the 38 items (items 20 and 39). Of course the identification of an issue or problem does not mean that the issue was in fact widely agreed upon or correctly identified and characterized. In Appendix W the reader can get a sense of just how much agreement there was on any particular issue raised. 65 Further, most of the issues or problems did not lead to DIF as indicated by the overall low number of DIF items. Nonetheless, test developers need to consider the evaluators’ full spectrum of comments on item 65 While the marks under 2.1 which were used for inter-rater reliability and rank order scoring came from only eight evaluators, the commentary under section 2.3 contains the comments from all ten reviewers (Appendix W). 125 quality because their comments can assist evaluators improve the quality of their work. Recall that Engelhard et al. (1999) carried out a study in which evaluators tried to locate technical and cultural mistakes in items. The most accurate reviewer was 94% accurate while the least accurate was 83%. Thus, there is some evidence to believe that whatever their accuracy in DIF prediction, evaluators can be reasonably accurate in identifying mistakes in substantive review. According to the analysis of the 157 individually completed rubrics, there were 82 distinct issues raised with the 39 items: 53 related to adaptation/translation issues, 17 related to Kyrgyz grammar, 8 related to item format, and 4 related to socio-demographic or cultural issues. The number of distinct issues per item ranged from zero issues (two items) to five issues (one item). Ten items were marked as having one issue, 11 items had two distinct issues, 11 items had three distinct issues, and 3 items had four distinct issues (see Appendix U). The average number of issues per item was 2.15. Eight of the 36 items with comments received no suggestions for how to improve the items (section 2.5) while the remaining 28 items received suggestions for how to make the item pairs more equivalent. All 17 grammar issues were related to Kyrgyz grammar. Not a single issue was raised with Russian grammar for any of the 38 items. Further, despite the potential impact of background factors (e.g. cultural or curricular) to impact test results, except for the four issues raised related to socio-demographic/cultural issues, commentary and item discussion focused on overt language and format issues between the two versions of the items examined. This is perhaps explained by several factors. First, the NST is an aptitude test, not directly tied to specific school curricula. Second, as a test from within a single country the NST developers likely did a reasonably good job of considering variation of conditions, culture, and content across groups. Third, as will be seen below, many issues related to the quality of the Kyrgyz 126 items kept attention and debate squarely on the more overt item characteristics of that language group. As will be demonstrated, the few hypotheses generated about background factors that might have led to DIF were not tenable as 3 of the 4 practical DIF items had overt technical flaws due to poor adaptation or typographical mistakes. In the following section I break down the data by sources of difference identified by item evaluators. In some areas of concern like poor translation, many pairs of items exhibited similar issues or elicited similar debate. In those cases, two or three items are presented below as examples of recurring themes and additional examples are referenced in the summary rubric. Parts of conversations that seemed to be especially insightful are presented in the text as quotations from evaluators. In order to facilitate a coherent presentation of results, I developed the following system of references to the individual item rubrics and group analyses. I reference the two data sources as either “IA” for data coming from individual analysis rubrics and “GA” for data coming from the group analysis. For example, IA12 indicates that the data came from the individual analyses from item 12. GA33 indicates that the data came from the group analysis of item number 33. In order to avoid confusion I have kept the original answer key and distractor names (letters of the alphabet) in the Russian style as presented in the summary data: (A), (б), (B) and (г), (A, B, V, and G in English), which is similar to the American style of distractor labeling (A), (B), (C) and (D). As noted in Chapter 4, each item has four possible answer choices. The term “item stem” is used to denote the prompt or question, and “answer key” to denote the correct answer choice while “distractor” denotes any one of the three incorrect choices. Item evaluators are referred to by two initials - MD, CJ, AB, etc. 127 Translation and Adaptation Issues In the western literature, the term item adaptation is generally preferred to translation because of its connotation of flexibility in conveying meaning (Sireci & Allalouf, 2003). Adaptation implies that as long as the essential meaning, nuance, and difficulty level is kept intact and conveyed, words and phrases appearing in the source language may be changed as necessary for the target linguistic/cultural group. This use of adaptation is intentional - as a way to distinguish it from translation which implies a more literal, word for word approach (Hambleton, 2005). In this study however, I subsume translation under the larger umbrella of adaptation as I do not believe the issue is “either – or.” Usually, flexibility is needed to make sure that nuance is accounted for in the target language version. However, as the data in this study will reveal, literal translation is sometimes more appropriate than misguided or overly creative attempts at adaptation. Therefore, I present issues of adaption and translation together and distinguish the nuances in the application of the terms as necessary in the context of each item under discussion. A myriad of adaptation and translation issues arose during item analysis. If an item was adapted but not directly translated, evaluators tended to make a note of it as potentially problematic. In the ensuing group analysis, it was then debated whether or not the adaptation was appropriately done. Sometimes, evaluators argued that flexible adaptations were necessary while at other times they claimed that such adaptations were problematic. In analogy item 2 for example, the Russian stem was given as “chef: borscht.” The logical relationship was “maker, preparer: something made/prepared by him/her.” The evaluators noted the literal translation in the two versions did not correspond. In the Kyrgyz stem, the second word given in the pair was “шорпо” (broth), which in Kyrgyz can have a wide meaning and imply not only “something 128 liquid” but also “something eaten as a first course.” In the Russian pair, “борщ” (borscht) is the name of a particular kind of soup (IA2). The evaluators noted: MD: I think we agree that the words utilized in the analogy stem are not strictly equivalent; however, there is disagreement as to whether or not this lack of equivalence should be considered a serious enough difference to estimate a lack of equivalence in outcomes. KM: Yes, they are different, but I don’t think the differences affect the relationship of the words in the analogy pair (GA2). The evaluators agreed with KM as three of them marked the item as “identical” while four of them marked it as “somewhat similar.” Their rationale was that although the second words in the item pairs were different, this did not impact essential meaning as the primary relationship was still “chef: something prepared by a chef” in both versions of the item. In fact, this item displayed no statistical DIF with a chi-square difference value well below the test statistic of 3.841 at .733. In analogy item 5 the item was adapted (not directly translated) but the relationships in the two different versions were maintained. It was noted that in distractor (A), the Kyrgyz pair of words was “бут: из” (leg (foot): track) while the Russian version was “палец: отпечаток” (finger: fingerprint) (IA5). This time both the first and second words in the pairs were different. Evaluators were more divided over this item as three of them marked the versions as either “somewhat different” or “different” while four of them marked it as “identical.” However, this item was also not a DIF item but displayed negligible DIF with an effect size of .015 (greater than the median effect size). There were other examples of analogy item adaptation that did not result in changes in the relationships between word pairs. In item 8, one Russian distractor was “ladle: pour” while the Kyrgyz version was “bucket: pour.” The item received only one vote as DIF from evaluators and was in fact negligible DIF. In both language versions, this particular 129 distractor was the least popular: It was selected by 7% of the Russian group and 1% of the Kyrgyz group. In distractor (B) of item 9, the Russian version was “mокрый: cушить” (wet: to dry) while the Kyrgyz version was “суу: кургатуу” (water: to dry). There were differences of opinion about the appropriateness of this adaptation. One evaluator noted, “If these words were used in context (in a sentence) then it would be okay. For example – ‘I was in the rain and got wet.’ However, when no context is given, this is a problem and literal translation is necessary” (IA9). The concern expressed was a minority opinion however, as no evaluators marked “somewhat different” or “different” and four evaluators marked it as “somewhat similar.” The difference in words in fact did not impact examinees as this was a non-DIF item with almost perfect one to one correspondence. The item had the lowest chi-square difference of all 38 items. Not all adaptations of analogy items maintained essential meaning however. The evaluators noted cases of poor adaptation and outright translation mistakes. In item 10 the second word in distractor (B) of the Kyrgyz version was mistranslated from the Russian version and resulted in the Kyrgyz word having the opposite meaning than was intended. The word used in the Russian item was “Ярче” (brighter) but the Kyrgyz distractor (B) was translated incorrectly as “карарaak” (darker). The Kyrgyz word for ‘brighter’ that was needed was “ачыгыраак” (IA10). Three evaluators marked this item as DIF but the DIF level was negligible at effect size .009. The mistake was located in the distractor that was least attractive to both the Russian and Kyrgyz group, which might explain why the mistake did not seem to have an impact on item responses. 130 Five evaluators believed that multiple translation errors on item 11 would lead to DIF. And, this item had an effect size of .028, putting it right at the verge of moderate DIF (recall the cutoff of .035). Two of the Kyrgyz distractors - “г” and “б” - had translation problems that changed the meaning of the distractors. One evaluator noted that the mistakes made the item difficult and confusing for the Kyrgyz group (IA11). During group analysis however, a different evaluator claimed that distractor б (which was not the answer key) was an attractive distractor in the Russian version, but not in the Kyrgyz version (GA11). Nothing was found to be wrong with the Russian version. Interestingly, while three evaluators marked the item as favoring the Russian group, the item in fact favored the Kyrgyz examinees. Because one evaluator noted that the number of attractive distractors in the Russian version was greater than in the Kyrgyz version, it seems plausible that mistake in the distractor of one language version could perhaps have made the odds of correct selection greater by reducing the total number of viable distractors for this group, assuming of course that the mistake was obvious to examinees. Item 18 had translation problems in distractors (б) and (г) and a comment that one of the words in the Kyrgyz answer key pair was “used in simple speech” not literary language. According to evaluators: In (б), “шашма” (hurried), does not correspond to the Russian version “откровенный” (open) and “шамдагай” (dexterous) is not the same as “болтливый” (talkative). In other words, neither word in this pair corresponds well to the pair in the other language. This distractor does not work. (г) also has an incorrect adaptation as “колдойгон”k. (attack) which is used in simple speech, not as a literary term. Further, the meaning of the pair of words in Kyrgyz does not correspond well to the meaning of the words in Russian” (IA18). And during group analysis: ZS: There is incorrect, inaccurate translation in several of the distractors in this item and the use of the incorrect meaning of some words. NO: the problem is related to the specifics and nuances of the Kyrgyz language. The thing is, some words can be used only 131 in combination with each other; in certain contexts they can’t be used individually. Therefore, this issue is poor adaptation. RM: Yes, the problem is that some words must be used in combination. MD: The use of some words out of context makes them impossible to understand, they can not be used individually, so the problem is adaptation. (GA18). Four evaluators believed that item 18 would be a DIF item. In fact, item 18 was categorized as negligible DIF with an effect size value of .008, right at the median effect size level. As with item 11, item 18 also favored Kyrgyz respondents, the group in which these mistakes with word combinations were allegedly occurring. However, as in item 11, the four votes cast for DIF direction were for “favoring Russian” (IA18). This pattern of identifying mistakes in the Kyrgyz version and voting for “favoring Russian” occurred frequently, especially with the analogy items. As presented in the beginning of this chapter, 9 of the 11 “negligible DIF” analogy items actually favored the Kyrgyz group but all were predicted by evaluators as “favoring Russian.” This was not the case with the sentence completion items for which prediction rates for DIF direction were better. Another important adaptation issue that arose several times was the use of Russian loan 66 words in Kyrgyz versions of the items. For example, on item 4 most evaluators noted that a commonly known Russian loan word should have been retained in one of the distractors (IA4). Instead, the Russian word “шарф” (scarf) was adapted into the Kyrgyz “моюн жоолук” (lit. neck wrap) (IA4). There was consensus on this point: ZS: I think foreign words should stay in their original form. AA: I agree; if there are no commonly used equivalents for foreign words, use the commonly used version. MK: It is best to use active, commonly used, words (GA4). 66 The term cognate means a word that is the same in several languages (i.e. shares the same origin). I use the term loan word here to emphasize that the most cognates were introduced through Russian in the 20th century. 132 MD: the problem here seems to be a too literal translation; sometimes there is no reason to translate. NO: Actually, I think there is a Kyrgyz equivalent to “шарф” (scarf) but it is not used very often. MD: Well… how can we say what “often” is – how do we know this? (GA4) While recommending the maintenance of a loan word for this item, the evaluators were divided about whether this was a DIF item or not. Two marked it as “somewhat different” with another four marking it as “identical” and two as “somewhat similar.” In fact, it was a negligible DIF item favoring the Kyrgyz group. The Kyrgyz stem of item 24 also contained an adaptation of a Russian word that several evaluators noted should have been left in the Russian original (IA24). In a different discussion just a few minutes later, evaluators made the opposite recommendation in regard to loan word use. Six of the evaluators noted that several Russian loan words in item 7 would not be understood by Kyrgyz speakers (IA7) and four of them marked the item as “somewhat different.” The Russian words “терапевт” (therapist), “слесарь" (metalworker), “Адвокат” (advocate) were all identified as problematic, especially for rural (Kyrgyz) students. An alternative Kyrgyz word was proposed for ‘metal worker’ - “темир уста”, which means literally “мастер по железо” (master of iron) in Russian (IA7). Item 7 however, showed no significant DIF. In item 2 the Kyrgyz version of distractor (Г) also employed a loan word from Russian, “деталь” (detail). Four evaluators proposed that the Kyrgyz equivalent “тетик” (detail) be used instead. It too, however, showed no indication of DIF with a chi-square difference value well below the test statistic. It would appear that in general, the Kyrgyz examinees were not troubled by the Russian loan words in most cases. However, it would be incorrect to say that the evaluators strongly believed that they would be; based on their marks, they were divided or had mixed feelings about how loan word use would or would not impact item response patterns. 133 Sometimes they felt they should be used, and sometimes not, depending on the item under evaluation. In any event, there is no evidence that the use of loan words on Kyrgyz versions caused DIF on any of the items analyzed with the possible exception of sentence completion item 23 (discussed below). Group analysis of item 23 raised the question of how to deal with new words and concepts and the fact that their incorporation into the Russian and Kyrgyz languages (from other foreign languages) might proceed at an unequal pace, especially considering the demographic distributions of the two populations within the KR. One of the Kyrgyz words utilized, “камсыздандыруу” (insurance, provision) in the item stem “has no meaning in Kyrgyz” in the context of this item according to one evaluator because the concept of “insurance” is unknown (IA23). According to another, the word is technically correct, but only understood by a small number of specialists. Yet another opinion was that the Kyrgyz word “камсыздандырылган” (to be guaranteed) might fit the item but that its meaning has a wider connotation than the Russian word utilized in the item, “застрахование” (insured). It was generally agreed that the item had an urban (pro-Russian) bias and three evaluators believed it would be a DIF item favoring the Russian group. While not a practical DIF item, in fact this item did have a relatively high level of negligible DIF at .019 and the Russian examinees were favored. In the discussion of whether the word for “insurance” was known or unknown by Kyrgyz examinees, evaluator ZS noted: “… Many new terms are constantly being formed all the time in Kyrgyz while in Russian the concepts are well known. For example, in Kyrgyz there are four or five completely different ways to say “entertainment center.” People do not know which is correct at this point. Therefore, Kyrgyz people often use Russian loan words. Some use Kyrgyz words that are not known. As teachers we see this on a regular basis. Many words are ‘created’ but not yet well known. In this item not everyone knows how to say “uninsured,” especially in rural areas where there is no such thing as insurance” (GA23). 134 Despite the lack of strong evidence for differential impact of loan/new words on item response patterns overall, the discussions of the above items demonstrate the difficulty of the loan word/new word issue in standardized testing situations. In GA4 evaluator MD raised the core issue: Given the demographic diversity of the republic, how does an item developer get a handle on the “commonality” of a given loan word? Or, as the evaluators often stated, the “activeness” of a given word. The evaluators noted that there were differences in the extent to which Kyrgyz speakers lived and interacted daily with Russian speakers and thus differences in the extent to which they would be exposed to loan words and new words. Recall that the examinee sample in this study comes from a broad, representative slice of the population, including the capital city of Bishkek. This may mean that a moderate or large part of the Kyrgyz sample is relatively well acquainted with Russian loan words which might explain the lack of DIF on the items presented above. That is, there may be some Kyrgyz speakers who are penalized by the use of Russian loan words but this doesn’t show up in the overall statistics because they are small proportion of the examinees in the sample. One way to get a better understanding of this issue would be to conduct experimental DIF studies with two Kyrgyz language groups – one from an ethnically mixed area and one from a more isolated area where Russian is not well known or had less penetration historically. 67 I will return to this issue in the discussion in the next chapter. Another challenge for evaluators was rectifying the multiple meanings of individual words in many of the items. Apparently, there were cases where a word had a clear, singular meaning in either Russian or Kyrgyz but several meanings in the other language, thus 67 On the other hand, many non-Russian speaking Kyrgyz would still likely know many loan words that found their way into Kyrgyz usage decades ago and, in a sense, “became Kyrgyz words,” regardless of their knowledge of Russian. 135 complicating the logic/meaning of certain items for one group. This finding of multiple meanings as a potential DIF cause is consistent with other DIF studies of verbal test items (Gierl & Khaliq, 2001; Sireci & Allalouf, 2003). Evaluators speculated that examinees would be confused on analogy item 3 as they wouldn’t know which of the meanings in Kyrgyz was needed to solve the item: MK: There are many problems with this item, especially with the item distractors. The first problem I see is confusion in distractor (A) because of the translation of the Russian “сад: яблоня” (orchard: apple trees) into Kyrgyz is incorrect. The given Kyrgyz version is – “бак: алма” (tree: apple). NO: Yes, but in Kyrgyz “бак” can mean tree or orchard. MK: OK, but we must consider that the Russian variant “сад” (orchard) is only fruit garden, not trees - that is the problem. A better analogy might thus be “tree: apple” – not “orchard: apple trees.” In other words, “from what/where” (material) comes (GA3). MD: I agree, “бак”k. (tree) is “сад”r. (orchard) and “дерево”r. (tree). The word “бакча” k. is “огород”r. (vegetable garden). I think a problem arises in analogies when the Kyrgyz words have many different meanings, and these same words in Russian have only one meaning. I do not know how much this affects overall results but this is true. Again, the problem is the use of multiple meaning and uncommon words in the Kyrgyz language when in the Russian language they have only one meaning (GA3). Item 3 was in fact a practical DIF item and it received six DIF marks from the evaluators. However, there was also a serious typographical error in the answer key (discussed below under format) that most believed was the cause of DIF because there was no viable answer key in the Kyrgyz version (GA3). Multiple issues within the same item occurred often in the Kyrgyz items which made disentangling potential sources of DIF challenging. Analogy item 13 was another example of how multiple meanings might cause DIF. Item 13 was also a practical DIF item favoring the Russian group though only one evaluator initially predicted DIF. During the group analysis however, it became apparent that many believed this to be a DIF item (GA13). The analogy item stem was “television: watch.” The answer key (б) was “automobile: go.” The main problem was the second word in the Kyrgyz key - “жүрүү” – 136 which in Kyrgyz means “to go.” However, depending on the combination of words used with this word, it can mean to go by foot or by car. The Russian version, in contrast, employed the word “ездить” which means going only by some form of transportation: Russian has a different word for going on foot. Thus, the essential relationship between words that was crucial for the analogy to work was clearer in the Russian version. As the evaluators noted: RM: the problem is that the distractors are not good. MD: Yes, maybe the main problem is in answer key (б). In my comments, I wrote that “жүрүү” k. (go) is different from “ездить” r. (go by transport) because “жүрүү” can be walking by foot or going by car while “ездить” means going by transportation. In Kyrgyz perhaps “айдоо” (drive) would be a better choice for the pair because it has meaning like the Russian “ездить.” … If they used “айдоо,” (drive) they will get it quickly… I think this is an issue of translation – it is a good item but the direct translation is incorrect. Many of us thought for a very long time about what the correct answer here was to this item… (GA13). Analogy item 19 was another DIF item favoring the Russian group. Initially, three evaluators marked it as DIF on their individual analyses. During the group analysis however, there was considerable discussion about a serious problem with the item stem. The Russian item stem would roughly be the English equivalent of “hot: recoil/jerk back quickly.” Apparently, the second word in the Kyrgyz stem pair “тартып алуу” has two meanings – “take away” and “pull” (IA19). The second word in the Russian stem, “отдёрнуть”r. (recoil/jerk back quickly), has only one meaning. In the Kyrgyz stem, in combination with the first word in the pair “ысык” (hot), the second word, could be understood as “attract or “pull warmth,” which implies attraction, not repulsion, the opposite of the what the Russian item stem implied with “recoil” (IA19, GA19). RM: there are multiple meanings of some words in the item stem; there needs to be a more careful selection of pairs of words – otherwise, the item misleads and it becomes impossible to find the correct answer. ... NO: I agree, depending on how they define the terms in the stem they could come to complete opposite meanings of the analogy... MD: Yes, the stem needs to be more clearly defined (contain no double meanings). MK: Absolutely, the stem and distractors should have only one meaningful interpretation (GA19). 137 While the above practical DIF items (3,13,19) all had issues with multiple meaning, it is important to highlight that in all three cases, the core problems were with either the answer key (3,13) or the item stem (19), not other distractors. Thus, the DIF in these cases seems to be as much about “location of the problem” or the extent to which the overall meaning of an item becomes confused as much as it is about “multiple meanings” in general. This finding is consistent with Ercikan (2002) and underscores how only through an examination of the minutia at the item level can DIF analysis be fruitful. I will return to this issue in the discussion chapter. Discussion around sentence completion item 26 demonstrated that due to grammatical issues some Kyrgyz items were difficult to comprehend, even for the evaluators. While the DIF was negligible the effect size value was fairly large at .019 (favors Russian, two DIF marks). On how lack of clarity can complicate understanding, evaluators noted: ZS: “себептүү”k. (due to) in the item stem is not needed. It needs a different affix here. MK: I do not agree…. without this, the item loses the main idea. What will be the correct answer? ZS: This item is confusing, the translation is not clear in several places. MD: Hmmm… It seems that the Russian text allows a “double meaning,” and Kyrgyz only one meaning. However, that meaning (for the Kyrgyz item) leads to a wrong answer. This is due to the way the item is structured. MK: What is the correct answer to the Kyrgyz item? MD: The complication is over the meaning of the word “production” which is quite unclear. ZS: If we can’t find the correct answer, I don’t think the children will either! (GA26). Reading comprehension item 33 elicited discussion as several mistakes were noted. Item 33 was a negligible DIF item but had a relatively high effect size of .027. Five evaluators marked it as a DIF item. There was consensus about a translation mistake in distractor (A) of the Kyrgyz version, though distractor A was the least attractive distractor for both groups. Though several evaluators thought “the overall meaning of the text is similar in Russian and Kyrgyz” the 138 multiple meaning of some words was noted at the end of the Kyrgyz stem (IA33). It was also noted that the Kyrgyz stem contained a sentence that was too long. However, despite the mistakes in the Kyrgyz item, the item favored the Kyrgyz group. Item 33 also elicited some general commentary about the reading comprehension text: ZS: There are many problems with the translation of this difficult text; it is not well adapted. One resolution is to take an original Kyrgyz language text, related closely to this theme and then select the Russian text because it is difficult to completely pass on the entire meaning and deeply consider the question in a foreign language…I must say that the Russian text is quite good, as are most of the items in Russian. I can’t find any difficult words, grammar mistakes, etc. But syntax issues might explain any differences. MD: It is easy to see a lack of connection due to translation issues. I think that the analytical thinking on the part of the Kyrgyz is different. ZS: I did not find any difficult words or issues with the item itself. Maybe some issues with the form of the sentences (constructions – syntax) in the reading text though. It is clear that the key is (B) but (G) is also an attractive answer (GA33). An important issue that came up during group analysis was the alleged “Russification” of Kyrgyz syntax and linguistic expression in some test items (and the Kyrgyz language in general). This is interesting in light of the historical discussion presented in Chapter 2 about the Russification of Kyrgyz in the 1920s and 1930s. Analogy item 15 is an interesting case of how a Russian (source) item can allegedly influence the adaptation of a Kyrgyz (target) item. According to evaluators, the word employed “кубанычсыз” (lit. happiness + form for “without,” the ending сыз) in distractor (Г) was “artificially created” or a made-up word. While “кубаныч”k. (happiness) is a word, the addition of the suffix in this case was inappropriate. In the Kyrgyz language “сыз” is often added to nouns to indicate “without.” In theory, adding it here could have seemed like a creative way to convey the meaning of “unhappiness” which was easily conveyed in the Russian version. 139 Evaluators posited that it was likely created to help “fit” the Kyrgyz item to the Russian version (GA15). At the same time, one evaluator offered an improvement with a different Kyrgyz word choice - “көңгүлсүз” (unhappy). While five evaluators marked this as a DIF item, the item had a relatively low chi-square and effect size measure (.008), making it a “negligible DIF” item. Further, this particular item actually favored the Kyrgyz group: Another example of an item with allegedly poor adaptation into Kyrgyz that nonetheless did not seem to be causing DIF in favor of the Russian group. Again, it seems plausible that obviously faulty distractors (a non-sense word in this case) could actually assist the Kyrgyz group by eliminating these particular distractors as viable answer choices. There was considerable discussion about the linguistic “Russification” of the Kyrgyz versions of the sentence completion items. Sentence completion items introduced more complexity into the discussion as stems and distractors got longer and more complicated. In particular, evaluators argued that the “Russian origin” of item 21 was obvious. Evaluators maintained that they could tell that the original item was developed by a “Russian thinker” because the form of the Kyrgyz item had a Russian form. The result was that the Kyrgyz version was less authentic and even “artificial.” The main issue was the inappropriate use of the Kyrgyz connector “жана” (and) which most evaluators argued could lead to considerable confusion (IA21). At the same time, several evaluators noted that, incorrect Kyrgyz or not, many Kyrgyz people use this expression incorrectly in their everyday speech. From the GA21 discussion: MD: I think this item needs to be completely changed as it will not be easy to simply adapt. The main problem is the incorrect use of the Kyrgyz term “жана” (and), which is obviously the result of a direct translation from the Russian sentence. AA: I agree, but the problem is that this is a common usage in Kyrgyz. It’s on the radio, the national TV stations and other official media sources. MD: It is common but it is not correct. AA: I understand... 140 MD: I believe this usage is a one of those “Russianisms” that has crept into Kyrgyz through (ethnic) Kyrgyz, Russian language speakers. The main problem is that in Kyrgyz we don’t use “and” as a connector when connecting two different verbs. Two verbs together often come together to convey a different meaning than when they are used singularly. The two verbs are simply put together, without the use of any connectors. ZS: I agree with MD, villagers don’t use “жана” (and) in this sense – they use Kyrgyz correctly… I think if Kyrgyz original texts had been used, there wouldn’t be this problem… Our syntax is different and should be Kyrgyz – not Russian. MD: Well, theoretically, I agree of course. A big problem is that much of our literature in the sciences and the arts is translated in this way – translated directly from Russian. Much of the Russian influence is inevitable. Little is produced in Kyrgyz due to a lack of specialists and resources. ZS: OK, but what if we had several specialists work on developing the items at the same time and then decided whether they would work or not? MD: To me, this item raises a bigger question from the perspective of the test translators. Should the items contain only language that is 100% correct or contain language that is incorrect but commonly used? Unfortunately, there is often a gap here. The situation and state of the Kyrgyz language is very sad. Further, we have to take into account language as it is used on a daily basis. Many people in the cities – and not only in the cities – speak Kyrgyz with lots of words and forms taken from Russian. Sometimes the language is simply all mixed up. This is a result of the language environment we live in. We combine Russian and Kyrgyz all the time in a sort of hybrid colloquial language. For example, “канча (how many - kyr) листов (lists (pieces) of paper - rus)?” or “сиз (you kyr) домой (home rus) барасызбы? (are going? kyr) – lit. Are you going home? There are hundreds of ways we do this. This item raises some big issues. (GA21). Item 21 was predicted as a DIF item by six of the eight evaluators. While it was a negligible DIF item (favoring the Russian group), the effect size level of .024 was very close to “moderate” DIF at .035. Five evaluators correctly predicted that the item favored the Russian group. Along with item 3, these marks represented perhaps the strongest sense of agreement on the part of the evaluators throughout the analysis. Discussion around item 21 also captured many of the issues that arose elsewhere in item and group analysis. One issue was that the Kyrgyz language was like a “moving target,” constantly under pressure and evolving, lacking standardization, and thus too easily influenced by Russian speaking ethnic Kyrgyz who introduced Russian language forms into Kyrgyz and by 141 the arbitrariness and dispositions of individual translators. Again, historical context presented in Chapter 2 about the “unfinished business” of Kyrgyz language codification and standardization seems relevant to this discussion (Korth, 2005). Another issue was that evaluators began to question the methods of item adaptation from the source language (Russian) to the target language (Kyrgyz) for the sentence completion items. The challenge of the adaptation of complex material also came up in sentence completion item GA23. Like item 21, this item was also negligible DIF (again favoring the Russian group) and the effect size measure was also large at .019. Three evaluators predicted DIF for this item and three evaluators correctly predicted that the item favored the Russian group. Evaluators noted that inherent differences between the two languages made the sentence completion items difficult to successfully adapt (GA23). The fact that Kyrgyz is an agglutinative language allegedly makes certain types of long Russian sentences too complex to be clear in Kyrgyz without significant adaptation. In agglutinative languages, meaning is typically conveyed through the addition of affixes (typically suffixes in Kyrgyz) to nouns which can determine possession, number, location, direction, etc. For example, the noun “kiz” (girl), becomes “kizdar” in the plural form. To indicate “to the girls” the form becomes “kizdarga” (Oruzbaeva, 1997). In this way, words can become quite long but sentences usually remain relatively short. According to the evaluators, long Russian sentences can complicate understanding of complex material in Kyrgyz. Yet, in order to create the logical relations needed to make a sentence completion item work, sentences need to be relatively long, often complex enough to allow three to four “blank spaces” in the sentence where examinees must fill in the needed words to form coherent meaning. Some of the 142 conversation about item 23 led to further recommendations about how to modify the item development process for the sentence completion items. In the words of the evaluators: CJ: I had difficulty reading and understanding the Kyrgyz text; then, I read the Russian text and I understood. I think that with background knowledge (knowing Russian) they might be able to understand some of the meaning. However, Kyrgyz-only speakers will find it confusing. That is, if the Russian concepts are covered first, then one knows what 68 the Kyrgyz authors meant to say. However, the students do not get this advantage because they do not know the Russian version. AA: We (item analysts) have an advantage because we can read both items at the same time! … So, the text is not well adapted, and this makes it difficult to comprehend (GA23). KK: The desire to pass on the main idea of the task was too much, as there is the possibility of losing the literary nuances of Kyrgyz; important to consider the characteristics of the language – in Kyrgyz sentences are usually short – as words are “complex” (compounded – i.e. agglutinative), and the result of the direct translation is that translated texts (from Russian into Kyrgyz) are longer than usual for Kyrgyz speakers. As they become longer, they become more confused. And, the sentences eventually become even longer than the Russian versions. KK: … I believe that it is possible to find original texts in Kyrgyz and then translate them into Russian - then you will see the richness of differences of the languages. ZS: Yes, the problem is the translation and the adaptation due to stylistic differences… One adaptation suggestion would be to not have the Kyrgyz sentences copy the Russian style but to make them “more Kyrgyz.” This means making the sentences shorter, even if it means more sentences... MD: ZS, I agree with your first point – in Kyrgyz we have “long words” but short sentences. This is important to remember in comparison with Russian (GA 23). The issue of the length of Russian sentences (too long for concise adaptation) came up again in the discussion of item stem 24 (IA24). And, the recommendation to break up longer Russian sentences into shorter Kyrgyz ones in GA28: MD: The challenge for test writers is that for some Kyrgyz texts it becomes complicated when we try to repeat the Russian syntax and constructs. It becomes complicated when translation is literal. The best way to keep the Kyrgyz intact is to break the Russian sentences into more sentences rather than trying to capture the Russian structure. In 68 With this conversation in mind, I conducted an experiment in which I split the Kyrgyz speakers into two groups and conducted more DIF analyses. I present the results of this experiment in the discussion chapter. 143 Kyrgyz, ideas are built not through one complex sentence, but through a series (many sentences) with simpler ideas that when compounded, express the same idea (GA 28). Agglutination was not the only alleged challenge to sentence completion item adaptation. Another issue raised was word order. In Kyrgyz, verbs, and hence essential meaning and ideas, come at the end of a sentence. In Russian, they can come before or after the noun. 69 They evaluators noted: ZS: … We often start to translate from the end of the sentence because the main idea comes last (the verb is at the end). Word order is different in Kyrgyz and Russian which can also cause complexities. Because of the word order, sometimes, I translate the literal sentences first – then rearrange them in order. The strategy here is to read sentences more times or to hear it in Russian first, and then piece together the puzzle… (GA28). Another conversation about item 28 looked at the same issue from a different perspective. One of the test center employees, a Russian-only speaking individual who was observing asked, “In the Russian version of item 28, the syntax is difficult. Is it difficult in Kyrgyz?” ZS: In general, syntax is easier in Kyrgyz than Russian. We have a straightforward “cause – result.” The structure of Russian is more difficult. MD: I think that there are several levels of structure in Russian. In Kyrgyz, it is “single level” – it is this and this and this. New ideas are “added” while in Russian there is a different structure. In Kyrgyz it is all part of the same syntactical level. ZS: My Kyrgyz students also tell me that the Russian constructions are difficult to learn at first. In Russian you have “Due to the fact … Because of the fact …” in Kyrgyz, more direct statements… MK: Yes, for example, in Russian you may have … “event/phenomena … which is/that is…” etc. In Kyrgyz we have “this happened” (stop) and “that happened” (stop) and then something else. It is all on “one level.” ZS: In general, syntax is easier in Kyrgyz (GA28). 69 For example, Men kizdarga bara jatamin in English is roughly “I am going to the girls.” However, literally, the words are Men (I) kizdarga (kiz/dar/ga = girl/plural form/to) bara jatamin (go). Note both the agglutination - as the single word kizdarga indicates direct object, number, and direction, and the word order – compound verb at the end of the sentence. 144 From the data generated from the item rubrics and discussions it would appear that there are challenges to adapting the more complex ideas in the short sentence completion items. Interestingly, the evaluators correctly predicted DIF direction (favors Russian) on these items with a better than random estimate, 71% accuracy. These items were also concentrated at the upper end of effect size values, close to the moderate DIF cutoff. I will return to a discussion of these items in the final chapter. Socio-Cultural Issues There was also discussion about the potential for cultural, socio-economic and demographic differences to impact item results on some items. This concern was usually expressed in terms like “kids from villages won’t know …” (GA23) or once, “urban kids won’t know…” (IA3). Issues related to contextual knowledge, regional dialect, and interface with Russian speakers were all noted. On occasion, the discussion digressed from looking at item specifics to conversations about how rural kids might be disadvantaged on the test in general. No curricular or instructional issues were noted by evaluators as potentially problematic. One Kyrgyz item that was identified as containing “dialect” was item 25. Of interest here were not the response patterns of Russian and Kyrgyz groups but rather whether different segments of the Kyrgyz population would be differentially impacted by regional differences in the Kyrgyz language. Two of the distractors contained forms of the word Kyrgyz “Пас” (low, down) that were consistently marked as “southern dialect” that “northern kids might not understand” (IA25). Five evaluators believed that this would be a DIF item (perhaps assuming that enough northern Kyrgyz would be penalized by the item to result in overall DIF between the Russian and Kyrgyz groups). The item was negligible DIF and favored Kyrgyz speakers overall. In order to understand how these dialect differences impact results among Kyrgyz speakers it 145 would be necessary to conduct a DIF study by regions (north vs. south for example), utilizing data from two different Kyrgyz groups. Item 19 also contained words in Kyrgyz that were allegedly dialect (IA19). Other concerns were raised about familiarity with terms that certain demographic groups might not know. For example, for item 3, the concern was raised that the Kyrgyz word “күл” (ash) in distractor “г” would not be known by city kids. One evaluator noted that “city kids do not encounter “күл” (ash) as “they live in apartments … so this is a lack of vocabulary, nuance” (IA3). While item 3 was a DIF item and highly marked, GA3 indicated that evaluators believed a crude formatting mistake (presented below) was the most plausible cause of DIF. There was also debate over item 22 and whether or not examinees would know the word “баобаб”r. (baobab tree) because it does not grow in Kyrgyzstan. Some felt that examinees would not know this word but others disagreed. Two evaluators marked this item as DIF and one as favoring the Russian group. As in sentence completion items 21 and 23, this item did in fact favor the Russian group though the DIF was negligible (.013 effect size). Evaluators noted that an incorrect pair of antonyms in distractor (A) made the item more difficult for the Kyrgyz group. They also noted an inappropriate word combination in item distractor (б) and that the distractors were much longer than the Russian distractors (IA22). In general, none the items with practical DIF were clearly associated with socio-demographic or cultural issues. Format There was minimal discussion about the moderate DIF on reading comprehension item 32 which favored the Kyrgyz group. One evaluator noted that the Kyrgyz item wasn’t clear but another disagreed and believed there was nothing wrong with the item (GA32). There was a typographical mistake in the Kyrgyz version as one Russian letter was used (which is also a 146 word) - “и” (and) - but most evaluators did not believe that this would lead to problems (IA32). Two evaluators predicted that the item favored Russians due to the general quality of the reading text in Kyrgyz. Despite the high DIF value, only two evaluators marked item 32 as DIF and only one evaluator offered the specific feedback that Kyrgyz item distractor “г” needed to be more clearly worded (IA32). During the conversation about this item, one evaluator noted that there were structural differences in the way the reading comprehensions items 32, 33, and 36 were phrased in the Russian and Kyrgyz versions. In the Kyrgyz version of the item pairs, respondents were asked to answer a question while for the Russian version the respondents were required to “complete the sentence” (i.e. the stem was not phrased in question form). He noted: MD: When there is a “question – answer,” it might be easier than when you have to “build” a sentence. In some way this might make it (Kyrgyz) easier to solve than the Russian item but I am not sure about that, the distractors all seem pretty clear (GA32). It turned out that items 32 and 33 did in fact favor the Kyrgyz group but item 36 was a non-DIF item. A larger study of item formats could test such a hypothesis that the format was affecting DIF levels. There was little evidence however, that evaluators had strong plausible hypotheses for why item 32 was a DIF item. Three evaluators predicted DIF for item 38 due to format mistakes. Item 38 was correctly predicted by four evaluators to favor Russian examinees. Several evaluators noted that the form of the item stem in Kyrgyz made overall understanding difficult. Several evaluators also noted a format mistake where a nonsense word makes distractor (б) confusing - “токтуу”k. (no meaning) should be replaced with “токтотуу”k. (to stop). However, it was a negligible DIF item with an effect size of .011. Many of the format issues noted appeared not to impact item responses. One exception was item 3. In item 3 a typographical error resulted in a non-sense word in the answer key and 147 thus no viable answer choice from among the distractors in the Kyrgyz group. Instead of “Чопо”k. (clay), the word “Чоно” (no meaning) was written, a one letter misprint which resulted in a total loss of meaning (IA3). Perhaps not surprisingly, this item had both the highest number of marks (6) from evaluators and the largest DIF level from all 38 items. For the Russian version the item had a .64 difficulty level and for the Kyrgyz version .21. Kyrgyz examinees selected from all the distractors in equal proportions. There was clear consensus from the evaluators that the item was highly problematic and that the format error was to blame (GA3). For the other format issues noted, most format or typographical errors such as a different arrangement of the order of distractors or missing letters in certain words did not seem to pose major problems for understanding. Items 35 and 40 both contained format errors but there were both few evaluator marks for DIF and no evidence of practical DIF on these items. Grammar Many items allegedly contained Kyrgyz grammar mistakes in individual words. These mistakes consisted of incorrect suffix use (items 10, 14, 15, 18, 26, 28, 29, 30, 31, 37), inappropriate use of compound words (item 17), incorrectly constructed word combinations (items 5, 18, 22, 33), incorrect use of connectors “and, but, because” (items 21, 27, 28), and word choice (items 28, 29, 30, 35, 38). Item 37 contained a grammar mistake in the Kyrgyz stem as “Эмнеде” was used instead of just “эмне” (what). However, most evaluators believed that simple grammar mistakes would not cause DIF on this item and the only mark was one mark for “somewhat similar.” This item had negligible DIF with an effect size of .009. In most cases, the item response patterns did not seem influenced when the grammar issue was related to a single word. However, as presented in the above discussion, major syntax issues came up in regard to 148 sentence completion items which made producing equivalent sentences and ideas quite challenging for several items (GA21, GA22, GA23, and GA28). Other Issues A careful review of the data from the rubrics indicates that most analyses centered on discussion of the Kyrgyz (target language) items in the item pairs and issues with their adaptation. As the facilitator, I often asked “what about the source language (Russian) items?” In most cases the response was that the items were clear and correct. However, the Russian version of item 16 elicited some discussion. This item was classified as negligible DIF but had a very high effect size of .031 and favored the Kyrgyz examinees. Two evaluators marked it as a potential DIF item with one evaluator marking it as favoring the Russian group, one favoring the Kyrgyz group. Unlike most items, in which response patterns were similar across groups in terms of order of attractive responses, the Russian and Kyrgyz groups selected different distractors as their first choice on this item (neither of them the answer key). Interestingly, while negligible, this item also had the highest effect size level (.018) for non-uniform DIF. The stem for item 16 was “author: writer.” The same word “author” was used in both the Russian and Kyrgyz versions (Автор). The answer key was “furniture: table.” The relationship was supposed to be “class of objects/member of that class.” That is, a writer is part of the class or family of “authors” (including authors of scenarios, plays, etc.). However, 42% of the Russian group selected “numeral: digit” as their preferred choice. The Kyrgyz respondents were attracted to neither the answer key nor “numeral: digit,” but instead 45% of them selected “journal: book” (журнал: китеп). One evaluator noted that the relationship in the stem could have been construed as a relationship of two synonyms instead of “part of class/family” as was the intent. This would explain the attractiveness of the other choices available. 149 Other than item 16, there were virtually no other in depth analyses of the Russian (source) items. As noted above, and clearly demonstrated in the rubrics, the overwhelming majority of discussion did not focus on issues within the Russian items themselves that might explain DIF. This item was one of many that favored the Kyrgyz group yet was perhaps the only item where the Russian version received focused attention. difference in focus of analysis in the discussion chapter. 150 I will return to this issue of Chapter 6: Discussion & Conclusions Understanding Evaluators’ DIF Predictions In this study I sought an understanding of what bi-linguals were capable of accomplishing in a “blind review” of cross-lingual test items. This is important because a considerable amount of item adaptation and review throughout the world is conducted without the assistance of statistical DIF detection methods. As highlighted above, item evaluators in the KR have minimal (if any) formal training in psychometrics or experience as participants in DIF studies. The predictions of the selected evaluators served as proxies for “the best possible substantive estimates” in the KR due to their previous work experience (Chapter 4). Relative to other DIF prediction studies, a .45 correlation between their ratings of item difference levels and statistical DIF estimations can be considered a relatively high correlation. This indicates that evaluators were able to identify some differences in content, meaning and difficulty between the two item versions that threatened equivalence. Evaluators were also able to identify problems with item pairs that were a function of the particular languages under study (e.g. agglutination in Kyrgyz led to complications in adapting sentence completion items). These insights into the unique, language-specific challenges of adapting Russian items into Kyrgyz items are also important. However, as presented in Chapter 5, the overall results of the study were somewhat ambiguous. This is because with the exception of the sentence completion items, evaluators were not able to predict which group was favored by DIF with more than chance accuracy. In fact, the overwhelming majority of their predictions were for differences to favor the Russian group. Thus, while the .45 correlation indicates a modest association between what they believed were “different items” with items with high chi- 151 square difference values, without accuracy in determining which group was favored by DIF it would be incorrect to infer that the evaluators were accurate in “predicting DIF” overall. Evaluators’ analyses focused almost exclusively on the quality of the Kyrgyz items and the challenge of adaptation from Russian into Kyrgyz, especially the sentence completion items (GA21, GA23, GA26, GA28). Virtually no hypotheses were generated as to problems that might lead to items favoring the Kyrgyz group even though the majority of the statistically significant (not practically significant) items favored that group. In this last chapter, I analyze these findings and interpret their implications for key stakeholders, highlight cautions to data interpretation, and provide recommendations for how to improve both item adaptation and DIF prediction accuracy based on lessons learned. Accuracy in Substantive Item Review The greatest challenge to evaluator accuracy was their inability to predict the direction of DIF. Of all 32 items classified as negligible, moderate, or large DIF, 18 items actually favored the Kyrgyz group while 14 favored the Russian group. It total, evaluators marked 26 items as favoring one of the groups. Of these 26 items, evaluators marked the Kyrgyz group as favored only twice, and this was done with a single mark in both instances; that is, there were only two individual votes for “favors Kyrgyz” in the entire study. This finding of one-sidedness in prediction of DIF direction is somewhat consistent with another study where DIF by racial group was analyzed in the USA (Engelhard et al., 1990). The researchers found that evaluators could not predict which test items would perform differently for black and white examinees when they had no empirical data. They proposed that one reason for the low agreement was the infrequent use of the category “favors blacks.” They concluded that perhaps because some reviewers were asked to represent the interests of their race in a high stakes situation, this might have proved 152 stressful for some of them and influenced their marking. As in the Engelhard et al. study (1990), the category “favors Kyrgyz” was selected rarely in the substantive review. It seems plausible that in many contexts (not just in the KR) reviewers enter DIF analyses with the assumption that DIF and item bias most often penalizes minority or disadvantaged groups. Thus, one plausible explanation for the one-sided outcome is evaluator dispositions. The overall context of the study plausibly explains these dispositions. Recall the dubious nature of the Soviets’ “creation” of the Kyrgyz literary language in the 1920s as presented in Chapter 2 (Hu & Imart, 1989). This process entailed developing a new written language, multiple changes in orthography, imposition of foreign “Sovietisms,” in addition to being a highly politicized endeavor in which the interests of the Soviet state were consistently prioritized over coherent or authentic Kyrgyz language development (Grenoble, 2003). Despite initial attention to native language education, the status of the Kyrgyz language was that of a second class language by as early as the end of the 1930s and certainly by the 1950s (Chapter 2). Contemporary attitudes towards Kyrgyz language use have perhaps remained relatively unchanged since independence, despite the improved symbolic status of the language (Korth, 2005). Considering this historical context and the two troubled decades in Kyrgyz language development since 1991 (Chapter 2), perhaps the tendency to mark almost all the NST items as “favoring the Russian group” should not be so surprising. The ten ethnic Kyrgyz evaluators in this study were certainly cognizant of both the large NST score gaps (favoring the Russianmedium educated) and the overall state of education in the Kyrgyz medium of instruction in the KR (OSI 2002; Korth 2005; De Young et al., 2006). To some extent, subtle, even subconscious, tendencies to “defend” the Kyrgyz examinees against what might be perceived as a privileged 153 and historically hegemonic force (the Russian language) might have resulted in a tendency to mark the Russian groups as advantaged without deep reflection upon the differences between item versions. This finding underscores the need to conceptualize review of cross-lingual items as a context-bound, social and political process, not simply a technical endeavor. Languages in DIF studies are not simply neutral “variables” but are invested with symbolic social meaning and language politics can be the vehicle through which power relations between groups are mediated. Participants enter into the substantive review process with certain dispositions, prejudices and strongly held beliefs, all shaped by individual experience and social context. In an important sense, this result underscores Grisay et al.’s (2006) point that each study involving language comparison is a unique endeavor in its own right. While Grisay was referring to the specific linguistic properties of the language(s) themselves, this study indicates that there are also important social dimensions to DIF studies which rely on substantive review. This social dimension appears manifest in the evaluators’ consistent predictions of DIF direction to favor the Russian group. Of course the one-sidedness of evaluators’ predictions of DIF direction may not be solely attributable to evaluator dispositions. As indicated by item evaluators on the rubrics and noted in the historical overview in Chapter 2, one of the main differences between the Russian and Kyrgyz languages is the extent to which they are both coherent, “standardized” systems (Korth, 2005). Whatever the political dimensions of language - and regardless of how incomplete the endeavor to “standardize” Kyrgyz in the 1920s - the early Soviet language planners can not be held responsible for all of the inherent grammatical or syntaxical attributes of a language that make item adaptation challenging. 154 Indeed, the lack of Kyrgyz standardization and contested nature of what constitutes “correct literary Kyrgyz” kept the focus of most item analyses squarely on the Kyrgyz items. Almost all of the 82 distinct adaption, format and cultural issues raised by evaluators were related to alleged problems with the Kyrgyz language items. Discussions often focused not on the differences in how Russian and Kyrgyz examinees would respond to item differences, but rather on the correct style, grammar, meaning, and dialect of the Kyrgyz item versions. An issue that arose consistently in the analyses was the gap between everyday usage and various (disputed) versions of “correct language.” By contrast, the Russian language has long-standing, consistent rules and enjoys relative consensus about norms, syntax, grammar, and general use, at least within the context of the KR. It is indeed difficult to compare Kyrgyz and Russian versions of an item if there is little consensus as to what “correct Kyrgyz” should be. And, as evaluators often noted, the Russian items tended to be “quite good” (GA33). The lack of evaluator experience could also have contributed to the inaccuracy in prediction of DIF direction. The evaluators were not pyschometricians, had no experience with applied statistics in educational research, had no experience with probability models, or as participants in any form of DIF study. The evaluators had no information about the actual statistical DIF outcomes when they filled in the individual rubrics and participated in the group analyses. In only one case (item 32) were evaluators informed that an item was practical DIF, and only after the item had been discussed and characterized as not problematic by evaluators. While I informed the evaluators that the results of their evaluations would be compared to a statistical analysis, none of them had knowledge of how these analyses were typically conducted or what kind of results they could deliver. 155 It is plausible that their lack of experience contributed to the focus on such overt, Kyrgyzrelated issues and distracted evaluators from a more nuanced, in-depth examination of the psychology of item response. Russian items at times seemed to be viewed primarily as “references” against which evaluators could check their understandings of the Kyrgyz items. In addition to the high number of Kyrgyz-related item conversations, there was considerable digression away from item analysis into general discussions about the challenges posed by a lack of standardization of the Kyrgyz language in general (GA21, GA23, GA28). Perhaps many issues that could have led to Russian items being more challenging simply went unnoticed in lieu of “finding the mistakes” in the Kyrgyz versions. It is conceivable that to novice evaluators, mistakes and contestation in one language version naturally leads to DIF that disadvantage that group. In other words, the “high quality” (and uncontested) items could perhaps become falsely associated with “advantage” while “lower quality” (contested, more mistake prone) items could become associated with “disadvantage” in the minds of evaluators. The fact the Russian items appeared to be of “high quality” might have led to the assumption that the Russians were favored in most instances where differences were evident. This line of thinking seems plausible when considering the inexperience of the evaluator group with DIF analyses. Recommendations for Researchers and CEATM Whether the reasons for inaccurate prediction of DIF direction were due to dispositions or lack of experience, I contend that there is nonetheless some room for optimism that evaluators in the KR can improve their estimations. First, the inter-rater reliability estimate of .66 and the .45 rank order correlation between their estimations and chi-squared values indicate that their overall estimations were not completely random. Second, as presented in the tables in Chapter 5, 156 evaluator marks on direction of DIF were often more tentative than the marks from section 2.1 (levels of difference). This indecision perhaps indicates that inexperience played as an important role in their estimations as dispositions. Below I propose several steps that could be taken to assess the hypothesis that the evaluators can improve prediction accuracy. First, as Ercikan (2002) argues, DIF study outcomes differ depending on whether both versions of the items are reviewed simultaneously or individually by evaluators. She notes that when both item versions are presented in pairs, evaluators tend to focus on the comparability of overt issues like format, content, and language use. This seems like an accurate characterization of what transpired in this study. Ercikan (2002) proposes that when reviewers analyze a single item they focus more on the context and issues that might make the item biased for a particular group. In other words, the single item review approach leads to a more nuanced item analysis and facilitates the consideration of the possibility of different cognitive processes among comparison groups (Ercikan, 2002). This kind of approach could lead to a more considered estimation of DIF that favors the Kyrgyz group and with additional research this approach could be readily employed in the KR. Second, exposure to statistical DIF detection methods by embedding them in some form of action research might also improve evaluators’ accuracy. One way to do this would be to conduct several individual item analyses - stop - and then compare the evaluators’ preliminary predictions with the actual statistical estimations and discuss the results together as a group. Such an approach would demonstrate the complex and tenuous nature of DIF prediction and interpretation to the novice evaluator. It would show that the language group with the lower average test score is not always the disadvantaged group at the item level. It would become more apparent that mistakes do not always lead to DIF; neither in the language where mistakes 157 occur nor in the other language involved. Finally, it would underscore the need to think deeply about the differences between item versions before predicting the direction of DIF. This kind of fine tuning and skills enhancement through the introduction of statistical methods holds promise for better analyses in the KR. With the increasing availability of on-line software and the option of relatively inexpensive statistical packages, the employment of statistical DIF detection methods is feasible in the KR in the near future. In addition to employing statistical analysis as part of the item review process as highlighted above, there are other ways statistics can be used to improve DIF prediction processes. While predicting DIF is difficult on average, there is evidence that some reviewers are more accurate than others. Engelhard et al. (1990) discovered considerable variability across reviewers in the correlations between their individual marks and statistical DIF. Estimations of individual reviewer accuracy can be used both in training and as a quality control tool for CEATM when selecting reviewers to participate in item analyses. Individual estimations were not computed for this study but could be done in future work by CEATM with the consent of evaluators and CEATM employees. Understanding the Causes of DIF The second purpose of this study was to gather data about causes of DIF that could inform and improve the item adaptation process in the Kyrgyz Republic. One finding from this study was that in order to understand DIF causes, it was necessary to analyze the minutia of each item: Broad, categorical labels such as “translation differences” didn’t capture the nuances necessary to provide a real understanding of what was causing DIF in a particular item. For example, most of the loan word, new word, or socio-cultural issues projected as potential DIF 158 causes (items 2, 7, 9, 17, 22, 25) did not lead to DIF. 70 Some translation and adaptation problems did (items 13, 19) while others did not (items 10, 15, 18). Item 3 contained an obvious format problem that plausibly led to DIF while other items with format issues (items 35, 38, 40) remained unaffected. Incorrect Kyrgyz grammar at the word level did not seem to cause DIF in most cases but the adaptation of entire sentences in the sentence completion items was characterized as a dubious endeavor by all the evaluators (GA21, GA23, GA26, GA28). Tenable hypotheses about causes of DIF were articulated when evaluators were able to breakdown the minutia of the item under review. The location of the difference or problem within the item, and the extent to which the difference impacted meaning or difficulty level across versions was paramount. Nuances that resulted in differences in key parts of words, phrases and sentences were important: “Key parts” meaning the place in the item where essential meaning is located (Ercikan, 2002). For example, essential meaning in analogies items is located in the item stem. If the stem is muddled in one version and the logical relationship between the pair of words in both versions not the same, the differences between the items are plausibly going to be problematic (IA19). The same is true for the answer keys. If differences between item versions result in no viable answer key for one version, DIF is also highly possible (IA3, IA13). However, if there is a format or small translation mistake in a distractor that was obviously not plausible to begin with, this issue might be less likely to cause DIF. In this study, three of the four practical DIF items had serious issues with either the answer keys or item stems (GA3, GA13, and GA19). Thus, causes such as poor adaptation, translation and format problems, incorrect grammar, and questionable cultural comparability are best conceptualized as general constructs. They are 70 With the possible exception of item 23 (negligible item but with a high r-squared delta), where the debate was over whether “insurance” was a known phenomenon (GA23). 159 useful primarily as organizing principles for data analysis, not as DIF causes per se as they are less meaningful terms outside the context of a particular item (Ercikan, 2002). It is also clear from this study that mistakes in items do not necessarily result in DIF. Previous research has shown that while item evaluators are quite good at locating mistakes in test items, this is not the same thing as successfully predicting DIF (Engelhard, et al., 1999). Item evaluators in this study identified mistakes overwhelmingly with the Kyrgyz versions though not every mistake noted was widely agreed upon. However, the majority of these mistakes were not associated with statistical DIF and in many cases evaluators were divided as to whether they would or would not lead to DIF. For example, recall that half of the statistically significant reading comprehension items favored the Kyrgyz group despite the fact that the evaluators reported problems exclusively with the Kyrgyz text. As noted in Chapter 5, it is perhaps possible that mistakes in Kyrgyz items in non-essential locations (less plausible distractors for example) might have actually favored that group as such mistakes reduced the number of plausible answer choices (IA11, IA15, IA18, IA25, and IA33). However, as no hypotheses were generated for why any of the items might favor the Kyrgyz group, it is difficult to suggest hypotheses about what caused some items to favor that group. While three of the four practical DIF items favored the Russian group, there were many items close to the moderate DIF cut-off that favored the Kyrgyz group. A tentative explanation for why many items favored the Kyrgyz group might be related to issues with word difficulty. For example, when words are adapted from the source language, they can become easier due to a lack of corresponding vocabulary at the same difficulty level in the target language (Schmidt & Belistein, 1987; Bejar, Chaffin & Embertson, 1991; Roccase & Moshinsky, 1997; Sireci & Allalouf, 2003). Recall that for some items in this study, there were single Kyrgyz words that 160 are differentiated by several different words or concepts in the Russian language: That is, the Russian language might have finer degrees of distinction for some concepts and some of these distinctions might have an impact on item difficulty. For example, the word for orchard and trees is the same word in Kyrgyz while in Russian there are different words for these concepts (IA3). It could be that such distinctions make the use of some words equivalent in meaning but divergent in difficulty level due to differences in commonality of use. Such differences are not overt and they are not easy to identify without deep probing and analysis. Of course as there were no hypotheses generated about why any item might favor the Kyrgyz group in this study, this is conjecture at best. It is interesting to note however, that nine of the eleven negligible DIF analogy items (not practically significant) did favor the Kyrgyz group. It is possible that word difficulty could be an issue for these particular language groups on analogy type items. Unfortunately, the data allow no more than tentative hypotheses about the issue at this time. The evaluators recognized problems in three of the four practical DIF items and predicted their DIF direction correctly. However, for the one practical DIF item that favored the Kyrgyz group, item 32, a conclusive determination of the cause of DIF remained elusive as no widely agreed upon hypothesis was offered to explain the DIF. Only the sentence completion items tended to favor the Russian group on a consistent basis. Most of the evaluators’ specific hypotheses about key differences between Russian and Kyrgyz items were generated in discussions about these items despite the fact that none of these items were practical DIF (GA21, GA23, GA26). They were successful in their predictions of DIF direction however, 71% of the time for sentence completion items. In short, as in previous studies, it is apparent that DIF causes overall are not always easy to identify but as this study indicates there may be variation in 161 success rates by item type (Plake, 1980; Engelhard et al., 1990; Rutledge, 1990; Gierl et al., 1999; Jodoin & Gierl, 2001; Ercikan & McCrieth, 2002). Despite the above qualifications about our limited ability to generalize about DIF causes beyond the location of cause within each item, the data do point to a few recurring patterns in regard to item type. First, the reading comprehension items demonstrated the lowest negligible DIF levels and generated the least amount of critical commentary. This result is consistent with several other studies of verbal reasoning items noted in Chapters 3 and 5. Second, one actionable finding from the study was the issue of the lack of “linguistic fit” of Russian and Kyrgyz sentence completion items. Specific hypotheses about the Kyrgyz versions of the sentence completion items were clearly articulated and widely supported by evaluators (GA21, GA23). These items elicited both the most commentary and the most accurate DIF direction predictions on the part of evaluators. Evaluators even found a few of these items difficult to answer themselves without being able to reference the original Russian item (GA26, GA28). Though none of the practical DIF items were sentence completion items, these items were clustered around the highest chi-squared and effect size values. Items 21, 22, 23, and 26, were some of the most problematic items from the entire 38 items according to the evaluators. All four of them statistically favored the Russian group. Sentence completion items 25 and 28 were also negligible DIF with high effect size values. The problem with the sentence completion items was related to the fact that Russian and Kyrgyz syntax was not easily reconcilable within the context of these items. Recall that syntax is the body of rules in a given language that determine how words and phrases come together to form grammatically correct sentences. The syntaxical clash between Russian and Kyrgyz 162 manifested itself in Kyrgyz items as incorrect use of compound words, incorrect word combinations, artificial “Russified” sentence structure, and general confusion of Kyrgyz items. Evaluators consistently noted that the items failed in Kyrgyz because they were being “forced” into a Russian syntaxical style that did not work (GA21, GA23, GA26, GA28). Unlike the reading comprehension items, that allow for a more natural flow of language due to the absence of constraints on text size (the Kyrgyz reading comprehension text is more than twenty lines longer than the Russian text), the sentence completion items must be short and concise by definition. They require examinees to make logical connections by filling in the missing words that makes the sentence(s) most logically complete. They consist of one or two sentences at the most but the sentences must be relatively long. Recall from Chapter 5 that evaluators believe that agglutination keeps Kyrgyz sentences short and not conducive to the longer kind of sentences necessary for the sentence completion items (GA21, GA23, GA26). It is hard to imagine sentence completion items with sentences of three to four total words. Yet, such short sentences are common in Kyrgyz (GA21, GA23). If it takes more (shorter) sentences to convey the same meaning and level of complexity in one language than another, it makes intuitive sense that item types that allow only one or two sentences become problematic for adaptation. This finding is consistent with other studies that found that DIF can be caused by differences in sentence structure that are inherent to the language under study (Gierl et al., 1999). Recommendations for Researchers and CEATM While the claim made by some evaluators that Russian syntax is inherently “more difficult” is perhaps questionable, it is certainly plausible that specific syntax differences can create specific challenges for certain item types. 163 Evaluators made specific proposals for rectifying this state of affairs with sentence completion items. They proposed that the long Russian sentences be broken into shorter (but more) sentences in the Kyrgyz versions (GA21, GA22, GA23, GA25, GA28). Evaluators also proposed that if breaking up items into smaller parts was not feasible, instead of adapting these items from Russian, in the future the test center should create them either separately (Kyrgyz and Russian items) or create them in Kyrgyz first and then adapt them into Russian; or, perhaps not use this type of item at all. CEATM notes that bi-linguals play an important role in item development and adaptation and that procedures are in place to guarantee item equivalence throughout the test item development cycle (see Chapter 4). The overall low amount of DIF detected in this analysis supports the contention that these processes are working well. However, in their item reviews, evaluators noted on several occasions that it seemed obvious that Kyrgyz language items were developed in Russian, or by “Russian thinkers,” without enough consideration for the authenticity of the Kyrgyz version. For example, in regard to item 21, one evaluator stated, “It seems that this item was obviously adapted from Russian. I think if Kyrgyz original texts had been used, there wouldn’t be this problem. We could avoid syntax problems like this” (GA21). There was wide agreement among evaluators about this contention. Similar comments can be found in GA23 and GA28. In GA33, one evaluator noted, “There are many problems with the translation of this difficult text… one resolution is to take an original Kyrgyz language text, related closely to this theme and the Russian text…” In the IA (reading comprehension text) evaluators also requested more Kyrgyz original texts. At a minimum, the evaluators’ comments merit reconsideration of current adaptation procedures, especially considering the issues raised in regard to the sentence completion items. Hambleton and Kanjee (1995) recommend “de-centering” the item development process in 164 challenging cross-lingual situations. In the context of item adaptation, “de-centering” means providing opportunities to return to the source item version (Russian in this case) and making changes to that source item if necessary due to challenges in adaptation to the target language. In GA15 one of the evaluators proposes: “Perhaps it would be possible to compare the translated Kyrgyz text with the original Russian text? That is, adjust the Russian text again if the translation into Kyrgyz does not seem to work?” Solano-Flores (2006) recommends what he calls “concurrent development” of test items. While his work focuses on English and Spanish speakers in the United States, several of his ideas are relevant to other cross-lingual contexts. He proposes that all test items be developed exclusively by bi-linguals. This forces test developers to seriously consider how culture and context are inextricably related to language. In some of his work, he has utilized two groups of bi-linguals to concurrently develop the two versions of a given test item. Through this process, modification of items becomes an iterative, negotiated endeavor that does not proceed without consensus. All recommendations for changes to one version of an item are only considered after the proposed changes have been analyzed in relation to how they will impact the other language group. He also recommends the use of “blueprints” or general item guides (like minispecifications) to mediate discussion around each item. Through the process of “localization” bilingual test developers work from these blueprints but have considerable freedom in adaptation in order to facilitate linguistic alignment between the two versions. The result is that some items will inevitably be slightly different versions but ultimately serve the same aims. Solano-Flores (2006) insists that “localization” in item development is essential as research has demonstrated that even bi-linguals do not always have consistent or accurate perceptions of all the linguistic 165 aspects of items that are critical to properly understand their functioning. Further, simply being a native speaker does not necessarily enable evaluators to identify those key aspects. Hambleton and Kanjee (1995) also offer a useful method for determining comparability of items that could be employed in the Kyrgyz Republic during item development and analysis. They propose utilizing examinee interviews to determine the cognitive processes items elicit as examinees engage with items. For example, Kyrgyz language examinees could explain their reasoning for answering certain ways on the sentence completion items. Judges would follow along with both Russian and Kyrgyz versions on hand and compare how well the items capture similar meaning and constructs. If the responses of the examinees correspond to the intent of the source item (Russian), then the items can arguably be considered equivalent. While labor intensive, this type of analysis could be performed as a follow-up for items that seem to be problematic according to DIF statistics or other forms of analysis, not necessarily for all items. In the case of the NST 2010 items, such follow-up analyses for the sentence completion items could be fruitful. While not done formally for this study, this kind of individual interview could also be conducted with item reviewers on problematic items. While the above suggestions for possible modifications to item development procedures seem reasonable, they of course must be realistic in terms of resources available to invest in item development. The government of the KR currently does not provide CEATM with financial support. CEATM resources are generated through student fees for test services (approximately 4-5 US dollars per test per student on the NST). The use of item development groups that employ multiple levels of review and other elaborate iterative processes is a labor intensive enterprise that demands a significant time and resource commitment. Thus, CEATM must 166 carefully consider both what can learned from “best practices” in cross-lingual item development as well as resource realities. Another important issue is how to distinguish between an “adapted” item and a “new” item altogether. For example, would breaking the Russian sentence completion items into smaller sentences constitute appropriate adaptation or the creation of completely different items with different meaning and difficulty levels? If new items are utilized, replacing one item type with another does not absolve CEATM of the need to employ items of similar aim, meaning and difficulty if they intend to make comparative inferences across groups. Testing practitioners and policymakers in the KR should be sensitive to this challenge and conduct further research, invest in training of reviewers, and experiment with different test item types to the greatest extent possible. The above findings in regard to the challenge of sentence completion items are also relevant for test developers in neighboring countries that also develop standardized tests in the Russian and other Turkic (agglutinative) languages such as Uzbek and Kazakh. Statistical DIF and the NST Verbal Items Not all multi-lingual countries provide opportunities for education through multiple language media or cross-lingual testing in high stakes situations. In many Asian countries, pupils and students are schooled in and sit for examinations in their second language. Hambleton and Kanjee (1995) argue that one of the main benefits of cross-lingual testing is the elimination of bias that potentially exists when students must sit for examinations in a language that is not their native tongue. In multi-lingual societies therefore, cross-lingual testing is potentially a good policy option for a variety of uses if inferences on cross-lingual tests can be validated. 167 The identification of only 4 of the 38 total items as practical DIF on the NST items is a relatively low number for a cross-lingual assessment: Other studies with the same approximate number of items have revealed that up to half the items are often flagged as DIF (Chapter 3). This is a positive result for both CEATM and higher education admissions policy makers in the KR. While the analyses included only a portion of the total number of NST items, policy makers and stakeholders concerned about the feasibility of employing cross-lingual testing in HEI scholarship selection now have empirical evidence that supports the inference that CEATM administers a test with a very high number (proportionally) of equivalent items. The low number of practical DIF items indicates that CEATM has done a reasonably good job utilizing the available linguistic and cultural resources and suggests that the bi-linguals employed are reasonably effective at developing equivalent cross-lingual test items. If the test center can incorporate statistical methods to assist with DIF detection and improve the item adaptation process, it can feasibly further improve the reliability of the NST and enhance the validity of selection inferences based on the NST. The overall low number of practical DIF items, the modest correlation between substantive review and statistical DIF (in terms of difference levels and chi-squared values), and the relative ease with which evaluators identified some causes of DIF are all reasons for cautious optimism. The overall low number of DIF items is perhaps best explained by the nature of the NST as a within country cross-lingual test. The large number of bi-lingual and bi-cultural scholars, teachers and adapters, readily available to the test center means that the cultural distance between groups is relatively small: Differences in schooling, curricula, instruction, and other intervening cultural and linguistic variables that can impact DIF levels across groups can “be known” quite easily in the KR (Ercikan, 2002). 168 Yet, despite low levels of statistical DIF, the results of this study for the item evaluators are not straightforward. In the previous section I noted that dispositions, characteristics of the Kyrgyz language, and evaluator inexperience all plausibly impacted evaluators’ inaccurate prediction of DIF direction. If the original item developers for the NST 2010 come from the same general population of item reviewers employed in this study in terms of experience and training - and CEATM believed that they did - one might expect if not accuracy, at least some element of “randomness” to their predictions of DIF direction overall, not the one-sided 71 estimations - “favors Russian” - across almost all the items. The paradox seems to be that while the cultural intimacy of the within country study in some ways makes cross-lingual testing more feasible than in broader cross-nation comparisons, there appears to be an added dimension of sensitive language politics (and subjectivity) when the research touches on sensitive questions such as “who benefits from item differences?” While this was not an anticipated result of this study, it was not too surprising considering the context of the DIF study and the history of Russian and Kyrgyz language politics in the KR. Cautions to Statistical DIF Interpretation The logistic regression method proposed by Swaminathan and Rogers (1990) with the effect size measure proposed by Jodoin and Gierl (2001) yielded clear and interpretable results. As the purpose of this study was not to evaluate the accuracy of various statistical DIF detection methods, the actual number of DIF items detected by the logistic regression method was not of primary importance. However, there are some important qualifications to the interpretation of 71 Recall that many of these evaluators did have experience working with CEATM on NST test adaptation in previous years (see chapter four for a breakdown of professional background and experience in testing). 169 the statistical findings related to important contextual factors that could impact statistical outcomes. In the next section I elaborate on these qualifications in detail. First, statistical methods in DIF studies are not 100% accurate in detecting DIF (Hambleton, 1995). The logistic regression (LR) method - while comparable to other DIF detection methods in accuracy - has had power rates of between 70-80% in experimental studies with various combinations of ability levels, item types, item characteristics and sample sizes (Jodoin & Gierl, 2001). Thus, in any given study relying on the LR method, it is feasible that some DIF items could remain unidentified, though a large sample size like the one employed in this study should lead to a relatively high success rate. There are other factors however, that could threaten the accuracy of the statistical estimations. The lower reliability of the Kyrgyz NST items and the large difference in ability distributions between the Russian and Kyrgyz populations could introduce statistical error (Narayanan & Swaminathan, 1996). challenge to accurate estimation. Differences in item characteristics could also pose a Hambleton et al.’s (1993) study indicated that items with lower discrimination were associated with items likely to be missed in some DIF detection methods. They also found that very difficult items were more likely to be missed, regardless of ability level. The researchers indicated that this is especially true for DIF studies in which comparison groups have dissimilar ability distributions. Upon request, CEATM provided the test item characteristics for the 2010 items. The average discrimination value for the Russian items was .45, while for the Kyrgyz items it was .32. The average difficulty level for Russian items was .54 while for the Kyrgyz items it was .33. Another important issue is knowledge of the Russian language on the part of some Kyrgyz language examinees. Variation in the Kyrgyz population in terms of how much Russian 170 they know could be influencing statistical results in hidden, unpredictable ways. Recall that evaluators noted on several occasions that they were only able to solve some Kyrgyz items because they knew Russian or had the Russian item available (IA21, GA21). They also noted that Kyrgyz examinees with Russian knowledge might be advantaged. This raises perhaps one of the more viable threats to DIF studies in the KR in general. Despite the fact that schooling is not bi-lingual by design, as shown in Chapter 2, bi-lingualism is common in Kyrgyzstan. Knowledge of Russian can be acquired through study as a second language at school or through the news media, social or cultural engagement with Russian speakers in everyday activities. In 1989, 83% of all urban Kyrgyz, 23% of the urban population, reported fluency in Russian (Fierman, 1991). Among ethnic Kyrgyz who are schooled in the Kyrgyz language, there is tremendous diversity in terms of Russian knowledge. It is possible to be educated in a Kyrgyz language school but also live in a community where Russian is widely spoken. Northern Chui Valley communities like Sokuluk, Kant, Tokmok, Kemin and Kara-Balta as well as many towns in the Issyk-Kul Oblast have large numbers of both Russian and Kyrgyz speakers (Census, 2010). Knowledge of Russian might favorably impact some Kyrgyz language examinees as they struggle to decode incoherent Kyrgyz items that were initially developed in the Russian language (Ackerman, 1992). Recall that at times item evaluators were at times able to identify the “Russian thinking” behind some items - some examinees might be able to do the same. Another way of conceptualizing the problem is to consider the difficulty of defining the “typical Kyrgyz language examinee.” When evaluators offered, “Kyrgyz kids won’t know this…”, perhaps they were envisioning mono-linguals in the farthest outlying regions of the country; those with almost no exposure to the Russian language or the “Russianisms” in the Kyrgyz language used by many 171 urbanites. However, there are other “typical Kyrgyz examinees” that do have at least some knowledge of Russian, if not very good functional command (Korth, 2005). Thus, this “mixing” of the constitution of the Kyrgyz language sample in terms of background knowledge means that the overall statistical outcomes could be hiding some of the real impact of particular item issues for certain subgroups of the Kyrgyz population. Higher statistical DIF levels might be evident if all subjects in the study knew one and only one language. This kind of hypothesis could be tested with additional research if reliable data on knowledge of Russian as a second language could be attained. In order to probe for how background Russian knowledge might have impacted the DIF statistics, I conducted one additional statistical analysis. Because student test identification numbers were tied to their region of registration, it was possible to determine from which region each examinee came from. I conducted an additional DIF analysis by breaking the Kyrgyz sample into two groups of about 750 examinees each, Kyrgyz 1 and Kyrgyz 2. In one group I put all of the Bishkek (capital city) examinees and in the other, primarily rural examinees. In essence, I used residence as a proxy for language knowledge under the assumption that Kyrgyz language examinees from Bishkek would be more likely to have Russian language knowledge. 72 In theory, all 38 items should have shown “no statistical DIF” when analyzed as groups Kyrgyz 1 and Kyrgyz 2. The results however, were interesting: Thirty-three of the thirty-eight items were indeed “no statistical DIF” (full statistics in Appendix V). Four of these items were just barely significant at the .05 level, test statistic 3.814: item 5, chi-square value of 3.92 (sig., .048), effect size .003; item 3, chi-square value of 4.98 (sig., .026), effect size .004; item 18, chi- 72 Though it is not possible to know which kids who sat for the NST have this language background, Census (2010) data support the contention that Kyrgyz urbanites are more likely to know Russian as a second language than Kyrgyz rural residents in general. 172 square 5.03 (sig., .025), effect size .003; and item 30, chi-square 5.72 (sig., .017), effect size .005. Item 21 however, the one Kyrgyz item that was unanimously and repeatedly claimed to have been created “by Russian thinkers” had the highest level of negligible DIF from all thirtyeight items. The chi-square difference value for this item was 12.68 (sig., .000), effect size .010. This analysis was of course only an investigative probe. In order to further investigate the possible impact of Russian knowledge on Kyrgyz item responses, experimental studies with more accurate data on language background of the participants needs to be conducted. Finally, while Jodoin and Gierl (2001) and Zheng, Gierl, and Cui (2004) have empirically analyzed the r-squared delta effect size measure employed in this study in both experimental studies and with actual assessment data, it has not been widely tested. 73 In light of both the potential threats to statistical estimation highlighted above, and the relatively untested state of the effect size measure, the statistical outcomes in this study need to be viewed somewhat cautiously. I recommend a more conservative interpretation that acknowledges that some items not identified as practical DIF by the LR method might nonetheless be problematic. Recall that the positive rank order correlation indicates that negligible DIF items with higher chi-square values were more closely associated with higher evaluator marks than the lower value chi-square items. Further, substantive data collected from evaluators indicates that several “borderline” negligible DIF items might be problematic. One purpose of relying on substantive evaluations (despite their modest reliability) is to provide additional confirmatory evidence of differences between items in the studied pairs (Gierl et al., 1999). Four of the negligible DIF items had effect size values above the median and at the same time received five 73 In these two studies the r-squared delta effect size measure correlates highly with two other commonly accepted effect size measures used for the SIBTEST and Mantel-Haenzsel DIF detection methods. 173 or more marks as likely DIF from the evaluators (i.e. they were predicted as DIF by the evaluators). Recall also that the initial cut-score for evaluator DIF prediction was four votes for DIF. However, this scale was developed with the assumption that the scores from ten evaluators would be utilized. After dropping two evaluators, I nonetheless maintained the four mark cutscore because that was the original scale. However, one could argue that three marks from eight evaluators for DIF is also a reasonably strong vote for DIF. Further, several items receiving three marks for DIF also had negligible DIF well above the median effect size value. With this in mind, I recommend that CEATM consider several other items near the practical DIF cut-off as potentially problematic. Table 6-1 below presents additional items for CEATM to consider. The criteria for their inclusion was that each item had a score of three or more evaluator marks for DIF and the negligible DIF value for each item was over the over the median value of .009. In addition, there were four other negligible DIF items with very high effect size values that received minimal marks from evaluators that are not in table 6-1 below. These items are item 22 (.013 effect size, rank order 27), item 26 (effect size .019, rank order 28), item 28 (effect size, .029, rank order 34), and item 16 (.031, rank order 31). These four items could also be considered “borderline” DIF items worthy of investigation. Six of these additional twelve items are sentence completion items, the only item type in which the evaluators’ accuracy in prediction of DIF direction was better than random. 74 74 I emphasize that I am not trying to “find practical DIF” where it doesn’t exist. I do believe that the reliance on bilingual test adapters could make within country, cross-lingual DIF levels lower than typically seen on across country studies like PISA and TIMSS (despite their greater methodological sophistication). CEATM does have item review procedures in place and their specialists make great efforts to produce quality test items. However, the sum of the accumulated 174 Table 6-1: Items Above Median Effect Size with Three or More DIF Marks Item Evaluators’ Marks χ2 Difference χ2 Rank Order Effect Size 10 38 5 25 23 21 33 11 xxx xxx xxx xxxxx xxx xxxxxx xxxxx xxxxx 15.510 20.210 22.576 23.006 38.703 42.413 43.427 49.326 19 23 25 26 23 30 32 33 .009 .011 .015 .016 .019 .024 .027 .028 Of course the identification of additional “borderline DIF” items is somewhat arbitrary, especially in terms of where to draw the line. For example, the test center might want to also reexamine item 15 that received many marks from evaluators but had an effect size value lower than the items noted above. In general, the best way to further test the accuracy of the statistical estimations in this study would be employ multiple methods of DIF detection on the items and then compare the results across methods (Rogers, 1989; Hambleton, 1995; Jodoin & Gierl, 2001). Recommendations for Improving Studies of Substantive Methods In this last section I provide recommendations for future substantive DIF prediction studies based on lessons learned from the administration of the evaluation rubrics. Understanding the limitations of the data collection process will help to put the results in perspective. The first issue is related to the quality of the data collection tools themselves. The original item analysis rubric contained an element of unnecessary complexity. I developed a evidence, including the threats to accurate measurement noted above, indicate that the underestimation of practical DIF in this study is possible. 175 coding scheme that asked each evaluator to code “the nature of the difference/issue” as definitively as possible (section 2.2 of the rubrics). In general, the a priori categories of potential item adaptation problems were useful as a guide to help evaluators think about the items. However, asking them to “over think” in this area proved counter-productive. After conducting a pre-test, it was apparent that the coding categories were somewhat problematic. Though the issue was addressed before administering the rubrics, it is worth highlighting so future researchers can avoid adding unnecessary complexity to their data collection tools. The first problem was that the descriptive typologies in 2.2 were not always mutually exclusive or easy to disentangle. The result was that during the pre-test the evaluator lost time trying to distinguish between adaptation issues and translation issues when she could have been describing a problem with a specific test item in greater detail. Second, the purpose of having the substantive review with ten members was to collect a broad spectrum of opinions as to the nature of the problems with the item pairs. As individuals, the evaluators see only their own work in isolation. It is the researcher who should collect and categorize their work and present it in summative format, drawing on the opinions and conclusions of all ten evaluators. The larger purpose of the individual evaluations was not to determine how consistently each evaluator precisely defined each problem, but rather to get a basic understanding of what the differences were between the language versions. In other words, it is not so much the individual’s marks that matter as much as the totality of the collective whole which better represents their professional guidance. The utility of the data from section 2.5, suggestions for improving the item pairs, was also questionable for a limited study like this one. While interesting, it demanded that the evaluators take significant time away from item analysis and description and devote that precious 176 time to what in essence became “item writing.” Perhaps due to time constraints, this was not filled in diligently in most cases anyway. It seemed redundant as evaluators often “corrected” the item when they wrote their descriptive comments under section 2.3. Many of the comments in this section amounted to not much more than “next time, do a better job with translation.” Overall however, as can be seen from the data in Appendix W, considerable data was collected from the individual analyses from sections 2.1, 2.3 and 2.4, as well as the group analyses. In the future I would recommend that evaluation protocols require each evaluator to only (1) assign a mark as to the level of difference (section 2.1), (2) describe the differences in detail (section 2.3), and (3) predict which group was advantaged (section 2.4). Finally, the statistical analyses of inter-rater reliability and the rank order correlation were both calculated with data taken from the individual analyses: That is, the initial marks evaluators made on their individual item rubrics. While not a critical mistake, the problem was that the benefits of the group analysis were not reflected in two of the key data analyses in the study. Evaluators did not have the opportunity to revise their original marks after learning more about the items through the group analysis (though the discussion transcripts reflect some newly generated hypotheses). The failure to do this was related to time and resource limitations. If I assume that the group analysis actually assisted in generating a more accurate understanding of the items - and I believe it did - it is possible that (1) inter-rater reliability would have been higher, (2) the rank order correlation between their predictions and the chisquared values would have been higher, and (3) perhaps their estimations of which group was favored would have been more accurate. Anecdotally, I am confident that the marks of several individual analyses do not reflect the evaluators’ post-discussion predictions of DIF for moderate and high DIF items 13 and 19, neither of which had been predicted as DIF by the majority of 177 evaluators initially. 75 Evaluators predicted DIF for item 3 but not for item 32 – neither before group discussion, nor after. This recommendation assumes of course that the impact of the group analysis would be to improve the accuracy of their estimations. It is theoretically possible that as a group, the accuracy of their estimations could get worse rather than better after group analysis. After all, there were persuasive personalities in the evaluator group who were not always personally correct in their predictions of how differences would impact DIF levels. So, peer pressure can certainly push group results in different directions. Nonetheless, in future studies I recommend adding the additional step of evaluators individually rescoring each item (section 2.1) after the group analysis has been carried out. In general, in order to conduct a more informed DIF prediction study, I would recommend twice the time that was allocated for this study as 10.5 hours was not enough. Challenges to Collecting and Interpreting Data from the Substantive Review Evaluation of cross-lingual test items as an individual process is influenced by the knowledge, experience, skills and dispositions of individual evaluators. Item evaluation as a collective process is influenced by the above factors plus the social dynamics of setting and context. Time and resource constraints also limited the amount of discussion that occurred for every test item. Group analysis was complicated by the fact that unlike in individual interviews, not all side-bar conversations, comments, and issues could be fully captured. Sometimes, many participants talked at the same time. The item discussion process, while recorded, was a 75 In a private conversation with one evaluator, I learned that the reason why these two items (13, 19) received so few marks initially was because evaluators couldn’t answer the items with confidence themselves. Thus, paradoxically, low marks (or “absence” of marks) can also indicate item trouble when evaluators struggle to make sense of the item and don’t know how to respond! More evidence for why it is imperative to rescore after group analysis. 178 vociferous and at times, a muddled affair. However, I contend that the nature of this study - and its focus on language - is enhanced, not limited by the inevitable “negotiation” that accompanied data collection. In matters of language, the collective view, even if contested by some, is more accurate than the unopposed view of one, on average, most of the time (Hambleton, 2005). Selecting what item data to highlight from both the individual rubrics and group analyses was an interpretive process. The researcher always influences data collation by selecting what data to present or not present, by proposing what is representative, informative, relevant or irrelevant, and in general by making claims as to what is worthy of attention. Capturing and faithfully representing the tone, focus, agreement and disagreement in conversations about test items was challenging: When an evaluator highlighted a certain issue that came up during discussion, how does the reader know the extent to which the rest of the group concurred, disagreed, or was divided on that issue? This challenge was there for both the interpretation of the individual analyses as well as the group discussion. In part, the coding system on the individual rubrics was there to “empiricize” the process to the greatest degree possible; it sets boundaries to interpretation and serves as the voice of participants to some extent. Nonetheless, the subjective stance of the researcher inevitably came into play in the presentation and interpretation of the raw data. Much DIF research is also complicated by the fact that the subjects themselves (item evaluators) are also highly subjective respondents with their own biases and proclivities. While their may be “strength in numbers” when ascribing validity to claims and hypotheses, in DIF studies, the majority view can also be the wrong view: Determining “what was correct” turned out to be a highly contested endeavor. What for some were highly problematic items, for others, were not. In the results chapter I tried to faithfully distinguish between opinions that seemed to 179 be representative from those that might be outliers. When generalizing I have tried to note exceptions, limitations, or any problems that might have challenged my interpretations and inferences. Interpretation of data from the group analysis was especially challenging. At that point in the study, I moved from the role of data collector to that of participant as the discussion facilitator. I took on the roles that facilitation typically requires: Time keeper, task manager, referee, in addition to observer and recorder of the proceedings. My facilitation of the process inevitably impacted the data collection process and hence the data itself through my choices of what merited discussion and how much time to allow per item. Without such facilitation however, the collection of this kind of data is not possible. In some ways, the “outsider” is perhaps in a good position to serve as a facilitator participant when the stakes to evaluator participants are connected intimately to language and identity. As an American whose native language is English, not Russian or Kyrgyz, I was able to maintain some distance from sensitive questions in the data collection process. Finally, what was perhaps most attractive to collect and report was data from items that elicited much conversation, items that were heavily critiqued, presented contradictions, or items that seemed to represent a systemic problem, issue, or challenge. Some items elicited few written comments and little discussion. While it was also important to understand why these items did not elicit commentary - or perhaps understand why they worked - for the most part, the focus centered on what didn’t work. Conclusion Of paramount importance in the cross-lingual test adaptation process is the proven ability of test developers to successfully adapt test items across languages in meaningful ways. In 180 situations where sophisticated statistical DIF detection methods are not utilized, the accuracy of item adapters and reviewers in discerning differences between items is especially important. In some ways, the results of this study are ambiguous. Evaluators’ marks were positively correlated with statistical DIF outcomes in terms of which item pairs had differences that made them problematic. At the same, based on evaluators’ inaccuracy in estimating which group was favored by group differences, it is difficult to discern with certainty just how well they actually understood the differences in item pairs. An interesting finding of the study was the consistency with which all but two of their predictions were for item differences to favor the Russian group. As has been pointed out, the evaluators focused on Kyrgyz language items but generated no hypotheses for causes of DIF that favored the Kyrgyz group. Thus, in a sense, the prediction of DIF direction was not random at all, but could be perhaps more accurately characterized as “one-sided.” I offered three explanations for this result. First, many Kyrgyz language issues are highly contested; therefore, it is natural that much attention would be paid to these items. Second, evaluators had no experience with DIF studies and the complex task of disentangling DIF causality. Third, social and historical contextual factors likely shaped the evaluators’ dispositions to the extent that it was almost taken for granted that “of course the Russians are favored.” As I argued above, I believe that two out of three of these issues, experience and dispositions, can be addressed by employing statistical methods in further studies. My recommendation that additional items (beyond the four identified as DIF) be carefully reconsidered is based on the uncertainty of the statistical estimation in the context of large differences in ability distribution, relatively lower Kyrgyz test reliability and the possibility of the nuisance factor of language knowledge on the part of some Kyrgyz language examinees. 181 From the perspective of the test center, three of the four items with practical DIF can be addressed because they appear to be related to overt adaptation issues. However, many problems with the Kyrgyz language items would be out of their control, even if some of these issues caused DIF (though most did not). The privileged status of the Russian language, the lack of a “standardized” Kyrgyz language, the lack of investment of resources for its study in general (or poor use of those resources), frame the contextual setting for test development and Kyrgyz item quality in the republic. Everyday uses of hybrid Kyrgyz, dialect and regional differences in vocabulary and knowledge of loan words are all socio-linguistic issues that can not be easily 76 controlled by political volition. These phenomena are the product history, culture, and demographics. The unique geography of Kyrgyzstan with its mountain barriers, isolated communities, and variation in the extent of engagement with other language groups has resulted in the evolution of many unique language system(s) which will continue to present challenges for those developing cross-lingual testing, whatever the resources allocated. Nonetheless, in addition to the actionable findings in regard to the three DIF items noted above, there were findings in regard to specific Russian to Kyrgyz adaptation issues that are within the power of the testing center to control. Both substantive evaluations and relatively high effect size values on most sentence completion items support the notion that syntax differences between the two languages make sentence completion items somewhat problematic to adapt from Russian into Kyrgyz. At a minimum, the center should evaluate these items more closely. Or, perhaps reconsider need to keep this item type on the NST. This finding answers the question raised at the outset about whether or not DIF issues related to the specific languages 76 I am not arguing that Kyrgyz “needs to become standardized,” but rather emphasizing that contested languages poses challenges to standardized testing. Standardization always has winners and losers. For an interesting discussion on the “ideology of standardization” see Milroy (2001). 182 under study could be identified. It would appear that while generalizing is difficult for many items and item types, for at least for one set of items the answer is yes. At the same time, the fact that the reading comprehension items appeared to be the least problematic of the item types supports the idea that there are some “general challenges” to item adaptation, regardless of languages employed (Agnoff & Cook, 1988). While clearly not infallible, if used properly, statistical methods can highlight inefficiencies, shed light on misconceptions and false beliefs about DIF and item bias, and demonstrate the strengths and weaknesses of a given testing program in regard to their development of instruments and specific item types. Specifically, statistical approaches can be employed to demonstrate that item response is complex and that item flaws will not always favor the Russian group. In general, statistical and substantive analyses are both needed to confirm hypotheses generated about the quality and nuances of cross-lingual test item adaptation (Hambleton, 2005). There is now empirical evidence that DIF studies can be used to identify specific challenges in cross-lingual test item adaptation from Russian into Kyrgyz in the KR. In regard to the quantity of DIF, the results are heartening: The low number of practically significant DIF items indicates that cross-lingual adaptation in Kyrgyzstan is feasible. Data from such studies such as this one can be used to improve the NST. To my knowledge, at the time of this study, there have been no DIF studies conducted on cross-lingual tests in any of the former Soviet Republics. Therefore, the results of this study will be of special interest to researchers not only in the Kyrgyz Republic but in other countries where Russian and Turkic language(s) are the primary languages of instruction and assessment. 183 APPENDICES 184 APPENDIX A: SCHOOLS BY LANGUAGE(S) OF INSTRUCTION IN THE KR Table A-1: Schools by Language(s) of Instruction in the KR Buildings with medium No. Schools % Schools Kyrgyz 1261 66.0 Russian 221 11.6 Uzbek 151 7.9 Kyrgyz/Russian 234 12.2 Kyrgyz/Uzbek 31 1.6 Russian/Uzbek 8 0.4 Kyrgyz/Russian/Uzbek 5 0.3 Total: 1911 100.0 Ministry of Education Data (2003). 185 % Students 54.9 13.1 8.8 19.9 1.8 0.9 0.5 100.0 APPENDIX B: STUDENTS (%) IN MAIN LANGUAGE TRACKS BY OBLAST Table A-2: Students (%) in Main Language Tracks by Oblast Kyrgyz Russian 63.3 22.7 Republic Northern Oblasts Bishkek Chui Talas Issyk-Kul Naryn Southern Oblasts Batken Djalal-Abad Osh Year 2000, Herczynski (2003) Uzbek 13.4 Tajik .30 34.8 39.9 88.2 72.7 88.2 65.2 60.0 11.8 27.3 11.8 .14 - - 74.5 71.4 63.8 7.2 8.4 7.4 15.2 20.2 28.7 3.1 .06 186 APPENDIX C: NST PARTICIPATION RATES IN THE KR Table A-3: NST Participation Rates by Oblast & Language 2009 2010 2009 2010 N N Kyrgyz Kyrgyz Region All Republic 33,579 30,264 63% 60% Bishkek (capital) 6,526 6,427 28% 25% Chui (northern) 4,405 3,848 41% 39% Issyk-Kul (northeastern) 3,881 3,561 69% 66% Naryn (south central) 2,703 2,481 78% 81% Talas (western) 1,724 1,533 80% 80% Djalal-Abad (southern) 4,903 4,203 79% 78% Osh City (southern) 1,398 1,186 49% 45% Osh (southern) 5,011 4,534 84% 82% Batken (southwestern) 3,028 2,491 80% 81% 2009 Annual NST Report, 2010 Annual NST Report, (www.testing.kg) 187 2009 Russian 33% 71% 59% 31% 22% 20% 15% 40% 07% 11% 2010 Russian 36% 75% 61% 34% 19% 20% 17% 43% 08% 11% 2009 Uzbek 04% (n = 4) (n = 1) (n = 2) 00% 00% 06% 10% 09% 09% 2010 Uzbek 03% (n=2) (n=1) 00% 00% 00% 05% 12% 10% 08% APPENDIX D: DEMOGRAPHICS AND TEST SCORES (2010) Table A-4: Demographics and Test Scores % Poor* Avg. NST Scores*** % NST Russian*** 12% 113.5 36% Bishkek (capital) 6.0 26% Osh City (southern) n/a 17% Issyk-Kul (northeastern) 30.6 13% Chui (northern) 26.6 11% Naryn (south central) 90.5 11% Talas (western) 67.0 10% Djalal-Abad (southern) 73.0 8% Osh (southern) 65.7 7% Batken (southwestern) 65.7 7% a Herczynski (2003)* Census (2009)** CEATM ( 2010)*** 135.4 120.1 111.1 116.5 104.2 103.9 106.2 100.3 103.5 75% 43% 34% 61% 19% 20% 17% 11% 08% All Republic 56.2 % Higher Ed** 188 APPENDIX E: DEMOGRAPHICS OF SCHOLARSHIP WINNERS Table A-5: NST Winners by Language, Oblast (2010) Kyrgyz NST 2010 Part. Win Avg. % % Region All Republic 60% 66% 125.6 Bishkek (capital) 25% 22% 134.0 Chui (northern) 39% 34% 125.3 Issyk-Kul (northeastern) 66% 60% 127.4 Naryn (south central) 81% 82% 123.4 Talas (western) 80% 76% 124.7 Djalal- Abad (southern) 78% 85% 122.5 Osh City (southern) 45% 52% 126.8 Osh (southern) 82% 93% 125.9 Batken (southwestern) 81% 89% 128.9 a Data constructed from CEATM (2010) Russian Part. Win % % Avg. 36% 75% 61% 34% 19% 20% 17% 43% 08% 11% 153.9 164.7 150.2 148.9 136.4 147.1 142.1 158.1 133.0 146.0 189 33% 78% 66% 40% 18% 24% 12% 45% 03% 05% Part. % Uzbek Win % Avg. 03% (n=2) (n=1) 00% 00% 00% 05% 12% 10% 08% 01% 00% 00% 00% 00% 00% 03% 03% 03% 06% 130.2 132.6 152.8 134.8 117.5 APPENDIX F: SELECTIVITY OF HEIs IN THE KR Table A-6: Average NST Scores of Scholarship Winners Institution Location Kyrgyz-Turkish Manas University Bishkek Kyrgyz-Russian Slavonic University Bishkek Kyrgyz State Medical Academy Bishkek International University Bishkek Kyrgyz Economic University Bishkek Kyrgyz State Technical University Bishkek Kyrgyz National University Bishkek Osh State University Osh Bishkek Humanities University Bishkek Building and Transport University Bishkek Issyk-Kul State University Kara-Kol Osh Technical University Osh Arabeava Pedagogical University Bishkek Institute of Mountain Technology Bishkek Kyrgyz-Uzbek University Osh Kyrgyz Agricultural University Bishkek Djalal-Abad State University Djalal-Abad Batken State University Batken Academy of Internal Affairs Bishkek Talas State University Talas Naryn State University Naryn Military Institute Bishkek All: 2009 Average Scholarships 185.0 105 184.6 150 176.2 199 --172.2 85 158.5 526 143.9 474 143.1 473 141.8 200 138.8 281 131.3 196 130.3 285 129.4 406 --127.8 245 126.6 186 126.4 344 122.6 165 121.7 230 121.2 84 117.0 145 104.1 150 139.6 4,929 190 2010 Average Scholarships 182.1 99 182.2 113 177.7 217 173.6 30 165.7 37 145.6 574 143.2 436 135.8 431 140.4 173 133.5 254 133.6 147 118.5 277 125.0 379 124.8 8 --120.9 255 120.2 343 122.7 160 107.1 230 116.4 60 115.4 150 107.1 99 134.9 4,472 APPENDIX G: COMPLETING THE ITEM ANALYSIS RUBRICS Review Glossary of Key Terms/Training Rubric 1.a: Analyze All Item Pairs Rubric 1.b: Code Difference “Identical” = 0 (STOP) Identical Content Difference Format Difference Cultural/Linguistic Difference 2.1: Level of Difference “Somewhat similar” = 1 “Somewhat different” = 2 “Different” =3 2.2: Nature of Difference Content (translation, adaptation, other) Format (adaptation, presentation, other) Cultural/Linguistic (meaning differences, contextual difference, linguistic, other) 2.3: Description of Differences 2.4: DIF direction (who is advantaged) 2.5: Suggestions for Improving Item 191 APPENDIX H: GLOSSARY OF KEY RUBRIC TERMS (English Version) Differences between two language versions of the test items can potentially invalidate the inferences based on test results. It is generally understood that different means not the same. Differences can be caused by variation in wording due to translation or adaptation mistakes, content differences, format or item presentation differences or the way that different cultural groups interpret the test items. In the context of this study, there are four key aspects of difference that merit attention – differences in the meaning of individual words, differences in overall item meaning, differences in relative difficulty and differences in cultural interpretation of the two versions of the item. Ultimately, equivalence of items is achieved when the item in both language versions has the same content, meaning, same relative difficulty level, and can be interpreted similarly in the different linguistic and cultural groups. Relative Difficulty Individual words as well as phrases, concepts and ideas can have similar overall meaning in both versions but still be problematic for one group of examinees. That is because certain topics or concepts can differ in their conceptual difficulty in the two groups. An obvious example is when one language has five synonyms for the same word while the other language has two. In the language that has five words, two of them might be rarely utilized, for example in literary or other scholarly circles. Thus, the commonality of their use may be as important as their actual meaning in terms of how differences in item difficulty manifest themselves in different groups. While the use of such pairs of words may technically be correct, their usage might pose the problem of relative difficulty for one language group. Another example of when linguistic adaptation appears correct but remains problematic is the issue of explicitness of words or ideas. For example, ideas that are conceptually challenging in one language might get adapted to a more literal or explicit meaning in the second language, making them easier to understand for one group. Complex metaphors are sometimes adapted to a more literal meaning in the target language which can lead to the target language examinees having greater changes of success on an item. Terms from the Rubric 1.b (Type of Difference) No Difference The two versions of the item are assessing the same thing in the same way, using equivalent words, ideas, and content, as well as a similar format. Similar cultural meaning and equivalent language is attained. You expect no differences in item performance by the two groups on this item. Content Differences Refers to the basic ideas, concepts, knowledge, skills, language, and words assessed on each item (see prompts on the content rubric). Format Differences 192 Refers to the way content is formatted, spaced, edited, and presented visually. Size of text, length of material, punctuation, capitalization, etc. (see prompts on the format rubric). Cultural/Linguistic Differences Meaning of items to both Russian and Kyrgyz examinees is different, relevance to different schooling contexts and cultures is different, lack of similarity of dispositions of two groups, lack of similarity of norms, psychological construct not present in both groups, lack of equivalence of linguistic expression, lack of similarity of linguistic structure and grammar which makes equivalence challenging, differences in symbolism, metaphorical meaning, level of explicitness different, etc. (see prompts on rubric). Terms from All Rubrics (2.1. Level of Differences) Somewhat Similar You note small differences between the two versions of the item but they are not very significant. The kind of “daily” differences you see are those that an examinee might also be quite familiar with and be able to negotiate with little or no difficulty. Somewhat Different These items appear to be different in more obvious and unambiguous ways. However, you are not certain that these differences will impact item response patterns. Different These items clearly have differences in meaning, relative difficulty or cultural interpretation. You are confident that these differences will impact the way students answer these questions. In other words, you are confident that these differences will impact item response patterns. Terms from the Content Rubric (Section 2.2. Nature of Differences) The incorrect translation of individual words, the addition or omission of a word can cause differences in item meaning or content. This problem can sometimes be resolved relatively easily by improvements in translation. The word translation will be used in this study in a narrow sense to refer to direct, one to one correspondence of words and sentences. In many instances, direct correspondence is needed to make words and ideas expressed by test items equivalent. If a single word is mistranslated, overall meaning can change or the item can make no sense at all. In many cases, however, two items translated correctly (word for word) can result in different overall meaning. For example if literal translation was used when the actual properties of the two languages require a more nuanced adaptation to retain similar meaning. So, the lack of direct correspondence of words is not necessarily always problematic. In recognition of the above, test developers often prefer to speak of test or item adaptation rather than translation (Hambleton, 2005). Adaptation acknowledges that direct, literal, translation is often not possible (nor desired) across disparate languages if we seek to maintain the overall similarity in meaning of two test items. A sentence or text can have little direct, literal correspondence to the same material in another language, yet maintain the same overall meaning. For this rubric, the term 193 adaptation is utilized to denote the process of conveying similar overall meaning, regardless of how individual words may or may not correspond. Terms from Cultural/Linguistic Rubric (Section 2.2. Nature of Differences) Meaning Differences Under the meaning differences for the cultural category, I refer not to meaning differences caused by translation mistakes, but meaning differences that might occur even when the translation/adaptation is accurate. In regard to comparison of Russian and Kyrgyz examinees, consider the word “family.” The definition of family is culturally informed and can very different meanings in different cultural groups (wider understanding or more narrow understanding). Other words/concepts such as “independence, freedom, love, values, respect, etc.” are all strongly influenced by cultural norms and values. Context Differences Contextual differences can impact response when examinees have different levels of exposure to ideas, knowledge or situations due to demographic or social differences. In Kyrgyzstan, Russian speaking examinees are (on average) concentrated in urban areas while Kyrgyz speakers (on average) are concentrated in rural areas. Urban examinees might have less knowledge about horsemanship or animal husbandry and the vocabulary, knowledge and norms that are common to rural youth. Geographic concepts and terminology about mountains might also advantage those who live in high mountain regions. Or, the opposite, urbanites might be more familiar with issues connected to the ways of life of urban dwellers. Success on an item should also not depend on exposure to similar curriculum and schooling practices that might vary by group. Cultural/Linguistic Differences Cultural understandings may differ between the groups enough to make the intended meaning of some items unclear, irrelevant, or have a different difficulty level for one group. Due to cultural understandings some words, concepts, or ideas might be more familiar to one group than the other, even after controlling for residence. For example, a focus on the cultural heroes, myths, legends might also be problematic across cultures. Like meaning and contextual differences, cultural differences might not be apparent in the quality of translation/adaptation (which may be accurate) but must be considered nonetheless. The most obvious form of “linguistic difference” becomes evident when items are poorly translated or adapted. However, there are also may be inherent differences in the way languages form, express and convey meaning that are irrelevant to the quality of adaptation. For example, an adaptation might be accurate but it might take many more words to express a concept in one language than another. How (if at all) does this impact the difficulty of an item? Some languages might have more nuances of meaning due to having more verb tenses which create meaning not easily captured in another language. The way two languages express or articulate ideas and concepts could make meaning more “difficult to locate” in some languages than others. Some languages might more efficiently convey meaning than others in some situations. Some languages might have many more words for richer variation of nuance of certain concepts. Word order can also be important. Consider the example of the item instructions in Russian and Kyrgyz below. As bi-lingual speakers, consider the times that you consciously or subconsciously 194 prefer to use one of the languages you know more often than the other because the language allows a more precise or efficient expression of your intended meaning. Are there differences in meaning and/or difficulty of the two paragraphs below? Are these differences related to inherent language differences? Is the issue easily resolved? “каждое задание состоит из пяти пар слов. Выделенная жирным шрифтом пара показывает образец отношении и между двумя словами. Определите, какие отношения сушествуют между словами в этой паре, а затем выберите в вариантах ответа пара слов с такими же отношениями. Порядок слов в выбранном вами ответе должен быть таким же как и в образце.” “Ар бир тапшырма беш жуп создон турат. кара тамгалар менен белгиленген жуп соз эки создун ортосундагы мамиленин улгусун корсотуп турат. адегенде бул жуптагы создордун ортосундагы мамелени анектаныз да, андан сонг жооптун варианттарынын ичинен ушундай мамеледе турган жуп созду тандап алыныз.” 195 APPENDIX I: ITEM RUBRICS 1.a & 1.b Directions: Please read and answer both the Kyrgyz and Russian versions of the test item below. In the comments boxes to the right, make a brief note about how well (in your estimation) these two items are assessing the same thing in the same way. You may write notes directly on the items. Consider the content, format and cultural/linguistic comparability. In the lower box, please comment on the quality of the translation/adaptation. Make only brief notes as you will return for a more in depth analysis of these items later. Item 1 Kyrgyz Version Notes: Equivalent/ Different ____________________________ ____________________________ ____________________________ ____________________________ ____________________________ ____________________________ ____________________________ ____________________________ ____________________________ ____________________________ ____________________________ ____________________________ ____________________________ ____________________________ Item here… Russian Version Notes: Translation/Adaptation ____________________________ ____________________________ ____________________________ ____________________________ ____________________________ ____________________________ ____________________________ ____________________________ ____________________________ ____________________________ ____________________________ ____________________________ ____________________________ Item here… 196 Item Rubric Summary 1.b Directions: Please review the notes you took while answering the test items in 1.a. and circle the descriptor that best characterizes each pair of items. Please circle differences (I, II, or III) if any level of difference is apparent (small, medium, or large). Item 1: 0. No differences I. Content Differences II. Format Differences III. Cultural/Linguistic Differences Item 2: 0. No differences I. Content Differences II. Format Differences III. Cultural/Linguistic Differences Item 3: 0. No differences I. Content Differences II. Format Differences III. Cultural/Linguistic Differences Item 4: 0. No differences I. Content Differences II. Format Differences III. Cultural/Linguistic Differences Item 5: 0. No differences I. Content Differences II. Format Differences III. Cultural/Linguistic Differences Item 6: 0. No differences I. Content Differences II. Format Differences III. Cultural/Linguistic Differences Etc.: etc… Item 40: 0. No differences I. Content Differences II. Format Differences III. Cultural/Linguistic Differences 197 APPENDIX J: ITEM RUBRIC 2 Directions Fill in item rubric 2 for each item not identified as “identical” on rubric 1.b (above). The purpose of item rubric 2 is to collect data that will facilitate an understanding of the level and nature of difference as well as the cause (source) of difference for each item. Please describe the issue or problem you see with the item in as much detail as possible. You need not comment on each prompt but please do your best to characterize the items in a complete and descriptive way. We will review these items together during our group discussion. The rubric is broken into three color coded categories. The main categories are: Content differences (purple), Format differences (green), and Cultural/Linguistic differences (pink). Match the color of the rubric that best fits the nature of the difference you identified in 1.b and fill it in. Note that these categories are not always mutually exclusive. However, these three categories provide a strong foundation from which to classify core item issues. You can also note other reason for difference if necessary on any of these rubrics. At the top of each rubric, you are provided a series of prompts – or possible explanations for differences. These prompts are not meant to be exhaustive but are examples of issues that can help you classify the nature of the differences. In section 2.1, please score the item as “somewhat similar”, “somewhat different” or “different” per the guidance in the glossary of key terms. Then, in 2.2, circle the most likely cause/source of the differences. In section 2.3, describe in as much detail as possible the problem of equivalence. Next, in section 2.4, estimate which group, if any, the item favors. Finally, in section 2.5, provide an improved item if you can, or a solution to the hypothesized problem with the item. If you find it difficult to classify the problem or see problems in more than one area, please describe the nature of the problems on one of the rubrics under section 2.3. 198 Purple Color Rubric 2: Content Item Number: _____ Consider the Equivalence of: Skills or knowledge demanded; vocabulary, ideas, situations, topics; words, expressions, sentences and phrases; word omission or word addition; grammar; the frequency of words, level of nuance, level of explicitness, literal vs. figurative meaning, the use of metaphor, idiom, etc. 2.1. The content of these items is (circle one): 2.2. The difference is best characterized as (circle one): Somewhat Similar (1) Somewhat Different (2) Different (3) (a) Translation (individual word issues) (b) Adaptation (general meaning) (c) Other 2.3. Describe the difference(s) in detail: 2.4. Advantage: If the item content is different, do you think that it favors one of the groups? Which one? (circle one): Russian or Kyrgyz 2.5. Improving Equivalence: Can the equivalence problem(s) with this item be resolved? How? 199 Green Color Rubric 2: Format Item Number _______ Consider the Equivalence of: Overall item presentation, item length, clarity of directions, order of words and ideas, number of words, punctuation, capitalization, typeface, typographical errors, missing letters or words, editing and general formatting, etc. 2.1. The format of these items is: (circle one) 2.2. The problem is best characterized as: (circle one) Somewhat Similar (1) Somewhat Different (2) (b) Presentation (a) Adaptation Different (3) (c) Other 2.3. Describe the difference(s) in detail: 2.4. Advantage: If the item format is different, do you think that it favors one of the groups? Which one? (circle one) Russian or Kyrgyz 2.5. Improving Equivalence Can the problem(s) with this item be resolved? How? 200 Pink Color Rubric 2: Cultural/Linguistic Item Number ______ Consider the equivalence of: Russian and Kyrgyz schooling context, curriculum in relation to items, importance or relevance to both cultures, similarity of dispositions, similarity of norms, psychological construct present in both groups, equivalence of linguistic expression, similarity of linguistic structure and grammar, symbolism, metaphor meaningful in both groups, level of explicitness similar, etc. 2.1. Cultural equivalence between the two items is (circle one): 2.2. The problem is best characterized as: (circle one) Somewhat Similar (1) (a) Meaning differences Somewhat Different (2) (b) Contextual differences Different (3) (c) Linguistic differences (d) Other 2.3. Describe the difference(s) in detail: 2.4. Advantage: If the items are not equivalent for cultural reasons, do you think that it favors one of the groups? Which one? (circle one) Russian or Kyrgyz 2.5. Improving Equivalence Can the problem(s) with this item be resolved? How? 201 Опросник 2: Содержание Номер задания: _____ Рассмотрите задания на эквивалентность по следующим вопросам: Навыки и знания, словарный запас, идеи, ситуации, предмет, слова, выражения, предложения, сложность фразировки, пропуск слова или добавление слова, грамматика; частота слов, степень нюансов, степень очевидности, буквальное или переносное значение, использования метафор, идиом и т.п. 2.1. Содержание заданий (обведите одно): 2.2. Причина различий (обведите одно): небольшие различия (1) средние различия (2) значительные различия (3) (a) Перевод (суть отдельных слов) (b) Адаптация (общее понятие) (c) Другое 2.3. Подробно опишите различия: 2.4. Преимущества: Если содержание задания отличается, у какой группы больше шансов на правильный ответ: кыргызской или русской? (обведите одно) 2.5. Улучшение эквивалентности: Можно ли решить проблему эквивалентности? Как? 202 Опросник 2: Формат Номер задания: _____ Рассмотрите задания на эквивалентность по следующим вопросам: Представление задания в целом, длина вопроса, четкость инструкций, порядок слов и идей, количество слов, пунктуация, использование заглавных букв, шрифт, редактирование и общее форматирование и т.д. 2.1. Формат заданий (обведите одно): 2.2. Причина различий (обведите одно): небольшие различия (1) средние различия (2) значительные различия (3) b) Вид c) Другое a) Адаптация 2.3. Подробно опишите различия: 2.4. Преимущества: Если формат задания отличается, у какой группы больше шансов на правильный ответ: кыргызской или русской? (обведите одно) 2.5. Улучшение эквивалентности Можно ли решить проблему эквивалентности? Как? 203 Опросник 2. Культурные/лингвистические различия Номер задания: _____ Рассмотрите задания на эквивалентность по следующим вопросам: Кыргызкая и русская образовательная среда, важность и релевантность, сходство нравов, сходство норм, психологическая составная присутствующая в обеих группах, эквивалентность языковых выражений, сходство языковых структур и грамматики, символизм, значение метафор, степень очевидности и т.д. 2.1. Степень различия по культурному признаку (обведите одно): 2.2. Причина различий (обведите одно): небольшие различия (1) (a) Различия в значении средние различия (2) (b) Контекстуальные различия значительные различия (3) (c) Лингвистические различия (d) Другое 2.3. Подробно опишите различия: 2.4. Преимущества Если задания не эквивалентны по культурным признакам, у какой группы больше шансов на правильный ответ: кыргызской или русской? (обведите одно) 2.5. Улучшение эквивалентности Можно ли решить проблему эквивалентности? Как? 204 APPENDIX K: UNIFORM DIF STATISTICS Table A-7: Uniform DIF Statistics for 38 Verbal Items Model 1 (compact) Model 2 (w/group) χ2 R2 ∆ χ2 Item 9 2 24 7 17 29 39 35 36 27 30 14 31 34 12 40 15 18 10 37 8 4 38 20 5 R2 χ2 R2 Difference (effect size) Group β2 609.384 620.330 426.294 159.771 446.021 91.075 381.248 256.366 340.026 292.44 78.421 44.124 513.364 189.212 377.694 401.746 350.341 554.431 350.534 213.746 298.915 609.253 464.279 262.557 6.929 0.365 0.372 0.256 0.103 0.297 0.061 0.245 0.162 0.228 0.186 0.056 0.030 0.302 0.12 0.238 0.243 0.226 0.324 0.214 0.136 0.192 0.379 0.278 0.203 0.005 609.878 621.063 427.046 161.089 448.098 93.444 385.981 261.162 345.016 297.733 84.629 50.523 521.002 198.916 387.473 412.05 365.231 569.895 366.044 229.341 317.089 628.754 484.489 283.736 29.505 0.365 0.372 0.257 0.103 0.298 0.062 0.248 0.165 0.231 0.189 0.06 0.034 0.306 0.126 0.244 0.249 0.234 0.332 0.223 0.145 0.202 0.390 0.289 0.218 0.020 0.494 0.733 0.752 1.318 2.077 2.369 4.733 4.796 4.99 5.293 6.208 6.399 7.638 9.704 9.779 10.304 14.890 15.464 15.510 15.595 18.174 19.501 20.21 20.749 22.576 0.000 0.000 0.001 0.000 0.001 0.001 0.003 0.003 0.003 0.003 0.004 0.004 0.004 0.006 0.006 0.006 0.008 0.008 0.009 0.009 0.010 0.011 0.011 0.015 0.015 -0.085 0.112 -0.097 0.122 -0.202 0.17 0.278 -0.242 0.298 0.268 0.307 -0.275 -0.308 -0.331 0.385 -0.351 0.451 0.456 -0.428 0.429 0.515 0.574 -0.507 0.741 0.497 205 sig. Odds ratio Exp(β) FAVORS 0.482 0.393 0.385 0.252 0.149 0.125 0.031 0.028 0.026 0.022 0.013 0.011 0.006 0.002 0.002 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.919 1.118 0.908 1.13 0.817 1.185 1.32 0.785 1.347 1.307 1.359 0.759 0.735 0.718 1.469 0.704 1.57 1.578 0.652 1.536 1.673 1.776 0.602 2.098 1.644 Kyrgyz Russian Kyrgyz Kyrgyz Kyrgyz Russian Russian Russian Kyrgyz Russian Kyrgyz Kyrgyz Russian Kyrgyz Kyrgyz Kyrgyz Russian Kyrgyz Kyrgyz Table A-7 (cont’d) 25 22 26 23 21 16 33 11 28 19 32 3 13 35.889 441.505 328.791 530.86 425.032 161.157 188.954 314.134 295.451 558.080 201.991 736.971 264.243 0.026 0.264 0.202 0.311 0.262 0.123 0.122 0.195 0.183 0.333 0.128 0.414 0.166 58.895 465.075 362.884 569.563 467.445 204.570 232.381 363.460 345.596 652.350 298.325 848.057 392.577 0.042 0.277 0.221 0.33 0.286 0.154 0.149 0.223 0.212 0.381 0.185 0.464 0.238 23.006 23.57 34.093 38.703 42.413 43.413 43.427 49.326 50.145 94.270 96.334 111.086 128.334 0.016 0.013 0.019 0.019 0.024 0.031 0.027 0.028 0.029 0.048 0.057 0.05 0.072 0.583 -0.532 -0.634 -0.694 -0.738 0.98 0.76 0.791 0.796 -1.171 1.101 -1.247 -1.218 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.792 0.587 0.531 0.5 0.478 2.663 2.138 2.205 2.127 0.31 3.007 0.287 0.296 Kyrgyz Russian Russian Russian Russian Kyrgyz Kyrgyz Kyrgyz Kyrgyz Russian Kyrgyz Russian Russian Items are arranged by order of chi-squared difference values in ascending order. At the .05 level, the test statistic for 1 degree of freedom is 3.841. When β2>0, uniform DIF favors the reference group (Kyrgyz language). When β2<0, uniform DIF favors the focal group (Russian language). 206 APPENDIX L: ITEMS WITH MODERATE OR LARGE DIF Table A-8: Verbal Items with Moderate or Large DIF χ2 Item Effect Size difference moderate large β2 Odds Ratio Exp (β) sig. 19 94.270 0.048 -1.171 0.000 32 96.334 0.057 1.101 0.000 3 111.086 0.05 -1.247 0.000 13 128.334 0.072 -1.218 0.000 Items are arranged by ascending chi-square difference values. 207 Favors Item Type 0.31 3.007 0.287 0.296 Russian Kyrgyz Russian Russian Analogy Reading Analogy Analogy APPENDIX M: ITEMS WITH NO DIF Table A-9: Non-Significant Verbal Items χ2 Odds Ratio β2 sig. difference Item Exp (β) 9 0.494 -0.085 0.482 0.919 2 0.733 0.112 0.393 1.118 24 0.752 -0.097 0.385 0.908 7 1.318 0.122 0.252 1.13 17 2.077 -0.202 0.149 0.817 29 2.369 0.17 0.125 1.185 Items are arranged by ascending chi-square difference values. Item Type Analogy Analogy Sentence Completion Analogy Analogy Sentence Completion 208 APPENDIX N: NON-UNIFORM DIF STATISTICS Table A-10: Non-Uniform Verbal DIF Statistics Model 2 (w/language) Model 3 (interaction) χ2 2 χ R χ R Difference R ∆ (effect size) 521.002 232.381 29.505 484.489 412.05 621.063 50.523 465.075 392.577 363.460 317.089 298.325 385.981 652.350 848.057 261.162 198.916 387.473 345.586 362.884 628.754 569.563 229.341 366.044 84.629 0.306 0.149 0.020 0.289 0.249 0.372 0.034 0.277 0.238 0.223 0.202 0.185 0.248 0.381 0.464 0.165 0.126 0.244 0.212 0.221 0.390 0.33 0.145 0.223 0.06 521.004 232.397 29.554 484.626 412.297 621.345 50.913 465.513 393.526 364.417 318.l64 299.544 387.206 653.597 849.495 262.809 200.596 389.866 348.014 365.855 632.316 573.618 234.861 372.257 91.957 0.306 0.149 0.020 0.289 0.249 0.373 0.034 0.277 0.239 0.223 0.203 0.186 0.249 0.381 0.465 0.166 0.127 0.245 0.213 0.223 0.391 0.332 0.149 0.226 0.065 0.002 0.016 0.049 0.137 0.247 0.282 0.390 0.438 0.949 0.957 1.075 1.219 1.225 1.247 1.438 1.647 1.68 2.393 2.428 2.971 3.562 4.055 5.52 6.213 7.329 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.000 0.001 0.000 0.001 0.001 0.001 0.000 0.001 0.001 0.001 0.001 0.001 0.002 0.001 0.002 0.004 0.003 0.005 2 Item 31 33 5 38 40 2 14 22 13 11 8 32 39 19 3 35 34 12 28 26 4 23 37 10 30 2 2 2 209 β3 sig. odds ratio Exp(β) 0.000 0.000 0.001 -0.003 -0.004 0.005 -0.004 -0.005 -0.007 0.008 0.008 0.008 0.009 -0.009 0.010 0.009 0.009 -0.012 -0.012 -0.012 0.023 -0.015 0.017 -0.018 -0.019 0.969 0.899 0.826 0.711 0.619 0.596 0.532 0.508 0.330 0.330 0.302 0.272 0.271 0.263 0.233 0.201 0.196 0.121 0.118 0.084 0.059 0.044 0.020 0.012 0.007 1 0.999 1.001 0.997 0.996 1.005 0.996 0.995 0.993 1.008 1.008 1.008 1.009 0.991 1.011 1.009 1.009 0.988 0.988 0.988 1.024 0.985 1.017 0.983 0.981 Table A-10 (cont’d) 21 9 24 17 20 18 25 27 29 36 15 7 16 467.445 609.878 427.046 448.098 283.736 569.895 58.895 297.733 93.444 345.016 365.231 161.089 204.570 0.286 0.365 0.257 0.298 0.218 0.332 0.042 0.189 0.062 0.231 0.234 0.103 0.154 474.789 617.288 437.017 459.319 295.348 581.663 71.278 312.692 109.723 362.639 384.341 181.691 229.358 0.29 0.369 0.262 0.305 0.226 0.338 0.051 0.198 0.073 0.242 0.245 0.116 0.172 7.344 7.410 9.971 11.221 11.612 11.768 12.383 14.959 16.279 17.623 19.110 20.602 24.788 Items are arranged in ascending order by chi-squared difference values. At the .05 level, the test statistic for 1 degree of freedom is 3.841. Non-uniform DIF = β3≠0, regardless of the value of β2 210 0.004 0.004 0.005 0.007 0.008 0.006 0.009 0.009 0.011 0.011 0.011 0.013 0.018 -0.022 0.028 -0.023 -0.029 0.031 0.031 -0.024 -0.027 -0.026 -0.033 0.040 0.032 0.042 0.007 0.007 0.002 0.001 0.001 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.978 1.029 0.977 0.972 1.032 1.031 0.976 0.973 0.974 0.967 1.041 1.032 1.043 APPENDIX O: ITEM LOCATION ACROSS EFFECT SIZE VALUES Table A-11: Continuum of Effect Size Values by Item Type Non-DIF Items (6 items total) .000 .000 .000 .001 .001 .001 % total* Type AN 9 2 7 17 22% SC 24 29 20% RC 0% * indicates the total % of all this item type found in this classification. I.e. 22% of all analogies fall under the “non-DIF” classification. Negligible DIF Items (Below effect size median, 14 items total) .003 .003 .003 .003 .004 .004 .004 .006 Type AN SC RC 14 27 39 35 8 36 4 20 .008 .009 18 10 34 40 .019 .024 23 .027 21 26 32 .029 % total 17% 0% 10% 211 % total 28% 20% 70% .031 % total 16 28 33 13 .028 11 25 38 3 .009 37 5 22 19 .008 15 12 31 Practical DIF Items (4 items total) .048 .050 .057 .072 Type AN SC RC .006 30 Negligible DIF Items (Above effect size median, 14 items total items) .010 .011 .011 .013 .015 .015 .016 .019 Type AN SC RC .006 33% 60% 20% APPENDIX P: EVALUATOR SCORING MATRIX Table A-12: Evaluator Item Scoring Matrix Evaluator 1 2 3 4 5 Item 2 0 1 0 1 0 3 2 3 0 3 3 4 0 2 0 2 1 5 3 2 0 3 0 7 0 2 0 2 0 8 1 2 0 1 0 9 0 0 0 1 0.57 10 0 0 0 1 2 11 3 0 3 3 1 12 0 0 0 1 0 13 0 0 0 3 0 14 0 0 0 0 0 15 0 3 3 2 3 16 0 0 2 3 0 17 0 0 0 1 0 18 2 0 2 3 0 19 3 1 1.4 2 2 20 0 0 0 0 0 21 0 3 2 3 3 22 1 0 2 0 0 23 0 3 2 1 2 24 0 2 2 0 0 25 0 2 2 2 1.42 26 0 0 0 0 2 27 0 0 0 3 0 28 0 0 0 0 1 29 0 0 0 0 0.28 30 0 0.28 0 0 2 31 1 3 2 1 0 32 0 0 0 0 3 33 3 0 2 3 3 212 6 7 8 Total 0.57 0 0 0 0 1 1 1 0 0 1 0 2 0.85 0 0 0 0 2 2 1.1 0.57 0 2 0 0 0 0 1.28 3 0 1 3 0 1 2 1 1 2 2 2 1 1 1 1 1 1 1 0 2 0 0 0 2 0 0 0 2 0 1 0 2 1 2 1 0 2 0 1 2 2 2 0 1 0 0 1 3 1 0 1 0 0 0 2 1 0 0 0 0 1 0 1.85 4.57 16 6 9 8 6 4.57 8 14 5 5 2 14 6.85 3 11 11.4 0 16 5 9.1 4.57 11.42 5 3 1 2.28 2.28 10.28 6 14.85 Table A-12 (cont’d) 34 35 36 37 38 39 40 0 1 3 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 3 0 1 0 1 2 1 2 0 0.42 213 0 0 0 0 0 0 0 0 1 0 0 2 0 1 0 3 0 0 1 0 0 2 7 5 1 8 0 3.42 APPENDIX Q: INTER-RATER RELIABILITY Table A-13: Reliability Statistics Case Processing Summary N 38 100.0 Excluded 0 .0 Total Cases % 38 100.0 Valid a a. Listwise deletion based on all variables in the procedure. Reliability Statistics Cronbach's Alpha Cronbach's Alpha Based on Standardized Items N of Items .663 .657 8 Item Statistics Mean Std. Deviation N V1 .6316 1.07606 38 V2 .8232 1.15466 38 V3 .6684 1.02644 38 V4 1.3158 1.21043 38 V5 .9655 1.10491 38 V6 .5097 .79436 38 V7 .9474 .83658 38 V8 .7855 .92961 38 214 Table A-13 (cont’d) Inter-Item Correlation Matrix V1 V2 V3 V4 V5 V6 V7 V8 V1 1.000 .005 .273 .361 .187 -.251 .218 .262 V2 .005 1.000 .353 .229 .216 .203 .150 -.014 V3 .273 .353 1.000 .356 .217 .287 .137 .146 V4 .361 .229 .356 1.000 .173 -.112 .551 .391 V5 .187 .216 .217 .173 1.000 .341 .282 .135 V6 -.251 .203 .287 -.112 .341 1.000 -.067 -.233 V7 .218 .150 .137 .551 .282 -.067 1.000 .605 V8 .262 -.014 .146 .391 .135 -.233 .605 1.000 Intraclass Correlation Coefficient 95% Confidence Interval F Test with True Value 0 Intraclass Correlationa Lower Bound Upper Bound Value df1 df2 Sig Single Measures .198b .101 .338 2.971 37 259 .000 Average Measures .663 .473 .804 2.971 37 259 .000 Two-way random effects model where both people effects and measures effects are random. a. Type C intraclass correlation coefficients using a consistency definition-the between-measure variance is excluded from the denominator variance. b. The estimator is the same, whether the interaction effect is present or not. 215 APPENDIX R: RAW DATA FOR RANK ORDER ESTIMATION Table A-14: Chi-Square Values & Evaluators’ Scores Item Chi-Square Difference* Evaluators’ Score 9 0.494 4.57 2 0.733 4.57 24 0.752 4.57 7 1.318 8 17 2.077 3 29 2.369 2.28 39 4.733 0 35 4.796 7 36 4.99 5 27 5.293 3 30 6.208 2.28 14 6.399 2 31 7.638 10.28 34 9.704 2 12 9.779 5 40 10.304 3.42 15 14.890 14 18 15.464 11 10 15.510 8 37 15.595 1 8 18.174 6 4 19.501 6 38 20.21 8 20 20.749 0 5 22.576 9 25 23.006 11.42 22 23.57 5 26 34.093 5 23 38.703 9.1 21 42.413 16 16 43.413 6.85 33 43.427 14.85 11 49.326 14 28 50.145 1 19 94.270 11.4 32 96.334 6 3 111.086 16 13 128.334 5 *Data presented by ascending chi-square difference values. 216 APPENDIX S: RANK ORDER CORRELATION Table A-15: Rank Order Correlation Results Correlations eval Spearman's rho eval Correlation Coefficient chi 1.000 .451** . .004 38 38 .451** 1.000 .004 . 38 38 Sig. (2-tailed) N chi Correlation Coefficient Sig. (2-tailed) N **Correlation is significant at the 0.01 level (2-tailed). 217 APPENDIX T: EVALUATOR MARKS AND DIF STATISTICS Table A-16: Evaluator Marks and DIF Statistics χ2 Item Marks Difference Effect Size β2 sig. Exp(β) 9 2 24 7 17 29 39 35 36 27 30 14 31 34 12 40 15 18 10 37 8 4 38 20 5 25 22 26 23 21 16 33 11 28 19 32 3 13 0 0 xx xxxx 0 x 0 x xx x x 0 xx x xx 0 xxxxx xxxx xxx 0 x xx xxx 0 xxx xxxxx xx xx xxx xxxxxx xx xxxxx xxxxx 0 xxx xx xxxxxx x 0.494 0.733 0.752 1.318 2.077 2.369 4.733 4.796 4.99 5.293 6.208 6.399 7.638 9.704 9.779 10.304 14.890 15.464 15.510 15.595 18.174 19.501 20.21 20.749 22.576 23.006 23.57 34.093 38.703 42.413 43.413 43.427 49.326 50.145 94.270 96.334 111.086 128.334 0.000 0.000 0.001 0.000 0.001 0.001 0.003 0.003 0.003 0.003 0.004 0.004 0.004 0.006 0.006 0.006 0.008 0.008 0.009 0.009 0.010 0.011 0.011 0.015 0.015 0.016 0.013 0.019 0.019 0.024 0.031 0.027 0.028 0.029 0.048 0.057 0.05 0.072 -0.085 0.112 -0.097 0.122 -0.202 0.17 0.278 -0.242 0.298 0.268 0.307 -0.275 -0.308 -0.331 0.385 -0.351 0.451 0.456 -0.428 0.429 0.515 0.574 -0.507 0.741 0.497 0.583 -0.532 -0.634 -0.694 -0.738 0.98 0.76 0.791 0.796 -1.171 1.101 -1.247 -1.218 0.482 0.393 0.385 0.252 0.149 0.125 0.031 0.028 0.026 0.022 0.013 0.011 0.006 0.002 0.002 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.919 1.118 0.908 1.13 0.817 1.185 1.32 0.785 1.347 1.307 1.359 0.759 0.735 0.718 1.469 0.704 1.57 1.578 0.652 1.536 1.673 1.776 0.602 2.098 1.644 1.792 0.587 0.531 0.5 0.478 2.663 2.138 2.205 2.127 0.31 3.007 0.287 0.296 218 APPENDIX U: NUMBER, NATURE OF DIFFERENCES BY ITEM Table A-17: Number and Nature of Differences by Individual Item # Nature of Difference Item Distinct Issues Analogies 1. ADAPTATION (1 word, multiple meanings) 2 2 2. ADAPTATION (loan word used) 1. SOCIO-DEMOGRAPHIC (city kids will not know a word) 3 3 2. FORMAT (misprint resulted in unknown word in answer key) 3. ADAPTATION (multiple meanings) 4 2 1. ADAPTATION (needed direct translation, not adaptation) 2. TRANSLATION (needed literary word) 1. GRAMMAR (incorrect word combination) 5 3 2. TRANSLATION (single word translated incorrectly) 3. ADAPTATION (Different words, same relationship) 7 1 1. SOCIO-DEMOGRAPHIC (rural examinees lack knowledge) 8 1 1. TRANSLATION (incorrect translation of a single word) 9 2 1. ADAPTATION: (incorrect word combination) 2. TRANSLATION (incorrect translation which makes the pairs different) 10 2 1. TRANSLATION (incorrect direct translation of one word) 2. TRANSLATION: (incorrect translation of one word) 11 3 12 1 13 3 1. TRANSLATION (mistake in the translation produces “opposite” of what was intended) 2. TRANSLATION (incorrect direct translation in distractor) 3. TRANSLATION (incorrect translation) 1. GRAMMAR (word can only be used in combination with other words) 1. ADAPTATION (answer key has multiple meanings) 2. TRANSLATION (direct translation is incorrect) 3. TRANSLATION (incorrect translation) 219 # Marking DIF Effect Size 0 .000 6 .05 2 .011 3 .015 4 1 .000 .010 0 .000 3 .009 5 .028 2 .006 1 .072 Table A-17 (cont’d) 14 15 1 2 16 17 1 2 18 19 3 4 20 0 Sentence Completion 21 5 22 23 24 4 3 3 1. ADAPTATION (style: singular vs. plural use) 1. ADAPTATION (commonality of word used) 2. ADAPTATION (artificially created word) 1. TRANSLATION Incorrect translation of a single word) 1. ADAPTATION (two word combination makes the answer obvious) 2. SOCIO – DEMOGRAPHIC (unknown word for some regions) 1. TRANSLATION (incorrect translation) 2. ADAPTATION (a word is used in simple speech, not literary) 3. TRANSLATION (incorrect translation) 1. FORMAT (missing letter) 2. ADAPTATION (incorrect, makes the stem have the opposite of intended meaning) 3. TRANSLATION (incorrect literal translation) 4. TRANSLATION (incorrect nuance in meaning) NO ISSUES 1. TRANSLATION (too literal from Russian) 2. TRANSLATION (single word) 3. TRANSLATION (single word) 4. TRANSLATION (single word) 5. FORMAT (spacing differences) 1. SOCIO-DEMOGRAPHIC (regional differences in vocabulary) 2. ADAPTATION (stylistic differences in distractors (negatives) 3. ADAPTATION (stylistic differences in distractors (antonyms) make sentences longer/hard to solve in Kyrgyz) 4. ADAPTION (equivalence of two pairs of words not good). 1. GRAMMAR (Kyrgyz sentence difficult to understand) 2. ADAPTATION (lack of equivalence in word concept) 3. ADAPTATION (one word has no meaning in Kyrgyz) 1. ADAPTATION (sentences too long in Kyrgyz) 2. TRANSLATION (do not need to adapt, use loan words) 3. TRANSLATION (incorrect single word) 220 0 5 .004 .008 2 .031 0 .001 4 .008 3 .048 0 .015 6 .024 2 .013 3 .019 2 .001 Table A-17 (cont’d) 25 26 1 4 1. TRANSLATION (use of dialect) 1. TRANSATION (incorrect single word) 2. TRANSLATION (text size too big/ causes loss of meaning) 3. GRAMMAR (mistake in distractor a) 4. GRAMMAR (mistake in distractor b) 27 2 1. FORMAT (typo in one word) 2. GRAMMAR (mistake) 28 3 1. GRAMMAR (incorrect connector used) 2. GRAMMAR (incorrect ending) 3. GRAMMAR (incorrect form of word) 29 3 1. GRAMMAR (incorrect ending) 2. GRAMMAR (incorrect form of word) 3. TRANSLATION (poor word choice) 30 3 1. GRAMMAR (incorrect word choice) 2. GRAMMAR (incorrect word choice) 3. GRAMMAR (spelling mistake) Reading Comprehension 1. GRAMMAR (incorrect word combination) 31 3 2. FORMAT (distractor order different) 3. GRAMMAR (different ending needed) 32 2 1. FORMAT (typo error) 2. ADAPTATION (content of questions different) 33 2 1. TRANSLATION (incorrect direct translation) 2. ADAPTATION (question form incorrect in Kyrgyz) 34 1 1. TRANSLATION (direct translation incorrect) 35 2 1. FORMAT (distractors in different places in two versions) 2. TRANSLATION (single word incorrect) 36 1 1. ADAPTATION (item is long and complex in Kyrgyz) 37 1 1. GRAMMAR (possessive ending incorrect in Kyrgyz) 38 2 1. ADAPTATION (incorrect form of sentence) 2. FORMAT (typo, missing letter) 39 0 NO COMMENTS 40 1 1. FORMAT (distractors in difference places) 221 5 .016 2 .019 1 .003 0 .029 1 .001 1 .004 2 .004 2 .057 5 .027 1 1 .006 .003 2 0 3 .003 .009 .011 0 0 .003 .006 APPENDIX V: KYRGYZ ONLY DIF ANALYSIS Table A-18: DIF Statistics for Kyrgyz Rural and Urban Students Model 2 Model 1 χ2 Model 2 Model 1 χ χ Difference R R2 R2 ∆ Sig. 126.02 125.02 101.65 130.11 41.56 87.81 62.10 99.07 161.75 176.00 12.67 8.76 144.21 34.04 43.62 .97 65.71 94.43 106.48 115.39 130.29 53.84 349.16 30.42 168.50 304.94 124.80 124.64 101.43 130.05 40.49 87.46 60.31 98.97 151.74 175.41 6.95 7.93 143.95 32.76 42.76 .49 65.08 94.35 106.43 102.704 130.21 53.82 344.13 30.41 167.03 304.32 1.22 0.38 0.22 0.06 1.07 0.35 1.79 0.1 0.01 0.59 5.72 0.83 0.26 1.28 0.86 0.48 0.63 0.08 0.05 12.686 0.08 0.02 5.03 0.01 1.47 0.62 .105 .117 .092 .107 .042 .077 .054 .084 .132 .146 .012 .008 .118 .030 .039 .001 .057 .084 .091 .095 .151 .058 .268 .036 .170 .243 .104 .117 .092 .107 .041 .076 .052 .084 .132 .146 .007 .007 .118 .029 .038 .000 .056 .084 .091 .085 .151 .058 .265 .036 .168 .243 0.001 0.000 0.000 0.000 0.001 0.001 0.002 0.000 0.000 0.000 0.005 0.001 0.000 0.001 0.001 0.001 0.001 0.000 0.000 0.010 0.000 0.000 0.003 0.000 0.002 0.000 .270 .539 .644 .788 .300 .552 .181 .749 .911 .441 .017 .364 .609 .258 .353 .490 .427 .780 .823 .000 .768 .886 .025 .975 .226 .432 2 Reading Comp 40 Reading Comp 39 Reading Comp 38 Reading Comp 37 Reading Comp 36 Reading Comp 35 Reading Comp 34 Reading Comp 33 Reading Comp 32 Reading Comp 31 Sentence Comp 30 Sentence Comp 29 Sentence Comp 28 Sentence Comp 27 Sentence Comp 26 Sentence Comp 25 Sentence Comp 24 Sentence Comp 23 Sentence Comp 22 Sentence Comp 21 Analogy 20 Analogy 19 Analogy 18 Analogy 17 Analogy 16 Analogy 15 2 222 2 Table A-18 (cont’d) Analogy 14 Analogy 13 Analogy 12 Analogy 11 Analogy 10 Analogy 9 Analogy 8 Analogy 7 Analogy 5 Analogy 4 Analogy 3 Analogy 2 3.48 30.94 77.24 207.56 25.38 365.94 130.80 144.50 15.69 384.10 201.13 173.30 1.91 304.32 77.15 206.79 23.22 365.65 130.46 144.05 11.77 383.06 196.15 170.01 1.57 0.27 0.09 0.77 2.16 0.29 0.34 0.45 3.92 1.04 4.98 3.29 223 .003 .026 .072 .167 .022 .280 .115 .118 .013 .301 .190 .167 .002 .026 .072 .166 .020 .280 .115 .118 .010 .301 .186 .164 0.001 0.000 0.000 0.001 0.002 0.000 0.000 0.000 0.003 0.000 0.004 0.003 .215 .601 .774 .383 .142 .591 .562 .500 .048 .855 .026 .070 APPENDIX W: SUMMARY ITEM ANALYSIS RUBRICS Evaluator Rubric (coded summary data) Item 2 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 0 2.3. Describe Differences in Detail: Content: • In the second word of the analogy pair in the item stem, there are some differences in meaning between the two language groups. In the Kyrgyz stem, the word “шорпо” (broth) suggests “first course,” that is “something liquid.” In the Russian stem, some people might not understand the corresponding “борщ” (borscht) like “шорпо,” as it is the name of a soup. • Poor translation of item stem: soup is not a direct equivalent to “борщ”r. (borscht) - soup is however, equivalent to “шорпо”k. (broth). • In distractor (г), there is a Kyrgyz word for “деталь”r. (detail) used in the Russian version. The word is “тетик” in Kyrgyz. Culture/Language: • In Kyrgyz, “шорпо” (broth) implies “first course” – soup. In Russian, “борщ” (borscht) is the name of a kind of soup. The item stems are thus not perfectly matched. • A literal translation of “шорпо”k. (broth) will be “soup.” • The translation of “шорпо”k. (broth) will be “soup.” (this is a difference in meaning) • The word Russian word “деталь” (detail) perhaps won’t be understood by village kids as this is a Russian loan word. Should have used the Kyrgyz word “тетик” (detail). • The word “деталь”r. (detail) – city kids (those who know Russian) will know this, but village kids may not, which will create difficulties in understanding. 224 • • The equivalent word for “деталь”r. (detail) in the Kyrgyz language is “тетик.” The word “деталь”r. (detail) = тетик; (these are linguistic differences) 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? Discussion: MD: I think we agree that the words utilized in the analogy stem are not strictly equivalent; however, there is disagreement as to whether or not this lack of equivalence should be considered a serious enough difference to estimate a lack of equivalence in outcomes. CJ: the problem here is the incorrect translation (not adaptation) of the item stem from Russian into Kyrgyz. KK: Yes, they are different, but I don’t think the differences affect the relationship of the words in the analogy pair. ZS: also, in regard to item stem (г) it is important to utilize commonly used words, as some terms in this item are rarely used or completely unknown. NO: Yes, I agree, the use of uncommon words and terms is problematic. So, the problem is translation, the use of uncommon words, sometimes due to the poorness of the language itself. Some kids in rural areas do not know some of these equivalents, like “деталь” (detail); And, there is a Kyrgyz equivalent for it. It is “тетик,” and it should be used. 225 Evaluator Rubric (coded summary data) Item 3 2.1. Difference Levels Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Content: • City kids do not encounter “күл”k. (ash) in distractor (г); they live in apartments and don’t know what “күл”k. (ash) means because they have not encountered this (so, this is a lack of vocabulary, nuance). • There is a problem in distractor (A). In the Kyrgyz version, “бак” (tree) can mean both “дерево”r. (tree) and “сад”r. (orchard). In the Russian distractor - “сад” (orchard) is utilized. Format: • Misprint in Kyrgyz distractor (B), which is the answer key; wrote “Чоно” (no meaning) – should be “Чопо” (clay) • Orthographical mistake in distractor (B) – student can’t understand the word “Чоно” (no meaning) - and the result is that they can’t find the correct answer. • Instead of “Чопо” (clay), the word “Чоно” (no meaning) is written, a misprint which results in a loss of meaning. • Misprint with one word in (B) – the word “Чоно” (no meaning) should be “Чопо” (clay). • The word “Чоно” (no meaning) should be “Чопо” (clay). • Misprint – instead of the letter “п” they printed the letter “н” in distractor (B). • Incorrect letter in word. The word “Чоно” (no meaning), in the pair where Kyrgyz is “Чоно” and Russian is “глина” (clay) should be “Чопо” (clay). Culture/Language: • In distractor (a) in the Kyrgyz pair “бак: алма” (tree: apple) - “бак” (tree) can mean both “дерево” (tree) and “сад” (orchard) in Russian. However, the corresponding Russian pair is “сад: яблоня” (orchard: apple trees). In the Russian language the Kyrgyz 226 • • • “алма” k. (apple) means “яблоко” r. (apple) and “яблоня” (apple trees) is “алма бак” in Kyrgyz. The problem is incorrect translation - “бак” k. (tree) is both “дерево”r. (tree) and “сад” r. (orchard). In Kyrgyz, apple trees is “алма бак”k. which is “яблоня” r. (apple trees) in Russian. The Kyrgyz “алма” k. (apple) is “яблоко” (apple) in Russian. The word “бак”k. (tree) and “алма” k. (apple) in comparison to Russian “сад” (fruit orchard) and “яблоня” (apple trees) have many meanings. The word “алма” k. (apple) is not correctly translated. The correct variant is ““яблоко” (apple).” (difference in meaning) 2.4. Advantage: Russian: Kyrgyz: 2.5. Can the items be reconciled? • Yes, with the correct letter added in distractor (B). • The translation needs to be tested. You can’t rely on only one person for translation. • Improve translation in distractor (A) by using “бакча”k. (garden) Discussion: MK: There are many problems with this item, especially with the item distractors. The first problem I see is confusion in distractor (A) because of the translation of “сад: яблоня”r. (orchard, apple trees) into Kyrgyz is incorrect. The given Kyrgyz version – “бак: алма” (tree, apple). NO: Yes, but in Kyrgyz “бак” can mean trees or orchard. MK: OK, but we must consider that the Russian variant “сад” (orchard) is only fruit garden, not trees - that is the problem. A better analogy might thus be “tree: apple” – not “orchard: apple.” In other words, “from what/where” (material) comes. MD: I agree, “бак”k. (tree) is “сад”r. (orchard) and “дерево”r. (tree). The word “бакча” k. is “огород” (vegetable garden). I think a problem arises in analogies when the Kyrgyz words have many different meanings, and these same words in Russian have only one meaning. I do not know how much this affects overall results but this is true. Again, the problem is the use of multiple meaning and uncommon words in the Kyrgyz language when in the Russian language they have only one meaning. This is a problem of item adaptation. RM: Another problem is distractor (B). There is a typographical error in this distractor that might cause the question not to work. ZS: Yes, the problem is the format (it could have been done correctly, but it wasn’t). The results might be influenced by the fact that kids can not determine the meaning of the word “Чоно” because there is no such word in Kyrgyz! NO: Yes, item distractor (B) Чоно is the 227 problem– this question will definitely not work because there is no correct answer; and, there is no way to find the correct answer. AA: I agree, further, many kids in Bishkek do not know the meaning of the word “Чопо” (clay) as this word is rarely used and therefore can lead to problems. So, they couldn’t have guessed that there was a misprint in this word. MD: In regard to city- village kids, we can probably divide kids in into three socio-linguistic groups – Kyrgyz who study in Kyrgyz schools in villages (and don’t know Russian), Kyrgyz who study in Russian schools (and speak primarily Russian), and Kyrgyz who study in Kyrgyz schools but communicate often in the Russian language (kids from Bishkek). AA: That’s true in general, there are different cultural groups who took the test, but I don’t see how that effects this item because all the kids tested here took the test only in Kyrgyz, which doesn’t impact the result. We can’t compare how different Kyrgyz groups will react… but it is clear that the incorrect word use is a problem. Thus, I think the problem is the typographical mistake (format). 228 Evaluator Rubric (coded summary data) Item 4 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Format: • Incorrect adaptation of the Russian word “шарф” (scarf) to the Kyrgyz “моюн жоолук” (lit. neck wrap)- should have used original Russian loan word, not a translation. Culture/Language: • The translation of “шарф”r. (scarf) into Kyrgyz is problematic – do not use the direct translation. • In distractor (г) should have left “шарф”r. (scarf) in both the Russian and Kyrgyz versions. • The literal translation of the word “шарф”r. (scarf) – “моюн жоолук” (lit. neck wrap) is not a widely used word and can impact understanding of the main idea. • In answer (г), the word “шарф”r. (scarf) is translated literally as ““моюн жоолук”k. (lit. neck wrap).” It is necessary to use a word more appropriate to the original meaning. • Incorrectly adapted “шарф”r. (scarf) to “моюн жоолук”k. (lit. neck wrap)” • In distractor (B) the word “жaбыштыруу”k. (to glue) is more “literary” than “чаптоо”k. (to glue) and is a better fit for this situation. 2.4. Advantage: Russian: Kyrgyz: 229 2.5 Can these items be reconciled? • Yes, leave the original Russian word in the Kyrgyz item as well – “шарф” (scarf) • It is not necessary to translate some words literally because kids can translate in their own way. • Use the original Russian. • Use a more literary term in distractor (B). Discussion: ZS: I think foreign words (cognates) should stay in their original form. AA: I agree, if there are no commonly used equivalents for foreign words, use the commonly used version. MK: It is best to use active, commonly used words. MD: the problem here seems to be a too literal translation; sometimes there is no reason to translate. I recommend leaving the original if the foreign words are used widely. If not, then it should be translated. NO: Actually I think there is a Kyrgyz equivalent to “шарф” (scarf) but it is not used very often. MD: Well… how can we say what “often” is – how do we know this? 230 Evaluator Rubric (fully coded data) Item 5 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Content: • The pair of words, “палец: отпечаток” (finger: fingerprint) in the Russian version (distractor A) is not the same pair of words in the Kyrgyz version, perhaps they relied on a full adaptation, not translation? Culture/Language: • Incorrect translation of single words in distractor (A), but the adaptation seems correct because the relationship of the words seems to be the same. • There is a difference with the translation in distractor (A). “Бут”k. (leg) means “нога”r. (leg) in Russian but “палец” r. (finger) is used in the Russian version. A direct translation of finger into Kyrgyz would be “манжа”k. (finger). However, this should not impact the test results if the Russian and Kyrgyz versions of the test will be used separate from each other. • In answer (A) the pair of words are “бут: из”k. (leg: track, footprint) while the Russian version is “палец: отпечаток” (finger: fingerprint): The translation contains the same relationship but through completely different words. • The word “бут”k. is incorrectly translated. The correct translation is “бармак”k. (finger), (A) difference in meaning. • In distractor (б), “нерсе”k. (subject, thing) should be used in conjunction with other words like “бир нерсе.” • For answer (б) the word “нерсе”k. is usually used in combination with other words. • The word “нерсе” can be understood as “что-то”r. (something) or “что-либо”r. (anything). • In distractor (г), the Kyrgyz “сургуч” has two meanings in Russian; it can mean both “тряпка” (cloth) and “щётка” (brush); “пыль” (dust) usually is wiped with “тряпка”r. (cloth) in Kyrgyz it is adapted well. • Also, “сургуч”k. has two Russian equivalents – “тряпка”r. (cloth) and “тёрка”r. (grater) 231 • • • The word “сургуч”k. means “тряпка” (cloth) but in Russian version the word “щётка” (brush) is used. In Kyrgyz conversation, the Russian word “щётка”r. (brush) is often used. The word “щётка”r. (brush) is translated as “сургуч”k. which may or may not be correct. 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? • Be more attentive to the translation. • Use different words. Discussion: MK, ZS: Word choice is a bit incorrect. It would have been better to translate according to the context. The problem (not significant) is with the translation. MD: It seems that this item should not be difficult to solve however because the differences in translation do not seem to always impact the relationships in the pairs of words. That is, if the core relationships in an analogy item are maintained, there might not be any differences in response patterns. 232 Evaluator Rubric (fully coded data) Item 7 1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 3. Describe Differences in Detail: Content: • In rural schools where testing is in Kyrgyz, they might not be familiar with the terms “Терапевт” r. (therapist) in the item stem and “слесарь" r. (metalworker) in distractor (г). • Kids from rural schools will not know the word “Терапевт” r. (therapist). • “Адвокат”r. (advocate) might be an unknown loan word. Also, “слесарь" r - in rural areas Kyrgyz use the word “темир уста” k., which literally translates as ““мастер по железо”r. (master of iron), which is not the same as “Кузнец” r. (blacksmith). • The word “слесарь" r. (metalworker) could have been changed to a different word, not a loan word, a word familiar to Kyrgyz rural kids. This is a contextual difference. • In this item, Russian loan words are used for professional specialties like “Терапевт” r. (therapist) and “Адвокат” r. (advocate). • For some village kids these words won’t be understood. 4. Advantage: Russian: Kyrgyz: 233 5. Can these items be reconciled? • Use words that are used and understand by representatives of both groups. • It is possible to have used a different medical term. • Maybe use other words, also understood by the masses. • The word “слесарь" r. (metalworker) can be translated as “темирчи”k. Discussion: All: There may be problems of contextual differences and word use. The stem and distractors contain many Russian loan words (five words total are the same in both items) that some Kyrgyz students may not understand. 234 Evaluator Rubric (fully coded data) Item 8 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Content: • In distractor (A) the Kyrgyz pair is “чака: куу” which means “Ведро: Лить” (pail: pour) in Russian; however, in the Russian version it is given “Лейка: Лить” r. (watering can: pour). • A “чака” in Kyrgyz is “Ведро”r. (pail) which is not the same thing as “Лейка”r. (watering can) in the Russian version. • Incorrect translation of the word “чака”k. (pail) from “Лейка”r. (watering can). It would be better to translate “Лейка”r. as “суу күйгүз.” But this shouldn’t impact the correct answer if the Russian and Kyrgyz versions will be used separately. • “Лейка”r. (watering can) is translated incorrectly as “чака”k. (pail) Culture/Language: • In distractor (A) the word “чака”k. is not “Лейка” but “Ведро” (pail). Need to use the word “суу күйгүз”if the Russian version is to remain unchanged. • The word “Лейка” can be translated as “суу күйгүз.” 2.4. Advantage: Russian: Kyrgyz: No Advantage: 2.5. Can these items be reconciled? • Adapted normal in both languages. • Need to use a different pair of words in distractor (A). Discussion (None) 235 Evaluator Rubric (fully coded data) Item 9 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 0 2.3. Describe Differences in Detail: Content: • In distractor (б), for the Kyrgyz version, instead of the word “кыйын” (difficult) it would be better to use the synonym “татаал”k. (difficult), because the combination “түшүнүү” with “татаал” is better; for example “түшүнүүгү татаал”k. (it is difficult to understand). The synonyms “кыйын – оор”k. (difficult) are more common (but not literary), in my opinion. • One of the distractors is translated incorrectly. In (B) the Kyrgyz version is “суу: кургатуу”k. (water: dry) but the corresponding Russian version is “mокрый: cушить”r. (wet: to dry). If these words were used in context (in a sentence) then it would be OK. For example – “I was in the rain and got wet.” However, when no context is given, this is a problem and literal translation is necessary. Culture/Language: • “суу” k. (water) is “Вода” r. (water) – not “mокрый”r. (wet). If it is to be translated there needs to be a correction, perhaps “суу болуп калды”k. (it gets wet) or “суу болуу”k. (to become wet). However, kids might understand from the context. • Answer (B) “суу-кургатуу”k. (water: dry) while in the Russian variant it is “mокрый: cушить” r. (wet: to dry), i.e. there is a different of the meaning of the words. 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? 236 • • • Consider the specifics of the language. It is necessary to use to the word “суу”k. (water) in a pair with a different word. Sometimes it is necessary to use a literal translation, especially with analogy items. Discussion: 237 Evaluator Rubric (fully coded data) Item 10 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Content: • The word “Ярче” r. (brighter) in the Russian version, distractor (B) was translated incorrectly into Kyrgyz as “карарaak” (darker) – which in principal changes the relations between the words. • In distractor (B) there is an incorrect translation of the word “Ярче” r. (brighter) into “карарaak” k. (blacker, darker). “Ярче” r. (brighter) is actually “ачыгыраак” in Kyrgyz. Culture/Language: • In distractor (B), “карарaaк”k. is actually “чернее”r. (darker), not “Ярче”r. (brighter) which is given. In order to make the pair equivalent you need the Kyrgyz word “ачыгыраак”k. (brighter). • In Kyrgyz, in the pair of words there is a contradiction: “даана: карарaaк” (clear: darker) while the Russian version is “чёткий: Ярче” (clear: brighter). • The Russian pair in distractor (B) чёткий: Ярче” (clear: brighter), can be translated like “тaк, ачык” in Kyrgyz. • The Kyrgyz pair (B) “даана: карарaaк”k. is not the same as the Russian pair - “чёткий: Ярче”r.: “Ярче” (brighter) in Kyrgyz would be “ачык.” • (г) incorrect translation: “ылдам”k. is not “быстрый”r.; “ылдамду = быстрый” – but they used the first version. 238 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? • Check the translation. • “карарaak”k. should have been “ачыгыраак”k. • Possibly change the distractor: However, I am not sure because maybe it was done to maintain the differences between the correct answer and the distractors. Discussion: All: The problem is poor translation and problematic adaptation. 239 Evaluator Rubric (fully coded data) Item 11 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Culture/Language: • In distractor (б), the pair of Russian words “Включить: выключить” (turn on: turn off) are translated incorrectly into Kyrgyz. The given Kyrgyz pair “үзүү: кошуу,” (break: connection) are equivalent to a different Russian pair, “отрывание: соединение” r. (break, tear off: connection) • Incorrect translation of the Russian word “выключить” r. (turn off) into Kyrgyz in distractor (б). The correct translation is “өчүрүү” k. (turn off) • Incorrect translation of the word “выключить” r. (turn off)” – the correct version would be “өчүрүү” k. (turn off)” • “үзүү” k. (break) is “оборвать” r. (break), not “выключить” r. (turn off). The word needed is “өчүрүү” k. (turn off). “үзүү” k. (break eng) is used when we speak about breaking a thread or a rope. This mistake makes the item difficult for Kyrgyz examinees. Also, in the item stem, the word “күйгүзүү”k. (to light a lamp, or to burn your hand) needs to be replaced with “жак”k. (to light) or “жандуруу”k. (to light a lamp) • (б) “үзүү”k. (break) = “оторвать”r. (break) - “кошуу” k. (add) = “дабавлять” r. (add) • (б) “үзүү”k. is equivalent to “оторвать, рвать”(break); “кошуу”k. is equivalent to “дабавлять”r. (add) • “үзүү”k. means “оторвать, рвать” – выключить = өчүрүү. The incorrect word is used, usually “өчүрүү”k. is used. • In item distractor (г), “угуу: айтуу“k. (listen: speak), the translation from Russian is incorrect. In Russian the pair is “ответить: спросить” (answer: ask) • The distractor (г) also has translation problems “угуу: айтуу“k. (listen: speak) is not what the Russian pair is. “спросить”r. = “суроо”k., “ответить”r. = “жооп берүү”k. 240 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? • Check the translation • Needed to use different words. Discussion: MK: There are two distractors with translation problems here. Distractor (г) has an obvious incorrect translation; distractor (б) is also not an exact translation. I think this is important because the Russian distractor (б) is attractive, but in Kyrgyz (б) is less attractive due to translation mistake. MD: Maybe they are looking for associations of “like words.” 241 Evaluator Rubric (fully coded data) Item 12 2.1. Difference Levels: Identical Somewhat Similar Somewhat Similar Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Culture/Language: • “жакыныраак”k. (closer) should be used in combination with different words. For example, “жакыныраак кароо”k. (look closer) – depending on the context. • “пристальнee”r. (fixedly, intently) is given as “жакыныраак”k. (closer) which results in an incorrect relationship between the pair of words. “пристально смотреть”r. (stare) is more accurately - “тигилип кароо”r. (lit stare at). • “пристальнee”r.(fixedly) is “тигилип кароо”k. (stare at). For some reason, the incorrect word “жакыныраак”k. (closer) “ближий” is used instead, which means to look closely at something. 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? • Use two words together. • Use the correct words. Discussion: (None) 242 Evaluator Rubric (fully coded data) Item 13 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Culture/Language: • The item pairs contain mistakes in the distractors. For example, in the answer key (б) the word “жүрүү” in Kyrgyz means “to go.” However, depending on the combination of words used with this word, it can mean either going by foot or by car. In contrast, the word “ездить” in Russian means “going only with transportation – by bus, by car, by taxi, etc. This can change the relationship of the analogy pairs. • In distractor (B), “көчө”k. (street) = “улица”r. (street); “дорога” (road) = “жол” (road) - so there is an incorrect translation here. • In distractor (B), actually “көчө”k. (street) is “улица” r. (street) but the Russian word “дорога” (road)” is used instead. 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? • The main idea needs to be considered. • The word “жол” (road) needed to be used. 243 Discussion: MD: There are several problems with the distractors and answer key in these analogies. The problem is many words have many meanings, and the choice needs to be determined by the context. AA: the translation is literal. CJ: I see the problem as multiple meaning of words and associations. AA: First, in distractor B the translation of the Kyrgyz word “көчө” is incorrect because it means “street” not “road.” The Russian version uses the word “дорога” (road) instead. I think that “дорога” (road) means something that has asphalt. MD: Well, I don’t necessarily agree with that and don’t believe everyone defines it that way… many villages have “roads” which are not asphalted. RM: The problem is that the distractors are not good. MD: Yes, maybe the main problem is in the answer key. In my comments I wrote that “жүрүү” k. is different from “ездить” r. because “жүрүү”k. can be walking by foot or going by car while “ездить”r. means going by transportation. In Kyrgyz, perhaps “айдоо” (drive) would be a better choice for the pair because it has meaning like the Russian “ездить.” In this case, it is like equating the English “walking” and “go” - in one term the meaning is wider while in the either the meaning is narrow … If they used “айдоо,” (drive) they will get it quickly… I think this is an issue of translation – it is a good item but the direct translation is incorrect. Many of us thought for a very long time about what the correct answer here is to this item… Distractor “Г” appears to be a very good choice here… 244 Evaluator Rubric (fully coded data) Item 14 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 0 2.3. Describe Differences in Detail: Culture/Language: • In all the Kyrgyz pairs of words, the words should be used in singular form. It would be correct in Kyrgyz to use “кол: манжа” (arm: hand), “адам: бут” (person: arm), “тил: тиш” (tongue: tooth) “бет: көз” (face: eye); in the Russian version all the words are used in the plural form. • In distractor (A), “манжалaр”k. is “кистьи”r. (wrist) in Russian but in the Russian version the word “пальцы” (fingers) is used. The word “пальцы”r. (fingers) should be translated as “бармактar”k. 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? • Use the words “бармактar.” Discussion: ZS: In the Kyrgyz language there are certain nuances and it is important to maintain certain norms in translation. In this example, pairs of Kyrgyz words need to be used in their singular form. Instead, the Russian way (plural) is utilized which does not follow the norms and rules of Kyrgyz during the translation. NO: “манжа” is the hand and fingers and includes the wrist, correct? CJ: No, it does not include the wrist, it is only the hand. This is not correct. MD: I think манжа is OK to use in an analogy item it if it is used in the singular form. 245 Evaluator Rubric (fully coded data) Item 15 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Content: • The word “бейкаруулук”k. (weakness) in distractor (B) is a word used rarely in Kyrgyz. • The word “кубанычсыз”k. (lit. happiness + form for “without”- сыз) in distractor (Г) is not used in Kyrgyz. A better choice would have been “көңгүлсүз”k. (unhappy). • The word “бейкаруулук”k. (weakness) is not often used. • The word “кубанычсыз” k. (lit. happiness + form for “without”- сыз) is a created word (artificially created by test writers?). Culture/Language: • The word “бейкаруулук”k. (weakness) is not widely used in Kyrgyz and its use could result in a lack of understanding. • It is possible that Kyrgyz kids in the city will not understand the word “бейкаруулук” (weakness). • In city schools it is possible that the word “бейкаруулук” (weakness eng) will not be understood as it is not widely used in conversation; it is important to use words that are common in normal speech. • Considering the correct answer, we need to make an accent on the distractor (B). The incorrect words “бейкаруулук – кубанычсыз” contradict logic and the grammar rules of the Kyrgyz language. • The word “слабость”r. (weakness) can be translated as “али жоктук”k. (weakness) 246 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? • • • • Change the words “кубанычсыз” to “көңгүлсүз”k. (unhappy). Need to use more common words. Use “алсыздык”k. (weakness) instead of “бейкаруулук”k. (no meaning) Use the word “али жок”k. (weakness) Use a synonym. Discussion: MD: The problem here is poor translation and adaptation in two of the distractors (B, Г). NO: Yes, it is necessary to use “active” words which are used in everyday conversation. I finished a Kyrgyz school in Bishkek and “бейкаруулук”k. (weakness) is unfamiliar to me. CJ: I agree with my colleagues that it is important to use commonly used words. MD: Perhaps it would be possible to compare the translated Kyrgyz text with the original Russian text? That is, adjust the Russian text again if the translation into Kyrgyz does not seem to work? KK: Need to use commonly used words instead of literary terminology. Plus, I think there are some outright mistakes here, it is not just a problem of perspective. For example, “кубанычсыз” is just not said. If it is said, then this is a dialect problem. 247 Evaluator Rubric (fully coded data) Item 16 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Content: • Was it really difficult to find a synonym in Kyrgyz for the Russian word “стул” (chair) in distractor “G”? Culture/Language: • “стул”r. (stool, chair) should not be translated into Kyrgyz as “Диван”k. (divan or couch) • “Диван”k. (divan or couch) is incorrectly translated from the Russian “стул”r. (stool, chair). • The word “стул” (stool or chair) is translated incorrectly into Kyrgyz. There is a Kyrgyz equivalent but it is not used. 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? • Look at the meaning. • Use the word “отургуч”k. (stool) instead of “Диван”k. (divan or couch) Discussion: ZS: I had difficulty answering this item correctly. Not sure what the relationship is between “Автор” (author) - “писатель” (writer) the item seems difficult. 248 MD: I think the relationship is one of general categories to a more specific part – that is, the second word in the pair is a part of the first category. However, I can see how the pair of words in the stem could be considered synonyms and thus make it hard to resolve. That is, I think the answer is “furniture: table” but I can see how they might have selected “journal: book” in Kyrgyz. Also, the Russian distractor “digit: number” is also somewhat attractive. 249 Evaluator Rubric (fully coded data) Item 17 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 0 2.3. Describe Differences in Detail: Format: • “санда жок”k. (be in the minority) is the only compound word (two words) in the pair, which makes it obvious. Culture/Language: • City school (kids) might not understand the word “арзыбаган”k. (insignificant) • The word “санда жок”k. (be in the minority) should have been translated as “кышындай”k. • “арзыбаган”k. (insignificant) is unclear. Perhaps kids will translate it as “нe достойный внимания”r. (not worthy of attention) which is slightly misleading. 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? • Test the translation if he uses difficult synonyms. • Use different words. • Use different words with related relationships. Discussion: 250 Evaluator Rubric (fully coded data) Item 18 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Content: • Distractors (б) and (г) both have problems. In (б) “шашма”k. (hurried) does not correspond to the Russian version “откровенный” (open) and “шамдагай”k. (quick, nimble) is not the same as “болтливый”r. (talkative). In other words, neither word in this pair corresponds well to the pair in the other language. This distractor does not work. (г) also has an incorrect adaptation as “колдойгон”k. (attack) is used in simple speech, not as a literary term. Further, the meaning of the pair of words in Kyrgyz does not correspondence well to the meaning of the words in Russian. Culture/Language: • Translation is incorrect- (б). “шашма”k. = “торопливый”r. (hurried eng); “шамдагай”k. = “шустрый”r. (quick, nimble) not “talkative" as is given. • “шашма:шамдагай”k. should be translated into Russian like “торопливый: ловкий” (hurried: dexterious), but not like “откровенный: болтливый” (frank: talkative) as is given. • In distractor (б) “шашма”k. is “торопливый”r. (hurried) not “откровенный” (frank). “шамдагай” should be translated as “ловкий” (dexterious), or “шустрый” (quick, nimble) into Russian because “болтливый” (talkative) is “көп сүлүгөн.” • “откровенный” (frank) is “ачык”k. and “болтливый”r. (talkative) - “көп сүлүгөн”k. or “сайранан”k. • The word “шашма”k. (hurried) is translated incorrectly from the Russian as “откровенный” (frank): the correct translation of “откровенный” (frank) would be “ачык”k. (open) • In distractor (B), “этият”k. is not the same as “осторожный”r. (careful). The translation is not accurate. • Answer (г) in the Kyrgyz and Russian versions are completely different. • (г) “колдойгон”k. is “большой”r. (big)”; “жүжүрөгөн”k. – “маленький”r. (small)” does not correspond to the given Russian pair. 251 • The words in (г) Russian are not translated correctly – the correct translation is “бысып келген”(to have come on foot, attacked) and “коргоочу” (defender). 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? • Test the items first, consider the main idea. • Use “этиятуу” (B). Discussion: ZS: There is incorrect, inaccurate translation in several of the distractors in this item and the use of the incorrect meaning of some words. NO: the problem is related to the specifics and nuances of the Kyrgyz language. The thing is, some words can be used only in combination with each other; in certain contexts they can’t be used individually. Therefore, this issue is poor adaptation. RM: Yes, the problem is that some words must be used in combination. KK: Yes, there is the incorrect use of some words in Kyrgyz. MD: the use of some words out of context makes them impossible to understand, they can not be used individually; the problem is adaptation. RM: But, there are also simply grammar mistakes. The endings of some words are incorrect which means the students will not know what it all means - “тилеген”k. (desired) should not be used! Due to grammar mistakes the endings are not correct and they won’t know what it will mean. On the other hand, the item was not difficult to answer. NO: In distractor (б) the translation is simply incorrect as “шашма”k. is not “откровенный”r. There are many such moments. Also, “мүмкүн”k. (possible, may) in distractor (A) without context is one meaning, used with other words it has different meanings. 252 Evaluator Rubric (fully coded data) Item 19 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Content: • “тартып алуу”k. in the item stem has two Russian equivalents – “притянуть”r. (pull) and “отбирать”r. (take away). The Russian word used however, is “отдёрнуть”r. (draw back quickly), which is neither one. Format: • Number of letters. Instead of “укпай калуу”k. (go deaf) (correct) one letter is absent from the word. • “укпай калу_”k. (go deaf) – one letter is missing. In Kyrgyz, the infinitive form is created with affixes (“уу” in this word). • There is a typing mistake in distractor (г) - “укпай калуу”k. (go deaf) is missing a letter; And (б) the correct form is “жабышчак”k. (sticky) instead of “жабышкак”k. • In the Kyrgyz language, “липкий”r. (sticky) is correctly translated like “жабышчак”k. Culture/Language: • The word in the item stem, “тартып алуу” has two meanings in Kyrgyz. These meanings are “отбирать”r. (take away) and “притягивать”r. (pull). However, the Russian stem given is actually “отдёрнуть”r. (draw back quickly). In the Kyrgyz stem, in combination with the first word in the pair “ысык”k. (hot), the analogy can be understood as “ысык тартуу”- that is “притягивать тепло” (attract warmth). In this case, the correct answer will be distractor (б). • “жабышчак”k. (sticky) – from “липкий”r. (sticky) is not the best translation. A more accurate one will be “батамашкан, батталуу”k. • Incorrectly translated equivalent of “зажмуриться”r. (tightly closing the eyes) the correct translation is “көздү бекем жүмүү”k. (tightly closing the eyes). 253 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? • More accurate translation. • Add a letter. • Check and recheck the translation. • It can be adapted to the understanding of students. Discussion: RM: the multiple meaning of some words in the item stem means that there needs to be a more careful selection of pairs of words – otherwise, the item misleads and it becomes impossible to find the correct answer. There seem to be some misprints as well. NO: I agree, depending on how they define the terms in the stem they could come to complete opposite meanings of the analogy. The given Russian word is “отдёрнуть”r. (draw back quickly). which implies a “pushing away from” while the Kyrgyz equivalent, “тартып алуу”, can mean “притягивать”r. (pull) and might even be interpreted as “pulling towards.” So, depending on how they interpret the meanings of the words in the stem pair, their answers might be different. MD: Yes, the stem needs to be more clearly defined (contain no double meanings). MK: Absolutely, the stem and distractors should have only one meaningful interpretation. NO: Also, the word used here in the Kyrgyz distractor (б) “жабышкак”k. (sticky) is unknown to me. Perhaps this is some form of dialect? MD: Yes, this is a word but it is not so widely used, and one letter is incorrect. CJ: Yes, but even with the correct spelling this is a word but the problem here is related to dialect use. According to the context though, the Kyrgyz will understand “something sticky.” AA: There is a difference in nuance with the word “зажмуриться”r. (squint) and the Kyrgyz equivalent. MK: Does the Kyrgyz term mean squinting or blinking? AA: In Kyrgyz it reads as “sneaky look” i.e. a dangerous person. MK: I think there is also a difference between blinking vs. squinting and the connotations of these words (negative connotation of evil for the Kyrgyz item). MD: The translation makes it an expressive, stylistic colorization, but in the pair of words – it works – as the relationships are maintained. All: There are two problems with this item– uncommon words as well as words with multiple meanings. There is a difference here between these two analogy pairs and that will probably impact the results. 254 Evaluator Rubric (fully coded data) Item 20 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 0 2.3. Describe Differences in Detail: 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? Discussion: 255 Evaluator Rubric (fully coded data) Item 21 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Content: • The translation was done literally and this resulted in a problem with the Kyrgyz variant. Therefore, for a clearer understanding – the word “жана”k. (and) needs to be changed to “ошону менен”k. (with this, together with this) or “ошол эле убакта”k.(at the same time) or “бирок”k. (but). • The translation was done correctly but the main idea was lost. • It would be better to change the word “болбойт”k. (will not) to “жөнөтөт”k. (send) as before. • “жумушка орношкондон кийин”k. implies he had “возможность” (possibility). It follows that the more correct form of the last phrase in the stem will be “мүмкүнчүлүгү жөнөтөт”k. (possibility to send). • The phrase from Russian “определенное количество”r. (certain quantity) is translated incorrectly into Kyrgyz. The word “канчадыр”k.(some) makes the question unclear and could mean lost time for the student. • The word “определенный”r. (determined) is not the same as “канчадыр”k.(some) Format: • The spacing of the blanks in the two different versions is different to the advantage of the Russian version. Differences appeared due to the adaptation. The instructions in Kyrgyz are not clear. In the translation, the punctuation is incorrect. Culture/Language: • Need to use “же”k. (or) or “бирок”k. (but) instead of “жана”k. (and). • Literal translation. Instead of the connection “жанa”k. (and) it is necessary to use “ошону менен бигgе”k. (together with this) because the connector doesn’t sound Kyrgyz, but sounds Russian. 256 • It is obvious that this sentence was translated completely from Russian in Kyrgyz. 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? • Need to use “же”k. (or) or “бирок”k. (but) instead of “жана”k. (and) • Need to use “бирок”k. (but) instead of “жана”k. (and) • Don’t translate literally but make an adaption - use “Өз убактысын бир бөлүгүн / белгилүү бир бөлүгүн”k. (part of your personal time/part of some time) • “Канчадыр бир убакытты”k. (for some time) - is a very scientific style and could be changed. • Of course in order to solve the problems with this item, experienced (the best) translators are needed. In selecting the text many factors need to be considered if it is important how well the task is completed. • Instructions need to be clear. Discussion: MD: I think this item needs to be completely changed as it will not be easy to simply adapt. The main problem is the incorrect use of the term “жана”k. (and) which is obviously the result of a direct translation from the Russian sentence. AA: I agree, but the problem is that this is a common usage in Kyrgyz. It’s on the radio, the national TV stations and other official media sources. MD: It is common but it is not correct. AA: I understand. MD: I believe this usage is a one of those “Russianisms” that has crept into Kyrgyz through (ethnic) Kyrgyz, Russian language speakers. The main problem is that in Kyrgyz we don’t use “and” as a connector when connecting two different verbs. Two verbs together often come together to convey a different meaning than when they are used singularly. The two verbs are simply put together, without the use of any connectors. ZS: I agree with MD, villagers don’t use “жана”k. (and) in this sense – i.e. they use Kyrgyz correctly. To me, this raises the issue of adaptation. It seems that this item was clearly adapted from Russian. I think if Kyrgyz original texts had been used, there wouldn’t be 257 this problem. We could avoid syntax problems like this. Our syntax is different and should be Kyrgyz – not Russian. MD: Well, theoretically, I agree of course. A big problem is that much of our literature in the sciences and the arts is translated in this way – translated directly from Russian. Much of the Russian influence is inevitable. Little is produced in Kyrgyz due to a lack of specialists and resources. ZS: OK, but what if we had several specialists work on developing the items at the same time and then decide whether they will work or not? MD: To me, this item raises a bigger question from the perspective of the test translators. Should the items can contain only language that is 100% correct or contain language that is incorrect but commonly used? Unfortunately, there is often a gap here. The situation and state of the Kyrgyz language is very sad. Further, we have to take into account language as it is used on a daily basis. Many people in the cities – and not only in the cities – speak Kyrgyz with lots of words and forms taken from Russian. Sometimes the language is simply all mixed up. This is a result of the language environment we live in. We combine Russian and Kyrgyz all the time in a sort of hybrid colloquial language. For example, “канча (how many - kyr) листов (lists (pieces) or paper rus)?” or “сиз (you kyr) домой (home rus) барасызбы (are going kyr)? – lit. Are you going home? There are hundreds of ways we do this. This item raises some big issues. 258 Evaluator Rubric (fully coded data) Item 22 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Content: • Perhaps the abiturients did not learn at school that the tree “баобаб” (baobab) grows in Kyrgyzstan (especially kids that study in rural areas) and therefore finding the answer might be difficult. • The antonyms used in the distractors make the item difficult. • Regional differences should be taken into account. • These Kyrgyz texts need to be developed very carefully. This is because the negative form given is located in the word itself. The distractors are very different as well – in Kyrgyz they are very long. There is no reason to use this example. Format: • Typographical error. • Poor equivalence of Russian and Kyrgyz distractors. In distractors (A) & (г) “плотные”r. (dense) is not the same as “өтө нык”k. (very strong ). In distractors (B) & (б) “пористый” (porous) is not a good match for “майда тешиктүү”k. (with small holes). The word “келген”k. in the stem creates a correct combination with “өтө нык”k. (very strong) in (A) & (г) but is worse fit with “майда тешиктүү”k.(with small holes) in (B) & (б). Culture/Language: • The tree “баобаб” (baobab) in the item stem does not grow in Kyrgyzstan and is unfamiliar to many students. The translation is correct however. • The Kyrgyz version takes more attention to solve than the Russian version. The antonyms in the distractors, (б) for example, are longer, more complex in Kyrgyz than in Russian. Please do not use this task. 259 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? Discussion: RM: I found this item to be poorly adapted. In many ways, the Kyrgyz and Russian versions are not equivalent in meaning. Several of the distractors contain words which convey different meanings. KK: I agree, there is a problem of psychological understanding the text and some words in Kyrgyz, it does not seem natural. There are stylistic problems here. MD: There is a problem of a lack of knowledge of some specific words on the part of students and some awkward terminology. AA: In my opinion, the problem is not a lack of knowledge as they cover this in school, but instead a lack of attention to language detail on the part of the item writers. MK I think the content is also at issue. The word “баобаб” (baobab) will not be understood by many students – I studied the item for several minutes. This seems like a word that only biologists will know – it is too technical. AA: Actually, I disagree, because the word does not make the kids miss the logic here. They don’t need to know that word to resolve the logic of the sentence completion. Plus, I think they learn this term in biology – it is covered in Kyrgyz schools. 260 Evaluator Rubric (fully coded data) Item 23 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Content: • The grammatical mistakes in this sentence demand time to understand the main idea of the sentence. The literal translation of the word “возмещение” (compensation) in the item stem does not fit the given situation (in Kyrgyz) and creates a false association. • In the Kyrgyz item stem, the words “айдап кетишкен учурда” is not the equivalent of “угон”r. (hijacking, car theft) in Russian. This is a mistake. • The word “камсыздандыруу” (provision, guarantee, insurance) in the item stem has no real meaning in Kyrgyz. Culture/Language: • “камсыздандырылган”k. (be provided for, insured) in all the item distractors is not an accurate translation but kids can understand from the context. • In Kyrgyz, the word “камсыздандырылган” has a wider connotation than “обеспечение” (provision) than “застрахование”r. (insured) does. The word “страхование”r. - камсыздандыруу”k. is typically used amongst specialists. • The word “застрахованный”r. (insured) – and “нeзастрахованный”r. (uninsured) are incorrectly translated from Russian into Kyrgyz (differences in translation). 261 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? • • • Adapt or translate correctly. Change the word “тургузулбайт”k. (will not be provided) to the word “тойлойбойт”k. () Find a synonym for the word “страховать”r. (to insure) Yes, for “угон”r. (hijacking, car theft) use the words “урдап кетуу”k. (stolen) in the Kyrgyz version. Discussion: CJ: I had difficulty reading and understanding the Kyrgyz text; then, I read the Russian text and I understood. I think that with background knowledge (knowing Russian) they might be able to understand some of the meaning. However, Kyrgyz-only speakers will find it confusing. That is, if the Russian concepts are covered first, then one knows what the Kyrgyz authors meant to say. However, the students do not get this advantage because they do not know the Russian version. AA: We (item analysts) have an advantage because we can read both items at the same time! And, “insurance” is rarely used in Kyrgyz terminology. So, the text is not well adapted, and this makes it difficult to comprehend. KK: Some words (used) in this item can only be understood in specific contexts. There are both translation and adaptation problems. RM: the problem is adaptation, not translation. AA: During task creation, it is necessary to consider many nuances and the commonality of certain words. And also the weak vocabulary of Kyrgyz speakers. KK: The desire to pass on the main idea of the task was too much, as there is the possibility of loosing the literary nuances of Kyrgyz; important to consider the characteristics of the language – in Kyrgyz sentences are usually short – as words are “complex” (compounded – i.e. agglutinative), and the result of the direct translation is that translated texts (from Russian into Kyrgyz) are longer than usual for Kyrgyz speakers. As they become longer, they become more confused. And, the sentences eventually become even longer than the Russian versions. I believe that it is possible to find original texts in Kyrgyz and then translate them into Russian then you will see the richness of differences of the languages. ZS: Yes, the problem is the translation and the adaptation due to stylistic differences. 262 MK: The problem is the terminological meaning of the words and the difficulty of the sentences in Kyrgyz. AA: Yes, there are not enough words in Kyrgyz for some of these concepts. Kyrgyz in general, has fewer words. On the other hand, if items are limited to only what everybody knows… MD: This is not a question of commonality of the concept of “insurance” - which is also new to Russian speakers, but rather use… In Russian the item is simply easier… To be honest, many of the evaluators could fully understand only after they read the Russian version of the item. RM: To me, this indicates that the items are not being prepared with enough consideration for the Kyrgyz language. I think the items should be written in Kyrgyz first, then have the Russian adapt to the Kyrgyz version. Why can’t this be done? ZS: I agree with RM. One adaptation suggestion would be to not have the Kyrgyz sentences copy the Russian style but to make them “more Kyrgyz.” This means making the sentences shorter, even if it means more sentences. That is the first point. Also, the issue of unknown concepts is also important here. Many new terms are constantly being formed all the time in Kyrgyz while in Russian the concepts are well known. For example, in Kyrgyz there are four or five completely different ways to say “entertainment center.” People do not know which are correct at this point. Therefore, Kyrgyz often use Russian loan words. Some use Kyrgyz words that are not known. As teachers we see this on a regular basis. Many words are “created” but not yet well known. In this item not everyone knows how to say “uninsured” especially in rural areas where there is no such thing as insurance. MD: ZS eje, I agree with your first point – in Kyrgyz we have “long words” but short sentences. This is important to remember in comparison with Russian. Also, in regard to the point about word use, as I said in regard to other items we often mix languages in such cases. For example, it is common to say “Страховkа”r. (insurance) “барбы”k.? (Do you have insurance?). These RussianKyrgyz hybridizations are OK in everyday speech but obviously become problematic in testing. AA: Also, “угон” (hijacking, car theft) is not understood unless the whole context is known. Otherwise in Kyrgyz it is understood as the “the car was left open” not stolen as is clear from the Russian version. NO: I disagree, because some other words provide hints. The problem is that it needed to be adapted fully, not simply translated – it is too literal here. 263 Evaluator Rubric (fully coded data) Item 24 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Content: • The stem is too long in the Kyrgyz version. Due to agglutination, sentences in Kyrgyz are usually kept shorter than Russian sentences. When direct translation comes from Russian the result is sometimes high complexity and difficulty in understanding. • Kyrgyz understand the word “кругозор” (outlook) (a Russian word), but will not understand the attempt to create a new Kyrgyz equivalent. Culture/Language: • “ар тараптан”k. – “с разниых сторон”r. (from a different sides); “ар тараптуу”k. – ““разносторонний”r. (multi-sided).” It seems to me that the word “өнүт”k. might be misunderstood by the general mass of examinees. 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? • Shorten the stem in Kyrgyz and refrain from making up new Kyrgyz words. • Change the word “өнүт” to the word “тарап”k. (side) Discussion: AA: the main problem is that uncommon words are used. 264 Evaluator Rubric (fully coded data) Item 25 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Content: • The word “Пас”k. (low, down) is dialect and may not be understood. • The word “Пас” k. (low, down) is dialect and may not be understood by many. • During the process of translation a syntaxical mistake was made. Culture/Language: • The word “Пас” k. (low, down) is dialect. • In city schools and in the northern regions, the word “Пас” k. (low, down) is not used, in the south they may understand this word. • The word “Пас” k. (low, down) is dialect which is not known to kids from the north. • The word “Пас” k. (low, down) is dialect, northern kids may not understand. Other: • This question demands specific knowledge. Who knows, maybe scholars have demonstrated that high voices are more difficult to understand or the perhaps the opposite? That’s why in both the Russian and Kyrgyz versions it could be difficult to answer correctly. 2.4. Advantage: Russian: Kyrgyz: 265 2.5. Can these items be reconciled? • Yes, use: эркектердүн үнүн кабыл алуу жеңил болуш керек эле. Натыйжада аялдардын үнүн кабыл алуу жеңил болуп чыкты. • Yes, use: Окумутруулар үндүн бийигирээк тембрин кабыл алуусун кыйындатат деген тяанакка келишти. Аялдардын үндөрү демейде эркектердикинен бийигирээк келет, ошондуктан алдардын кебин укканда, анны түшүнүүгө кыйынраак болуп калат. • Change “Пас” to “төмөн”k. (low, down) • Maybe use the word “төмөнүрaak”k. (lower) • Instead of “Пас” use the word “төмөн.”k. (low, down) Discussion: NO: The problem here is the use of dialect instead of general use, literary language. In order to understand the task, it is necessary to understand the meaning of the text. It is possible that kids will understand the opposite of what is meant. MK: the dialect used here is common in the south of the country. 266 Evaluator Rubric (fully coded data) Item 26 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Content: • “производство”r. (production) is translated as “өнүмдөрдү чыгаруудан”k. which actually changes the situation and leads to a lack of clarity in the Kyrgyz item stem. • In distractor (A), the words “чыгашалар көбөйгөн” are incorrect; should be “… көбөйгөндүктөн”k.(because of the increase in something). • In the process of translation there is a mistake in formulation (б) too. • It is not possible to translate literally texts of this size and content. No good result will come from this. It is very important to carefully select the texts. The word “зыяны”k. (harmful) in the stem makes it difficult to understand the text and forces one to read it again. Format: • In my opinion, this is an example of when an item was obviously translated from Russian into Kyrgyz. Culture/Language: • The words “баш тартуу”k. (refuse, avoid) in the stem has different meanings from the Russian version. • The suffix on “себептүү”k. (due to) needs to be changed to a different word (use suffix to form possessive). This will result in fewer words, difficult sentence, but also will not change the meaning (it will improve it). 267 4. Advantage: Russian: Kyrgyz: 5. Can these items be reconciled? • Yes, translate the word “производство”r. (production) as “өндүрүш”k. (production) • When the texts are large, they need to be adapted not translated directly. An offered solution “ишканалар үчүн экологияга зыяны азыраак өнүмдөрдү чыгаруудан баш тартуу максатка ылайык эмес, анткени кирешелерди көбөйтүү үчүн бул аларга пайда көрсөтөт”k. (It is not reasonable for the enterprises to deviate from producing products that would be less harmful for environment, because they can be useful for increasing their profit). • An offered solution: Кирешелер көбөйгөн себептүү (because of the increased profit) – кирешелердин көбөйтүүсү (increase of profit). • An offered solution: 1. “Экологияга зыяны азыырак”k. (less harmless for environment) – “экологиялуу таза азык-түлүк же…”k. (environmentally safe ("clean") foods or...). The word harm hinders comprehension of the sentence right away, so one needs to read it again. 2. “чыгашалар көбөйгөн себептүү”k. (because of increased expenditures) should be – “чыгышалар көбөйгөндуктан”k. (in connection with increased expenditures). Discussion: ZS: “себептүү”k. (due to) in item stem is not needed. It needs a different affix here. MK: I do not agree… without this, the item loses the main idea. What will be the correct answer? ZS: this item is confusing, the translation is not clear in several places. MD: Hmmm… It seems that the Russian text allows a “double meaning,” and Kyrgyz only one meaning. However, that meaning (for the Kyrgyz item) leads to a wrong answer. This is due to the way the item is structured. MK: What is the correct answer to the Kyrgyz item? MD: the complication is over the meaning of the word “production” which is quite unclear. ZS: If we can’t find the correct answer I don’t think the children will either! 268 Evaluator Rubric (fully coded data) Item 27 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Content: • Grammar mistakes in sentence formulation. In distractor (A), replace “анткени”k. (because) with “себеби”k. (because). Format: • Typographical error – instead of “арзан”k. (inexpensive) it is written “арзар”k. (no meaning) 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? • Check the spelling. • Yes, I propose: «Японияда квалификациялуу эмгек кымбат турат, себеби анын ашыкчылыгы анык, ошондуктан Япония квалификациялуу эмгекти талап кылган арзан товарларды өндүрүп чыгарууга жөндөмдүү эмес». (In Japan, highly qualified jobs are paid appropriately (well), because they are limited; therefore, Japan doesn't need to develop cheap products necessary for highly qualified jobs). Discussion: All: There are a few typographical errors but from the context you can understand the meaning of the words in both items. 269 Evaluator Rubric (fully coded data) Item 28 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 0 2.3. Describe Differences in Detail: Content: • Incorrect content in distractor (A) because of the incorrect use of the words “байланыштуу”k. (to connect) and “жана”k. (and). The translation should like look this (see below): • There is a grammar mistake in item stem of the Kyrgyz version; instead of “прогноздого,” should be “прогноздорго”k. (prognosis) Culture/Language: • In distractor (A) of the Kyrgyz version, “жол бербеген”k. (to hinder) should have been “тоскоол болгон” k. (to hinder) 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? • The correct version of the Kyrgyz stem is “прогноздорго”k. (prognosis) • I recommend “Атмояферада жүрүүчү процесстер өтө татаал болгондуктан, атмосферада аба-ырайына так прогноздорго жол бербеген кубулуштар болушу мүмкүн” (Perhaps because processes taking place in our environment (atmosphere) are very complex, therefore, it is hard to make prognoses about the weather). 270 Discussion: ZS: In the Russian school classes there are many Kyrgyz language students. Many of them speak in Kyrgyz at home. They don’t know Russian well as for them it is not their native language. I finished Russian school but I intuitively understand the Kyrgyz test better. That is why native language is easier. MD: ZS eje, Well… we are speaking about something else here because the students answer only one version of the items, they aren’t asked to answer both versions… ZS: In general, syntax is easier in Kyrgyz than Russian. We have a straightforward “cause – result.” The structure of Russian is more difficult. MD: Textbooks are mostly translated. We see the same issues in our textbooks at schools… I think that there are several levels of structure in Russian. In Kyrgyz, it is “single level” – it is this “and” this “and” this. New ideas are “added” while in Russian there is a different structure. In Kyrgyz it is all part of the same syntactical level. ZS: My Kyrgyz students also tell me that the Russian constructions are difficult too learn at first. In Russian you have “Due to the fact …. Because of the fact …” in Kyrgyz, more direct statements. MK: Yes, for example, in Russian you may have … “event/phenomena … which is/that is…” etc. In Kyrgyz we have “this happened” (stop) and “that happened” (stop) and then something else. It is all on “one level.” ZS: In general, it is easier in Kyrgyz. MD: The challenge for test writers is that for some Kyrgyz texts it becomes complicated when we try to repeat the Russian syntax and constructs. It becomes complicated when translation is literal. The best way to keep the Kyrgyz intact is to break the Russian sentences into more sentences rather than trying to capture the Russian structure. In Kyrgyz, ideas are built not through one complex sentence, but through a series (many sentences) with simpler ideas that when compounded, express the same idea. ZS: Related to this, we often start to translate from the end of the sentence because the main idea comes last (the verb is at the end). Word order is different in Kyrgyz than Russian which can also cause complexities. Because of the word order, sometimes, I translate the literal sentences first – then rearrange them in order. (The) Strategy here is to read sentences more times or to hear it in Russian first, and then piece together the puzzle. 271 Evaluator Rubric (fully coded data) Item 29 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Content: • In my opinion, it is necessary to pay attention to the form of the verb “деген экен”k. (turns out). • In the stem, the pronoun “аларга”k. (to them) is used incorrectly. The correct form would be “аларды”k. (their). There is a similar grammar mistake in the word “адистерди”k. (at the specialists). Here, the more correct form is “адистер менен”k. (with specialists). Culture/Language: • In the Kyrgyz version the words “деген экен”k. (turns out) are used while in the Russian version the words “говорил”r. (he said) are used. “деген экен”k. (turns out) actually means “оказываться”r. (turns out) 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? • • • Don’t use the word “экен,” or use the word “айткан,” or “айтыртып.” Yes, make the following corrections: change “Аларга”k. (to them) to “аларды”k. (their), and “Адистерди”k. (at the specialists) to “адистер менен”k. (with specialists). An improved version: Генрид Форд мындай деп айтыптыр - “Эгерде мен өз атаандаштарымдан озуп өтүүнү кааласам, 272 анда аларды жакшы адистер менен камсыз кылмакмын, себеби эң бир мыкты идеядан кемсилик табу жана ошонун аркасы менен анын иш жүзүнө ашырылышына жолтоо болккшкна алардын колунан келет.”k. (Henry Ford said: “if I want to be ahead from my rivals, then I wish them good workers, because they will be able to find weaknesses of the best idea and by this they will interfere with its realization). Discussion: 273 Evaluator Rubric (fully coded data) Item 30 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Content: • There are differences in the use of grammar in the word “болуусу”k. - The correct form is “болууш”k. (official). • Another grammar mistake is the incorrect form of “Жайлатат” – a better form is “жай кылат.” Format: • Orthographical mistake (spelling) “болууш” 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? • “тоскоол”k. (hindrance) instead of “тоскол.” • Yes, in the Kyrgyz stem, do not use “Болуусу” but болушу. Also change “Жайлатат”k. (slow down) – to жай кылат (do something slowly). • Another recommended change: “Туруксуз табигый түрлөрдүн пайда болушу эволюциялык процесстерди тай кылат. Түрдүн туруксуздугу анын өнүгүсүүнө тоскол болбойт деген ой-пикир бул фактыга карама-каршы келбейт.”k. (The existence of the unstable natural types will create an evolutionary process. The unstableness of the type will not be an obstacle to its development, such opinion will not contradict this fact). Discussion: 274 Evaluator Rubric (fully coded data) Item 31 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Content: • In distractor (г) there is a mistake. The Kyrgyz word combination “жээк аймактарында”k. (at the coastal territories/areas) is not used at all. Format: • (б) and (B) are not in the same place for the items. • (б) and (B) are not in the same order. • (б) and (B) are in different places in the items. • In Kyrgyz distractor (г), “жашагандарды”k. (people living) should be “жашагандарга”k. (to the people living) 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? • In order to correct distractor (г), you need to use “жээкде”k.(at the coast) instead of the above. • Look at the order Discussion: 275 Evaluator Rubric (fully coded data) Item 32 2.1. Difference Levels: No Diff. Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Content: • In the Russian version a “concrete” task is given – that is, something that can be related to. In the Kyrgyz version, it asks for the how the ideas are connected (between the sections). Format: • Not complete, and incorrect word combination. Need to remove the Russian letter “и”r. (and) from the Kyrgyz version. 2.4. Advantage: Russian: Kyrgyz: 2,5. Can these items be reconciled? • For distractor (Г) in the Kyrgyz version: «Натыйжаларды жана алардын келип чыгуу шарттарын сыпатто”k. (analyze the results and their origins) Discussion: KK: The question is not that clear. ZS: Actually I don’t agree. 276 MD: there is a small typo - “и” is there in the Kyrgyz version instead of with “жана” but I don’t think that should impact results. Besides the Russians answered worse and this is a problem with the Kyrgyz item. ZS: I would have projected that the item favors Russians because the text in Kyrgyz has some problems that we already discussed. MD: There is a difference in the way the stems are worded, structurally speaking. I mean, the Russian version requires students to “complete the sentence.” The sentence goes on and then stops, the four distractors each represent a possible continuation of the idea. For the Kyrgyz item, it is actually a complete question with a question mark at the end. I think this issue exists. When there is a question- answer – it might be easier than when you have to “build” a sentence. In some way this might make it easier to solve than the Russian item but I am not sure about that, the distractors all seem pretty clear. I didn’t notice this the first time we analyzed the items, I only now- with more careful inspection - see these differences. ZS: I think “b” is correct. I am not sure if I agree with RM’s proposed change for one distractor but… MD: As far as KK’s recommendation – I think we should be sensitive to the fact that when you change grammar you sometimes you loose the main point of the item… Actually, I can’t seem to find any explanation for why this item might be harder for Russians except what I mentioned before about differences in the structure of the stem (question) itself – complete question vs. fill in the blank. 277 Evaluator Rubric (fully coded data) Item 33 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Content: • In distractor (A) “Подробно”r. (in detail) is translated as “жеке”k. (special, separate) which is incorrect. Format: • The form of the question (stem) in Kyrgyz is incorrect. Culture/Language: • In distractor (A), “жеке”k. (special, separate) is translated as “частный”r. (private) which is incorrect. • Distractor (A) uses the word “жеке”k. instead of the word “тагыраак"k.” (more precisely). “жеке”k. is translated incorrectly as “частный”r. (private) • The word “подробно”r. (in detail) is translated as “жеке”k., which could mean “отдельный”r. (separate) 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? • Examine the translation and change the stem. Remove “Кандай болгонун”k. (about its state) and replace with “төмөнкүдөй билдирет”k. (states as below) • Use the word “тагыраак"k.” (more precisely) in distractor (A) instead of “жеке”k. (special, separate) • Use “толук”k. (complete/holistic) in distractor (A). 278 Discussion: MK: There is some incorrect translation (like distractor (A)) with this item, though the overall meaning of the text is similar in Russian and Kyrgyz. The main difficulty arises from the multiple meaning of some words, and the difficult word combinations of some long sentences. ZS: There are many problems with the translation of this difficult text; it is not well adapted. One resolution is to take an original Kyrgyz language text, related closely to this theme and the Russian text, because it is difficult to completely pass on the entire meaning and deeply consider the question in a foreign language. NI: It is very difficult to find original texts in Kyrgyz. MD: It is easy to see the lack of connection due to translation issues. I think that the analytical thinking on the part of the Kyrgyz is different. ZS: I did not find any difficult words or issues with the item itself. Maybe some issues with the form of the sentences (constructions – syntax) in the reading text though. It is clear that the key is (B) but (G) is also an attractive answer. ZS: There are many translation problems in the reading text. I want to reiterate that test writers should select Kyrgyz tests first and then adapt them. I must say that the Russian text is quite good, as are most of the items in Russian. I can’t find any difficult words, grammar mistakes, etc. but syntax issues might explain any differences. 279 Evaluator Rubric (fully coded data) Item 34 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Culture/Language: • The word “каржыларды”k. (finances ) in the Kyrgyz distractor (A) would be better translated as “Каражаттарды”k. (means/resources) 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? Discussion: 280 Evaluator Rubric (fully coded data) Item 35 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Format: • (б) and (B) have changed places. • (б) and (B) are in different places. • (б) and (B) are in different places in the Kyrgyz version. • The word “төмөн жакта”k. (below) in the Kyrgyz stem is incorrect. The correct word is “төмөндө”k. (below). • The distractors (б) and (B) are in different places in the two versions. 2.4. Advantage: Russian: Kyrgyz: No Advantage: 2.5. Can these items be reconciled? • Examine the items. • Change “төмөн жакта”k. (below) to “төмөндө”k. (below).” Discussion: 281 Total Diff. Evaluator Rubric (fully coded data) Item 36 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Content: • In the task the sentence is not correctly formed in Kyrgyz. It is long and difficult to understand. • The translation of the question is poor. The word order is bad, structure is too complex, it’s difficult to read and understand. 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? • What did the author want to demonstrate using the example of measuring the temperature of a healthy person? • The question should be expressed in a different way, reworded. Discussion: 282 Evaluator Rubric (fully coded data) Item 37 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 0 2.3. Describe Differences in Detail: Format: • There is an extra affix (ending) in the word “эмнеде”k. (what at what location) 2.4. Advantage: Russian: Kyrgyz: No Advantage: 2.5. Can these items be reconciled? • “Эмнеде”k. (at what location) should be just “эмне”k. (what). Discussion: 283 Evaluator Rubric (fully coded data) Item 38 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 2.3. Describe Differences in Detail: Content: • Incorrect form of the sentence in this task makes understanding the main idea difficult. Format: • There is one word missing that is needed. Instead of “токтуу”k. (no meaning) – “токтотуу”k. (to stop) is needed. • There is no such word as “токтуу”k. (no meaning). It should have been “токтотуу”k. (to stop) • In distractor (б) there is a missing letter. It should be “токтотуу”k. (to stop) but “токтуу”k. (no meaning) is written. 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? • Yes, use: «Автор эмнеден улам геоинжиниринг долбоорлорунун ийгиликтүү жүрө тургандыгынан күмөн санайт?» • Edit and check again and again. Discussion: 284 Evaluator Rubric (fully coded data) Item 39 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 0 2.3. Describe Differences in Detail: 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? Discussion: 285 Evaluator Rubric (coded summary data) Item 40 2.1. Difference Levels: Identical Somewhat Similar Somewhat Different Different Total Diff. Content Format Cult/Ling. Other 0 2.3. Describe Differences in Detail: Format: • (B) and (г) are not in the place. • (B) and (г) are mixed up. • (B) and (г) are not in the sample place. 2.4. Advantage: Russian: Kyrgyz: 2.5. Can these items be reconciled? • Need to check carefully Discussion: 286 Item Analysis Rubric (Reading Comprehension Text) • • • • • • • • • • • • • In Kyrgyz, “Global Warming” is hard to understand (the title) of the text. In my opinion, attention needs to be paid to the size of the text of the Kyrgyz version, because too big of a text can cause difficulties. In general, I have only a few comments about the translation of the text. See text lines: 1, 20, 40, 45, 105, where hyphens are missing which might hinder some understanding. Text: there is an extra hyphen between words in some places. In the lines 19, 22, the word “не- гизделген” is written with a hyphen, and “метр-ге”k. is also; this is incorrect. The Kyrgyz text had a lot of words, and needs a more careful adaptation. The text needs to be adapted, make the Kyrgyz text shorter so that they will be equivalent. In the lines 64-69 (Kyrgyz) the text is more difficult than the Russian variant, the same lines (55-57 Russian) is easier to understand. In the Kyrgyz variant (157-159) is written “таппай жатышат” (they can find the answer) but in Russian it is written (131132) “they do not know.” They need to use the Kyrgyz words “билбей жатышат.” In the Kyrgyz version (line 29) “көбөйүүгө”k. is incorrect. “более тёплый климат”r. (a warmer climate) is translated incorrectly into Kyrgyz. In the Kyrgyz version (line 132) “калифорниядагы” is written as “калифорнистская” in the Russian version. In Kyrgyz this means “locative case” and in Russian is “possessive” case. Need to use the word “калифорнияалык.” There are more words in the Kyrgyz text. The word “Global Warming” does not have an equivalent in the Kyrgyz language. Text needs to be selected carefully. Take a text from Kyrgyz first (original- not translation). This was clearly written in Russian first – which leads to problems. Perhaps take two different versions from both languages about the same theme. The text is in scientific format (on the whole). There are many mistakes in the structure of the sentences primarily due to the literal translation which results in incorrect word combinations. For example: «прогноз кылуусуз» (not making a prognosis) should not be translated literally from Russian into Kyrgyz because the result will be poor. This is due to differences in 287 • • syntax between the Russian and Kyrgyz languages. In order for the content to be clear, the translation needs to be accessible. About 70-75 percent is difficult to understand. Kyrgyz “thinking” is different that Russian “thinking.” There is a mistake in the translation: “Many believe that global warming is completely caused by people…” the translation was: “Көптөрү глобалдык жылуулук көбөүүгө толугу менен Адам айыпкер деп ишенишет” – but it should have been “Көптөрү глобалдык жылуулуктун жогорулашына толугу менен Адам айыпкер деп ишенет. 288 REFERENCES 289 REFERENCES Ackerman, T. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29(1), 67-91. Agnoff, W.H., & Cook, L.L. (1988). Equating the scores of the prueba de aptitude academica and the scholastic aptitude test. Report No. 88-2. New York: College Examination Board. Allalouf, A., Hambleton, R., & Sireci, S. (1999). Identifying the causes of DIF in translated verbal items. Journal of Educational Measurement, 36(3), 185-198. a American Councils for International Education (2004, January). Analiz Obshe Respublikanskova Testirovaniya 2003 goda v Kirgizskoi Respublikii: Perviyee Vzglad [Analysis of 2003 National Scholarship Testing in the Kyrgyz Republic: A First Look]. Bishkek, Kyrgyzstan: Drummond, T. & Titov, C. b American Councils for International Education (2004, April). Assessing the impact of the new testing and enrollment system of the kyrgyz republic: A measurement of impact through a survey of key stakeholders. Bishkek, Kyrgyzstan: American Councils. Archer, M. (1979). Social origins of educational systems. London: SAGE Publications. Bejar, I., Chaffin, R., & Embertson, S. (1991). Cognitive and psychometric analyses of analogical problem solving. New York: Springer-Verlag. Beller, M. (1995). Translated versions of israel’s inter-university psychometric entrance test (PET). In T. Oakland & R.K. Hambleton (Eds.), International Perspectives of Academic Assessment (pp. 207-217). Boston, MA: Kluwer Academic Publishers. Beller, M., Gafni, N., Hanani, P. (2005). Constructing, adapting, and validating admissions tests in multiple languages: the Israeli case. In R. Hambleton, P, Merenda, P. & C, Spielberger (Eds.), Adapting Educational and Psychological Tests for Cross-Cultural Assessment, London: Lawrence Erlbaum Associates. Bereday, G. (1960). The changing soviet school. Cambridge, MA: The Riverside Press. 290 Berk, R.A. (Ed.). (1982). Handbook of methods for detecting test bias. Baltimore MD: Johns Hopkins University Press. Birbaum, M., & Tatsuoka, K.K. (1982). On the dimensionality of achievement test data. Journal of Educational Measurement, 19, 259-266. Boljurova, I. (2003, December 2). Shag za shagom: Obsherespublikanskoe testirovanie 2003 goda. [National scholarship testing: Step by step]. Slovo Kyrgyzstan. pp. 12-13. Brunner, J. & Tillet, A. (2007). Higher education in central asia: The challenges of modernization (case studies from kazakhstan, tajikistan, the kyrgyz republic and Uzbekistan). Washington, DC: The International Bank for Reconstruction and Development/TheWorld Bank. Camilli, G. & Shephard, L. (1994). Methods for identifying biased test items. London: Sage Publications. CEATM (2007). Rezul’tati obsherespublikanskova testirovaniya i zachisleniya na grantovie mesta vuzov v Kirgizskoi Respubliki v 2007 godu. [Results of national scholarship testing and enrollment in university grant places in the kyrgyz republic in 2007]. Bishkek: CEATM. www.testing.kg CEATM (2009). Rezul’tati obsherespublikanskova testirovaniya i zachisleniya na grantovie mesta vuzov v Kirgizskoi Respubliki v 2009 godu. [Results of national scholarship testing and enrollment in university grant places in the kyrgyz republic in 2009]. Bishkek: CEATM. www.testing.kg a CEATM (2010). Rezul’tati obsherespublikanskova testirovaniya i zachisleniya na grantovie mesta vuzov v Kirgizskoi Respubliki v 2010 godu. [Results of national scholarship testing and enrollment in university grant places in the kyrgyz republic in 2010]. Bishkek: CEATM. www.testing.kg b CEATM (2010). Natsional’noye otsenivanie obrazovatel’nix dostizhenii uchashixsya. [National assessment of educational quality]. Bishkek: CEATM. www.testing.kg Census (2010). National statistical committee of the KR: Population and housing census of the kyrgyz republic of 2009. Bishkek: Kyrgyzstan. 291 Clark, N. (2005, December). Education reform in the former soviet union. World Education News and Reviews. WES. http://www.wes.org/ewenr/PF/05dec/pffeature.htm Clauser, B.E., Mazor, K.M., & Hambleton, R.K. (1991). The influence of the criterion variable on the identification of differentially functioning items using the mantel-haenszel statistic. Applied Psychological Measurement, 15(4), 353-359. Clauser, B.E., Mazor, K.M., & Hambleton, R.K. (1993). The effects of purification of the matching criterion on identification of DIF using the mantel-haenszel procedure. Applied Measurement in Education, 6, 269-279. Clauser, B.E., Nugester, R.J., & Swaminathan, H. (1996). Improving the matching for DIF analysis by conditioning on both test score and an educational background variable. Journal of Educational Measurement, 33, 453-464. Clauser, B.E., Nungester, R.J., Mazor, K., & Ripkey, D. (1996). A comparison of alternative matching strategies for DIF detection in tests that are multidimensional. Journal of Educational Measurement, 33, 202-214. Clauser, B. & Mazor, K. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, Spring, 31-44. Coffman, W.E. (1961). Sex differences in response to items in an aptitude test. In E.M. Huddleston (Ed.). The 18th Yearbook of the National Council on Measurement in Education. Ames, IA: The Council. Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155-159. College Board Report No. 88-2 (1988). Equating the scores of the Prueba de Aptitud Academica and the Scholastic Aptitude Test. New York: Agnoff, W.H. & Cook, L.L. Conference Program (2006, September 19-20). Comments from the minister, black sea conference on admissions in higher education: Promoting fairness and equity in access to higher education. Tbilisi, Georgia. Davidson, D. E. (2003). Prognozirovaniie uspeshnosti studentov pervyx kursov vysshix uchebnyx zavedenii Kyrgyzskoi Respubliki po rezul’tatam obsherespublikanskogo testa 2003 goda: verifikatsia validnosti testa. [Prognosis of first year student achievement in higher education institutions in the Kyrgyz Republic according to the results of the 292 national scholarship test 2003: Verification of the validity of the test]. American Councils for International Education. De Young, A., & Santos, C. (2004). Central asian educational issues and problems. In S. Heyneman & A. De Young (Eds.), The Challenge of Education in Central Asia (pp. 6580). Greenwich, CT: Information Age Publishing. De Young, A., Reeves, M. & Valyaeva, G. (2006). Surviving the transition? Case studies and schooling in the kyrgyz republic since independence. Greenwich, Connecticut: Information Age Publishing. De Young, A. (2007, October). Paradoxes of higher education in the kyrgyz republic. Paper presented at the meeting of the Central Eurasian Studies Society Eighth Annual Conference, University of Washington, Seattle, WA. Dienes, L. (1987). Soviet asia: economic development and national policy choices. Boulder and London: Westview Press. Dorans, N.J., & Holland, P.W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P.W. Holland & H.Wainer (Eds.), Differential Item Functioning (pp. 35-66). Hillside, NJ: Lawrence Earlbaum, 35-66. Douglas, J.A., Roussos, L.A., & Stout, W. (1996). Item-bundle DIF hypothesis testing: Identifying suspect bundles and assessing their differential functioning. Journal of Educational Measurement, 33(4), 465-484. Drummond, T. & De Young, A. (2004). Perspectives and problems in education reform in kyrgyzstan: The case of national scholarship testing 2002. In S. Heyneman & A. De Young (Eds.), The Challenge of Education in Central Asia (pp. 225-242). Greenwich, CT: Information Age Publishing, Drummond, T. (2011). Higher education admissions regimes in kazakhstan and kyrgyzstan: Difference makes a difference. In I. Silova (Ed.), Globalization on the Margins: Education and Post-socialist Transformations in Central Asia (pp. 117-144). Charlotte, NC: Information Age Publishing. Duncan, A. (June 14, 2009). “States Will Lead the Way Towards Reform,” Address by the Secretary of Education at the 2009 Governors Education Symposium. http://www.ed.gov/news/speeches/states-will-lead-way-toward-reform 293 Ellis, B.B. (1995). A partial test of hulin’s psychometric theory of measurement equivalence in translated tests. European Journal of Psychological Assessment, 11, 184-193. Engelhard, G., Hansche, L., & Rutledge, K.E. (1990). Accuracy of bias review judges in identifying differential item functioning on teacher certification tests. Applied Measurement in Education, 3, 347-360. Engelhard, G., David, M., & Hanshe, L. (1999). Evaluating the accuracy of judgments obtained from item review committees. Applied Measurement in Education, 12(2), 199-210. Ercikan, K. (2002). Disentangling sources of differential item functioning in multilanguage assessments, International Journal of Testing, 2(3&4), 199-215. Ercikan, K., & McCreith, T. (2002). Effects of adaptations on comparability of test items and test scores. In D. Robitaille & A. Beaton (Eds.), Secondary analysis of the TIMSS results: A synthesis of current research (pp. 391-407). Dordrecht, the Netherlands: Kluwer. Ercikan, K., Gierl, M., McCreith, T., Phan, G., & Koh, K. (2004). Comparability of bilingual versions of assessments: Sources of incomparability of English and French versions of the Canada’s national achievement tests. Applied Measurement in Education, 17(3), 301321. Ercikan, K. & Koh, K. (2005). Examining the construct comparability of the English and French versions of TIMSS. International Journal of Testing, 5(1), 23-35. Faranda, R. & Nolle, D.B. (2010). Boundaries of ethnic identity in central asia: Titular and russian perceptions of ethnic commonalities in kazakhstan and kyrgyzstan. Ethnic and Racial Studies, 34(4), 620–642. Fierman, W. (1991). The soviet transformation of central asia. In W. Fierman (Ed.), Soviet Central Asia: The Failed Transformation (pp. 11-35). Boulder, CO: Westview Press. Fierman, W. (1995). Introduction: The division of linguistic space. Nationalities Papers, 23(3), 507–513. 294 Furr, M., & Bacharach, Y. (2008). Psychometrics: An introduction. Los Angeles: Sage Publications. Gafni, N., & Canaan-Yehoshafat, Z. (1993, October). An examination of differential item functioning for hebrew and russian-speaking examinees in Israel. Paper presented at the Conference of the Israeli Psychological Association, Ramat-Gan. Gierl, M., Rogers, W.T., & Klinger, D. (1999, April). Using statistical and judgmental reviews to identify and interpret translation DIF. Paper presented at the Symposium Translation DIF: Advances and Applications, Annual Meeting of the National Council on Measurement in Education (NCME), Montreal, Canada. Gierl, M.J. & Khaliq, S.N. (2001). Identifying sources of differential item and bundle functioning on translated achievement tests. Journal of Educational Measurement, 38, 164-187. Glenn, C. (1995). Educational freedom in eastern europe. Washington, DC: Cato Institute. Grenoble, L. (2003). Language policy in the soviet union. Boston, MA: Kluwer Academic Press. Grisay, A. de Jong, J.H.L., Gebhardt, E., Berezner, A., & Halleux, B. (2006, July). Translation equivalence across PISA countries. Paper presented at the 5th Conference of the International Test Commission, Brussels, Belgium. Grisay, A. & Monseur, C. (2007). Measuring the equivalence of item difficulty in the various versions of an international test. Studies in Educational Evaluation, 33, 69-86. Hambleton, R.K., & Rogers, H.J. (1989). Detecting potentially biased test items: Comparison of IRT and Mantel-Haenszel Methods. Applied Measurement in Education, 2(4), 313-334. Hambleton, R.K., Clauser, B.E., Mazor, K.M., Jones, R.W. (1993). Advances in the detection of differentially functioning test item. European Journal of Psychological Assessment, 9, 118. Hambleton, R. & Kanjee, A. (1995). Increasing the validity of cross-cultural assessments: Use of improved methods for test adaptations. European Journal of Psychological Assessment, 11(3), 147-157. 295 Hambleton, R. (2005). Issues, designs, and technical guidelines for adapting tests into multiple languages and cultures. In R. Hambleton, P. Merenda, & C. Spielberger (Eds.), Adapting Educational and Psychological Tests for Cross-Cultural Assessment. London: Lawrence Erlbaum Associates. Helimskaia, R. I. (1994). Taina chong-tasha. [The Secret of Chong-Tash]. Bishkek: Ilim. Herczynski, J. (2003). Key issues of governance and finance of kyrgyz education. Problems of Economic Transition, 45(10), 58-103. Heyneman, S., Anderson, K. & Nuralieva, N. (2008). The cost of corruption in higher education. Comparative Education Review, 52(1), 1-25. Holland, P. & Wainer, H. (Eds.). (1993). Differential item functioning. Hillsdale, NJ: Lawrence Earlbaum Associates. Hu, Z. & Imart, G. (1989). A kirghiz reader. Bloomington, IN: Research Institute for Inner Asian Studies. Huskey, E. (1995). The politics of language in kyrgyzstan. Nationalities Papers, 23(1), 549- 572. International Crisis Group, (2003). Youth in central asia: Losing the new generation (Asia Report No. 66). Osh/Brussels. Jodoin, M.G. & Gierl, M.J. (2001). Evaluating type I error and power using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14(4), 329-349. Johnson, M. (2004). The legacy of russian and soviet education. In S. Heyneman and A. De Young (Eds.), The Challenge of Education in Central Asia (pp. 21-36). Greenwich, CT: Information Age Publishing. Joldersma, K. (2008). Comparability of multi-lingual assessments: An extension of meta-analytic methodology to instrument validation. Unpublished doctoral dissertation, Michigan State University, East Lansing. 296 Kirk, R.E. (1996). Practical significance: A concept whose time has come. Educational and Psychological Measurement, 56, 746-759. Kok, F. (1988). Item bias and test multidimensionality. In R. Langeheine & J. Rost (Eds.), Latent Trait and Latent Class Models. (pp. 263-274). New York: Plenum. Korth, B. (2004). Education and linguistic division in kyrgyzstan. In S. Heyneman and A. De Young (Eds.), The Challenge of Education in Central Asia (pp. 97-112). Greenwich, CT: Information Age Publishing. Korth, B. (2005). Language attitudes towards kyrgyz and russian: Discourse, education and policy in post-soviet Kyrgyzstan. Bern: Peter Lang. Kutueva, A. (2008, May 22). Po dannym antikorruptsionnogo komiteta za 2007 god, Ministerstvo obrazovaniya i nauki Kyrgyzstana stoit na vtorom meste po urovnyu korruptsii. [According to data from the anti-corruption committee in 2007, the ministry of education and science is in second place for highest levels of corruption]. Retrieved from the website of Information Agency 24.KG: http://www.24.kg/community/2008/05/22/85241.html Landau , J. & Kellner-Heinkele, B. (2000). Politics of language in the ex-soviet muslim states. Ann Arbor, MI: University of Michigan Press. Mambetaliev, R. (2003, September 11). Nasha zadacha – dostup i kachestvo obrazovaniya. [Our task - Access and quality of education]. Obshestveni Rating, No. 35 (157). Mazor, K. (1993). An investigation of the effects of conditioning on two ability estimates in DIF analyses when the data are two-dimensional. Unpublished doctoral dissertation, University of Massachusetts, Amherst. Mazor, K.M., Kanjee, A., Clauser, B.E. (1995). Using logistics regression and the mantelhaenszel with multiple ability estimates to detect differential item functioning. Journal of Educational Measurement, 32, 131-144. Mazor, K., Hambelton, R.K., & Clauser, B.E. (1998). The effects of matching on unidimensional subtest scores. Applied Psychological Measurement, 22, 357-367. Mazor, Clauser, Hambleton, (1992). The effect of sample size on the functioning of the mantelhaenszel statistic. Educational and Psychological Measurement, 52, 443-451. 297 McGraw, K.O. & Wong, S.P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30-46. Megoran, N. (2002). The Borders of Eternal Friendship?: The politics and pain of nationalism and identity along the uzbekistan-kyrgyzstan ferghana valley boundary, 1999-2000. Unpublished doctoral dissertation, Cambridge University, Cambridge. Mellenbergh, G.J. (1982). Contingency table models of assessing item bias. Journal of Educational Statistics, 7, 105-118. Messick, S. (1988). The once and future issues of validity: Assessing the meaning and consequences of measurement. In H. Wainer & H.I. Braun (Eds.), Test Validity (pp. 3345). Hillsdale, NJ: Erlbaum. Milroy, J. (2001). Language ideologies and the consequences of standardization. Journal of Sociolinguistics, (5/4), 530-555. Oxford: Blackwell publishers. National Statistical Committee of the USSR (1989). Narodnoye obrazovanie i kultura v SSSR. {Education and Culture in the USSR}. Moscow, USSR. National Statistics Committee (2000). Obrazovanie v kyrgyzskoi respublike. {Education in the Kyrgyz Republic}. Bishkek: Kyrgyzstan. National Governors Association (NGA), Council of Chief State School Officers, & Achieve (2008). Benchmarking for success: Ensuring U.S. students receive a world-class education. http://www.achieve.org/BenchmarkingforSuccess Narayanan, P. & Swaminathan, H. (1994). Performance of the mantel-haenszel and simultaneous item bias procedures for detecting differential item functioning. Applied Psychological Measurement, 18, 315-338. Narayanan, P. & Swaminathan, H. (1996). Identification of items that show non-uniform DIF. Applied Psychological Measurement, 20(3), 257-274. Oruzbaeva, B. (1997). Kirgizskii Yazyk: Yazyki Mira, Turkskie Yazyki. [Kyrgyz Language: Languages of the World, Turkic Languages]. Bishkek: Izdatel’skii Dom. 298 OSI (Open Society Institute) (2002). Educational development in kyrgyzstan, tajikistan and uzbekistan: challenges and ways forward. Retrieved January 21, 2011, from http://www.soros.org/initiatives/esp/articles_publications/publications/development_2002 0401 Organization for Economic Cooperation and Development (OECD), (2003). PISA Technical Report 2003. http://www.pisa.oecd.org/dataoecd/49/60/35188570.pdf. pp. 1-426. Osipian, A. (2007, February). Corruption in higher education: Conceptual approaches and measurement techniques. Paper presented at the meeting of the Comparative and International Education Society (CIES), Baltimore, MD. Plake, B.S. (1980). A comparison of statistical and subjective procedures to ascertain validity: one step in the test validation process. Educational and Psychological Measurement, 40, 397- 404. Podol’skaya, D. (03/03/11). V Rossii na Zarabotkax naxodyatsya 548 tysyach Kyrgyzstantsev. [There are 548 thousand Kyrgyzstanis Working in Russia] – ИА «24.kg». http://mirror24.24.kg/parlament/94490-v-rossii-na-zarabotkax-naxodyatsya-548tysyach.html Poortinga, Y.H. (1983). Psychometric approaches to intergroup comparison: The problem of equivalence. In S.H. Irvine and J.W. Berrey (Eds.), Human Assessment and CrossCultural Factors (pp. 237-258). New York: Plenum Press. Poortinga, Y.H. (1989). Equivalence of cross-cultural data: An overview of basic issues. International Journal of Psychology, 24, 737-756. Presidential Decree No. 91. (2002, April 18). “O dal’neyshih merax po obespecheniyu kachestva obrazovaniya i sovershenstvovaniyu upravleniya obrazovatel’nymi protsessami v Kyrgyzskoy Respublike.” [About further measures for ensuring quality education and improving the administration of educational processes in the Kyrgyz Republic]. Reckase, M.D. (1985). The difficulty of items that measure more than one ability. Applied Psychological Measurement, 9, 401. 299 Reckase, M.D., & Kunce, C. (2002). Translation accuracy of a technical credentialing examination. International Journal of Continuing Engineering Education and Lifelong Learning, 12(1-4), 167-180. Reeves, M. (2005). Of credits, kontrakty, and critical thinking: encountering ‘market reforms’ in kyrgyzstani higher education. European Educational Research Journal, 4(1). pp. 5-21. RIA News, Moscow. (2007, February 5). Vvedenie v rossii edinogo gosudarstvennogo ekzamena (EGE) yavlyaetsya oshibkoy, ubezhden spiker sovyeta federatsii sergey mironov. [Speaker of federal Soviet thinks the Unified State Exam (USE) is a mistake]. Retrieved from http://www.spravedlivo.ru/news/section_385/738.smx Robin, F., Sireci, S., & Hambleton, R. (2003). Evaluating the equivalence of different language versions of a credentialing exam. International Journal of Testing, 3(1), 1-20. Roccase, S., & Moshinsky, A. (1997). Factors affecting the difficulty of verbal analogies (NITE Report No. 239). Jerusalem: National Institute for Testing and Evaluation. Rogers, H.J. (1989). A logistic regression procedure for detecting item bias. Unpublished Doctoral Dissertation, University of Massachusetts, Amherst. Rogers, J. & Swaminathan, H. (1993). A comparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17(2), 105-116. Roussos, L., & Stout, W. (1993). A multidimensionality-based DIF analysis paradigm. Applied Psychological Measurement, 20, 355-370. Sait Halma, T. (1981). 201 Turkish verbs fully conjugated in all the tenses. New York: Barron’s Educational Series. Scheuneman, J.D. (1982). A posteriori analyses of biased items. In R.A. Berk (Ed.) Handbook of Methods for detecting test bias (pp. 180-198). Baltimore, MD: Johns Hopkins Press. Schmitt, A.P. (1988). Language and cultural characteristics that explain differential item functioning for Hispanic examinees on the scholastic aptitude test. Journal of Educational Measurement, 25(1), 1-13. 300 Schmidt, A.P., & Belistein, C.A. (1987). Factors affecting differential item functioning of black examinees on scholastic aptitude test analogy items (Research Report 87-23). Princeton, NJ: Educational Testing Service. Shamatov, D. & Niyozov, S. (2010). Teachers surviving to teach: Implications for post-soviet education and society in tajikistan and kyrgyzstan. In J. Zajda (Ed.), Globalization, Ideology and Education Policy Reforms: Globalization, Comparative Education and Policy Research. Springer Science + Business Media B.V. Shamatov, D. (in press). Everyday realities of a young teacher in post-Soviet Kyrgyzstan: A case of a history teacher from a rural school. In P. Akcali, & C.E. Demir (Eds.), Post-Soviet Kyrgyzstan: Political and Social Challenges. Routledge. Shealy, R., & Stout, W.F. (1993). A model-based standardization approach that separates true DIF/Bias from group differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58, 159-194. Silova, I. (2009). The crisis of the post-soviet teaching profession in the caucuses and central asia. Research in Comparative and International Education, 4(4), www.wwwords.co.uk/RCIE. Sireci, S.G., & Allalouf, A. (2003). Appraising item equivalence across multiple language and cultures. Language Testing, 20(2), 148-166. Sireci, G., Patsula, L., & Hambleton, R. (2005). Statistical methods for identifying flaws in the test adaptation process. In R. Hambleton, P. Merenda, & C. Spielberger (Eds.), Adapting Educational and Psychological Tests for Cross-Cultural Assessment. London: Lawrence Erlbaum Associates. Solano-Flores, G. (2006). Measurement error in the testing of english language learners, Teachers College Record, Columbia University, 108(11), pp. 2354-2379. Soktoev, I.A. & Usubaliev, E. T. (1982). Vyshee shkola sovetskaya kirgizstana. {higher education in soviet Kyrgyzstan}.Frunze: Kyrgyzstan. Steiner-Khamsi, G., Teleshaliyev, N., Sheripkanova-MacCleod, G., & Moldokmatova, A. (2011). Ten-plus-one ways of coping with teacher shortage in kyrgyzstan, In I. Silova 301 (Ed.), Globalization on the Margins: Education and Postsocialist Transformations in Central Asia (pp. 203-232). Charlotte, NC: Information Age Publishing. Subkoviak, M.J., Mack, J.S., Ironson, G.H., & Craig, R.D. (1984). Empirical comparison of selected item bias procedures with bias manipulation. Journal of Educational Measurement, 25, 301-319. Swaminathan, H. & Rogers, J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361-370. Tittle, C.K. (1982). Use of judgmental methods in item bias studies. In R.A. Berk (Ed.) Handbook of Methods for detecting test bias (pp. 31-63). Baltimore, MD: Johns Hopkins Press. Toursunov, H. (May, 11, 2010). “Jumping Off a Sinking Ship.” in Transitions On-Line, http://www.tol.org/client/article/21435-jumping-off-a-sinking-ship.html. USSR Government Committee for Statistics (1989). Narodnoye Obrazovanie i Kultura v SSSR: Statisticheskii Sbornik. {Education and Culture in the USSR: Statistical Collection}. Moscow: Finance and Statistics. Valkova, I. (2001). My symphony: Interview with the minister of education of kyrgyz republic, Camilla Sharshekeeva. Thinking Classroom, 6. Vilnius, Lithuania: International Reading Association. Valkova, I. (2004). Getting ready for the national scholarship test: Study guide for abiturients. Bishkek: CEATM. Valyaeva, G. (2006, September). Standardized testing for university admissions in kazakhstan: A step in the right direction? Paper presented at the Central Eurasians Studies Conference, University of Michigan: Ann Arbor, MI. Van de Vijver, F. & Tanzer, N.K., (1999). Bias and equivalence in cross-cultural assessment: An overview. European Review of Applied Psychology, 47, 263-279. Van de Vijver, F. & Poortinga, Y. (2005). Conceptual and methodological issues in adapting tests. In R. Hambleton, P. Merenda, & C. Spielberger (Eds.), Adapting Educational and 302 Psychological Tests for Cross-Cultural Assessment. London: Lawrence Erlbaum Associates. Wright, S. (1999). Kyrgyzstan: The political and linguistic context. Current Issues in Language & Society, 6(1), 85-91. Yeh, S.S. (2005). Limiting the unintended consequences of high-stakes testing. Education Policy Analysis Archives, 13(43), 1-23. Zheng, Y., Gierl, M., & Cui, Ying, C., (2005). Using real data to compare DIF detection and effect size measures among mantel-haenszel, SIBTEST, and logistic regression procedures. University of Alberta: Centre for Research in Applied Measurement and Evaluation. Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and likert-type (ordinal) item scores. Ottawa, ON: Directorate of Human Resources and Evaluation, Department of National Defense. Zumbo, B.D., (2003). Does item level DIF manifest itself in scale level analysis? Implications for translating language tests. Language Testing, 20(2), 136-147. 303