ASSESSING THE VALIDITY OF ACTFL CAN-DO STATEMENTS FOR SPOKEN PROFICIENCY By Sonia Magdalena Tigchelaar A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Second Language Studies—Doctor of Philosophy 2018 ABSTRACT ASSESSING THE VALIDITY OF ACTFL CAN-DO STATEMENTS FOR SPOKEN PROFICIENCY By Sonia Magdalena Tigchelaar The NCSSFL-ACTFL (2015) Can-Do Statements describe what language learners can do in the target language at the various ACTFL Proficiency sublevels. Unlike the extensive work that has been done to scale and refine the proficiency descriptors and corresponding can-do statements associated with the CEFR (Council of Europe, 2001) using Rasch modeling (North, 2000; North & Schneider, 1998), the NCSSFL-ACTFL statements have yet to be empirically tested. Both the scale and its foreign language performance indicators were constructed using language teachers’ beliefs and experiences (Shin, 2013). While this is a logical starting point, concerns include whether the difficulty levels of the skills described in the statements match their assigned ACTFL proficiency levels, and whether each statement accurately measures the underlying construct: language proficiency on the ACTFL subscales. This study addresses these concerns by analyzing a self-assessment instrument composed of fifty NCSSFL-ACTFL (2015) Can-Do Statements targeting spoken language proficiency. American university students of varying proficiency levels in Spanish language classes (N = 382) rated the Can-Do Statements as: 1 (I cannot do this yet), 2 (I can do this with some help), 3 (I can do this with much help), 4 (I can do this). I analyzed their item responses using a Rasch rating scale model (Andrich, 1978; Rasch, 1960/80). I compared the difficulty levels estimated by the model to the proficiency levels assigned to the statements, and assessed each item’s fit to the model by considering the item’s measures of infit and outfit. The mean item difficulties estimated by the Rasch model matched the difficulty level predicted by the ACTFL scale at the major threshold proficiency levels, and these differences were statistically significant. The mean item difficulties did not show statistically significant increases at the ACFTL proficiency sublevel, and there was a decrease in difficulty from Advanced Low (M = 1.72, SD = 1.51) to Advanced Mid (M = 1.37, SD = 1.32) items, rather than an increase. The analysis also revealed 14 items that did not fit the model measuring spoken proficiency. In the second phase of the study, I revised the self-assessment instrument based on the findings of the first phase, and the revised assessment was used for a second round of testing. Spanish language learners (N = 886) rated the items in the revised instrument of the ACTFL (2015) Can-Do Statements. I analyzed their item responses using an exploratory factor analysis (EFA) and a Rasch rating scale model. The results of the EFA revealed two possible models of spoken language proficiency as represented by the Can-Do Statements that were included in the instrument: a unidimensional model in line with ACTFL’s unitary and hierarchical model of spoken proficiency, and a two-factor model. The Rasch analysis revealed that some of the items and some of the test takers in the analysis did not behave as expected. The analysis also replicated the finding that the mean item difficulties estimated by the Rasch model matched the difficulty level predicted by the ACTFL scale at the major threshold proficiency levels, and that these differences were statistically significant. The mean item difficulties in the revised assessment also ascended according to the ACTFL sublevels: There were significant differences between items from the lower proficiency sublevels, but the instrument did not discriminate well between statements pegged at higher proficiency sublevels. Findings are discussed in terms of how the NCSSFL-ACTFL (2015) Can-Do Statements can be used to self-assess spoken language proficiency, and how the statements should be assessed for content validity and psychometric value. ACKNOWLEDGMENTS No assessment task is entirely satisfactory. Each format has its own weaknesses. Rather than searching for one ideal task type, the assessment designer is better advised to include a reasonable variety in any test or classroom assessment system so that the failings of one format do not extend to the overall system. (Green, 2014, p. 140) As I’ve been working on this study of self-assessment, I have had the opportunity to reflect on my own self-assessments. One of the well-documented failings of self-assessment is the Dunning-Krueger effect, where students who are less experienced tend to overestimate what they are capable of, while students with more experience and knowledge tend to under-evaluate their skills. As I approach the end of my dissertation project (with far more experience and knowledge than when I first started out), I have sometimes concluded that this research is not worth doing or that my work is not good enough. Thankfully, I have also received a preponderance of other evaluations along the way that have provided very different perspectives: “Really nice job” (S. Gass, personal communication, May 15, 2017); “Your dissertation is good” (C. Polio, personal communication, April 23, 2018); “Fantastic…Really great!!!” (P. Winke, personal communication, April 19, 2018); “Because I plan to include this manuscript in the Fall 2017 issue of Foreign Language Annals, I am writing this afternoon to let you know that I look forward to receiving a revised version that addresses our queries and suggestions as soon as possible” (A. Nerenz, personal communication, June 24, 2017); “I am pleased to inform you that your application for the MLJ/NFMLTA Dissertation Support Grant has been recommended for funding by the NFMLTA Dissertation Support Grant Committee members. Congratulations!” iv (A. Schleicher, personal communication, November 10, 2017). These perspectives have forced me to re-evaluate my self-assessments and provided a more balanced view of my work. I would like to thank a number of my outside evaluators, whose encouragement and support have helped me to stay on track over the course of this project. First, I would like to thank my dissertation co-chairs, Drs. Charlene Polio and Paula Winke. I am grateful for Charlene’s insight and leadership throughout the course of my doctoral career. Thank you Paula for your enthusiastic support, encouragement, and ability to see possibilities where I see challenges. I am also grateful to Dr. Ryan Bowles for introducing me to the fascinating world of measurement. To Dr. Sue Gass, thank you (and Paula) for generously making the Flagship Grant data available to me, and for the wonderful opportunity to work together with you on SSLA. More generally, I would like to thank the Second Language Studies program faculty members and past and current SLS and MA TESOL students for your support and camaraderie along the way. On a more personal note, I would also like to thank my family for their support and inspiration. I’m thankful to have grown up with parents and siblings who place great value in discovery and learning. I’m fortunate to have found a similar love of learning and curiosity about life in my in-laws. I am particularly grateful to my brother Evan for his patient, enthusiastic, and loving childcare while I’ve returned to work. Thank you baby Meriwether for inspiring me to set aside procrastination and to use my time efficiently, and for motivating me to set an example of hard work and perseverance. And last, and most importantly, thank you Joel for your patience, for always reminding me to trust my outside evaluators, and for your unwavering belief in me. v TABLE OF CONTENTS LIST OF TABLES ....................................................................................................................... viii LIST OF FIGURES ........................................................................................................................ x INTRODUCTION .......................................................................................................................... 1 BACKGROUND ............................................................................................................................ 3 Defining the construct: Oral proficiency .................................................................................... 3 Dimensions of second language proficiency. ......................................................................... 5 Assessing the validity of oral proficiency self-assessments ....................................................... 8 Factors that threaten the validity of self-assessment items ....................................................... 12 Motivation for the current study ............................................................................................... 15 PHASE I: SPRING 2015 PROFICIENCY TESTING ................................................................. 19 Methods..................................................................................................................................... 19 Participants. ........................................................................................................................... 20 Materials. .............................................................................................................................. 20 Procedure. ............................................................................................................................. 21 Data analysis. ........................................................................................................................ 23 Results ....................................................................................................................................... 26 Fit to the Rasch model. ......................................................................................................... 26 Item difficulty estimates. ...................................................................................................... 33 Discussion ................................................................................................................................. 39 PHASE II: SPRING 2017 PROFICIENCY TESTING ................................................................ 45 Methods..................................................................................................................................... 46 Participants. ........................................................................................................................... 46 Materials. .............................................................................................................................. 46 Procedure. ............................................................................................................................. 49 Data Analysis. ....................................................................................................................... 50 Exploratory factor analysis (EFA). ....................................................................................... 51 Rasch analysis. ...................................................................................................................... 52 Results ....................................................................................................................................... 52 Factor analysis. ..................................................................................................................... 52 Fit to the Rasch model. ......................................................................................................... 57 Item difficulty estimates. ...................................................................................................... 70 Discussion ................................................................................................................................. 74 Dimensions of spoken proficiency........................................................................................ 75 Fit to the Rasch model. ......................................................................................................... 79 Item difficulty. ...................................................................................................................... 81 DISCUSSION AND CONCLUSION .......................................................................................... 85 Conclusions ............................................................................................................................... 90 ENDNOTES ................................................................................................................................. 92 vi APPENDICES .............................................................................................................................. 94 APPENDIX B: Phase I principal components analysis ............................................................ 98 APPENDIX C: ACTFL OPIc 1-5 levels and revised Can-Do Statements ............................. 100 APPENDIX D: Phase II principal components analysis ........................................................ 103 REFERENCES ........................................................................................................................... 105 vii LIST OF TABLES Table 1: Number of participants who completed each level of the self-assessment ......................... 22 Table 2: Distribution of 2015 OPIc ratings by class level ....................................................................... 23 Table 3: Linacre’s interpretation of mean-square fit statistics .............................................................. 24 Table 4: Misfitting items from the 2015 self-assessment questionnaire .............................................. 29 Table 5: Fit statistics for 36 fitting items ...................................................................................................... 31 Table 6: Descriptive statistics for difficulty estimates of ACTFL threshold levels ........................... 33 Table 7: Descriptive statistics for difficulty estimates of ACTFL sublevels ....................................... 34 Table 8: Advanced-level items in order of difficulty .................................................................................. 38 Table 9: 14 misfitting items and their replacements .................................................................................. 47 Table 10: Number of participants who completed each level of the self-assessment....................... 49 Table 11: Distribution of 2017 OPIc ratings by class level .................................................................... 50 Table 12: Factor loadings for the 1-factor and 2-factor models ........................................................... 55 Table 13: Misfitting items from the initial 2017 Rasch model ................................................................ 59 Table 14: More misfitting items ....................................................................................................................... 60 Table 15: Misfitting people, ratings and response strings (and most unexpected responses) ...... 61 Table 16: Final model fit statistics for the revised self-assessment questionnaire ........................... 65 Table 17: Fit statistics for the fourteen replacement items included in the revised Spring 2017 assessment .................................................................................................................................................... 67 Table 18: Fit statistics for original and revised items .............................................................................. 68 Table 19: Descriptive statistics for 2017 difficulty estimates of ACTFL threshold levels .............. 72 Table 20: Descriptive statistics for 2017 difficulty estimates of ACTFL sublevels .......................... 73 Table 21: Items with large and significant DIF. ......................................................................................... 95 viii Table 22: ACTFL OPIc level 1 Can-Do Statements .................................................................................. 95 Table 23: ACTFL OPIc level 2 Can-Do Statements .................................................................................. 95 Table 24: ACTFL OPIc level 3 Can-Do Statements .................................................................................. 96 Table 25: ACTFL OPIc level 4 Can-Do Statements .................................................................................. 96 Table 26: ACTFL OPIc level 5 Can-Do Statements .................................................................................. 97 Table 27: First contrast in the original Rasch model ............................................................................... 98 Table 28: First contrast in the final Rasch model ...................................................................................... 99 Table 29: ACTFL OPIc level 1 Can-Do Statements ................................................................................ 100 Table 30: ACTFL OPIc level 2 Can-Do Statements ................................................................................ 100 Table 31: ACTFL OPIc level 3 Can-Do Statements ................................................................................ 101 Table 32: ACTFL OPIc level 4 Can-Do Statements ................................................................................ 101 Table 33: ACTFL OPIc level 5 Can-Do Statements ................................................................................ 102 Table 34: First contrast in the original Rasch model ............................................................................. 104 Table 35: First contrast in the final Rasch model .................................................................................... 104 ix LIST OF FIGURES Figure 1: ACTFL (2012) Proficiency levels. Reprinted with permission. ............................................. 3 Figure 2: Test takers’ path................................................................................................................................. 21 Figure 3: Distribution of OPIc ratings in Spring 2015.The numeric OPIc ratings from 1-9 represent the range of ACTFL proficiency levels from Novice Low to Advanced High. .... 23 Figure 4: Wright map of test taker ability and item difficulty. A period indicates one person. A hash mark is equal to three people. ....................................................................................................... 36 Figure 5: Distribution of OPIc ratings in Spring 2017. The numeric OPIc ratings from 1-9 represent the range of ACTFL proficiency levels from Novice Low to Advanced High, and Above Range and Below Range are represented by -1 and 0, respectively. ............................. 50 Figure 6: Scree plot of item-level EFA. ........................................................................................................ 55 Figure 7: Wright map of test taker ability and item difficulty. ............................................................... 71 Figure 8: 95% confidence intervals of the mean threshold difficulty estimates. 1 = Novice; 2 = Intermediate; 3 = Advanced; 4 = Superior. ......................................................................................... 72 Figure 9: 95% confidence intervals of the mean sublevel difficulty estimates. 1 = NL; 2 = NM; 3 = NH; 4 = IL; 5 = IM; 6 = IH; 7 = AL; 8 = AM; 9 = AH; 10 = S. .............................................. 74 Figure 10: Phase I standardized residual contrast plot. ............................................................................. 98 Figure 11: Phase II standardized residual contrast plot for the original model. .............................. 103 Figure 12: Phase II standardized residual contrast plot for the final model. .................................... 103 x INTRODUCTION Traditionally, research on assessment in the field of second language (L2) learning has focused on the assessment of language learning, but recent trends in educational assessment have called for the use of assessment for language learning (Butler, 2016; Lee, 2016; Nikolov, 2016; Purpura & Turner, 2014, 2015; VanPatten, Trego, & Hopkins, 2015). Such a shift requires that L2 learners gain awareness of their language abilities and deficiencies by taking a more active role in their assessment, rather than being the passive recipients of an outsider’s rating or judgment of their proficiency. One way that they can participate in this process is to reflect on their own language use by engaging in self-assessment. To facilitate this process, materials have been developed for language learners to take stock of what they can do in the target language. The first publication of can-do statements was created as part of the Swiss European Language Portfolio project (Schneider & North, 2000). Large language assessment entities including the Common European Framework of Reference for Languages (CEFR; Council of Europe, 2001), WIDA (2014), and the American Council on the Teaching of Foreign Languages (ACTFL, 2012, 2015, 2017) have since developed can-do statements that relate to their existing commonly used proficiency standards. These statements were designed to be used by language learners as “self-assessment checklists...to assess what they ‘can do’ with language” (ACTFL, 2015). Some of the benefits associated with can-do statements are that they are positive, concrete, clear, brief, and can promote independence (Fang, Yang & Zhu, 2011). They are designed to be psychologically affirming, focusing on abilities rather than deficiencies (e.g., I can schedule an appointment). Can-do statements are also clear and brief. They use specific, understandable language that describes functional skills rather than linguistic jargon (e.g., I can 1 describe a childhood experience rather than I can narrate in the past tense) and they divide complex language features into short, simple descriptions (e.g., I can describe a place I have visited). Finally, in line with learner-centered language teaching (Lee, 2016; Little, 2005) and the use of assessment for language learning (Butler, 2016; Purpura & Turner, 2014, 2015), can-do statements are designed to promote language awareness (Sweet, Mack, & Olivero-Agney, in press) and independence by allowing learners to take stock of what they can do in their L2. Experts in language assessment have researched can-do statements and their corresponding language proficiency descriptors to validate them for assessment purposes (e.g., Faez, Majhanovich, Taylor, Smith, & Crowley, 2011; Jones, 2002; North, 2000; Shin, 2013; Weir, 2005; WIDA, 2014). One main concern is that the ACTFL (2012) proficiency scales were “constructed according to the shared experiences and beliefs of language teachers and experts” (Shin, 2013, p. 2). While this is a reasonable first step in the construction of language proficiency descriptors, without psychometric validation, it is not certain that the abilities described in the scales and their assigned difficulties are valid or useful for measuring language proficiency. In the case of the European scale (CEFR, Council of Europe, 2001), extensive empirical work has been done to scale the proficiency descriptors using Rasch modeling (North, 2000; North 2011; North & Schneider, 1998). However, the ACTFL (2012, 2015) scale and its corresponding language descriptors have yet to be empirically tested. Therefore, more work needs to be done to provide empirical evidence for the construct validity of the ACTFL (2015) Can-Do Statements for language proficiency. The purpose of this study was to assess the validity of a selection of ACTFL Can-Do Statements targeting spoken proficiency that were used for self-assessment by university-level Spanish language learners. 2 BACKGROUND Defining the construct: Oral proficiency The ACTFL Proficiency Guidelines for speaking are considered a model of oral language proficiency, in the sense that they provide “a theoretical overview of what we understand by what it means to know and use a language” (Fulcher & Davidson, 2009, p. 126). The Guidelines for speaking describe what language learners can do with language at five major proficiency levels: Novice, Intermediate, Advanced, Superior, and Distinguished. The first three major levels are further divided in Low, Mid, and High sublevels (ACTFL, 2012, p. 3). These proficiency levels are shown in the pyramid in Figure 1, colloquially known as the ACTFL pyramid. Figure 1: ACTFL (2012) Proficiency levels. Reprinted with permission. The spoken proficiency guidelines describe some of the linguistic features of second language speech (e.g., accuracy, discourse structure, language functions) at each level. For example, Advanced level speakers are expected to be able to narrate and describe in all the major time frames. This is distinct from Superior level speech, which is marked by the ability to use argumentation and to hypothesize. The Guidelines also specify content and tasks that 3 speakers should be able to accomplish at the major threshold levels (i.e., Novice, Intermediate, Advanced, Superior) and each of the proficiency sublevels. For example, at the Advanced threshold level, speakers should be able to express themselves on concrete and familiar topics, while abstract topics are representative of the content of Superior-level language use. The ACTFL (2012) Guidelines and the NCSSFL-ACTFL (2015, 2017) Can-Do Statements further specified tasks at each of the proficiency sublevels (e.g., Novice Low, Novice Mid, Novice High). In the case of the 2015 Can-Do Statements, ACTFL defined specific performance indictors for two spoken modalities: presentational speaking (“learners present information, concepts, and ideas to inform, explain, persuade, and narrate on a variety of topics using appropriate media and adapting to various audiences of listeners, readers, or viewers,” The National Standards Collaborative Board (NSCB), 2015) and interpersonal communication (“learners interact and negotiate meaning… to share information, reactions, feelings, and opinions,” NSCB, 2015). The 2017 publication also included statements in a third category called intercultural communicative competence. There are between five (Novice Low) and twenty-five (Novice Mid) indicators for all eleven proficiency sublevels in both modes: Each indicator is in the form of a Can-Do Statement (e.g., I can tell someone my name, Novice Low, Interpersonal Communication). In addition to the qualities of second language speech reviewed above, what distinguishes speech from one proficiency level to the next, according to the ACTFL (2012) Guidelines for speaking, is in part defined by the idea of language quantity. Language users at the High level of a proficiency band (who have more language than those at lower levels) are expected to be able to do some of the tasks associated with the proficiency band directly above, but are unable to sustain performance at this level. For example, an Advanced High speaker may be able to 4 “discuss some topics abstractly, especially those relating to their particular interests and special fields of expertise, but in general, they are more comfortable discussing a variety of topics concretely.” (ACTFL, 2012, p. 5). Superior-level speakers, on the other hand, would be expected to sustain discussions on abstract topics. Dimensions of second language proficiency. The model of language proficiency represented by the ACTFL pyramid (see Figure 1) and represented by the language written in the ACTFL (2012) Guidelines strongly suggests that spoken language proficiency grows hierarchically over time (with use, practice, and coursework): The “levels form a hierarchy in which each level subsumes all lower levels” (p. 3). The model depicted in the pyramid also represents a large, unitary construct of something individuals can refer to as spoken language proficiency. The NCSSFL-ACTFL (2015, 2017) Can-do Statements, however, appear to indicate that there are at least two dimensions (or subskills) of speaking ability within the larger construct of speaking ability in general, although these horizontal categorizations (or genres) of the spoken language construct are not indicated in the pyramid. In the original publication of the Can-Do Statements (NCSSFL-ACTFL, 2015) speaking performance was separated into two areas: presentational speaking and interpersonal communication. In the revised 2017 version of Can-Dos, a third dimension was added: intercultural communicative competence. Thus, according to ACTFL, speaking can be considered a unitary construct, but it also has at least two prominent sub-categories, which means spoken language proficiency can also be seen as a multidimensional construct with at least two major categories. A unidimensional model of speaking is in line with many other second language assessments that consider speaking to be one of four skills. The Test of English as a Foreign 5 Language Internet-Based Test (TOEFL iBT), for example, is divided into four modalities: speaking, listening, reading, and writing. In an item factor analysis of the structure of test takers’ performance on the TOEFL iBT, each of the four skills or modalities was found to be unidimensional (Sawaki, Stricker, & Oranje, 2005). McNamara (1991) used items from tasks that were considered to represent distinct aspects of L2 listening (listening comprehension and making inferences) in a Rasch model and showed that it was possible to construct a single dimension for the measurement of listening ability. These findings suggest that spoken language proficiency can be modeled mathematically as a unitary construct. At first glance, these models of L2 proficiency seem overly simplistic when compared to other theoretical models of language use (e.g., Bachman & Palmer, 1996; 2010; Canale & Swain, 1980; Celce-Murcia, Dörnyei, & Thurell, 1995). Canale and Swain (1980), for example, proposed a longstanding and influential model of communicative language ability that included four dimensions: grammatical competence, discourse competence, sociocultural competence, and strategic competence, which can each be broken into subcategories. The domain of grammatical competence, for example, can be further broken down into smaller units (e.g., grammatical complexity, accuracy, lexical knowledge, and fluency), which can again be broken down to the point that the construct of language proficiency has been described as a true “Pandora’s Box” (Canale & Swain, 1980; McNamara, 1995). McNamara (1991), however, argued that “models should be judged in terms of their utility and not merely in terms of their relative approximation of reality” (p. 156). This perspective is widely held in the fields of language assessment and measurement. While no assessment can fully capture the complexity of a given human attribute (for example, IQ or motivation level), useful quantitative estimates can nevertheless be made (Bond & Fox, 2015). A 6 useful illustration of this point is the difference between a two-dimensional map and the section of earth that it is designed to represent: “Even when it is known that the assumption of flatness is incorrect, that is, the model is at variance with what is known of the reality being modelled, such maps are useful and adequate for most purposes” (McNamara, 1991, p.147). Thus, a measurement model of language proficiency that represents a simplification of what is known about the nature of L2 proficiency and development can still be useful. One last point to make about the nature and measurement of spoken proficiency regards the distinction between general spoken proficiency and an academic register (Bailey, 2007; Lin, 2015). Hulstijn (2007) hypothesized a core language proficiency minimally required to function communicatively in a second language. This core includes the mental representation of phonology and frequent lexical and grammatical constructions, and the ability to process these while speaking. Speakers with higher educational backgrounds will have similar profiles to speakers with lower educational levels in terms of core language skills, but beyond this core there will be differences in proficiency. Speaking ability beyond the core requires higher order cognition: “explicit, conscious knowledge of all sorts of topics…as well as attention allocation, decision making, inferencing ability, and the like” (Hulstijn, 2007, p. 664). Thus, going beyond core language proficiency may rely on additional factors such as education and intelligence. Hulstijn’s idea is similar to Cummins’ (1979, 2008) basic interpersonal communicative skills (BICS), which he distinguished from cognitive/academic language proficiency (CALP). BICS encompasses speaking as conversational fluency, while CALP refers to a separate dimension of language proficiency that captures language users’ ability to express academic concepts and ideas. Cummins (1980, 1981) showed that these two dimensions of spoken proficiency in L2 English develop at different rates: English language learners who could function well 7 conversationally required several more years to develop academic proficiency. Given this differential growth, in tests of second language oral proficiency, it may be possible to measure a general or conversational language dimension and another dimension that goes beyond conversational language. Assessing the validity of oral proficiency self-assessments Researchers who have investigated the self-assessment of oral proficiency have primarily been concerned with the concurrent validity of the self-assessment instruments, and particularly how well self-assessments correlate with other assessments (e.g., Brown, Dewey & Cox, 2014; Malabonga, Kenyon & Carpenter, 2005; Sweet et al., in press; Tigchelaar, in press; Trofimovich, Isaacs, Kennedy, Saito, & Crowther, 2014). This line of research has not produced conclusive results. In a meta-analysis of self-assessment in language testing, Ross (1998) found that in 29 studies correlating self-assessments with outside measures of speaking skills, the average correlation and effect sizes were smaller than correlations between self-assessment and reading or listening. He also observed a large range in the correlations between self-assessment and speaking in the studies he surveyed (from r = .09 to .78). He therefore argued, “self-assessment of speaking skill is quite susceptible to extraneous factors in the self-assessment process” (p. 9). Factors might include the instrument used to conduct the self-assessment, the test taker’s interpretation of the assessment (e.g., assessing communicative intention rather than the success of that communication, Ross, 1998), or the purpose for self-assessment, among others. One factor that may lead researchers in applied linguistics to observe weak correlations between self-assessment and other test scores is the type of self-assessment instruments used in the research. Recently, Trofimovich et al. (2014) compared self- and other-assessments and found a very weak relationship between the two measures. This study was focused on L2 English 8 learners’ ability to assess how native-like and comprehensible their spoken production was. They found a non-significant correlation for accent (r = .06, p = .50) and a very weak correlation for comprehensibility (r = .18, p = .03). It is possible that the weak associations found in this study were due to the fact that language learners are not trained (as applied linguists are) to assess linguistic components of L2 speech. In an earlier study with similar findings, Brantmeier (2006) hypothesized that the instrument she used may have resulted in self-assessments as poor predictors of other-assessment scores. Her participants assessed their abilities using a simple questionnaire to rate how well they thought they understood a reading passage. Their ratings did not correlate with the results of an online placement test. She concluded that subsequent research in self-assessment should consider the use of “more contextualized, criterion-referenced self- assessment instruments” (p. 30). More recently researchers have responded to this call for studying criterion-referenced self-assessments by investigating the validity of proficiency descriptors and related can-do statements for the self-assessment of oral skills. Sweet et al. (in press) compared self- and other- assessment scores by correlating scores on a can-do questionnaire and ACTFL Oral Proficiency Interview by computer (OPIc) scores. They found that the strength of the relationship increased with language learning and self-assessment experience: Participants in the second semester of language study had much lower correlations (r = .37) than students in the sixth through eighth semester (r = .61). Stansfield, Gao, and Rivers (2010) found that self-assessments written with can-do statements led second language users to assess their proficiency more accurately than more general global proficiency statements. They also found a moderate correlation (r = .41) between self-assessments using can-dos and Oral Proficiency Interview (OPI) scores. Brown et al. (2014) found significant small to medium correlations between both pre-study-abroad OPI 9 scores and self-assessments (r = .27) and post-study-abroad OPI scores and self-assessments (r = .21). Comparing the strength of the correlations observed in these studies with those found by Trofimovich et al. (2014) suggests that the type of self-assessment used matters: Researchers find stronger concurrent validity between can-do statement self-assessments and outside proficiency measures. The purpose for self-assessment may also impact the observation of the relationship between self-assessed proficiency and outside proficiency measures (Oscarson, 1997; Stansfield et al., 2010). One important use of self-assessment is to establish a starting point for test takers in computer-adaptive test (CAT) contexts (Calhoub-Deville & Deville, 1999). In ACTFL’s (2012) computerized oral proficiency interview by computer (OPIc), test takers begin by choosing one of five levels at which to begin the speaking test. One issue with this procedure is that if test takers overestimate their abilities in the self-assessment, they may select a task or test form that is too difficult for them, which can prevent raters from being able to rate their proficiency. However, using self-assessments may help to guide test takers toward the appropriate starting point. Malabonga et al. (2005) at the Center for Applied Linguistics designed a short self- assessment to guide computer-adaptive-speaking-test takers in choosing a starting level for their computerized oral assessment. The self-assessment was in the form of a questionnaire that included 18 questions. Based on their score on the self-assessment, one of four task levels was suggested for examinees to select for their first speaking task. The authors found that 92% of participants accurately used the self-assessment questionnaire and subsequently chose a starting level that was at an appropriate level of difficulty. They also found that the results of the self- assessments correlated strongly (r = .88) with the results of the oral proficiency test. In a similar study, Tigchelaar (in press) investigated the relationship between self-assessment scores and 10 ACTFL OPIc scores. Second language learners of French completed a self-assessment of ACTFL Can-Do Statements designed to guide test takers toward one of five ACTFL OPIc forms. The majority (94%) of the test takers received a proficiency rating, which suggests that they used the self-assessment accurately to choose an appropriate test form. The remainder of the participants were assigned the rating of above range or below range, which indicated the test taker took the wrong test form, either too easy or too hard, respectively, and the raters could not match the test taker’s performance to the range of ability being assessed by the selected form. The correlations between self-assessments and test scores were also strong (between r = .46 and ρ = .64, depending on the scale used to convert proficiency ratings). One should note that these correlations have a certain amount of collinearity: That is, the outcome variable (the final test score) relied in part on the initial self-assessment outcome. Taken together, these studies suggest that learners may be better equipped to self-assess functional speaking skills than more fine-grained linguistic components of speech, and that the use of criterion-referenced instruments such as can-do statements may facilitate this process. The purpose for self-assessment also impacts how well self- and other-assessments are related. However, while this body of research has provided some preliminary validity evidence by establishing links between can-do statements and criterion references, there has been minimal work looking at other aspects of the validity of these instruments. One exception is Brown et al. (2014), who provided three pieces of validity evidence for a selection of ACTFL (2015) Can-Do Statements targeting spoken proficiency. They assessed the reliability and predictive validity of the statements for OPI scores (reported above). They also assessed whether the items ascended according to the ACTFL (2012) proficiency hierarchy. Their findings indicate that the items ascended in the order predicted, based on a Rasch analysis. However, they only considered 11 whether items followed the order of the major thresholds (i.e., Novice, Intermediate, Advanced and Superior), and not the order of the sublevels (e.g., Novice Low; Novice Mid; Novice High). Another area that they left unexplored was how well the items fit the Rasch model, which could have provided evidence that the items they assessed were good indicators of spoken proficiency (i.e., construct validity). In particular, more work needs to be done to assess the validity of the individual can-do statements, since little evidence exists that these discrete proficiency descriptors are all important and accurate components of oral proficiency. Factors that threaten the validity of self-assessment items Heilenman (1990) called for language testing researchers to identify which factors influence how well self-assessment items function and how assessees respond to them in order to refine questionnaire items. Since this call, researchers have uncovered a number of features in the content of items that mitigate and threaten the usefulness of self-assessment questionnaires. Haladyna, Downing, and Rodriguez (2002) synthesized the consensus from educational testing textbooks and research targeting classroom assessment to create a taxonomy of guidelines for item writing. In terms of content, their first suggestion is that “every item should reflect specific content and a single specific mental behavior” (p. 312). Other research has shown that explicitly negative statements (e.g., I cannot…) perform badly in second language self-assessment, as higher ability learners tend to endorse these statements (Heilenman, 1990; Jones, 2002). This has prompted test developers to phrase items in terms of things test takers can do (ACTFL, 2015; Council of Europe, 2001). In terms of how items function, Turner (1984) identified two main factors that contribute to unexpected responses to self-report items: items that are vague or that have ambiguous referents, and items that are irrelevant to a test-taker’s daily life. More recent research has 12 documented how each of these factors can influence self-report responses in the context of language testing. For example, Jones (2002) found that can-do statements that are brief or vague tend to be easier than expected, in that items that were written to describe upper-level tasks were endorsed by lower-proficiency language learners. Conversely, he identified statements about language use that include highly specific examples, refer to stressful situations, or involve channels such as speaking on the phone as more difficult than expected. Commenting on why the CEFR can-do statements do not distinguish well between adjacent levels of upper levels of language proficiency, Weir (2005) noted, “the likely root cause is that so few contextual parameters or descriptions of successful performance are attached to such ‘Can-do’ statements. Both the context and the quality of performance may be needed to ground these distinctions” (p. 288). Not surprisingly, research on the self-assessment of language proficiency has confirmed that items addressing skills and contexts that are relevant to test- takers’ lives perform better than those that are less relevant (Butler & Lee, 2006; Ross, 1998; Suzuki, 2015). Ross (1998) found that self-assessment items that matched instructional content correlated more strongly with outside measures than more abstract proficiency-based items. Building on this finding, Butler and Lee (2006) found that the more recently learners had engaged in a self-assessment task in the classroom, the more accurate their scores were. Suzuki (2015) found that learners with more experience in the target language in more naturalistic settings, measured by length of residency, had less discrepancy between their self-assessed proficiency and that assessed by outside measures. Clearly, the contexts addressed in can-do statements have an impact on how test-takers respond to them, and those items with contexts that are unfamiliar to the test-taking population are likely not useful for measuring language proficiency. 13 The studies reviewed above suggest that when constructing items for self-assessment of L2 proficiency, items function best when they address a single mental skill, when they include content that is specific, and when they address contexts and content that are familiar to the test- taking population. These qualities provide some general guidelines for self-assessment test design. However, well-constructed self-assessment items of L2 proficiency do not guarantee that language learners will respond to these items exactly as test designers and researchers might expect. The expectation is that items that represent the most difficult skills should only be endorsed by people with higher proficiency levels, while the easiest items should be endorsed by every test taker. In other words, in order for items to be valid for measuring L2 proficiency, variation in item responses should be caused by variation in test takers’ L2 proficiency (Borsboom, Mellenbergh, & van Heerden 2004). An excellent test of this conception of validity is the use of a Rasch model, which hypothesizes a single dimension (or scale) of item difficulty and person ability. An analysis of test takers’ item responses is a test of this hypothesis: items fit the Rasch measurement scale if they generate item responses that align with what the model predicts (e.g., the most difficult items will only be endorsed by test-takers with the highest ability). Items that elicit unexpected responses from test takers misfit the model (e.g., an item with a low difficulty estimate that is not endorsed by a high ability test-taker). In the case of misfit to the model, several conclusions can be drawn: In relation to items, it may indicate (1) that the item is poorly constructed; (2) that if the item is well constructed, it does not form part of the same dimension as defined by other items in the test, and is therefore measuring a different construct or trait. In relation to persons, it may indicate (1) that the performance on a particular item was not indicative 14 of the candidate’s ability in general, and may have been the result of irrelevant factors such as fatigue, inattention, failure to take the test item seriously…; (2) there is a heterogeneous test population in terms of the hypothesis under consideration. (McNamara, 1991, p. 143). On the other hand, if the parameters of dimensionality and model fit are met, Rasch measurement provides evidence that the items (and the people) considered in the analysis are useful for the construction of measurement (Wright, 1991). Namely, when items and people fit the model, there is evidence of construct validity, which is widely considered the most important step in assessing the validity of a test (AERA, APA & NCME, 2014; Cumming & Berwick, 1996). The idea is that if the test worked in measuring the skill of the people who were given the test (as evidenced through model fit with Rasch analysis), then the test will work for other people (in the future) who are also given the test. This assumption holds as long as the people who are given the test in the future are similar to the population originally tested. Thus, Rasch modeling is often used as a norming procedure in educational measurement. Motivation for the current study This study draws on data that was collected as part of the National Education Security Program’s (NESP, 2016) Language Flagship Proficiency Initiative, which provided ACTFL language tests to L2 learners at three American universities over the course of three academic years (Fall 2014 until Spring 2017). All students in the project took the ACTFL OPIc in the second language they were learning. At the current study’s institution, each student was able to self-select which of five OPIc level tests to take, which ranged in proficiency from Novice Low to Superior. 15 The ACTFL OPIc is a standardized speaking test that measures second language learners’ functional speaking ability. According to the test specifications, the OPIc assesses the Interpersonal mode of communication (ACTFL, 2014): “learners interact and negotiate meaning…to share information, reactions, feelings, and opinions” (NSCB, 2015). It should be noted that other dimensions of spoken proficiency defined by ACTFL, namely presentational speaking and intercultural competence, are not included in the test’s scope. The test is administered over the Internet by an avatar that delivers questions to the test taker. The test can be considered to be somewhat adaptive as the test form generated depends on which level an examinee chooses after they complete a simple self-assessment of their oral proficiency. Test takers interact with the avatar for 20-30 minutes, and the resulting speech sample is recorded and rated by a certified ACTFL rater by comparing the OPIc performance to the ACTFL Proficiency Guidelines (2012). Ratings are given in terms of four major levels, or thresholds: Novice, Intermediate, Advanced, and Superior. The first three thresholds are further subdivided into High, Mid and Low sublevels. In the first semester of the Language Flagship project, students were told to self-select which level of OPIc to take by responding to ACTFL’s quick self-assessment of language proficiency that was provided on the starting page of the ACTFL OPIc. On the test website, the language learners were presented with five short descriptions, which were matched to the five levels of the test. For example, description 1 was: “I can name basic objects, colors, days of the week, foods, clothing items, numbers, etc. I cannot always make a complete sentence or ask simple questions.” (Note that all five short descriptors are in Appendix A). If the student believed that this description represented his or her proficiency level, he or she was instructed to take OPIc level 1. The level 1 test assessed Novice Low through Novice High skills, and level 5 16 assessed Advanced Low to Superior level language skills. Thus, selecting the wrong test level could result in the non-scoring of the learners’ responses. If he or she took a level test that was too difficult, instead of a score, the test taker would get a score of “below range,” meaning that the raters were unable to rate the speech provided.1 A score of “above range” meant the test taker took a form that was too easy, and raters were unable to rate the speech. After the first semester of testing, it became clear that the undergraduate students needed a more nuanced way to self- select their ACTFL OPIc test level: Many students received “below range” scores because they were taking level tests that were too hard. Therefore, the principle investigator and the project team created a more nuanced, 50-statement self-assessment of language proficiency using the NCSSFL-ACTFL (2015) Can-do statements, which are aligned with ACTFL’s model of second language proficiency. Language testers are primarily concerned with establishing the validity of the tests that they use. At the heart of this concern is determining whether a test measures what it claims to measure. The ACTFL (2015) Can-Do Statements, which are items that claim to be performance indicators of second language proficiency, were developed based on language teachers’ common experiences and beliefs (Shin, 2013). Unlike their European equivalent (CEFR, Council of Europe, 2001), these performance indicators have not been subject to thorough empirical analysis that provides evidence that they can indeed be useful as a measure of language proficiency. Brown et al. (2014) provided some initial evidence for the scaling of the ACTFL Can-Do Statements to ACTFL’s hierarchical model of spoken proficiency. However, their analysis was limited to whether the Rasch difficulty estimates followed the same hierarchy as the four major threshold levels (i.e., Novice, Intermediate, Advanced) of the ACTFL scale. The current study continues this line of research by comparing item difficulties with both the levels 17 and sublevels (i.e., Low, Mid, High) of the ACTFL (2012) scale, seeking evidence for construct validity (i.e., item and person fit to the model) and searching for possible explanations for why misfitting items may not successfully measure language proficiency. Phase one of this study is an analysis of Language Flagship Proficiency test takers’ responses to the original 50-statement self-assessment of language proficiency that the Flagship team created and administered in Spring 2015. The purpose of this analysis was to determine whether the self-assessment items were productive for the construction of a measurement of second language proficiency. The following questions guided the study: 1. Do the individual Can-do Statements fit ACTFL’s unitary and hierarchical model of spoken language proficiency when used for self-assessment? If they do not fit, can a reason for the misfit be identified in the content of the statements? 2. To what extent do the difficulty levels of the Can-do Statements match the hierarchy of the statements’ assigned ACTFL (2012) levels and sublevels? 18 PHASE I: SPRING 2015 PROFICIENCY TESTING Methods This study draws on data that were collected as part of the National Security Education Program’s (NSEP, 2016) Language Flagship Proficiency Initiative,2 which provided ACTFL language tests to L2 learners at three American universities over the course of three academic years (Fall 2014 until Spring 2017). All students in the project took the ACTFL Oral Proficiency Interview by computer (OPIc) in the language they were learning. The participants in the current study were Spanish students at one university who took a 50-statement self-assessment prior to taking the ACTFL OPIc in Spring 2015 (phase one) and in Spring 2017 (phase two). Although the ACTFL (2012) Guidelines were designed to measure proficiency in any second language, I chose to consider only learners in the Spanish program. A differential item functioning (DIF) analysis3 revealed that some of the items may not function equally across language groups: some of the statements were significantly easier for one language group compared to another group of language learners. Another main reason for only considering the Spanish learners was that these students had experience self-assessing their language proficiency because their classroom instruction included curriculum-specific can-do statements (for a description, see VanPatten et al, 2015). Since the ability to self-assess improves with experience using self-assessments (Sweet et al, in press), this made the Spanish learners the best choice of sample. This group also had the largest number of test takers, and the largest number of high-proficiency learners. Inclusion of the other L2 learners’ data (i.e., item responses from Chinese, French, and Russian learners) would have added more numbers, but primarily at lower levels of proficiency. This also may have added language learners to the sample who had less experience self-assessing their language proficiency. Since previous research has shown that lower proficiency learners who do 19 not have experience with self-assessment tend to over-estimate their ability (Trofimovich et al, 2014), I chose not to consider these learners. Participants. The students in the first phase of the study were in the first year of university-level Spanish (N=113), the second year (N=132), the third year (N=102), and the fourth year (N=35), for a total of 382 Spanish test takers. Materials. The materials included a computer-adaptive self-assessment questionnaire that was divided into five testlets, or sets of ten Can-Do Statements. The 50 statements were selected to represent the five levels of the ACTFL OPIc from the fuller list of NCSSFL-ACTFL (2015) Can- Do statements by one of the principal investigators (PI Paula Winke) of the Language Proficiency Flagship Initiative and the project’s research team at the university at which the study was conducted. The PI and the project team consulted with ACTFL officials and received feedback on earlier versions of the computer-adaptive self-assessment. The questionnaire was modified according to this feedback so that the chosen statements represented the ACTFL OPIc levels as accurately as possible. Each statement was followed by a Likert scale: Participants rated their ability to execute the task described in each statement on a scale ranging from one to four: 1 (I cannot do this yet), 2 (I can do this with much help), 3 (I can do this with some help), 4 (Yes, I can do this well). The 50 statements that appeared in the Spring 2015 self-assessment as well as information on both the ACTFL sublevels from which the 50 statements were selected and the way in which the statements were arranged into the five testlets are presented in Appendix A. Each testlet increased with difficulty and was presented in a computer adaptive way: After the 20 first set of items, subsequent testlets were presented according to the test takers’ self-assessment responses to the previous set of Can-Do Statements. Procedure. As part of the larger project, the Spanish language learners came into computer labs during their intact classes (which were 50 minutes in duration) to take the self-assessment with the 50 Can-do statements, and then the ACTFL OPIc (ACTFL, 2012b). If a learner indicated he or she could not do well on nine or more Can-do statements on a set of ten, he or she was recommended to take that level’s OPIc. If the learner indicated he or she could do nine or more of the ten Can-do statements well, he or she moved on to the next set of ten Can-dos. At level 5, if the learner indicated he or she could do eight out of the ten very well, he or she was recommended to take level 5; otherwise, he or she was recommended to take level 4. Figure 2 shows the test takers’ path through the sets of can-do statements, which led to the generation of an OPIc form targeted to each participant’s approximate proficiency level. Figure 2: Test takers’ path. 21 Table 1 shows the number of participants who completed each of the levels of the self- assessment questionnaire and how many students were recommended to take which levels of the OPIc. Table 1: Number of participants who completed each level of the self-assessment Questionnaire Corresponding N of test takers who responded statements OPIc level to statements at that level N recommended to take that level 1-10 11-20 21-30 31-40 41-50 1 2 3 4 5 382 141 84 51 38 of OPIc 241 57 33 32 19 Since OPIcs are official ACTFL tests, they were rated by certified ACTFL raters according to the ACTFL (2012) proficiency guidelines for speaking. Of the 382 Spanish test takers, 15 (4%) did not receive an OPIc rating: some of these test takers did not produce enough speech for the ACTFL raters to assess, and others selected a test form that was too difficult for them, so they were unable to perform the speech tasks asked of them. 367 (96%) received OPIc ratings that ranged from Novice Low to Advanced High, which were distributed as shown in Figure 3. The distribution by class level is shown in Table 2. 22 Figure 3: Distribution of OPIc ratings in Spring 2015.The numeric OPIc ratings from 1-9 represent the range of ACTFL proficiency levels from Novice Low to Advanced High. Table 2: Distribution of 2015 OPIc ratings by class level OPIc Rating 102 202 (N=114) (N=135) 300-level (N=97) 400-level (N=36) Novice Intermediate Advanced Low (N=22) Mid (N=80) High (N=70) Low (N=79) Mid (N=84) High (N=19) Low (N=9) Mid (N=3) High (N=1) Data analysis. 4 25 21 16 31 8 3 1 18 45 26 22 13 4 1 1 1 8 15 31 28 5 3 1 2 8 10 12 2 2 I analyzed the test takers’ item responses to the self-assessment questionnaire using a Rasch model (Rasch, 1960/80). One of the specifications of item response modeling is that the items of a test measure a single latent trait (Embretson & Reise, 2000; Wright, 1991). In order to test this hypothesis using a Rasch model, researchers should provide evidence of unidimensionality and item fit to the model. Linacre (2016b) suggested that evidence for 23 multiple dimensions can be assessed using a Principal Components Analysis (PCA) of residuals. A secondary dimension (i.e., evidence that threatens the assumption of unidimensionality) is indicated by an eigenvalue greater than 2.0 for the first factor in the PCA and a disattenuated correlation substantially less than 1.00 between separately calibrated theta scores. The second assumption is that all items in a test fit the model. An item is said to fit if it generates item responses that align with what the model predicts (e.g., the most difficult Can-Do Statements will only be endorsed by test-takers with the highest ability). Items that elicit unexpected responses from test takers misfit the model (e.g., an item with a low difficulty estimate that is not endorsed by a high ability test-taker). Fit to the Rasch model is assessed by measures of infit and outfit. Infit is “sensitive to unexpected patterns of observations by persons on items that are roughly targeted on them,” while outfit is “more sensitive to unexpected observations by persons on items that are relatively very easy or very hard for them” (Linacre, 2016b). The expected value for these two measures is 1.00; higher values indicate more error than expected, while lower values indicate more redundancy than expected. Wright and Linacre (1994) suggested a fit analysis cutoff between .6 and 1.4 for rating scale models. In an addendum to the reasonable mean-square fit value publication (Wright & Linacre, 1994), Linacre also proposed four mean- square ranges for interpreting fit statistics, shown in Table 3. Table 3: Linacre’s interpretation of mean-square fit statistics Statistic > 2.0 1.5 - 2.0 0.5 - 1.5 < 0.5 Interpretation Degrades the measurement system Unproductive for construction of measurement, but not degrading Productive for measurement Less productive for measurement, but not degrading 24 If the parameters for dimensionality and model fit are met, Rasch measurement provides evidence of construct validity. The item responses to the Spring 2015 self-assessment by 382 test-takers in the current study were analyzed using WINSTEPS Version 3.92 (Linacre, 2016a). Responses for participants who did not complete all of the self-assessment items were entered as missing data. A Rasch rating scale model (Andrich, 1978) was selected for the analysis because each item of the questionnaire was rated on the same scale4. This model is expressed in equation (1): (1) The model gives the probability of a given item response, taking into consideration the test takers’ ability and each item’s difficulty. In the equation, θ represents test taker ability (i.e., language proficiency), β represents item difficulty, and τ is the number of category thresholds. In the case of the 4-point rating scale used in this study, there are three category thresholds. To test the assumption that the self-assessment items used in this study are useful for measuring spoken language proficiency (research question 1), I analyzed the rating scale model for dimensionality and item fit as described above. Since one of the objectives of the study was to construct an assessment instrument that was productive for measuring language proficiency by self-assessment, I considered fit values between 0.5 and 1.5 as acceptable, following Linacre’s suggestion (Wright & Linacre, 1994). To answer the second research question, I evaluated the model by conducting a difficulty analysis of the items to compare the item difficulty estimated by the Rasch model and the 25 ACTFL proficiency level associated with these items. I plotted the item difficulties on a Wright map and calculated the mean item difficulty for items at each threshold level (i.e., Novice, Intermediate, Advanced, Superior) and sublevel (e.g., Novice Low, Novice Mid, Novice High) to determine whether the items ascended in the order of difficulty defined by the ACTFL scale. Results Fit to the Rasch model. The first research question addressed whether the individual Can-do statements fit ACTFL’s model of spoken language proficiency when used for self-assessment and was answered by evaluating the dimensionality and fit to the model of the 50 Can-do self-assessment statements describing spoken language proficiency. The initial Rasch rating scale model had a person reliability of .95, indicating that the self-assessment instrument discriminated well between test takers of varying proficiency levels. To test the assumption of unidimensionality, the model was analyzed using a Principal-Components Analysis (PCA) of the residuals. The aim of the PCA was to determine whether the instrument under consideration was measuring multiple dimensions. The Rasch dimension explained 59.9% of the variance while the largest secondary dimension, or first contrast in the residuals, explained 2.4% of the variance. The eigenvalue of this first contrast was 3.04, or the strength of approximately three items. However, the disattenuated correlation for person measures was 1.00 and the contrast plot (shown in Appendix B) did not show a group of three outlying items at the bottom or top of the plot. Inspection of the items with the highest and lowest factor loadings did not reveal any obvious contrasts in content: both clusters contained items from both the interactional and presentational modes of speaking, and both covered different topics and proficiency levels (also shown in Appendix 3). Finally, 26 according to Holmes (1982), “an item set may be considered unidimensional if the first eigenvalue from the analysis is large compared to the second, and all eigenvalues other than the first are the same size (p. 141).” This was exactly the case for the values in this analysis: the first eigenvalue was large (59.9%), and the first through fifth contrasts were similar in size, ranging from 3.04 to 2.18. Therefore, it appears that the unexplained variance may be due to random noise. In summary, the PCA did not provide any evidence of multidimensionality, which indicates that this set of items can be considered unidimensional and is appropriate for use with Rasch analysis. With sufficient evidence that the items were measuring a single dimension, spoken language proficiency, the next step of the Rasch analysis was to determine how well the items measured the underlying trait. A fit analysis cutoff between .5 and 1.5, targeting items that are productive for measurement (Wright & Linacre, 1994), was adopted as the criterion for determining fit to the Rasch model. Fourteen items, presented in Table 4, displayed outfit values that were outside the cutoff range. These misfit values suggest that these items did not contribute to the unidimensional measurement of spoken proficiency. They are, in other words, psychometrically problematic items. After identifying misfit, the next step in the Rasch analysis was to examine the construction (McNamara, 1991) and the content of the misfitting items to determine why they may not be good measures of the construct. In the case of the 14 items identified in Table 4, three common features were identified as problematic: they were either vague (e.g., I can ask for help), described experiences that many college students might not have (e.g., I can explain an injury or illness and manage to get help), described multiple skills in one item (e.g., I can say the date and day of the week), or a combination of the above. The research team discussed the 27 misfitting items and reached consensus as to which misfit category each item belonged to. These problematic item features are shown in in the last three columns of Table 4 for each of the misfitting items. In addition to the above issues that were common to multiple questionnaire items, a couple of other item features are of note. First, Item 32 (I can give a presentation about my interests, hobbies, lifestyle, or preferred activities) may be a skill that test takers would not have experience doing because this genre does not reflect real-life language use. Secondly, Items 12 (I can describe a place I have visited or want to visit) and 33 (I can ask for and provide descriptions of places I know and also places I would like to visit) describe language use that might not be possible (i.e., describing a place one has not yet visited). 28 Table 4: Misfitting items from the 2015 self-assessment questionnaire Statement Difficulty and Misfit Estimates Misfit Categorization 1. I can say the date and the day of the week. 2. I can list the months and seasons. 5. I can state my favorite foods and drinks and those I do not like. 8. I can list my classes and tell what time they start and end. 12. I can describe a place I have visited or want to visit. 13. I can ask for help at school, work, or in the community. 14. I can talk about my daily routine. 15. I can talk about my interests and hobbies. 18. I can plan an outing with a group of friends. 23. I can describe a childhood or past experience. 30. I can explain an injury or illness and manage to get help. 32. I can give a presentation about my interests, hobbies, lifestyle, or preferred activities. 33. I can ask for and provide descriptions of places I know and also places I would like to visit. 38. I can exchange general information about leisure and travel, such as the world’s most visited sites or most beautiful places to visit. Difficulty estimate (S.E.) -4.68 (.14) -3.43 (.12) -4.45 (.13) -2.43 (.11) -1.87 (.26) -2.16 (.28) -2.17 (.28) -3.07 (.36) -1.43 (.24) -0.49 (.36) 1.90 (.23) -0.66 (.75) Infit MNSQ (z-std) 1.14 (1.4) 1.06 (0.8) 1.23 (2.3) 1.03 (0.4) 0.89 (-0.5) 0.91 (-0.4) 1.01 (0.1) 1.07 (0.4) 0.84 (-0.9) 1.04 (0.3) 1.16 (1.0) 0.95 (0.1) Outfit MNSQ (z- std) 1.94 (2.4) 2.53 (4.1) 2.96 (4.3) 1.70 (2.5) 0.45 (-1.0) 0.39 (-1.1) 0.48 (-0.6) 0.38 (-0.9) 0.49 (-1.1) 2.97 (2.3) 1.53 (2.2) 0.42 (-0.2) -0.66 (.75) 0.95 (0.1) 0.42 (-0.2) 0.42 (.51) 0.66 (-0.9) 0.34 (-1.0) Vague Exper . Depe nd. √ √ √ √ √ √ √ √ Multipl e skills √ √ √ √ √ √ √ √ √ √ √ 29 A second Rasch analysis was conducted by deleting the 14 items with the greatest misfit. This resulted in a final version that included 36 items with both infit and outfit mean square values that all fell within 0.5 and 1.5, shown in Table 5. This revised model had a person reliability of .94. In terms of dimensionality, 64% of the variance was explained by the Rasch dimension, while 3% of the variance was explained by the largest secondary dimension. The eigenvalue of the first contrast decreased to 2.92 and the disattenuated correlation remained at 1.00, and there was still no evidence from the content of the items in the first contrast that a secondary dimension was at play (see Appendix 3). These findings provide evidence that the remaining 36 items are useful for measuring spoken language proficiency among college-level Spanish learners. 30 Table 5: Fit statistics for 36 fitting items Statement Difficulty and Misfit Estimates 3. I can say which sports I like and don’t like. 4. I can list my favorite free-time activities and those I don’t like. 6. I can talk about my school or where I work. 7. I can talk about my room or office and what I have in it. 8. I can list my classes and tell what time they start and end. 9. I can answer questions about where I’m going or where I went. 10. I can present information about something I learned in a class or at work. 11. I can describe a school or workplace. 16. I can schedule an appointment. 17. I can talk about my family history. 19. I can explain why I was late to class or absent from work and arrange to make up the lost time. 20. I can tell a friend how I’m going to replace an item that I borrowed and broke/lost. 21. I can give some information about activities I did. 22. I can talk about my favorite music, movies, and sports. 24. I can ask for and follow directions to get from one place to another. 25. I can return an item I have purchased to a store. 26. I can arrange for a make-up exam or reschedule an appointment. 27. I can present an overview about my school, community, or workplace. 28. I can compare different jobs and study programs in a conversation with a peer. 29. I can discuss future plans, such as where I want to live and what I will be doing in the next few years. 31. I can present ideas about something I have learned, such as a historical event, a famous person, or a current environmental issue. 34. I can explain how life has changed since I was a child and respond to questions on the topic. 35. I can discuss what is currently going on in another community or country. 36. I can provide a rationale for the importance of certain classes, subjects, or training programs. 37. I can talk about present challenges in my school or work life, such as paying for classes or dealing with difficult colleagues. 39. I can give a presentation about cultural influences on society. 31 Difficulty estimate (S.E.) -6.24 (.15) -5.89 (.14) -4.53 (.12) -2.80 (.11) -1.87 (.26) -3.74 (.12) -2.41 (.11) -2.43 (.25) -0.43 (.19) -0.25 (.19) -0.86 (.20) 0.21 (.19) -2.42 (.54) -2.75 (.61) -0.18 (.29) 0.71 (.26) -0.18 (.29) -0.82 (.33) 0.36 (.27) -1.30 (.37) Infit MNSQ (z-std) 1.23 (2.2) 0.88 (-1.3) 0.86 (-1.8) 0.91 (-1.1) 0.89 (-0.5) 0.95 (-0.6) 1.09 (1.1) 0.99 (0.0) 1.19 (1.5) 1.16 (1.0) 1.01 (0.1) 1.15 (1.1) 1.02 (0.2) 1.00 (0.2) 0.87 (-0.8) 0.79 (-1.3) 0.86 (-0.8) 1.00 (0.1) 0.76 (-1.5) 0.84 (-0.6) Outfit MNSQ (z- std) 1.41 (1.2) 0.61 (-1.3) 0.78 (-0.8) 1.41 (1.7) 0.45 (-1.0) 0.39 (-1.1) 0.71 (-1.2) 0.93 (0.1) 1.45 (1.3) 1.14 (0.5) 0.70 (-0.8) 1.18 (0.8) 0.96 (0.3) 1.33 (0.6) 0.70 (-0.5) 0.84 (-0.4) 0.68 (-0.6) 1.11 (0.4) 0.71 (-0.7) 0.93 (0.1) 1.88 (.33) 1.16 (0.9) 1.45 (1.4) -0.14 (.51) 1.40 (.36) 1.53 (.35) -0.43 (.56) 1.01 (0.3) 0.70 (-0.8) 0.89 (-0.5) 0.81 (-0.4) 0.97 (-0.1) 0.82 (-0.4) 1.09 (0.3) 0.89 (0.2) 1.88 (.33) 1.77 (.34) 1.02 (0.2) 1.05 (0.3) 0.86 (-0.4) 1.20 (0.7) Table 5 (cont’d) 40. I can participate in conversations on social or cultural questions relevant to speakers of this language. 41. I can interview for a job or service opportunity related to my field of expertise. 42. I present an explanation for a social or community project or policy. 43. I can present reasons for or against a position on a political or social issue. 44. I can give a clear and detailed story about childhood memories, such as what happened during vacations or memorable events and answer questions about my story. 45. I can exchange general information about my community, such as demographic information and points of interests. 46. I can exchange factual information about social and environmental questions, such as retirement, recycling, or pollution. 47. I can usually defend my views in a debate. 48. I can exchange complex information about my academic studies, such as why I chose the field, course requirements, projects, internship opportunities, and new advances in my field. 49. I can provide a balance of explanations and examples on a complex topic. 50. I can explain participate actively and react to others appropriately in academic debates, providing some facts and rationales to back up my statements. 3.85 (.37) 2.73 (.39) 3.30 (.37) 1.93 (.42) 0.77 (-0.9) 0.84 (-0.6) 0.94 (-0.2) 0.94 (-0.2) 0.80 (-0.7) 0.78 (-0.8) 0.90 (-0.3) 0.90 (-0.3) 1.93 (.42) 0.93 (-0.2) 0.93 (0.0) 2.42 (.40) 0.83 (-0.7) 0.87 (-0.3) 2.43 (.40) 2.73 (.39) 0.64 (-1.8) 1.07 (0.4) 0.56 (-1.6) 0.99 (-1.6) 3.30 (.37) 3.44 (.37) 1.01 (0.1) 1.36 (1.4) 0.96 (-0.1) 1.44 (1.6) 32 Item difficulty estimates. The second research question assessed the extent to which the difficulty of the statements (estimated by the Rasch model) matched the ACTFL (2012) proficiency levels associated with the Can-Do statements. I calculated the mean logit scores from the Rasch analysis for the statements belonging to each of the threshold levels (i.e., Novice, Intermediate, Advanced, Superior) and sublevels (e.g., Novice Low, Novice Mid, Novice High) to evaluate whether they ascended according to the hierarchy of the ACTFL (2012) scale. The mean difficulty estimates for each of the major threshold levels, presented in Table 6, ascended in the expected order: Novice statements were the easiest, followed by Intermediate and Advanced, with Superior statements being the most difficult. In other words, when grouped by major threshold level, the statements acted as expected. Furthermore, the 95% confidence intervals for each threshold did not overlap, which suggests that the mean statement difficulty at each of the major ACTFL levels differs to a statistically significant degree. Table 6: Descriptive statistics for difficulty estimates of ACTFL threshold levels ACTFL threshold 1 - Novice 2- Intermediate 3- Advanced 4- Superior Note. N = number of statements. N Mean logit score (SD) 10 17 21 2 -3.33 (1.25) -0.68 (1.46) 1.78 (1.34) 3.72 (0.08) SE .39 .36 .29 .06 95% CI [-4.22, -2.44] [-1.43, 0.07] [1.18, 2.39] [2.96, 4.48] The mean difficulty estimates for the ACTFL sublevels, presented in Table 7, ascended as anticipated (although the 95% confidence intervals for adjacent means all overlap, thus the differences between sublevels cannot be statistically significant), for the most part. However, there was one notable exception: there was a decrease in difficulty from Advanced Low (M = 1.72, SD = 1.51) to Advanced Mid (M = 1.37, SD = 1.32), rather than an increase. 33 Table 7: Descriptive statistics for difficulty estimates of ACTFL sublevels -4.06 (0.88) -3.39 (1.19) -1.421 95% CI -12.00, 3.89 -4.49, -2.29 - N Mean logit score (SD) 2 7 1 3 6 8 10 7 4 2 ACTFL sublevel 1 - Nov-Low 2 - Nov-Mid 3 - Nov-High 4 - Int-Low 5 - Int-Mid 6 - Int-High 7 - Adv-Low 8 - Adv-Mid 9 - Adv-High 10 - Superior Notes: 1. Because there was only one statement at the Novice High level in the instrument, the mean logit score was not included in the difficulty analysis. 2. The mean logit score at this level falls out of the expected order. -1.82 (0.36) -1.48 (1.24) 0.34 (1.22) 1.72 (1.51) 1.372 (1.32) 2.65 (0.42) 3.72 (0.08) -2.73, -0.91 -2.77, -0.18 -0.68, 1.36 0.65, 2.81 0.15, 2.58 1.98, 3.32 2.95, 4.48 SE .62 .45 - .21 .51 .43 .48 .50 .21 .06 Inspection of the Wright map, in Figure 4, sheds some light on why the Advanced Low and Mid levels did not ascend in the anticipated order. The right half of the map displays each item and its intended ACTFL sublevel in order of difficulty based on item responses. Items at the top of the map were rated the most difficult and items at the bottom were the easiest. The range of item difficulty for the Advanced Low and Advanced Mid sublevels spans a wide distance (from -0.03 to 4.14 logits for Advanced Low and from -0.49 to 2.85 logits for Advanced Mid). These ranges also closely overlap, and the Advanced Mid range includes items with the easiest item difficulty estimates of the two levels, while the Advanced Low range includes the most difficult items in the analysis. These values could have skewed this sublevel’s mean difficulty estimate higher than the Advanced Mid sublevel. These most difficult items addressed topics (Item 42: I can present an explanation for a social or community project or policy) or experiences (Item 41: I can interview for a job or service opportunity in my field of service) that many college students may not be familiar with. It is possible that for this population, the skills described by the most difficult Advanced Low items may represent higher level proficiency 34 skills than Advanced Low. Another possibility is that items belonging to these two sublevels may not form two distinguishable groups when used by this population. 35 A L ( - 0 . 0 3 - 4 . 1 4 ) MEASURE PERSON - MAP - ITEM 9 .### + | | 8 . + .# | . | 7 . + .## | . T| 6 .# + ## | .# | 5 .## +T ## | .## | 4 ### + 41_AL #### S| 43_AL 49_S 50_S # | 3 .### + 42_AL 46_AL 47_AH 48_AH #### | ## |S 31_IH 39_AH 40_AH 44_AM 45_AM 2 .## + 30_AM 35_AL 36_AL ## | # | 25_IH 1 .############ + 20_AL 28_AL .# M| 17_IH 24_IH 26_IH .######## | 16_IM 34_AL 38_AM 0 ########## +M 19_AL 27_AL 37_AM .###### | 23_IM 29_AM .######## | 32_IH 33_IH -1 .######## + #### | 10_NH 11_IL 18_IH #### | 21_IM 7_NM -2 ### S+ 12_IL 13_IL 22_IM ### |S 14_IM 8_NM .## | 9_NM -3 .# + 15_IM ### | 2_NL 6_NM .## | -4 . + # | 4_NM 5_NM | 1_NL 3_NM -5 .# T+T . | | -6 . + | | -7 + | . | -8 + | | -9 . + Figure 4: Wright map of test taker ability and item difficulty. A period indicates one person. A hash mark is equal to three people. A M ( - 0 . 4 9 - 2 . 8 5 ) 36 As mentioned above, the items at both the Advanced Low and Mid levels had wide ranges of difficulty. The ranges for the two Superior level items and the Advanced High items were much smaller: The Advanced High items ranged from 2.25 to 3.14 logits and the Superior level items ranged from 3.66 to 3.78 logits. The difficulty range and item content for the Advanced and Superior level items are shown in Table 8. Inspection of the content of the four most difficult Advanced Low items indicates why these items were as difficult or more difficult than items at the Superior level. Items 42, 43 and 46 require speakers to provide policy explanations, reasons for a position, and rationales. Recalling that speakers at the Advanced level are expected to be able to use language for narration and description, while Superior level speakers should be able to use argumentation, hypothesize, and discuss abstract topics, the language functions in items 42, 43, and 46 may better describe Superior-level language use. If these items were modified to use descriptive language on concrete topics (e.g., I can describe a community project), they may better reflect Advanced proficiency as defined by the ACTFL (2012) Guidelines. The most difficult item, Item 41 (I can interview for a job or service opportunity), requires high-stakes language use that would likely require speakers to hypothesize about how they would perform in a job or service opportunity (e.g., If I were to teach a Research Methods course, I would include the following topics…). Again, this type of language use aligns better with ACTFL’s definition of Superior language proficiency. The content of the remainder of the Advanced Low and Mid items appear to require narration (e.g., I can give a clear and detailed story about childhood memories) and description (e.g., I can compare different jobs and study programs in a conversation with a peer). 37 Table 8: Advanced-level items in order of difficulty Advanced Low Advanced Mid Advanced High Superior 29. I can discuss future plans, such as where I want to live and what I will be doing in the next few years. (-0.49) 37. I can talk about present challenges in my school or work life, such as paying for classes. (0.14) 44. I can give a clear and detailed story about childhood memories and answer questions about my story. (2.38) 45. I can exchange general information about my community, such as demographic information and points of interests. (2.38) 36. I can exchange factual information about social and environmental questions, such as retirement, recycling, or pollution. (2.85) (-0.49) 40. I can participate in conversations on social or cultural questions relevant to speakers of this language. (2.25) 39. I can give a presentation about cultural influences on society. (2.36) 47. I can usually defend my views in a debate. (2.85) 48. I can exchange complex information about my academic studies. (3.14) 19. I can explain why I was late to class or absent from work and arrange to make up the lost time. (-0.03) 27. I can present an overview about my school, community, or workplace. (0.14) 34. I can explain how life has changed since I was a child and respond to questions on the topic. (0.42) 38. I can compare different jobs and study programs in a conversation with a peer. (0.42) 30. I can tell a friend how I’m going to replace an item that I borrowed and broke/lost. (1.90) 45. I can discuss what is currently going on in another community or country. (2.38) 46. I can provide a rationale for the importance of certain classes, subjects, or training programs. (2.85) 42. I present an explanation for a social or community project or policy. (3.14) 43. I can present reasons for or against a position on a political or social issue. (3.66) 41. I can interview for a job or service opportunity related to my field of expertise. (4.14) 38 49. I can provide a balance of explanations and examples on a complex topic. (3.66) 50. I can explain participate actively and react to others appropriately in academic debates, providing some facts and rationales to back up my statements. (3.78) Discussion In the first phase of this study I analyzed 50 ACTFL (2015) Can-Do Statements targeting spoken language proficiency that were included in a computer-adaptive self-assessment constructed for the Language Flagship Initiative with input from ACTFL and LTI. The first research question assessed how well the fifty items included in the analysis fit the Rasch model measuring spoken language proficiency. Fourteen items did not fit the model, and when these items were deleted, there was evidence that the remaining 36 items did, in fact, measure a single latent trait. Specifically, the remaining items had fit values that fell within an acceptable range and did not display any obvious evidence of multidimensionality. As for the 14 misfitting items, the discussion that follows outlines three possible reasons why these items elicited unexpected response patterns from test-takers in the current study and also offers recommended revisions. According to Jones (2002), discrepancies in item difficulty between what is observed and what is predicted can often be explained by considering the features of the misfitting items. The items in the present study that did not fit the model displayed a number of features that have been documented in the literature on response patterns to self-assessment items (Haladyna et al., 2002; Heilenman, 1990; Jones, 2002; Ross, 1998; Suzuki, 2015; Turner, 1984): items that addressed multiple skills, items that depended on specific experiences using the target language, and items that were vague. These problematic item features are shown in the last three columns of Table 4 for each of the misfitting items. The research team discussed the misfitting items and reached consensus as to which misfit category each item belonged to. The majority of the misfitting items (11 out of 14) included one or more coordinating conjunctions (i.e., and, or). Combining two skills which may have different degrees of difficulty 39 and for which learners have different degrees of mastery may have influenced their fit to the model. For example, Item 1 (I can say the date and day of the week) involves two skills with different degrees of difficulty. Saying the day of the week simply requires vocabulary knowledge of the days of the week, whereas saying the date requires both vocabulary knowledge of the months and numbers and knowledge of how to combine this information. Although coordinating conjunctions were also included in items that did fit the Rasch model, one possibility for why these items fit is that the skills had similar degrees of difficulty. Further, it is generally accepted that assessment items should target only one skill (Haladyna et al, 2002). The unexpected item response patterns for some items that contained coordinating conjunctions may be due to the disparate nature of the skills being assessed, a problem that is exacerbated when multiple coordinating conjunctions are used in the same Can-Do Statement, as in Items 5, 8, 30, and 33. Another feature of items that did not fit the model was that they were brief and vague. In the present analysis, four items fit this description (Items 13, 14, 15, 23). These items are all noticeably shorter than the other statements, and include topics that are not specific (“asking for help”, “daily routines”, “interests” and “past experiences”). A daily routine, for example, could be anything from what a person does (e.g., working eight hours per day) to a more detailed, grammar-focused (i.e., reflexive verbs) list of a person’s morning routine that is found in many foreign language textbooks and curricula (e.g., Garcia & Asención, 2001). The finding corroborates Jones’ (2002) analysis of the CEFR Can-Do statements, which showed that items that were brief and included topics that were vague tended to elicit response patterns that were easier than expected. Five of the statements that did not fit the model made reference to experiences or knowledge with which the typical college Spanish student would likely not have encountered 40 (Items 18, 30, 32, 38, 47). Research has clearly shown that self-assessment items that are not relevant to students’ lives tend to perform less well than items that describe skills with which students have experienced in the classroom or in naturalistic settings (Butler & Lee, 2006; Ross, 1998; Suzuki 2015; Turner, 1984). Among these items, 18 and 30 describe experiences that likely would require having spent time abroad, having made friends who speak the target language, or having experienced a medical emergency. Without this kind of experience, it would be very difficult to evaluate whether one can effectively carry out these tasks in the target language. Items 32 and 47, on the other hand, include modes of communication that could be more common in a language classroom. However, without the experience of giving a presentation specifically about one’s interests or defending one’s views in a debate format, it might not be possible to self-judge these skills. Item 38 requires very specific travel knowledge, which could have made this statement more difficult for assessees to endorse as something they can do (Jones, 2002). Without experience performing these skills, college-aged examinees would likely be unable to judge how difficult the tasks actually are, causing them to guess and leading to measurement error (Haladyna et al., 2002). In addition to the above issues that were common to multiple questionnaire items, Items 33 and 12 were poorly constructed because they included language use that might not be possible (i.e., describing a place one has not yet visited). The second research question addressed the extent to which the difficulty of the items considered in this study would follow the hierarchy predicted in the ACTFL scale. Comparison of the mean Rasch difficulty estimates for items at each of the major threshold levels of the ACTFL scale revealed that the mean logit scores ascended in the anticipated order and that mean differences were statistically significant. This finding is in line with Brown et al. (2014), who 41 found that the ACTFL Can-Do statements which they modeled ascended in the same order of difficulty as the threshold levels in the scale (although not to a statistically significant degree). The current study took the analysis one step further by considering the mean difficulty estimates for the sublevels at each proficiency level, which revealed that most sublevel means ascended in the predicted order. However, the mean difficulty score for items at the Advanced Low level was higher than items at the Advanced Mid level, and the ranges of these categories overlapped considerably. One explanation for this unexpected item difficulty is that some of the Advanced Low items described language use that may go beyond ACTFL’s expectations of Advanced proficiency. Specifically, Items 41, 42, 43, and 46 included language use that would require speakers to discuss abstract topics and hypothetical situations, which reflect Superior- level language functions. The finding that the order of difficulty did not match the sublevel identifications exactly is not completely unexpected. For example, Shin (2013, p. 4) warned that learners’ performances may not always indicate their levels of proficiency because language learners may differ in their interactions with the tasks and/or may have unstable language abilities because they are, in fact, learners. Further, it may not be possible to create Can-Do statements that precisely discriminate at the sublevel. The findings of this study showed significant differences in difficulty between items at the major threshold levels, as expected. But the sublevel identifications were less accurate. This may not be unexpected because in theory the sublevels describe examinee performance between two adjacent levels where barely sustained performance of the floor level is associated with the Low sublevel, strong performance of the floor level with some success at the ceiling level indicates Mid sublevel proficiency, and near sustaining performance at the ceiling level indicates High proficiency. These theoretical interpretations of Low, Mid, and High 42 suggest that researchers should not expect to find unique differences in the difficulty of items at the sublevels. One might also speculate based on data reported here that items from the Advanced Low level address topics or experiences that many college students are not familiar with. For example, the most difficult items from this sublevel clustered with the items from the Superior level and addressed topics (Item 42: I can present an explanation for a social or community project or policy) or experiences (Item 41: I can interview for a job or service opportunity in my field of service) for which many college students may not have personal experiences. It is likely that most college students have not interviewed for a job in the target language, making it very difficult for them to accurately judge the difficulty of this and other tasks with which they do not yet have personal experience, as they may not recognize that they require higher level proficiency skills. An alternate explanation is that for this population, the skills described in these Advanced items may represent language use that better describes Superior level language proficiency. The current study has informed the larger Language Flagship Proficiency Initiative by signaling which items in the self-assessment were not good indicators of spoken language proficiency. For the second phase of this dissertation, the research team revised the misfitting items identified in this analysis for subsequent data collection and analyses. The aim was to create a self-assessment instrument that would elicit item responses from test takers that might better fit the model measuring spoken proficiency. One of the items that included coordinating conjunctions was revised by separating them into distinct statements that would be evaluated separately: Item 1 was changed to two items: 1. I can say the day of the week. 2. I can say the date. To address the misfitting items in the assessment instrument that were brief and vague, 43 these items were removed and replaced with more specific statements from the bank of NCSSFL-ACTFL (2015) Can-Dos. Another option for revision would have been to make these misfitting items more specific by including an exemplification (Weir, 2005). For example, Item 15, I can talk about my interests and hobbies, could be modified to give specific examples of topics that a test taker might address when they discuss their interests (e.g., I can talk about my interests, such as sports, music, parties). However, the aim was to include statements that were as true to the original ACTFL (2015) statements as possible. Finally, several misfitting items included in the self-assessment addressed experiences that college students may not have had. To increase the content validity of the revised instrument, these items were replaced with skills that are targeted to the population’s typical foreign language experiences (e.g., describing summer plans or talking about weekend activities). The findings of the current study also appear to have informed a revision of the Can-Do Statements by ACTFL and NCSSFL. Shortly after phase one of this study was fast-tracked to publication in ACTFL’s journal, Foreign Language Annals (Tigchelaar, Bowles, Winke, & Gass, 2017), NCSSFL-ACTFL (2017) released a new set of Can-Dos that includes revisions to the previous statements and additional statements targeting intercultural communication. 44 PHASE II: SPRING 2017 PROFICIENCY TESTING For the second phase of this dissertation, I revised the self-assessment instrument that was administered in the first phase of the study in collaboration with the Language Flagship research team. The motivation for this revision came from the finding that 14 of the items misfit the model measuring spoken proficiency. In addition, 4% of the test takers from the Spring 2015 round of testing over- or under-assessed their Spanish proficiency using the self-assessment. The same self-assessment was used in Spring 2016, and 9% of the Spanish test takers over- or under- assessed their proficiency. The goal for the revised instrument, then, was to create a more refined self-assessment that might elicit item responses from test takers that would better fit the model measuring spoken proficiency and that would guide even more test takers toward an appropriate OPIc test form. To do this, we kept the items that fit the model in the first phase of the study, removed the misfitting items that were identified and replaced them with new Can-Do Statements. We then administered the revised self-assessment instrument to test takers who took the Language Flagship Proficiency testing in Spring 2017. In order to compare the results of the second phase of the study to the findings of the first phase, I formulated similar research questions (RQ2 and RQ3). I also added a question (RQ1) to explore in more detail the question of dimensionality of the measurement model of spoken proficiency. The questions that guided phase two of the study were: 1. How many factors do the Can-Do Statements for spoken proficiency measure? Can they be used to measure a unitary dimension? The hypothesis, based on the ACTFL Guidelines (2012), is that there is one factor, speaking ability in general. An alternative hypothesis, based on the NCSSFL-ACTFL (2015) Can-Do Statements, is that there are two factors, presentational speaking and interactional communication. 45 2. Do the individual Can-Do Statements in the revised self-assessment instrument fit ACTFL’s (2012) unitary and hierarchical model of spoken language proficiency when used for self-assessment? 2a. If they do not fit, can a reason for the misfit be identified in the content of the statements or in the characteristics of the test takers? 3. Do the difficulty levels of the Can-Do Statements in the revised self-assessment instrument match the hierarchy of the statements’ assigned ACTFL (2012) levels and sublevels? Methods For the second phase of the study, I used data from the Language Flagship Proficiency testing project that was administered in Spring 2017. Participants. The participants in the second phase of the study were university-level Spanish students who took the revised 50-statement self-assessment and the ACTFL OPIc in Spring 2017. The students were in the first year of university-level Spanish (N=137), the second year (N=265), the third year (N=387), and the fourth year (N=98), for a total of 886 Spanish test takers. Materials. The materials included the revised computer-adaptive self-assessment questionnaire that was composed of five testlets, or sets of ten Can-Do Statements, which were organized in the same way as in the first round of testing. This self-assessment included 36 items from the original Spring 2015 assessment that were found to have mean-square fit values that were productive for measurement (i.e., between 0.5 and 1.5) in the first phase of the study. 46 I also selected 14 new items to replace the misfitting items from the first round of testing. I revised one of the original items that assessed multiple skills (I can say the date and day of the week) into two items (I can say the day of the week and I can say the date). I revised another of the original items by simplifying it (I can describe a place I have visited or want to visit to I can describe a place I have visited). I selected 11 other Can-Do Statements from NCSSFL-ACTFL’s (2015) large bank of items that fit the following criteria: items that were a) specific, b) described language use for college-level test takers and c) included a single language task. For the most part, I kept the original content and structure of the ACTFL Can-Do Statements. However, in addition to the three revised items described above, I eliminated some of the language that was not geared to college students’ experiences in the replacement items. This text is shown in strikeout in Table 9. Table 9: 14 misfitting items and their replacements Misfitting Items Misfit Categorization Replacement Items Vague Exper. Depend. Multiple skills I can say the date and the day of the week. -NL I can list the months and seasons. -NL I can state my favorite foods and drinks and those I don’t like. - NM I can list my classes and tell what time they start and end. -NM Table 9 (cont’d) I can describe a place I have visited or want to visit. -IL I can ask for help at school, work, or in the community. -IL √ I can say the day of the week. -NL I can say the date. -NL I can say what someone looks like -NM I can talk about what I do on the weekends. -NM I can describe a place I have visited. -IL I can describe what my summer plans are. -IL √ √ √ √ √ √ 47 I can talk about my daily routine. -IM I can talk about my interests and hobbies. -IM √ √ I can plan an outing with a group of friends. -IH I can describe a childhood or past experience. -IM I can explain an injury or illness and manage to get help. -AM I can give a presentation about my interests, hobbies, lifestyle, or preferred activities. -IH I can ask for and provide descriptions of places I know and also places I would like to visit. - IH I can exchange general information about leisure and travel, such as the world’s most visited sites or most beautiful places to visit. -AM √ √ √ √ √ √ √ √ √ √ I can report on a social event that I attended. -IM I can bring a conversation to a close. -IM I can explain a series of steps needed to complete a task or experiment.1 -IH I can give a short presentation on a current event. -IM I can describe in detail a social event or local celebration. -AM I can make a presentation about an interesting person. -IH I can explain to someone who was absent what took place in class or on the job. -IH I can recount the details of a historical event. -AM 1. Note: The original NCSSFL-ACTFL (2015) statements include the text that is struck out. This language was eliminated to reduce the statements to a single topic that is geared toward college students’ knowledge or experiences. Appendix C presents the revised 50-statement questionnaire that appeared in the Spring 2017 self-assessment, as well as information on both the ACTFL sublevels from which the 50 statements were selected, and the way in which the statements were arranged into the five testlets. As in the first round of data collection, each testlet increased with difficulty and was presented in a computer adaptive format. After the first set of items, subsequent testlets were presented according to the test takers’ self-assessment responses to the previous set of Can-Do Statements. Participants rated more difficult items if they said they could do the majority of the previous set of items. 48 Procedure. The test takers from 2017 followed the same procedure as those in 2015: They took the revised self-assessment with the 50 Can-Do Statements, followed by the ACTFL OPIc (ACTFL, 2012b). If a learner indicated he or she could not do well on nine or more Can-Do Statements on a set of ten, he or she was recommended to take that level’s OPIc. If the learner indicated he or she could do nine or more of the ten Can-Do statements well, he or she moved on to the next set of ten Can-Dos. At level 5, if the learner indicated he or she could do eight out of the ten very well, he or she was recommended to take level 5; otherwise, he or she was recommended to take level 4. Table 10 shows the number of participants who completed each of the levels of the self- assessment questionnaire and how many students were recommended to take which levels. Table 10: Number of participants who completed each level of the self-assessment Questionnaire Corresponding N of test takers who statements OPIc level responded to statements at 1-10 11-20 21-30 31-40 41-50 1 2 3 4 5 that level 886 385 157 105 73 N recommended to take that level of OPIc 501 228 52 70 35 The OPIcs were rated by certified ACTFL raters according to the ACTFL (2012) proficiency guidelines for speaking. Of the 886 Spanish test takers, 63 (7%) did not receive an OPIc rating because they either over-assessed (i.e., took a test that was too difficult; n=59) or under-assessed (i.e., took a test that was too easy; n=4) their speaking ability. Thus, 823 (93%) received OPIc ratings, which were distributed as shown in Figure 5. The distribution by class level is in Table 11. 49 Figure 5: Distribution of OPIc ratings in Spring 2017. The numeric OPIc ratings from 1-9 represent the range of ACTFL proficiency levels from Novice Low to Advanced High, and Above Range and Below Range are represented by -1 and 0, respectively. Table 11: Distribution of 2017 OPIc ratings by class level OPIc Rating 102 202 (N=137) (N=265) 300-level (N=386) 400-level (N=98) Above Range Below Range Novice Intermediate Advanced Data Analysis. N=4 N=59 Low (N=10) Mid (N=64) High (N=136) Low (N=237) Mid (N=255) High (N=92) Low (N=21) Mid (N=7) High (N=1) 0 3 9 51 61 10 3 1 7 1 13 50 107 78 3 5 3 36 22 104 149 60 8 3 1 0 13 3 16 25 29 8 4 I used two statistical procedures for the second round of data analysis. I analyzed the test takers’ item responses using a Rasch model and assessed the items’ difficulty estimates, dimensionality, and fit to the model. The Rasch model hypothesizes that person ability and item 50 difficulty can be measured on a unidimensional scale (Embretson & Reise, 2000; McNamara, 1991; Wright, 1991). The item response analysis of item fit and dimensionality represents a test of this hypothesis in relation to the data. To further explore the specification of item response modeling that the items in the assessment can measure a single construct, I also conducted an exploratory factor analysis (EFA). Exploratory factor analysis (EFA). EFA allows researchers to assess the underlying structure of a test with no a priori hypothesis of how the items group into factors. The model identifies how the items form factors, or underlying latent traits, and estimates the strength of association of each item to a factor. To address the question of the dimensionality of the self-assessment instrument, I ran an EFA on the test takers’ responses to the revised set of 50 items using Mplus Version 8 (Muthén & Muthén, 2017). Because of the adaptive nature of the assessment, many of the test-takers did not provide item responses for all of the items. Given the amount of missing data, I followed Muthén and Muthén’s (2017) suggestion to conduct the analysis using full information maximum likelihood (FIML) estimation. If there is more than one factor within the construct of speaking, I assumed that they would be strongly correlated (and when factors are believed to be correlated, they are called oblique). So, based on that theoretical assumption, I performed the EFA using Geomin rotation, which is one of the standard oblique rotation methods available in Mplus. Data are rotated in EFA to “further analyze initial… EFA results with the goal of making the pattern of loadings clearer, or more pronounced” (Brown, 2009, p. 20), and researchers have the choice of oblique rotations, or orthogonal rotations, depending on whether the factors are assumed to be correlated or not, respectively. I performed the EFA with data from 1,268 observations: 886 test takers’ item responses to the 50-item, revised self-assessment from the second round of data 51 collection, plus 382 test takers’ item responses from the first round of data collection to the 36 items that were common to both datasets. I considered factor loadings of .45 to be fair, loadings of .55 to be good, and loadings of .71 and above to be excellent (Comrey & Lee, 1992). Rasch analysis. As in the first phase of the study, I analyzed the test takers’ item responses to the self- assessment questionnaire using a Rasch rating scale model (Rasch, 1960/80; Andrich, 1978). To evaluate whether the self-assessment items used in the second phase of study are useful for measuring spoken language proficiency, I tested the hypotheses of dimensionality and item fit to the Rasch model. In the second data analysis, I flagged only items that had fit statistic measures with values greater than 2.0 as problematic, since these items could be degrading to the measurement system. Because the objective of this second data analysis was no longer construction of the assessment but quality control, this more liberal cutoff was adopted. To answer the third research question, I evaluated the model by conducting a difficulty analysis of the items to compare the item difficulty estimated by the Rasch model and the ACTFL proficiency level associated with these items. I plotted the item difficulties on a Wright map and calculated the mean item difficulty for items at each threshold level (i.e., Novice, Intermediate, Advanced, Superior) and sublevel (e.g., Novice Low, Novice Mid, Novice High) to determine whether the items ascended in the order of difficulty predicted by the ACTFL scale. Results Factor analysis. To answer the first research question regarding the factor structure of the Can-Do Statements included in the revised self-assessment, I began by performing an EFA. Even though 52 the ACTFL (2012) Guidelines “are not based on any particular theory” (p. 3) of language proficiency, the model of language proficiency represented by the ACTFL pyramid (see Figure 1) and represented by the language written in the ACTFL (2012) Guidelines strongly suggests that spoken language proficiency grows hierarchically through time, and that speaking represents a single construct. The Can-Do Statements, however, appear to indicate that there are at least two dimensions of speaking ability within the larger construct of speaking ability in general. The 2015 statements are separated into two areas of speaking performance: presentational speaking and interpersonal communication. The 2017 statements include an intercultural communicative competence dimension. Thus, according to ACTFL, speaking is either a unitary construct, a general construct with two prominent sub-categories, or a multidimensional construct with at least two major categories. Therefore, I chose an EFA to explore how many dimensions might be represented by the data I collected. I also chose this method because I wanted to check the assumption of unidimensionality for using Rasch analysis, since unidimensionality is a requirement for data to be useful for latent trait measurement (Wright, 1991). In my analyses, both a one-factor model and a two-factor model converged (meaning that both are possible models). The first unidimensional model of speaking accounted for 19.14% of the variance in self-assessment scores. The second dual-dimensional model of speaking accounted for 34.08% of the variance: 19.14% in factor 1 and 14.94% in factor 2. The eigenvalues for each of the factors (used to calculate the variance) are shown in the scree plot, in Figure 6. Although these models leave a lot of unexplained variance, previous studies have shown that “when item-level data are factor analyzed, it is not uncommon to see a relatively low percentage of variance accounted for by the first factor” (Young et al, 2008, p. 177). This is the case even when a one- or two-factor solution is selected as the most parsimonious model. 53 Inspection of the factor loadings in Table 12 shows that the one-factor solution had excellent factor loadings (i.e., greater than .7, Comrey & Lee, 1992) that were all statistically significant. The two-factor model, on the other hand, included some items that had significant cross-loadings on both factors. All of the items from the Novice Low to Intermediate Low levels loaded onto the first factor. The second factor included the most difficult items in the analysis (i.e., the Advanced Mid to Superior level items). Items from the Intermediate Mid to Advanced Mid proficiency levels were less clear: Items 26, 29, 32, 34-36 and 38 had factor loadings with significant values that were fair (> .45) or good (> .55) on both factors. The remainder of the items from these Intermediate/Advanced proficiency levels were split between factors one and two. This pattern of factor loadings may be consistent with a difficulty factor; the pattern could also be interpreted as two factors that represent basic or core language proficiency at the lower levels, and academic speaking at the upper levels (Cummins, 1979; Hulstijn, 2007). This was a rather unexpected finding, as ACTFL does not distinguish between these constructs in its model. In other words, two hierarchical factors are not directly represented by the ACTFL pyramid. Further, the content of the items did not form any of the proposed distinct patterns of language use (e.g., presentational versus interactional speaking; description versus argumentation). If researchers want to keep the theoretical model of speaking based on the ACTFL Guidelines, the clear pattern of the loadings in the one-factor model is preferable. The second, two-factor model challenges the commonly accepted, ACTFL-based model of speaking proficiency as represented by the pyramid: It suggests that speaking at the lower levels of proficiency is qualitatively (and measurably) different (i.e., a different construct) than speaking at the upper levels of proficiency, at least for the college-level language leaners in this study. This interpretation does not negate 54 the ACTFL model, but it does suggest that it could be refined—I will discuss this challenge further in the discussion section of this dissertation. Figure 6: Scree plot of item-level EFA. Table 12: Factor loadings for the 1-factor and 2-factor models Item 1. I can say the date. -NL 2. I can say the day of the week. -NL 3. I can say which sports I like and don’t like. -NM 4. I can list my favorite free-time activities and those I don’t like. - NM 5. I can say what someone looks like. -NM 1-factor model: 0.723* 0.740* 0.782* 0.860* 0.819* 6. I can talk about my school or where I work. -NM 0.878* 7. I can talk about my room or office and what I have in it. -NM 0.832* 8. I can talk about what I do on the weekends. -NM 0.901* 9. I can answer questions about where I’m going or where I went. - NM 0.844* 2-factor model: Factor 1 Factor 2 0.567* 0.385* 0.502* 0.533* 0.754* 0.087 0.884* -0.032 0.831* 0.004 0.937* -0.111 0.968* -0.249* 1.031* -0.267* 0.963* -0.238 10. I can present information about something I learned in a class or at work. -NH 11. I can describe a school or workplace. -IL 12. I can describe a place I have visited. -IL 0.798* 0.911* -0.218 0.776* 0.863* 1.040* -0.298* 1.008* -0.169 55 13. I can describe what my summer plans are. -IL 14. I can report on a social event that I attended. -IM 15. I can bring a conversation to a close. -IM 16. I can schedule an appointment. -IM 17. I can talk about my family history. -IH 18. I can explain a series of steps needed to complete a task. -IH 19. I can explain why I was late to class or absent from work and arrange to make up the lost time. -AL 20. I can tell a friend how I’m going to replace an item that I borrowed and broke/lost. -AL 21. I can give some information about activities I did. -IM 22. I can talk about my favorite music, movies, and sports. -IM 23. I can give a short presentation on a current event. -IM 24. I can ask for and follow directions to get from one place to another. -IH 25. I can return an item I have purchased to a store. -IH 26. I can arrange for a makeup exam or reschedule an appointment. - IH 27. I can present an overview about my school, community, or workplace. -AL 28. I can compare different jobs and study programs in a conversation with a peer. -AL 29. I can discuss future plans, such as where I want to live and what I will be doing in the next few years. -AM Table 12 (cont’d) 30. I can describe in detail a social event. -AM 31. I can present ideas about something I have learned, such as a historical event, a famous person, or a current environmental issue. - IH 32. I can make a presentation about an interesting person. -IH 33. I can explain to someone who was absent what took place in class. -IH 34. I can explain how life has changed since I was a child and respond to questions on the topic. -AL 35. I can discuss what is currently going on in another community or country. -AL 36. I can provide a rationale for the importance of certain classes, subjects, or training programs. -AL 37. I can talk about present challenges in my school or work life, such as paying for classes or dealing with difficult colleagues. -AM 0.876* 0.906* 0.853* 0.898* 0.881* 0.943* 0.960* 0.862* 0.030 0.841* 0.096 0.831* 0.036 0.787* 0.181 0.768* 0.168 0.773* 0.266 0.782* 0.288 0.939* 0.803* 0.237 0.931* 0.860* 0.955* 0.902* 0.214 0.754* 0.279 0.568* 0.703* 0.367* 0.647* 0.371* 0.917* 0.941* 0.735* 0.302 0.586* 0.465* 0.945* 0.652* 0.423* 0.961* 0.647* 0.436* 0.929* 0.549* 0.471* 0.978* 0.955* 0.655* 0.436* 0.287 0.654* 0.981* 0.904* 0.608* 0.499* 0.750* 0.323 0.974* 0.583* 0.524* 0.961* 0.501* 0.594* 0.978* 0.534* 0.585* 0.973* 0.327 0.725* 38. I can recount the details of a historical event. -AM 39. I can give a presentation about cultural influences on society. - AH 40. I can participate in conversations on social or cultural questions relevant to speakers of this language. -AH 0.975* 0.985* 0.483* 0.611* 0.318 0.746* 0.973* 0.281 0.728* 56 41. I can interview for a job or service opportunity related to my field of expertise. -AL 0.984* 0.315 0.756* 42. I present an explanation for a social or community project or policy. -AL 43. I can present reasons for or against a position on a political social issue. -AL 44. I can give a clear and detailed story about childhood memories, such as what happened during vacations or memorable events and answer questions about my story. -AM 45. I can exchange general information about my community, such as demographic information and points of interests. -AM 46. I can exchange factual information about social and environmental questions, such as retirement, recycling, or pollution. -AM 47. I can usually defend my views in a debate. -AH 48. I can exchange complex information about my academic studies, such as why I chose the field, course requirements, projects, internship opportunities, and new advances in my field. -AH 49. I can provide a balance of explanations and examples on a complex topic. –S 50. I can explain participate actively and react to others appropriately in academic debates, providing some facts and rationales to back up my statements. -S Geomin factor correlations 0.993* 0.069 0.923* 0.992* 0.069 0.923* 0.993* 0.005 0.949* 0.992* -0.008 0.944* 0.993* -0.093 0.997* 0.997* 0.991* 0.153 -0.32 0.897* 1.078* 0.996* -0.202 1.072* 0.997* -0.086 1.024* 1.00 .520* Note: * = significant at the .05 level; good factor loadings (> 0.55) are highlighted in bold. Fit to the Rasch model. The second research question addressed whether the individual Can-Do statements fit ACTFL’s model of spoken language proficiency when used for self-assessment. The question was answered by evaluating the dimensionality and fit to the model of the 50 Can-Do self- assessment statements describing spoken language proficiency. The initial Rasch rating scale model had a person reliability of .96, indicating that the self-assessment instrument discriminated well between test takers of varying proficiency levels. To re-examine the assumption of unidimensionality, the model was analyzed using a Principal-Components Analysis (PCA) of the residuals. The aim of the PCA was to further explore whether the instrument under consideration was measuring multiple dimensions. 57 The Rasch dimension explained 53.8% of the variance, while the largest secondary dimension, or first contrast in the residuals, explained 3.3% of the variance. The eigenvalue of this first contrast was 3.80, or the strength of approximately four items. The disattenuated correlation for person measures was .452. Because this value is much lower than 1.00, I had some concerns that the instrument might be multidimensional. Further, the contrast plot (shown in Appendix D) showed a group of five outlying items at the top of the plot. Inspection of the items showed that these statements were from the final set of Can-Dos (Items 44, 46, 48, 49, 50). One possibility, as describe above, is that these items might be measuring an academic or higher order speaking dimension, at least with this group of test takers. One of the items in the first contrast requires speakers to give a clear and detailed story about childhood memories, which does not necessarily describe academic speaking, but may still fit within a dimension that goes beyond a core language proficiency (Hulstijn, 2007) of speaking. Beyond difficulty, these items share little in common: They represented both the interactional and presentational modes of speaking, and covered different topics (also shown in Appendix 5). Thus, these items may form an interpretable secondary dimension, but this dimension is not what is predicted by ACTFL. Next, I evaluated the item fit to the Rasch model, considering items that are productive for the construction of measurement (i.e., with fit statistics between 0.5 and 1.5) and looking for items that would distort the measurement (i.e., items with fit statistics > 2.0). In addition, I considered (a) the items that were selected to replace the misfitting items from the Spring 2015 self-assessment and (b) the items that were revised from the original assessment for the Spring 2017 version. Analysis of a Rasch model of the Spring 2017 test takers’ responses revealed seven items that had very large outfit values (which indicates the items were not working; a zero is perfect fit, 58 and large values above or below zero indicate misfit), shown in Table 13. Inspection of the content of these items showed that two items (3 and 4) described likes and dislikes in the same statement, which could have distorted the measurement. Item 35, I can discuss what is currently going on in another community or country, stands out because this statement likely relies more on intercultural competence than it does on spoken proficiency, and also because the two objects (community and country) are so vastly different. Test takers thinking of one or the other may have extremely different responses, which would contribute to item noise or the item’s inability to measure one thing well. Table 13: Misfitting items from the initial 2017 Rasch model Statement Difficulty and Misfit Estimates 3. I can say which sports I like and don’t like. (NM) 4. I can list my favorite free time activities and those I don’t like. (NM) 5. I can say what someone looks like. (NM) 7. I can talk about my room or office and what I have in it. (NM) 11. I can describe a school or workplace. (IL) 17. I can talk about my family history. (IH) 35. I can discuss what is currently going on in another community or country. (AL) Difficulty estimate (S.E.) -5.51 (.10) Infit Outfit MNSQ (z- MNSQ (z- std) std) 1.06 (0.8) 9.90 (9.9) -5.15 (.09) 0.78 (-3.7) 4.50 (9.9) -3.89 (.08) -2.64 (.07) 0.94 (-1.0) 0.92 (-1.6) 3.35 (9.1) 5.32 (9.9) -2.34 (.14) 1.19 (2.0) 9.90 (9.9) 0.07 (.11) 1.79 (.26) 1.17 (2.0) 1.11 (0.7) 4.51 (9.7) 2.07 (2.0) Deleting these items resulted in more misfitting items (unlike the analysis in phase one of the study, where deleting misfitting items resulted in the remainder of the items having fit values within the acceptable range). After deleting the above items, seven more items had fit values 59 greater than 2.0, shown in Table 14. Inspection of this model’s residuals revealed an even weaker disattenuated correlation of .151. Thus, deletion of all the items that initially misfit the model does not seem to result in a set of items that all contribute to the measurement of spoken proficiency. Table 14: More misfitting items Statement Difficulty and Misfit Estimates 2. I can say the day of the week. 10. I can present information about something I learned in a class or at work. Table 14 (cont’d) 16. I can schedule an appointment. 19. I can explain why I was late to class or absent from work and arrange to make up the lost time. 20. I can tell a friend how I’m going to replace an item that I borrowed and broke or lost. 24. I can ask for and follow directions to get from one place to another. Difficulty estimate (S.E.) -7.78 (.13) -3.06 (.07) Infit Outfit MNSQ (z- MNSQ (z- std) 1.16 (2.5) 1.04 (0.7) std) 2.32 (4.0) 3.35 (9.9) -0.80 (.11) -1.49 (.12) 2.06 (3.7) 1.00 (0.1) 0.86 (-1.7) 2.51 (4.5) -0.10 (.11) 1.01 (.2) 3.25 (7.1) -0.26 (.23) 1.06 (0.5) 3.88 (3.4) These findings suggest that the misfit may not be a result of the characteristics of the items, or at least not this alone. The magnitude of the outfit values highlights that the items had unexpected item responses from test takers whose ability was far from the item difficulty level. For example, Items 3-5 (Novice Mid level items with very low difficulty estimates) would likely have item responses from high ability test takers that said they could not do these things, which is unexpected. Inspection of the person fit to the model showed that this was in fact the case, as thirteen test takers had outfit mean-square values of 9.90. The item responses of these test takers 60 show that the majority of these people rated all 50 Can-Do Statements, which suggests that they have high spoken proficiency levels. Each of the response strings of these misfitting people (shown in Table 15) includes at least one statement from either the first form (statements 1-10) or second form (statements 11-20) that was highly unexpected, as indicated by Winsteps. For example, test taker 698, who received an Advanced Mid proficiency rating, indicated that they would need some help to list favorite free time activities (Item 4). This rating is indeed quite unexpected, because listing favorite free time activities should be something that someone with Advanced proficiency should be able to do independently and easily. All of the items that misfit in the original model (except Item 35) are among these most unexpected item responses from the test takers who had very high fit statistics. These responses are highlighted in bold in the response strings (which represent ratings from 1-4 for each of the 50 items in the self- assessment) and the item number is shown in parentheses. Table 15: Misfitting people, ratings and response strings (and most unexpected responses) Person number Outfit OPIc form MNSQ (z-std) 715 9.90 (3.4) 5 Levels assessed AM - S Proficiency Rating BR 44444444444444443444444444444444444444444444444444 (17) 512 9.90 (5.1) 5 AM - S BR 44444434444444444443444433444444444444443444444443 (7) 257 9.90 (9.9) 4 IH - AM BR 44344444444444444433444444444444443444443334434434 (3) 185 9.90 (7.5) 5 AM - S BR 44444434444444444443444344444444444444444444444444 (7) 61 149 9.90 (3.4) 5 AM - S BR 44444444444444443444444444444444444444444444444444 (17) 652 9.90 (9.9) 5 AM - S AR 44444444443444444444444444444444444444444444444444 (11) 119 9.90 (4.7) 5 AM - S IH 44444444434444434444443444444434444444443444444444 (10, 16) 509 9.90 (9.9) 4 IH - AM IM 44344444444444444444444444434444443443443434433433 (3) Table 15 (cont’d) 698 9.90 (9.9) 4 IH - AM AM 44434444444444444443444444334444444443443334443343 (4) 616 9.90 (7.1) 4 IH - AM IH 44443434444444444444444444444434444344443334344443 (5, 7) 065 9.90 (4.2) 4 IH - AM 4444343444444444444444443444444343344444 (5, 7) 006 9.90 (9.4) 4 IH - AM IH IH 44344434444444434444444444443444444444443333332233 (3, 7) 237 9.90 (6.2) 3 IM - AL IM 434444444444443444444433333344 (2) Note: the most unexpected item responses are indicated in parentheses. Of these people, five (715, 512, 257, 185 and 149) did not receive a rating (BR) because they took an OPIc test that was too difficult for them. Interestingly, two test takers (119 and 509) selected an OPIc form that assessed proficiency levels that were above their final rating: Person 119 selected form 5, which is designed to elicit speech at the Advanced Mid to Superior levels, 62 and was rated as Intermediate High (i.e., two levels below what the form is designed to measure). Person 509 selected form 4, which is designed to elicit speech at the Intermediate High to Advanced Mid levels. These seven test-takers over-assessed their spoken proficiency using the self-assessment instrument. One highly misfitting test taker under-assessed their ability: They selected the highest OPIc form after saying they could do nearly all of the statements in the self- assessment. However, the ACTFL rating indicated that their proficiency was higher than form 5 could assess (i.e., higher than Superior-level proficiency). In fact, this test taker was a heritage speaker or a native speaker of Spanish (they spoke Spanish at home), as I found out after inquiring about the student’s background (without personal identifiers) from the Flagship team. Five other test takers had extreme outfit values and received official ACTFL ratings that corresponded to the OPIc test forms that they selected. Their proficiency ratings ranged from Intermediate Mid to Advanced Mid. Interestingly, none of the highly misfitting test takers selected a test form that tested the lowest four proficiency levels (i.e., Novice Low to Intermediate Low). Deletion of these highly misfitting test takers (outfit MNSQ of 9.90) and Item 35 (which seems to rely on intercultural competence than spoken proficiency) resulted in a final model that had a person reliability of .96, with evidence of unidimensionality, and fewer misfitting items. With these deletions, the Rasch dimension explained 57.7% of the variance (an increase from 53.8%), and the disattenuated correlation between measures increased to .86. Further, there was no longer an obvious cluster of items at the top of the contrast plot (shown in Appendix 5), which lends support for a unidimensional model of spoken proficiency. The final model (shown in Table 16) still includes some misfitting items, indicating that scale improvement beyond a 63 certain point may not be possible with these data. This may not be surprising given that this is self-assessment, which I will discuss in the discussion section. 64 Table 16: Final model fit statistics for the revised self-assessment questionnaire Item 1. I can say the date. -NL 2. I can say the day of the week. -NL 3. I can say which sports I like and don’t like. -NM 4. I can list my favorite free time activities and those I don’t like. -NM 5. I can say what someone looks like. -NM 6. I can talk about my school or where I work. -NM 7. I can talk about my room or office and what I have in it. -NM 8. I can talk about what I do on the weekends. -NM 9. I can answer questions about where I’m going or where I went. -NM 10. I can present information about something I learned in a class or at work. -NH 11. I can describe a school or workplace. -IL 12. I can describe a place I have visited. -IL 13. I can describe what my summer plans are. -IL 14. I can report on a social event that I attended. -IM 15. I can bring a conversation to a close. -IM 16. I can schedule an appointment. -IM 17. I can talk about my family history. -IH 18. I can explain a series of steps needed to complete a task. -IH 19. I can explain why I was late to class and arrange to make up the lost time. -AL 20. I can tell a friend how I’m going to replace an item that I borrowed and broke/lost. -AL 21. I can give some information about activities I did. -IM 22. I can talk about my favorite music, movies, and sports. -IM 23. I can give a short presentation on a current event. -IM 24. I can ask for and follow directions to get from one place to another. -IH 25. I can return an item I have purchased to a store. -IH 26. I can arrange for a makeup exam or reschedule an appointment. -IH 27. I can present an overview about my school, community, or workplace. -AL Table 16 (cont’d) 28. I can compare different jobs and study programs in a conversation with a peer. -AL 65 Estimate (S.E.) -5.61 (..09) -7.26 (.13) -6.19 (.10) -5.80 (.09) -4.49 (.08) -4.77 (.08) -3.20 (.07) -4.93 (.08) -3.93 (.08) -2.69 (.07) -2.80 (.14) -2.87 (.15) -3.05 (.15) -1.77 (.12) -2.16 (.13) -0.54 (.11) -0.30 (.11) -0.94 (.12) -1.17 (.12) 0.16 (.11) -2.08 (.47) -1.72 (.40) 0.77 (.22) -0.11 (.25) 1.25 (.21) 0.23 (.23) -0.36 (.26) Infit Outfit MNSQ z-std 2.8 3.1 0.9 -3.6 -0.6 -4.4 -1.4 -5.0 -1.9 2.6 2.2 -0.2 0.9 -0.9 0.7 -0.2 2.6 0.9 -1.3 0.4 0.0 0.0 0.0 0.5 0.2 0.5 -1.4 1.18 1.29 1.06 0.79 0.97 0.77 0.92 0.74 0.90 1.14 1.21 0.98 1.09 0.93 1.06 0.98 1.23 1.08 0.90 1.03 0.94 0.97 1.00 1.07 1.01 1.06 0.78 MNSQ z-std 1.5 -1.4 -0.7 -2.5 2.8 -3.0 0.1 -2.7 -0.7 3.5 4.2 -0.9 -0.9 -0.8 -0.9 3.1 5.5 1.0 -1.4 5.2 -1.0 -0.6 1.4 1.1 2.6 -0.4 -1.1 1.30 0.67 0.84 0.58 1.50 0.57 1.00 0.58 0.89 1.45 2.56* 0.75 0.74 0.80 0.79 1.77 2.60* 1.20 0.70 2.45* 0.39 0.54 1.77 1.58 2.64* 0.73 0.46 1.19 (.21) 0.84 -1.2 0.60 -0.8 29. I can discuss future plans, such as where I want to live and what I will be doing in the next few years. -AM 30. I can describe in detail a social event. -AM 31. I can present ideas about something I have learned, such as a historical event, a famous person, or a current environmental issue. -IH 32. I can make a presentation about an interesting person. -IH 33. I can explain to someone who was absent what took place in class. -IH 34. I can explain how life has changed since I was a child and respond to questions on the topic. -AL 36. I can provide a rationale for the importance of certain classes, subjects, or training programs. -AL 37. I can talk about present challenges in my school or work life, such as paying for classes or dealing with difficult colleagues. -AM 38. I can recount the details of a historical event. -AM 39. I can give a presentation about cultural influences on society. -AH 40. I can participate in conversations on social or cultural questions relevant to speakers of this language. -AH 41. I can interview for a job or service opportunity related to my field of expertise. -AL 42. I present an explanation for a social or community project or policy. -AL 43. I can present reasons for or against a position on a political social issue. -AL 44. I can give a clear and detailed story about childhood memories, such as what happened during vacations or memorable events and answer questions about my story. -AM 45. I can exchange general information about my community, such as demographic information and points of interests. -AM 46. I can exchange factual information about social and environmental questions, such as retirement, recycling, or pollution. -AM 47. I can usually defend my views in a debate. -AH 48. I can exchange complex information about my academic studies, such as why I chose the field, course requirements, projects, internship opportunities, and new advances in my field. -AH 49. I can provide a balance of explanations and examples on a complex topic. -S 50. I can explain participate actively and react to others appropriately in academic debates, providing some facts and rationales to back up my statements. -S -0.30 (.26) 0.63 (.22) 2.34 (.29) 0.43 (.41) -0.16 (.49) 0.99 (.35) 2.09 (.29) 1.11 (.34) 3.10 (.27) 2.19 (.29) 2.81 (.27) 5.54 (.32) 4.89 (.34) 5.74 (.32) 3.40 (.41) 3.56 (.40) 5.54 (.32) 5.54 (.32) 5.01 (.34) 5.54 (.32) 5.22 (.33) 1.02 0.74 1.02 0.71 1.10 0.90 1.03 0.86 1.15 0.92 1.17 1.00 0.75 0.91 1.05 0.99 0.91 0.62 1.14 0.78 0.63 0.2 -2.3 0.2 -1.1 0.4 -0.4 0.3 -0.6 0.8 -0.4 0.9 0.1 -1.1 -0.3 0.3 0.1 -0.3 -1.9 0.7 -1.0 -1.8 1.05 0.60 2.96* 0.31 1.43 0.58 2.03 0.91 1.89 1.01 1.47 1.12 0.68 0.89 0.70 0.76 0.85 0.60 1.24 0.75 0.55 0.3 -0.7 2.5 -0.7 0.7 -0.3 1.5 0.2 2.0 0.2 1.1 0.6 -1.2 -0.4 -0.4 -0.3 -0.6 -1.9 0.9 -1.1 -2.0 66 Replacement items. The fourteen items that were selected to replace the misfitting items identified in Spring 2015 and their fit statistics from the second round of testing are presented in Table 17. These items were selected preferentially so that they were specific (avoiding statements that were vague to the extent possible), included a single spoken task (avoiding the description of multiple skills in the same item), and addressed language use that college level test takers would have experience with. Table 17: Fit statistics for the fourteen replacement items included in the revised Spring 2017 assessment Infit Outfit MNSQ z-std MNSQ z-std Difficulty estimate (S.E.) -5.61 (.09) -7.26 (.13) -4.49 (.08) -4.93 (.08) -2.87 (.15) -3.05 (.15) -1.77 (.12) -2.16 (.13) -0.94 (.12) 1.18 1.29 0.97 0.74 0.98 1.09 0.93 1.06 1.08 2.8 3.1 -0.6 -5.0 -0.2 0.9 -0.9 0.7 0.9 1.30 0.67 1.50 0.58 0.75 0.74 0.80 0.79 1.20 0.77 (.22) 1.00 0.0 1.77 0.63 (.22) 0.43 (.41) -0.16 (.49) 3.10 (.27) 0.74 0.71 1.10 1.15 -2.3 -1.1 0.4 0.8 0.60 0.31 1.43 1.89 Item 1. I can say the date. -NL 2. I can say the day of the week. -NL 5. I can say what someone looks like. -NM 8. I can talk about what I do on the weekends. -NM 12. I can describe a place I have visited. -IL 13. I can describe what my summer plans are. -IL 14. I can report on a social event that I attended. -IM 15. I can bring a conversation to a close. -IM 18. I can explain a series of steps needed to complete a task. -IH 23. I can give a short presentation on a current event. -IM 30. I can describe in detail a social event. -AM 32. I can make a presentation about an interesting person. - IH 33. I can explain to someone who was absent what took place in class. -IH 38. I can recount the details of a historical event. -AM 1.5 -1.4 2.8 -2.7 -0.9 -0.9 -0.8 -0.9 1.0 1.4 -0.7 -0.7 0.7 2.0 Twelve of the fourteen replacement items had fit statistics that fell between 0.5 and 1.5, indicating that they are productive for measuring spoken language proficiency. Two items (23 67 and 38) had fit values that were between 1.5 and 2.0, indicating that these items are unproductive for the construction of measurement, but not degrading to the instrument. Revised items. After the first round of testing, two items were revised and reassessed by test takers in the second round of testing. These items and their fit statistics are shown in Table 18. Table 18: Fit statistics for original and revised items Original (2015) Revision (2017) Difficulty estimate (S.E.) -4.68 (.14) -5.61 (.09) -7.26 (.13) Infit Outfit z-std MNSQ z-std MNSQ 1.14 1.18 1.29 1.4 2.8 3.1 1.94* 1.30 0.67 2.4 1.5 -1.4 1. I can say the date. 2. I can say the day of the week. 1. I can say the date and day of the week. 12. I can describe a place I have visited or want to visit. -1.87 (0.26) 0.89 -0.5 0.45* -1.0 -2.87 (.15) 0.98 -0.2 0.75 -0.9 12. I can describe a place I have visited. The first and easiest item, I can say the date and day of the week, misfit the model (outfit MNSQ = 1.94) in the first round of testing. While this value does not indicate that the item would be degrading for measurement, it is likely not productive for the construction of a test measuring spoken proficiency (Wright & Linacre, 1994). The original item includes two skills, and thus it was separated and included as two items in the modified instrument that was administered in the second round of testing. The item responses in the Spring 2017 round of testing give some indication of why the original item may have misfit the model: there is a difference in difficulty of nearly two logits for the two skills, where saying the day of the week was rated as much easier -7.26 logits) than saying the date (-5.61 logits). This is not surprising given that saying the day of the week requires simply memorizing seven vocabulary words, whereas saying the date includes month vocabulary, numbers, and the knowledge of how to 68 combine this information. Test takers’ responses in the first round of testing resulted in a difficulty estimate of -4.68 for the item combining the two skills, which was similar to the estimate for saying the date in the second round of testing. Some test takers may have ignored the part of the item that referred to saying the day of the week, while others may have rated whether they could do both of the skills. These potential differences in interpretation may have caused the original item to misfit the model. Item 12 originally misfit the model and contained two types of descriptions: places I have visited and places I want to visit. In the Spring 2017 round of testing, the item was modified to remove the description of a place someone would like to visit (which may not be possible). This change resulted in an item that was productive for measurement, with fit values between 0.5 and 1.5. To summarize the findings for the second research question, which addressed the fit to the Rasch model, when all items and all test takers were included in the model, there were seven misfitting items and evidence that the items in the instrument did not contribute to a unitary measurement of spoken proficiency. Deleting thirteen test takers with very high outfit values (who were people who either over-assessed their spoken proficiency and selected a test form that was too difficult, may not have been second language learners, or who interacted with the self- assessment in an unexpected way), and Item 35, which appears to rely more on intercultural competence than on spoken language proficiency or which may be a compound question asking (problematically) two very different questions, resulted in a final model that had fewer misfitting items and evidence that the items in the model contributed to a unitary measurement of spoken language proficiency. The items that were revised or selected as replacements for the 2017 round of data collection all fit the model. 69 Item difficulty estimates. The third research question addressed the extent to which the difficulty of the statements (estimated by the Rasch model) matched the ACTFL (2012) proficiency levels associated with the revised Can-Do Statements. The items are plotted in order of difficulty in the Wright map shown in Figure 7. 70 7 . + ## | .## T| 6 . + .# |T .## | 5 ### + .# | 41_AL ##### | 43_AL 49_S 47_AH 50_S 4 .### + 46_AL 42_AL .# | 48_AH .##### S| 3 ###### +S .####### | 45_AM 38_AM . | 44_AM 40_AH 2 .############## + 31_IH .##### | 39_AH 36_AL 35_AL .#### | 28_AL 25_IH 1 .##### + 34_AL 23_IM .###### | 37_AM 30_AM 20_AL # M| 32_IH 26_IH 24_IH 0 .###################### +M 29_AM 27_AL 16_IM 17_IH .################# | 33_IH .############## | 19_AL 18_IH -1 . + .############# | 22_IM 14_IM .########### | 15_IM -2 ########## + 21_IM 10_NH .####### | 12_IL 11_IL .####### S| 7_NM 13_IL -3 ##### +S ### | 9_NM .#### | -4 ### + 5_NM .## | 8_NM 6_NM .# | -5 . + 4_NM 1_NL . | . T|T 3_NM -6 # + | | 2_NL -7 + Figure 7: Wright map of test taker ability and item difficulty. I calculated the mean logit scores from the Rasch analysis for the 50 statements of the revised instrument at each of the threshold levels (i.e., Novice, Intermediate, Advanced, Superior) and sublevels (e.g., Novice Low, Novice Mid, Novice High) to evaluate whether they ascended according to the hierarchy of the ACTFL (2012) scale. The mean difficulty estimates for each of the major threshold levels, presented in Table 19, ascended in the expected order: Novice statements were the easiest (M = -4.27, SD = 1.35), 71 followed by Intermediate (M = -0.56, SD = 1.38) and Advanced (M = 2.08, SD = 1.66), with Superior statements being the most difficult (M = 3.31, SD = 0.11). The 95% CI for each threshold did not overlap (shown in Figure 8), which suggests that the differences in mean statement difficulty between each of the major ACTFL levels were statistically significant. Table 19: Descriptive statistics for 2017 difficulty estimates of ACTFL threshold levels ACTFL threshold 1 - Novice 2- Intermediate 3- Advanced 4- Superior Note. N = number of statements. N Mean logit score (SD) 10 17 21 2 -4.27 (1.35) -0.56 (1.38) 2.08 (1.66) 3.31 (0.11) SE .43 .33 .36 .08 95% CI -5.25, -3.31 -1.27, 0.15 1.33, 2.84 3.29, 5.33 Figure 8: 95% confidence intervals of the mean threshold difficulty estimates. 1 = Novice; 2 = Intermediate; 3 = Advanced; 4 = Superior. The mean difficulty estimates for the ACTFL sublevels, presented in Table 20, ascended as anticipated. Inspection of the 95% CIs, plotted in Figure 9, showed some statistical differences between item difficulties at the sublevels. The intervals at the Novice Mid, Intermediate Low and Intermediate Mid levels did not overlap. No interval could be calculated for the Novice High 72 level since only one item was included from this level. The interval at the Novice Low level is very wide because this level only included two items; this interval overlaps with all other intervals. The 95% CIs for adjacent means at the Intermediate Mid level and above all overlapped. Thus, the differences between higher proficiency sublevels are not necessarily statistically significant. Also of note are the mean difficulty estimates for the Advanced Low (M = 1.84) and Advanced Mid (M = 1.89) levels. The means and 95% CIs are very similar, suggesting that the items at these two proficiency levels do not discriminate well at the sublevel. Table 20: Descriptive statistics for 2017 difficulty estimates of ACTFL sublevels -5.80 (1.21) -4.15 (0.99) -2.161 95% CI -16.66, 5.07 -5.06, -3.23 - N Mean logit score (SD) 2 7 1 3 6 8 10 7 4 2 ACTFL sublevel 1 - Nov-Low 2 - Nov-Mid 3 - Nov-High 4 - Int-Low 5 - Int-Mid 6 - Int-High 7 - Adv-Low 8 - Adv-Mid 9 - Adv-High 10 - Superior Notes: 1. Because there was only one item at the Novice High level in the instrument the mean logit score was not included in the difficulty analysis. -2.44 (0.12) -0.94 (1.10) 0.43 (0.86) 1.84 (1.91) 1.89 (1.50) 3.01 (1.21) 4.31 (0.11) -2.73, -2.14 -2.09, 0.21 -0.29, 1.15 0.47, 3.21 0.51, 3.27 1.08, 4.94 3.29, 5.33 SE .86 .37 - .07 .45 .30 .60 .56 .60 .08 73 Figure 9: 95% confidence intervals of the mean sublevel difficulty estimates.1 = NL; 2 = NM; 3 = NH; 4 = IL; 5 = IM; 6 = IH; 7 = AL; 8 = AM; 9 = AH; 10 = S. The items at both the Advanced Low and Advanced Mid levels had very wide ranges of difficulty (AL: -1.17 – 5.54; AM: -0.30 – 3.56). As in the analysis in phase one of the study, the two most difficult items (41 and 43) were from the Advanced Low level. These items shared similar difficulty levels with the two Superior level items in the revised assessment. As noted in phase one, the likely cause for this unexpected item difficulty lies in the content of the four most difficult Advanced Low items. These items require speakers to provide policy explanations, reasons for a position, rationales, and perform an interview. These are language functions that better describe Superior-level language use, as they require speakers to use argumentation and to hypothesize. Discussion In the second phase of this study I analyzed the revised computer-adaptive self- assessment constructed for the Language Flagship Proficiency Initiative. The instrument included 50 ACTFL (2015) Can-Do Statements targeting spoken language proficiency: 36 items 74 from the original self-assessment and 14 items that were either new ACTFL (2015) Can-Do selections or revised items. Dimensions of spoken proficiency. The first research question addressed the number of factors of spoken proficiency that are measured by the revised 50-item self-assessment. The hypothesis, based on the ACTFL Guidelines, was that speaking can be measured as a unitary construct. An alternative hypothesis, based on the NCSSFL-ACTFL (2015, 2017) Can-Do Statements, was that spoken proficiency includes multiple dimensions: a presentational speaking factor and an interpersonal communication factor. The most recent publication of the Can-Do Statements (ACTFL, 2017) also includes a possible third factor: intercultural communicative competence. To test these hypotheses, I performed an EFA and also considered the results of a PCA of the residuals of a Rasch model of test takers’ item responses. The EFA provided evidence for both a unidimensional model of spoken proficiency and a model with two factors. The unidimensional model had a very clear pattern of significant, strong factor loadings, which is in line with the theoretical model of speaking represented by the ACTFL pyramid. A Rasch model that included 49 of the 50 self-assessment items (excluding Item 35, which appeared to rely more on intercultural competence than on spoken proficiency) also provided evidence for a unidimensional model: the PCA of the model residuals showed no separate clusters of items in the contrast plot, and items in the first and third contrasts did not form any obvious patterns of difficulty or speaking mode (i.e., presentation versus interpersonal communication). These contrasts were also strongly correlated (.85), suggesting that the items contributed to a unidimensional measurement model. If researchers want to keep the traditional theoretical model of speaking represented by ACTFL and other language assessments such as the speaking portion of the TOEFL (Sawaki et al., 2005), the clear pattern of the loadings in the one-factor model is preferable. The current analysis showed that the ability to speak about current events in other 75 communities or countries may be measured differently than the other items we included as measures of spoken proficiency, since removal of these topics resulted in a unidimensional model. This finding lends support for NCSSFL-ACTFL’s (2017) decision to create a new category for intercultural communicative competence in the revised publication of the Can-Do Statements. In the two-factor model, all of the Novice Low to Intermediate Low level items clearly loaded onto the first factor, while all of the Advanced Mid to Superior level items clearly loaded onto the second factor. Items at the proficiency levels in between (i.e., Intermediate Mid to Advanced Low) were less clear: some loaded onto both factors, while others loaded clearly onto factor one or factor two. A similar pattern of factor loadings was also observed in the Rasch model that included all 50 Can-Do Statements: there was a cluster of five items (44, 46, 48, 49, 50) from the Advanced Mid, Advanced High, and Superior levels at the top of the contrast plot that were separate from the rest of the items, and that did not correlate strongly with items in the third contrast. This suggests that these upper-level speaking skills may be measured differently than the easier items in the analysis. These findings were unexpected as they challenge the commonly accepted, ACTFL-based model of speaking proficiency and the two- and three-factor models detailed in the NCSSFL-ACTFL (2015, 2017) Can-Do Statements. Two factors, presentational speaking and interactional communication, are posited in the original Can-Do Statement publication. However, the items that had been designated to these two modes of speaking did not form factors in this way. This may be because presentation and interaction are not easily disentangled. For example, saying the date (Item 1) involves presenting information, but presumably to another interlocutor (i.e., in interaction). The two-factor model in the current analysis suggests that speaking at the lower levels of proficiency may be qualitatively (and measurably) different (i.e., a difference construct) than speaking at the upper levels of proficiency. This interpretation does not negate the single-factor 76 ACTFL model of speaking, but it does suggest ways in which it could be enriched. At the Intermediate level, language learners can “create with the language when talking about familiar topics related to their daily life,” while Advanced-level speakers can speak about “topics of community, national, or international interest” and Superior level speakers can handle all of these topics “in formal and informal settings from both concrete and abstract perspectives” (ACTFL, 2012, pp. 5-7). According to these definitions, in order to achieve upper levels of proficiency, it is also necessary to have knowledge about national and global topics (Advanced) and to be able to discuss these topics from abstract perspectives (Superior). This may require a level of education, experience, or intelligence that is acquired separately from the ability to speak a second language. Speaking, as a construct, may change with growth in proficiency: at lower levels of proficiency, anyone may be able to function regardless of level of education, whereas beyond a certain threshold, certain explicit knowledge or education may be necessary to add to speaking skills. In the two-factor model, factor one may represent a core language proficiency (Hulstijn, 2007), or basic interpersonal communication skills (Cummins, 1979), as it includes the most basic conversational speaking tasks (e.g., I can describe what my summer plans are; I can report on a social event that I attended; I can bring a conversation to a close). Factor two may represent spoken proficiency beyond a general core, as it includes the most difficult speaking tasks that are more academic or may require higher order cognition (e.g., I can recount the details of a historical event; I can present reasons for or against a position on a political or social issue; I can exchange factual information about social and environmental questions). This implies that speaking, as a construct, may change with grown in proficiency. Speakers at Advanced and Superior proficiency levels may use their core language skill set differently than Novice and Intermediate speakers, who are still learning to convey and interpret basic meaning. The threshold between the two, not surprisingly, is not entirely clear, but may lie somewhere 77 between the Intermediate Mid and Advanced Low proficiency levels (i.e., the proficiency levels of items that loaded onto both factors in this analysis). A similar threshold has been documented both in language learning research and in task analyses of language use in the workplace. For example, DeKeyser (2010) found that students in a Spanish study abroad program did not benefit very much from their time abroad if they had not acquired an adequate baseline proficiency level. This was the case for the majority of the students he observed, as they had only taken two years of college level Spanish courses. As a result, they were effectively unable to interact with native speakers and gained very little in terms of linguistic accuracy and self-perceived language improvement. Those with better preparation (i.e., more automatized language skills) gained the most. For the workplace, ACTFL (2012) reports that based on task analyses, at least Intermediate Mid proficiency is required to function in the most basic professions (e.g., Cashier, Tour Guide). Professions that require more specialized training (e.g., Teacher, Nurse, Translator, Lawyer) require at least Advanced Low proficiency. This implies that in practice, the ACTFL (2012) Proficiency levels are used as a two-dimensional scale: one scale to reach basic, working proficiency and one scale to describe proficiency above this threshold. Based on the findings of the EFA, two possible conclusions can be drawn: either it is possible to create a unidimensional measurement of spoken proficiency with the revised instrument, or the items separate into two different dimensions of speaking. In the case of the two-factor solution, a two-parameter item response model would be more appropriate for measurement. If we accept the one-factor solution, a single parameter Rasch model is suitable for evaluating how well the people and items in the analysis fit the model of spoken language proficiency. This model serves as a proxy for the unitary and hierarchical model of proficiency represented by the ACTFL Scale descriptors. 78 Fit to the Rasch model. Analysis of the single parameter Rasch model of the test takers’ responses to the revised self-assessment questionnaire revealed that some of the items and some of the people in the analysis did not behave as expected. Thirteen people had very high misfit values (outfit MNSQ = 9.90) and seven items had misfit values greater than 2.0, indicating that these items would distort the measurement. Following McNamara (1991), several interpretations can be made based on these misfit values. The person misfit suggests that either test performance did not reflect the misfitting participants’ true ability, or that they may not belong to the intended test taking population. Inspection of these test takers’ item response patterns and proficiency ratings showed evidence for these interpretations. Several of the test takers in the analysis who had extreme misfit values were also test takers who selected an OPIc form that assessed proficiency levels above their level. This indicates that they over-assessed their ability using the self-assessment. The other misfitting test takers were people who had high ability levels and gave unexpected ratings to Can-Do Statements in the first set of items (i.e., they assessed that they could not do one of these items well). They may not have performed these-lower level Can-Dos tasks recently, or they may have been overly modest about their abilities. One possibility to further explore this misfit would be to use the CUTLO option in Winsteps, which would remove highly unexpected responses on items that are far from the ability level of the test takers. The person fit analysis also revealed one test taker who was not a language learner. For the misfitting Can-Do Statements, since the items appeared to be well constructed, I assessed whether any items might be measuring a dimension other than spoken proficiency. Inspection of the content of the misfitting items revealed that Item 35 may be addressing another dimension, discussed above. These item and person characteristics justified removing them from the Rasch model, which resulted in a final model that had evidence of unidimensionality and only five misfitting items. 79 After analyzing fifty items in the first phase of this study and fourteen additional items in the second phase, forty-four items were identified that fit the unitary model measuring spoken proficiency. These items are psychometrically constructive for the self-assessment of L2 speaking for the college-level test taking population. Thirty items fit the model in both phases of the study. The 14 items added to the revised questionnaire all fit the model in phase two. These items were selected preferentially to represent speaking tasks that were specific and described language use for college language learners. In addition, they included a single speaking task per statement. Three of the items were revised versions of items included in the first round of testing that originally misfit the measurement model. The revised items were simplified versions of the original items, reducing them to a single statement. The revised versions resulted in items that fit the model. Further, when Items 1 and 2 were split, they showed large differences in difficulty estimates. These findings provide support for the suggestion that items are more productive for measurement when they address a single skill (Haladyna et al., 2002). Of the thirty-six items that fit the original model in phase one, five of these did not fit the final model in phase two. One explanation for why these items fit the first time and then misfit may be that the samples of Spanish language learners were not identical. There may have been slight differences in the sample in terms of proficiency level, number of heritage speakers, and the way the test takers interacted with the instrument. For example, fewer test takers over- assessed their ability in the first phase (4%) than in the second phase (7%), and one participant who was not a language learner was identified in the second round of testing. Another difference was that there were more than twice the number of test takers in the second round of data collection. The revised self-assessment also had a slightly higher person reliability (.96) than the original instrument (.94). One might speculate based on these findings that if we were to control for a more homogeneous test taking population (e.g., eliminate heritage learners from the sample), provide better self-assessment training (Sweet et al., in press), and eliminate test-taking 80 fatigue, better model fit might be observed. One should also expect that using Can-Do Statements to self-assess spoken proficiency will never produce a perfect estimate of person ability and item difficulty. As Green (2014) highlighted, No assessment task is entirely satisfactory. Each format has its own weaknesses. Rather than searching for one ideal task type, the assessment designer is better advised to include a reasonable variety in any test or classroom assessment system so that the failings of one format do not extend to the overall system. (p. 140) Therefore, although the final model of L2 Spanish learners’ responses to the self-assessment items included in this study is not perfect, it provides a reasonable estimate of language proficiency that can be included in a broader assessment system. Item difficulty. The third research question of the second phase of the study addressed the extent to which the item difficulties of the Can-Do Statements in the revised self-assessment instrument ascended according to ACTFL’s hierarchy of proficiency levels. Since the ACTFL (2012) Guidelines are used to measure proficiency both at the major threshold level (i.e., Novice, Intermediate, Advanced, Superior) and the first three major level subdivisions (i.e., Novice Low, Novice Mid, Novice High; Intermediate Low, Intermediate Mid, Intermediate High; Advanced Low, Advanced Mid, Advanced High), I evaluated whether the Can-Do Statements distinguished proficiency at both the threshold level and at the sublevel. Comparison of the mean Rasch difficulty estimates for items at each of the major threshold levels of the ACTFL scale revealed that the mean logit scores ascended in the anticipated order and that mean differences were statistically significant. This finding is in line with Brown et al. (2014) and replicated the results of the first phase of my dissertation. Brown et al. (2014) found that the ACTFL Can-Do Statements that they modeled ascended in the same order of difficulty as the threshold levels in the scale, although the differences were not statistically significant. I found that the statements 81 included in the original self-assessment instrument also ascended in order of the threshold levels, and that the differences in mean item difficulty between major proficiency levels were significant. Taken together, these findings suggest that the ACTFL (2015) Can-Do Statements can be useful at least for estimating second language proficiency in broad brush strokes (i.e., at the major thresholds of proficiency) when used for self-assessment. I also considered the mean difficulty estimates of the revised Can-Do Statements for each of the proficiency sublevels. In the second round of testing, I found that each of the sublevels ascended in the predicted order. In phase one, the assessment items also ascended according to the expected sublevel difficulty, except at the Advanced Low and Advanced Mid levels. This result was similar to the revised self-assessment in phase two: the mean difficulty estimates at the Advanced Low and Mid levels were nearly identical, and the most difficult items in the analysis were from the Advanced Low level. Analysis of the content of these items revealed a mismatch between their required language use and the ACTFL proficiency level descriptors as these Advanced items would require speakers to perform Superior-level tasks. In phase two I found that the difference between items written for the Novice Mid and Intermediate Low proficiency levels was significant, and that the difference in difficulty between Intermediate Low and Intermediate Mid items was also significant. These findings are slightly different than the findings from phase one of the study. In the original assessment, none of the mean difficulty estimates were statistically significant at the sublevel, but the revised self- assessment did discriminate at some of the Novice and Intermediate sublevels. This suggests that the revised instrument may allow for more accurate self-assessments at lower proficiency sublevels. At the Intermediate High level and above, however, both the original and revised self- assessment instruments did not distinguish well between proficiency sublevels. This is similar to Jones (2002), who had difficulty matching the CEFR (Council of Europe, 2001) can-do statements to the upper levels of proficiency on the CEFR scale. He noted, “[o]ne problem is that 82 in the current analysis the highest level (C2) statements are not well distinguished from the level below (C1)” (p. 177). Weir (2005) suggested that in order to have better discrimination between language use at upper levels of proficiency, inclusion of the communicative context and the quality of the performance may be necessary. Considering performance of Item 36 (I can provide a rationale for the importance of certain classes, subjects, or training programs, Advanced Low) provides a good illustration. A test taker might be able to provide a very simple rationale for the importance of a class to a peer: “It is important for me to take Spanish 202 so that I can complete my language requirement.” A speaker at a higher proficiency level, on the other hand, might be able to give a more formal presentation of a rationale for a more abstract subject (e.g., the importance of freedom of speech) and elaborate in greater detail. These two performances would both match the content of the Can-Do Statement, but represent different levels of spoken proficiency. In order for upper-level Can-Do Statements to be useful, then, it might be necessary to include the context (e.g., specification of the interlocutor) and the quality (e.g., length of text or amount of detail provided) of the language task. Another consideration is that of the total sample of test takers (886), there were 157 participants who responded to the testlets that included Advanced-level statements. Of these, 39 participants received official ACTFL OPIc ratings at the Advanced level. A sample including more participants at higher proficiency levels may show improved accuracy of the Advanced level statements. Language researchers may lack accurate descriptions of Advanced level proficiency because few university learners reach this level (Byrnes & Ortega, 2008; Soneson & Tarone, in press). The flip side of this coin, however, is that descriptions of functional language use for Novice and Intermediate levels of proficiency are becoming more refined. As mentioned previously, given ACTFL’s characterization of emerging ability at the Low sublevels and sustained ability at the High sublevels, the finding that the revised self-assessment instrument did not discriminate well between all of the proficiency sublevels is not unexpected. 83 These interpretations of Low, Mid, and High suggest that researchers should not expect to find unique differences in the difficulty of items at the sublevels. The finding that there were significant differences in item difficulty between some of the Novice and Intermediate sublevels is in fact unexpected. This suggests that language proficiency on the ACTFL subscales may develop in more incremental steps in the initial levels, followed by the phenomenon of emerging/sustaining performance at upper levels. In their current form, the use of Can-Do Statements for self-assessment may be more accurate for Novice and Intermediate proficiency levels than upper levels of proficiency. 84 DISCUSSION AND CONCLUSION The aim of this study was to assess the validity of a selection of NCSSFL-ACTFL (2015) Can-Do Statements for use with postsecondary Spanish language learners. These statements were assembled in a self-assessment instrument whose intended purpose was to guide test takers toward an appropriate OPIc form. Analysis of the Can-Do Statements included in the original self-assessment highlighted several items that could be improved in terms of content validity (i.e., required language use and contexts) and item construction. These items were removed and replaced with items that were well-constructed, and described language use relevant to college test takers’ experience in the revised version of the instrument. Using the original instrument, 4% of test takers in the 2015 test administration and 9% in the 2016 administration over- or under- assessed their language ability. Using the revised instrument, 7% of the test takers under- or over-assessed their ability in the 2017 round of testing. These findings suggest that it may not be possible to refine the Can-Do Statements to improve self-assessment accuracy beyond a certain point. When interpreting the results of this dissertation, it should be pointed out that in this study the NCSSFL-ACTFL (2015) Can-Do Statements were analyzed as a means of approximating college learners’ L2 proficiency. However, this is not one of the intended uses for the statements. Rather, two purposes are articulated for the Can-Do Statements: “for programs, the statements provide learning targets for curriculum and unit design, serving as performance indicators; for language learners, the statements provide a way to chart their progress through incremental steps” (ACTFL, 2015, p. 1). Despite this slight difference in usage, the item response analysis of the self-assessment statements as a measure of foreign language proficiency in this study provides a picture of a) whether the skills articulated in the statements are important indicators or learning targets for foreign language proficiency, and b) whether the incremental 85 steps documented in the Can-Do Statements match actual gains in self-assessed language proficiency. The current study has implications for the self-assessment of Spanish learners’ language proficiency at the college level. The self-assessment instrument under consideration was normed on postsecondary Spanish language learners over two rounds of testing. The results of the study showed that the instrument did not discriminate well between statements pegged at higher proficiency levels. The analysis also showed that the ability to speak at Advanced levels of proficiency may develop differently than language proficiency at lower levels, and therefore may be considered a construct separate from speaking proficiency. However, this may not be a serious problem for testing college level test takers, since this population rarely reaches Advanced levels of proficiency (Byrnes & Ortega, 2008). The median proficiency level of Spanish test takers in both 2015 and 2017 was Intermediate Low; only 3.4% (2015) and 3.3% (2017) of the learners in the study received Advanced-level proficiency ratings. Since the items in the current study were normed on Spanish language learners, the results may not be generalizable across languages. The DIF analysis of the test takers’ responses in this study suggested that some of the items may not behave the same way for French and Spanish students. Although it seems premature to make pedagogical recommendations, this study has implications for foreign language assessment. Can-Do Statements are ripe for use as a self- assessment tool, but teachers should be aware that when they select specific statements for classroom use, they should be selective. The data in this study suggest that some of the Can-Do Statements may not be relevant for use with all populations and may not be interpreted in the same way by all language learners. Teachers may want to work with students by conducting a needs analysis to identify what types of language use and performance they anticipate needing, and then match those needs with Can-Do Statements that describe them. This study also suggests that the statements can be used with college language learners to estimate their language 86 proficiency at the major ACTFL threshold levels and at the lower sublevels of L2 proficiency. The descriptors of proficiency at the Advanced level and above my be less accurate. This study also has implications for the development and categorization of descriptors of language proficiency. First, this study provides further evidence that the content and construction of Can-Do Statements impact the way learners can evaluate their proficiency. Including language tasks that have differing degrees of difficulty in the same self-assessment item can interfere with language learners’ ability to accurately evaluate how well they can accomplish the tasks. Test users also stand a better chance of accurate self-assessment when the content of the items is in line with their experience using the language. The finding that some of the content of the Advanced-level statements required Superior-level language use implies that future development of performance indicators should be more carefully aligned with current descriptions of language proficiency. Descriptors of higher levels of L2 proficiency may also require specifications of the quality of language production required for upper-level performance that are not currently included in the Can-Do Statements. Another consideration for the development and categorization of individual descriptors of language proficiency regards the factors of spoken proficiency that test designers and researchers seek to measure. ACTFL has created two measures of spoken language proficiency: The OPIc and the Can-Do Statements for speaking. The OPIc is described as an assessment of interpersonal communication (ACTFL, 2014), while the Can-Do Statements are designed to describe interpersonal communication, presentational speaking (ACTFL, 2015), and intercultural competence (ACTFL, 2017). The factor structure of spoken proficiency identified in the Can-Do Statements in the current study did not show a clear distinction between these modes of speaking. Instead, two possible factor structures converged: a unidimensional model and a two- factor model of general/academic speaking. This finding makes the relationship between official ACTFL ratings of spoken proficiency (i.e., OPIc ratings) and the speaking tasks described in the 87 Can-Do Statements unclear. One possibility, based on the unitary model, is that the OPIc and the Can-Do Statements both measure a single dimension of spoken proficiency. Another possibility is that the OPIc is a measure of core spoken proficiency, and that some of the Can-Do tasks belong to this same core, while other statements rely on content knowledge (e.g., world knowledge, intercultural competence, travel experience, course or curriculum content) that goes beyond core language proficiency. These possibilities present a challenge for language assessment researchers to continue to refine construct definitions of language proficiency and to create stronger ties to theoretical models of L2 proficiency (e.g., Canale & Swain, 1980; Cummins, 1979; Bachman & Palmer, 1996) and development (e.g., Pienemann, 1998). The results of the factor analysis also have implications for the measurement of spoken language proficiency. ACTFL’s descriptors of functional language use may form a single factor, but it may also be possible to separate out multiple factors. If the performance indicators defined in the Can-Do Statements do form multiple dimensions of spoken proficiency, it may be possible to measure different profiles of language users. For example, a speaker who has highly automatized language ability but little world knowledge or education may be considered highly proficient in terms of core language ability, but lacking in the higher order cognition required to accomplish the language tasks that have been assigned to the higher levels of proficiency in the ACTFL model. Speakers who are highly educated in their first language may be able to tackle abstract topics of global interest, which represent Advanced- and Superior-level language use, while lacking in basic interpersonal communication skills. Thus, it may be more appropriate to measure these constructs on different scales. Other factors such as age and intercultural competence may further challenge the possibility of measuring spoken proficiency as a unitary and hierarchical construct. These challenges for spoken proficiency measurement models provide many avenues for future research. 88 Because the current study was limited to language learners’ self-assessments of their ability to perform spoken language tasks, and these performances can be unstable (Shin, 2013), future research should include outside ratings of learners’ ability to perform each of these tasks. This would allow for further exploration of the factor structure of the tasks described in the Can- Do Statements and performance descriptors. Only a selection of the Can-Do Statements for spoken proficiency have been analyzed in this study and by Brown et al. (2014), so inclusion of more statements and the revised ACTFL (2017) performance indicators may provide a more complete picture. Although the ACTFL (2012) Guidelines “are intended to be used for global assessment in academic and workplace settings,” this study showed that some items may not be well targeted for college-aged test-takers, let alone test-takers in secondary settings. Younger learners, such as students who have K-12 experience in dual immersion settings, working adults, and college students may have different enough foreign language experiences and needs that they should be considered different populations, requiring test items that are normed accordingly. The differences in item difficulty between postsecondary Spanish students’ item responses and predicted ACTFL difficulty levels merit further study so that empirical evidence can be provided for the difficulty of the descriptors for all age groups (i.e., young children, children, teens, college-students, and working adults). The current research is limited in that the majority of the test takers in this study had Novice- and Intermediate-level speaking proficiency. As Byrnes and Ortega (2008) highlighted, the study of advanced language learners is under-researched, and future research on self- assessment of speaking abilities should include more focus on this population so that the descriptors of functional language for Advanced proficiency can be refined. A stratified sample of equal numbers of language learners from all proficiency levels would allow for more accurate descriptions of proficiency standards at all levels (Crocker & Algina, 1986). The current study was limited to Spanish learners’ use of the statements, as there was some indication of 89 differential item functioning when the Spanish learners’ responses were compared to learners of French. Therefore, another avenue for future research is to further explore whether the statements can be used in the same way for learners of all languages. Conclusions In the first phase of my dissertation, I identified 14 misfitting items in a self-assessment of ACTFL (2015) Can-Do Statements. The suspected reason for the misfit was in the construction of the items. Specifically, these items assessed multiple skills in a single Can-Do Statement, were not specific, or included experiences that were not relevant to college test- takers’ typical experiences. In the second phase of the study, I revised the self-assessment to include Can-Do Statements that were well-constructed and targeted language use for college language learners. The items that were selected preferentially to fit these criteria were found to be useful for measuring L2 Spanish spoken language proficiency. Not all of the statements fit the model, and therefore the first conclusion is that the revised self-assessment instrument is not perfect. It can be concluded, however, that the revised instrument is an improvement over the original, based on the results of the difficulty analysis. The revised assessment discriminated well between the ACTFL threshold levels and lower sublevels (up to Intermediate Low) of proficiency. While the original self-assessment did not show any significant differences between the difficulty of items at the ACTFL proficiency sublevels, the revised instrument showed some significant differences. Particularly, the Novice mid items were significantly easier than the Intermediate Low items, and the Intermediate Low items were significantly easier than the Intermediate Mid items. These proficiency levels match the official OPIc ratings of the majority of the test takers in the 2017 test administration. Therefore, the revised instrument appears to be useful for estimating major threshold proficiency levels and for discriminating language proficiency at the sublevel for college language learners. 90 The analysis of the revised self-assessment also resulted in two possible conclusions on the dimensionality of spoken proficiency. Either it is possible to create a unidimensional measurement of spoken proficiency using the ACTFL Can-Do Statements in the revised instrument and the majority of these items are useful for measuring spoken proficiency, or the items separate into two different dimensions of speaking. In this case, a 2PL model would be more appropriate for the measurement of spoken proficiency using ACTFL’s descriptors of spoken L2 proficiency. 91 ENDNOTES 1. OPIc raters use floor and ceiling scoring, meaning they need to hear speech that is sustained at one of the major levels and find evidence of linguistic breakdown at the next major level. Thus, if a Novice High examinee took a test that did not have a Novice floor, this examinee would be in constant breakdown and the rater would be unable able to determine the major level the examinee was at. 2. This research was funded by the National Security Education Program’s Language Proficiency Flagship Initiative (Grant # 2340-MSU-7-PI-093-PO1) awarded to principal investigators Drs. Paula Winke and Susan Gass. 3. Differential item functioning (DIF) tests whether a test measures a latent trait in the same way for all subgroups. In a research report by Educational Testing Service (ETS, Zwick, 2012) DIF items are split into three categories. First of all, no DIF items have a Mantel-Haenszel (MH) chi- square statistic that is not significant at the .05 level and a DIF contrast value less than 0.43 logits: Items with no DIF are considered to measure the construct in the same way for all groups. Affirmative DIF items have DIF contrasts greater than 0.64 logits and a significant (i.e., p < .05) Mantel-Haenszel (MH) chi-square statistic, and may measure the construct differentially. Neutral DIF items are those that do not meet the criteria of “no DIF” or “affirmative DIF.” Another common DIF cutoff for Rasch analysis is - 0.5 < x < 0.5., with scores at X being an indication of neutral to no DIF. If an item shows (affirmative) DIF, researchers must make a determination about the source of the difference by examining the content of the item. It is possible that the DIF is not real (e.g., a Type I error has occurred), that the DIF might not be interpretable, or that the DIF is expected, especially if the subgroups are expected to perform differently due to expected influencing differences (Zumbo, 1999; 2007). According to Davey and Wendler (2001), the minimum sample size for a DIF analysis is 200 for the smaller group, and 500 in total for the construction of a test. In the Spring 2015 data, there were 220 French test takers and 382 Spanish test takers, making these two groups ideal for testing the hypothesis that the ACTFL (2012, 2015) performance indicators for speaking measure language proficiency in the same way irrespective of language. In other words, are the Can-Do Statements equally useful for learners of Spanish and French, as evaluated using a DIF detection method? To answer this question, I performed a DIF analysis using the Mantel-Haenszel procedure based on the comparison of Spanish learners (reference group) and French learners (focal group). I considered items that had large DIF contrasts (- 0.5 < x < 0.5) that were statistically significant. Of the fifty items included in the self-assessment, three items, shown in Table 21, exhibited DIF across L2 language groups. Item 2 was easier for the Spanish learners than the French learners, while Items 3 and 4 were easier for the French learners. To determine whether this DIF is interpretable, I considered the linguistic and psycholinguistic features of each item. Listing months, seasons (Item 2) and free time activities (Item 4) requires learning vocabulary items (that are relatively similar), and is therefore likely equally difficult in both languages. Talking about likes, on the other hand, requires the gustar structure in Spanish, which is considered to be linguistically, developmentally, and psycholinguistically complex for L1 English speakers (Cerezo, Caras, & Leow, 2016). To talk about likes in French requires a regular present tense verb construction, which can be considered a linguistically simple form as it only requires a single transformation (Spada & Tomita, 2010). Thus, it might be possible to interpret the DIF exhibited in Item 3, which was significantly easier for French learners than Spanish learners. 92 Table 21: Items with large and significant DIF. Item French Spanish DIF t test sig. DIF DIF Contrast measure -4.23 measure -3.54 0.69 t (465) = 3.60 p = .005* -4.32 -4.91 -0.59 t (527) = -2.84 p = .017* -4.05 -4.59 -0.54 t (516) = -2.69 p = .046* 2. I can list the months and seasons. (NL) 3. I can say which sports I like and don’t like. (NM) 4. I can list my favorite free time activities and those I do not like. (NM) In this analysis, I considered whether any of the 50 Can-Do Statements included in a self- assessment showed differential item functioning when comparing Spanish and French language learners. Three items showed statistically significant DIF. Of these, one item may include language use that is significantly more difficult for Spanish language learners than French language learners: talking about likes and dislikes. This finding is interesting because the great majority of the items appear to measure language proficiency in the same way for learners of Spanish and French. This is in line with the way the ACTFL (2012) Proficiency Guidelines were intended to be used: to describe and evaluate functional language use in any second language. However, the finding that Item 3 was more difficult for Spanish learners than French learners and that this difficulty may be attributable to linguistic and acquisitional features of the language use required may call ACTFL’s global description of language proficiency into question. Talking about likes and dislikes has been assigned to the Novice Mid proficiency level. This may be appropriate for French language learners, since this requires a linguistically simple construction. A learner of Spanish, on the other hand, is not likely to acquire this construction until later in their language learning (Cerezo, Caras, & Leow, 2016). Thus, this descriptor of language proficiency may not be suitable for describing functional language ability in all languages. 4. Sufficient evidence that the rating scale worked as it was designed was observed. First, the average measure increased from category one to four: 1 (M = -1.39), 2 (M = -0.03), 3 (M = 2.63), 4 (M = 5.69). In addition, the distribution of the item responses was unimodal and peaked in category four (Linacre, 2002). 93 APPENDICES 94 APPENDIX A: ACTFL OPIc 1-5 levels and Can-Do Statements Table 22: ACTFL OPIc level 1 Can-Do Statements I can name basic objects, colors, days of the week, foods, clothing items, numbers, etc. I cannot always make a complete sentence or ask simple questions. Can-do statements ACTFL Levels Mode ❑ I can say the date and the day of the week. ❑ I can list the months and seasons. ❑ I can say which sports I like and don’t like. ❑ I can list my favorite free-time activities and those I don’t like. ❑ I can state my favorite foods and drinks and those I don’t like. ❑ I can talk about my school or where I work. ❑ I can talk about my room or office and what I have in it. ❑ I can list my classes and tell what time they start and end. ❑ I can answer questions about where I’m going or where I went. ❑ I can present information about something I learned in a class NL NL NM NM NM NM NM NM NM NH PS PS PS PS PS PS PS PS IC PS or at work. Table 23: ACTFL OPIc level 2 Can-Do Statements I can give some basic information about myself, work, familiar people and places, and daily routines speaking in simple sentences. I can ask some simple questions. Can-do statements ❑ I can describe a school or workplace. ❑ I can describe a place I have visited or want to visit. ❑ I can ask for help at school, work, or in the community. ❑ I can talk about my daily routine. ❑ I can talk about my interests and hobbies. ❑ I can schedule an appointment. ❑ I can talk about my family history. ❑ I can plan an outing with a group of friends. ❑ I can explain why I was late to class or absent from work and arrange to make up the lost time. ❑ I can tell a friend how I’m going to replace an item that I borrowed and broke/lost. ACTFL Levels Mode IL IL IL IM IM IM IH IH AL AL PS PS IC IC IC IC IC IC IC IC 95 Table 24: ACTFL OPIc level 3 Can-Do Statements I can participate in simple conversations about familiar topics and routines. I can talk about things that have happened but sometimes my forms are incorrect. I can handle a range of everyday transactions to get what I need. Can-do statements ❑ I can give some information about activities I did. ❑ I can talk about my favorite music, movies, and sports. ❑ I can describe a childhood or past experience. ❑ I can ask for and follow directions to get from one place to another. ❑ I can return an item I have purchased to a store. ❑ I can arrange for a make-up exam or reschedule an appointment. ❑ I can present an overview about my school, community, or workplace. ❑ I can compare different jobs and study programs in a conversation with a peer. ❑ I can discuss future plans, such as where I want to live and what I will be doing in the next few years. ❑ I can explain an injury or illness and manage to get help. Table 25: ACTFL OPIc level 4 Can-Do Statements ACTFL Levels Mode IM IM IM IH IH IH AL AL AM AM IC IC PS IC IC IC PS IC IC IC I can participate in fully and confidently in all conversations about topics and activities related to home, work/school, personal and community interests. I can speak in connected discourse about things that have happened, are happening, and will happen. I can explain and elaborate when asked. I can handle routine situations, even when there may be an unexpected complication. Can-do statements ❑ I can present ideas about something I have learned, such as a historical event, a famous person, or a current environmental issue. ❑ I can give a presentation about my interests, hobbies, lifestyle, or preferred activities. ❑ I can ask for and provide descriptions of places I know and also places I would like to visit. ❑ I can explain how life has changed since I was a child and respond to questions on the topic. ❑ I can discuss what is currently going on in another community or country. ACTFL Levels IH Mode PS IH IH AL AL PS IC IC IC 96 Table 25 (cont’d) ❑ I can provide a rationale for the importance of certain classes, AL subjects, or training programs. ❑ I can talk about present challenges in my school or work life, such as paying for classes or dealing with difficult colleagues. ❑ I can exchange general information about leisure and travel, such as the world’s most visited sites or most beautiful places to visit. ❑ I can give a presentation about cultural influences on society. ❑ I can participate in conversations on social or cultural AM AM AH AH PS IC IC PS IC questions relevant to speakers of this language. Table 26: ACTFL OPIc level 5 Can-Do Statements I can engage in all informal and formal discussions on issues related to personal, general or professional interests. I can deal with these issues abstractly, support my opinion, and construct hypotheses to explore alternatives. I am able to elaborate at length and in detail on most topics with a high level of accuracy and a wide range of precise vocabulary. Can-do statements ACTFL Levels Mode ❑ I can interview for a job or service opportunity related to my field of expertise. ❑ I present an explanation for a social or community project or policy. ❑ I can present reasons for or against a position on a political or social issue. ❑ I can give a clear and detailed story about childhood memories, such as what happened during vacations or memorable events and answer questions about my story. ❑ I can exchange general information about my community, such as demographic information and points of interests. ❑ I can exchange factual information about social and environmental questions, such as retirement, recycling, or pollution. ❑ I can usually defend my views in a debate. ❑ I can exchange complex information about my academic studies, such as why I chose the field, course requirements, projects, internship opportunities, and new advances in my field. ❑ I can provide a balance of explanations and examples on a complex topic. ❑ I can explain participate actively and react to others appropriately in academic debates, providing some facts and rationales to back up my statements. AL AL AL AM AM AM AH AH S S IC PS PS IC IC IC IC IC PS IC 97 APPENDIX B: Phase I principal components analysis Figure 10: Phase I standardized residual contrast plot. Table 27: First contrast in the original Rasch model Cluster 1 35. I can discuss what is currently going on in another community or country. (AL; IC) 36. I can provide a rationale for the importance of certain classes, subjects, or training programs. (AL; PS) 29. I can discuss future plans, such as where I want to live and what I will be doing in the next few years. (AM; IC) Cluster 3 32. I can give a presentation about my interests, hobbies, lifestyle, or preferred activities. (IH; PS) 33. I can ask for and provide descriptions of places I know and also places I would like to visit. (IH; IC) 49. I can provide a balance of explanations and examples on a complex topic. (S; PS) 98 Table 28: First contrast in the final Rasch model Cluster 3 20. I can tell a friend how I’m going to replace an item that I borrowed and broke/lost. (AL; IC) 35. I can discuss what is currently going on in another community or country. (AL; IC) 36. I can provide a rationale for the importance of certain classes, subjects, or training programs. (AL; PS) Cluster 1 48. I can exchange complex information about my academic studies, such as why I chose the field, course requirements, projects, internship opportunities, and new advances in my field. (AH; IC) 49. I can provide a balance of explanations and examples on a complex topic. (S; PS) 50. I can explain participate actively and react to others appropriately in academic debates, providing some facts and rationales to back up my statements. (S; IC) 99 APPENDIX C: ACTFL OPIc 1-5 levels and revised Can-Do Statements Table 29: ACTFL OPIc level 1 Can-Do Statements I can name basic objects, colors, days of the week, foods, clothing items, numbers, etc. I cannot always make a complete sentence or ask simple questions. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Can-do statements ACTFL Levels Mode I can say the date. I can say the day of the week. I can say which sports I like and don’t like. I can list my favorite free-time activities and those I don’t like. I can say what someone looks like. I can talk about my school or where I work. I can talk about my room or office and what I have in it. I can talk about what I do on the weekends. I can answer questions about where I’m going or where I went. I can present information about something I learned in a class or at work. NL NL NM NM NM NM NM NM NM NH PS PS PS PS PS PS PS PS IC PS Table 30: ACTFL OPIc level 2 Can-Do Statements I can give some basic information about myself, work, familiar people and places, and daily routines speaking in simple sentences. I can ask some simple questions. Can-do statements ACTFL Levels Mode 11. I can describe a school or workplace. 12. I can describe a place I have visited. 13. I can describe what my summer plans are. 14. I can report on a social event that I attended. 15. I can bring a conversation to a close. 16. I can schedule an appointment. 17. I can talk about my family history. 18. I can explain a series of steps needed to complete a task. 19. I can explain why I was late to class or absent from work and arrange to make up the lost time. 20. I can tell a friend how I’m going to replace an item that I borrowed and broke/lost. IL IL IL IM IM IM IH IH AL AL PS PS PS PS IC IC IC PS IC IC ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ 100 Table 31: ACTFL OPIc level 3 Can-Do Statements I can participate in simple conversations about familiar topics and routines. I can talk about things that have happened but sometimes my forms are incorrect. I can handle a range of everyday transactions to get what I need. ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ Can-do statements ACTFL Levels Mode 21. I can give some information about activities I did. 22. I can talk about my favorite music, movies, and sports. 23. I can give a short presentation on a current event. 24. I can ask for and follow directions to get from one place to another. 25. I can return an item I have purchased to a store. 26. I can arrange for a make-up exam or reschedule an appointment. 27. I can present an overview about my school, community, or workplace. 28. I can compare different jobs and study programs in a conversation with a peer. IM IM IM IH IH IH AL AL 29. I can discuss future plans, such as where I want to live and AM what I will be doing in the next few years. 30. I can describe in detail a social event. AM IC IC PS IC IC IC PS IC IC PS Table 32: ACTFL OPIc level 4 Can-Do Statements I can participate in fully and confidently in all conversations about topics and activities related to home, work/school, personal and community interests. I can speak in connected discourse about things that have happened, are happening, and will happen. I can explain and elaborate when asked. I can handle routine situations, even when there may be an unexpected complication. Can-do statements ❑ 31. I can present ideas about something I have learned, such as a historical event, a famous person, or a current environmental issue. 32. I can make a presentation about an interesting person. 33. I can explain to someone who was absent what took place in class. 34. I can explain how life has changed since I was a child and respond to questions on the topic. 35. I can discuss what is currently going on in another community or country. 36. I can provide a rationale for the importance of certain classes, subjects, or training programs. ❑ ❑ ❑ ❑ ❑ ACTFL Levels IH Mode PS IH IH AL AL AL PS PS IC IC PS 101 Table 32 (cont’d) ❑ ❑ ❑ 37. I can talk about present challenges in my school or work AM IC life, such as paying for classes or dealing with difficult colleagues. 38. I can recount the details of a historical event. 39. I can give a presentation about cultural influences on society. 40. I can participate in conversations on social or cultural questions relevant to speakers of this language. AM AH AH PS PS IC Table 33: ACTFL OPIc level 5 Can-Do Statements I can engage in all informal and formal discussions on issues related to personal, general or professional interests. I can deal with these issues abstractly, support my opinion, and construct hypotheses to explore alternatives. I am able to elaborate at length and in detail on most topics with a high level of accuracy and a wide range of precise vocabulary. Can-do statements ACTFL Levels Mode ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ ❑ 41. I can interview for a job or service opportunity related to my field of expertise. 42. I can present an explanation for a social or community project or policy. 43. I can present reasons for or against a position on a political social issue. 44. I can give a clear and detailed story about childhood memories, such as what happened during vacations or memorable events and answer questions about my story. 45. I can exchange general information about my community, such as demographic information and points of interests. 46. I can exchange factual information about social and environmental questions, such as retirement, recycling, or pollution. 47. I can usually defend my views in a debate. 48. I can exchange complex information about my academic studies, such as why I chose the field, course requirements, projects, internship opportunities, and new advances in my field. 49. I can provide a balance of explanations and examples on a complex topic. 50. I can explain participate actively and react to others appropriately in academic debates, providing some facts and rationales to back up my statements. AL AL AL AM AM AM AH AH S S IC PS PS IC IC IC IC IC PS IC 102 APPENDIX D: Phase II principal components analysis Figure 11: Phase II standardized residual contrast plot for the original model. Figure 12: Phase II standardized residual contrast plot for the final model. 103 Table 34: First contrast in the original Rasch model Cluster 1 50. I can participate actively and react to others appropriately in academic debates, providing some facts and rationales to back up my statements. (S; IC) 44. I can give a clear and detailed story about childhood memories, such as what happened during vacations or memorable events and answer questions about my story. (AM; IC) 49. I can provide a balance of explanations and examples on a complex topic. (S; PS) Cluster 3 20. I can tell a friend how I’m going to replace an item that I borrowed and broke/lost. (AL; IC) 25. I can return an item I have purchased to a store. (IH; IC) 41. I can interview for a job or service opportunity related to my field of expertise. (AL; IC) Table 35: First contrast in the final Rasch model Cluster 1 Cluster 3 48. I can exchange complex information about my academic studies, such as why I chose the field, course requirements, projects, internship opportunities, and new advances in my field. (AH; IC) 49. I can provide a balance of explanations and examples on a complex topic. (S; PS) 50. I can explain participate actively and react to others appropriately in academic debates, providing some facts and rationales to back up my statements. (S; IC) 25. I can return an item I have purchased to a store. (IH; IC) 33. I can explain to someone who was absent what took place in class. (IH; PS) 41. I can interview for a job or service opportunity related to my field of expertise. (AL; IC) 104 REFERENCES 105 REFERENCES ACTFL. (2012). ACTFL proficiency guidelines - speaking. Retrieved December 12, 2016 from http://www.actfl.org ACTFL. (2014). ACTFL OPIc familiarization manual. Retrieved May 14, 2018 from https://www.languagetesting.com/pub/media/wysiwyg/manuals/actfl-fam-manual- opic.pdf ACTFL. (2015). NCSSFL-ACTFL can-do statements. Retrieved December 12, 2016 from http://www.actfl.org/global_statements ACTFL. (2017). NCSSFL-ACTFL can-do statements. Retrieved April 27, 2018 from https://www.actfl.org/sites/default/files/CanDos/Can-Do%20Introduction.pdf American Educational Research Association, American Psychological Association, National Council on Measurement in Education, Joint Committee on Standards for Educational, & Psychological Testing (US). (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561–573. Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Oxford: Oxford University Press. Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford University Press. Bailey, A. L. (2007). Introduction: Teaching and assessing students learning English in school. In A. L. Bailey (Ed.), The language demands of school: Putting academic English to the test (pp. 1–26). New Haven, CT: Yale University Press. Bond, T., & Fox, C. (2015). Applying the Rasch model: Fundamental measurement in the human sciences. New York/London: Routledge. Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111, 1061-1071. Brantmeier, C. (2006). Advanced L2 learners and reading placement: Self-assessment, CBT, and subsequent performance. System, 34, 15-35. 106 Brown, J. D. (2009). Choosing the right type of rotation in PCA and EFA. JALT Testing & Evaluation SIG Newsletter. 13(3), 20-25. Available from http://hosted.jalt.org/test/PDF/Brown31.pdf Brown, N. A., Dewey, D. P., & Cox, T. L. (2014). Assessing the validity of can-do statements in retrospective (then-now) self-assessment. Foreign Language Annals, 47, 261. Butler, Y. G. (2016). Self-assessment of and for young learners’ foreign language learning. In M. Nikolov (Ed.), Assessing young learners of English: Global and local perspectives (pp. 291-315). New York, NY: Springer International Publishing. Butler, Y. G., & Lee, J. (2006). On-task versus off-task self-assessments among Korean elementary school students studying English. The Modern Language Journal, 90, 506–518. Byrnes, H. & Ortega, L. (2008). The longitudinal study of advanced L2 capacities. New York, NY: Routledge. Canale, M., & Swain, M. 1980. Theoretical bases of communicative approaches to second language teaching and testing. Applied Linguistics 1, 1-47. Celce-Murcia, M., Dörnyei, Z., & Thurrell, S. (1995). Communicative competence: A pedagogically motivated model with content specifications. Issues in Applied Linguistics, 6, 5-35. Chalhoub–Deville, M., & Deville, C. (1999). Computer adaptive testing in second language contexts. Annual Review of Applied Linguistics, 19, 273-299. Cerezo, L., Caras, A., & Leow, R. (2016). The effectiveness of guided induction versus deductive instruction on the development of complex Spanish gustar structures: An analysis of learning outcomes and processes. Studies in Second Language Acquisition, 38, 265-291. Comrey, A. L., & Lee, H. B. (1992). A first course in factor analysis (2nd Ed.). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. Council of Europe. (2001). Common European framework of reference for languages: Learning, teaching, assessment. Cambridge, UK: Cambridge University Press. Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York: Holt, Rinehart and Winston. Cumming, A. H., & Berwick, R. (1996). Validation in language testing. Clevedon, England: Multilingual Matters. 107 Cummins, J. (1979). Cognitive/Academic language proficiency, linguistic interdependence, the optimum age question and some other matters. Working Papers on Bilingualism, 19, 121–129. Cummins, J. (1980). Psychological assessment of immigrant children: Logic or intuition? Journal of Multilingual and Multicultural Development, 1, 97-lll. Cummins, J. (1981). Age on arrival and immigrant second language learning in Canada: A reassessment. Applied Linguistics, 1, 132-149. Cummins, J. (2008). BICS and CALP: Empirical and theoretical status of the distinction. In B. Street & N. H. Hornberger (Eds.), Encyclopedia of language and education, vol. 2: Literacy (2nd ed., pp. 71–83). New York: Springer Science + Business Media LLC. Davey, T., & Wendler, C. (2001). DIF best practices in statistical analysis [ETS internal memorandum]. Princeton, NJ: ETS. DeKeyser, R. (2010). Monitoring processes in Spanish as a second language during a study abroad program. Foreign Language Annals, 43, 80-92. Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum Associates. Faez, F., Majhanovich, S., Taylor, S., Smith, M., & Crowley, K. (2011). The power of "can do" statements: Teachers' perceptions of CEFR-informed instruction in French as a second language classrooms in Ontario. Canadian Journal of Applied Linguistics/Revue Canadienne De Linguistique Appliquee, 14, 1-19. Fang, X., Yang, H., & Zhu, Z. (2011). The background of and approach to Can-Do description of language ability: Taking CEFR as an example. Shijie Hanyu Jiaoxue / Chinese Teaching in the World, 25, 246-257. Garcia, P., & Asención, Y. (2001). Interlanguage development of Spanish learners: Comprehension, production, and interaction. Canadian Modern Language Review, 57, 377-401. Green, A. (2014). Exploring language assessment and testing. New York, NY: Routledge. Haladyna, T., Downing, S., & Rodriguez, M. (2002). A review of multiple choice item- writing guidelines for classroom assessment. Applied Measurement in Education, 15, 309-334. Heilenman, L. K. (1990). Self-assessment of second language ability: The role of response effects. Language Testing, 7, 174–201. 108 Holmes, S. E. (1982). Unidimensionality and vertical equating with the Rasch Model. Journal of Education Measurement 19, 139-47. Hulstijn, J. (2007). The shaky ground beneath the CEFR: Quantitative and qualitative dimensions of language. The Modern Language Journal, 91, 663–6. Jones, N. (2002). Relating the ALTE framework to the Common European Framework of Reference. In Council of Europe (Eds.), Case studies on the use of the Common European Framework of Reference. Cambridge, Cambridge University Press: 167-183. Linacre, J. M. (2002). Optimizing rating scale category effectiveness. Journal of Applied Measurement 3, 85-106. Linacre, J. M. (2016a). WINSTEPS (Version 3.92) [Computer program]. Chicago: MESA press. Linacre, J. M. (2016b). A user’s guide to WINSTEPS. [Computer software manual]. Retrieved December 12, 2016 from http://www.winsteps.com/winman/principalcomponents.htm Little, D. (2005). The Common European Framework and the European Language Portfolio: Involving learners and their judgments in the assessment process. Language Testing, 22, 321–336. Malabonga, V. M., Kenyon, D. M., & Carpenter, H. (2005). Self-assessment, preparation and response time on a computerized oral proficiency test. Language Testing, 22, 59-92. McNamara, T. F. (1991). Test dimensionality: IRT analysis of an ESP listening test. Language Testing, 8, 139-159. McNamara, T. F. (1995). Modelling performance: Opening Pandora's Box. Applied Linguistics, 16, 159-179. Muthén, L. K., & Muthén, B. O. (2017). Mplus user's guide (8th ed). Los Angeles, CA: Muthén & Muthén. National Education Security Program. The language flagship. Retrieved December 12, 2016 from http://www.nsep.gov/content/language-flagship Nikolov, M. (2016). A Framework for young EFL learners’ diagnostic assessment: ‘Can do statements’ and task types. In M. Nikolov (Ed.), Assessing young learners of English: Global and local perspectives (pp. 65-92). New York, NY: Springer International Publishing. 109 North, B. (2000). The development of a common framework scale of language proficiency. New York, NY: Peter Lang. North, B. (2011). Describing language levels. In B. O’Sullivan (Ed.), Language testing: Theories and practices (pp. 33–59). London, England: Palgrave Macmillan. North, B., & Schneider, G. (1998). Scaling descriptors for language proficiency scales. Language Testing, 15, 217–63. Oscarson, M. (1997). Self-assessment of foreign and second language proficiency. In C. Clapham & D. Corson (Eds.), The encyclopedia of language and education, Vol. 7, Language testing and assessment (pp. 175-187). Dordrecht, The Netherlands: Kluwer Academic Publishers. Pienemann, M. (1998). Language processing and second language development. Amsterdam/Philadelphia: John Benjamins. Purpura, J. E., & Turner, C. E. (2014). “A learning-oriented assessment approach to understanding the complexities of classroom-based language assessment.” Teachers College, Columbia University Roundtable in Second Language Studies: Roundtable on Learning-Oriented Assessment in Language Classrooms and Large Scale Assessment Contexts, 10 October 2014, Teachers College, Columbia University, New York, NY. Retrieved from http://www.tc.columbia.edu/tccrisls/ Purpura, J. E., & Turner, C. E. (2015). Learning-oriented assessment in second and foreign language classrooms. In D. Tsagari & J. Banerjee (Eds.), Handbook of Second Language Assessment. Boston, MA: De Gruyter Mouton. Rasch, G. (1960/80). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research. Ross, S. (1998). Self-assessment in second language testing: A meta-analysis and analysis of experiential factors. Language Testing 15, 1-20. Schneider, G., & North, B. (2000): Fremdsprachen können – was heisst das? Chur/Zürich: Rüegger. Shin, S.-Y. (2013). Proficiency scales. In C. A. Chappelle (Ed.), The Encyclopedia of Applied Linguistics (pp.1-7). Oxford, UK: Wiley-Blackwell. Soneson, D. & Tarone, E. (in press). Picking up the PACE: Proficiency assessment for curricular enhancement. In P. Winke & S. Gass (Eds.), Foreign language proficiency in higher education. New York: Springer. Spada, N., & Yasuyo, T. (2010). Interactions between type of instruction and type of language feature: A meta-analysis. Language Learning, 60, 263-308. 110 Stansfield, C. W., Gao, J., & Rivers, W. P. (2010). A concurrent validity study of self- assessment and the federal interagency roundtable Oral Proficiency Interview. Russian Language Journal, 60, 299–315. http://www.jstor.org/stable/43669189 Suzuki, Y. (2015). Self-assessment of Japanese as a second language: The role of experiences in the naturalistic acquisition. Language Testing, 32, 63-81. Sweet, G., Mack, S., & Olivero-Agney, A. (in press). Where am I? Where am I going, and how do I get there?: Increasing learner agency through large-scale self assessment in language learning. In P. Winke & S. Gass (Eds.), Foreign language proficiency in higher education. New York: Springer. The National Standards Collaborative Board. (2015). World-readiness standards for learning languages (4th ed). Alexandria, VA: Author. Tigchelaar, M. (in press). Exploring the relationship between self-assessments and OPIc ratings of oral proficiency in French. In P. Winke & S. Gass (Eds.), Foreign language proficiency in higher education. New York: Springer. Tigchelaar, M., Bowles, R. P., Winke, P., & Gass, S. (2017). Assessing the validity of ACTFL can-do statements for spoken proficiency: A Rasch Analysis. Foreign Language Annals, 50, 584-600. DOI: 10.1111/flan.12286 Trofimovich, P., Isaacs, T., Kennedy, S., Saito, K., & Crowther, D. (2014). Flawed self- assessment: Investigating self-and other-perception of second language speech. Bilingualism: Language and Cognition, 19, 1-19. Turner, C. F. (1984). Why do surveys disagree? Some preliminary hypotheses and some disagreeable examples. In Turner, C.F. & Martin, E., (Eds.), Surveying subjective phenomena, Vol. 2. New York: Russell Sage Foundation. VanPatten, B., Trego, D., & Hopkins, W. (2015). In-class vs. online testing in university- level language courses: A research report. Foreign Language Annals, 48, 659– 668. Weir, C. (2005). Limitations of the Common European Framework for developing comparable examinations and tests. Language Testing, 22, 281–300. WIDA (2014). WIDA can do descriptors. Retrieved May 10, 2018 from http://www.wida.us/standards/CAN_DOs/ Wright, B. (1991). Scores, reliabilities and assumptions. Rasch Measurement Transactions, 5, 157-158. Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch 111 Measurement Transactions, 8, 370. Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago, IL: Mesa Press. Young, J., Cho, Y., Ling, G., Cline, F., Steinberg, J., & Stone, E. (2008). Validity and fairness of state standards-based assessments for English language learners. Educational Assessment, 13, 170-192, DOI:10.1080/10627190802394388. Zumbo, B. D. (2007). Three generations of DIF analyses: Considering where it has been, where it is now, and where it is going. Language Assessment Quarterly, 4, 223- 233. Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, ON: Directorate of Human Resources Research and Evaluation, Department of National Defense. Zwick, R. (2012). A review of ETS differential item functioning assessment procedures: Flagging rules, minimum sample size requirements, and criterion refinement. [ETS RR-12-08]. Princeton, NJ: ETS. 112