ASSESSING THE VALIDITY OF ACTFL CAN-DO STATEMENTS  

FOR SPOKEN PROFICIENCY 

By 

Sonia Magdalena Tigchelaar 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

A DISSERTATION 

 

Submitted to 

Michigan State University 

in partial fulfillment of the requirements 

for the degree of 

 

Second Language Studies—Doctor of Philosophy 

 

2018 

 

ABSTRACT 

ASSESSING THE VALIDITY OF ACTFL CAN-DO STATEMENTS  

FOR SPOKEN PROFICIENCY 

 

By 

 

Sonia Magdalena Tigchelaar 

The NCSSFL-ACTFL (2015) Can-Do Statements describe what language learners can do in the 

target language at the various ACTFL Proficiency sublevels. Unlike the extensive work that has 

been done to scale and refine the proficiency descriptors and corresponding can-do statements 

associated with the CEFR (Council of Europe, 2001) using Rasch modeling (North, 2000; North 

& Schneider, 1998), the NCSSFL-ACTFL statements have yet to be empirically tested. Both the 

scale and its foreign language performance indicators were constructed using language teachers’ 

beliefs and experiences (Shin, 2013). While this is a logical starting point, concerns include 

whether the difficulty levels of the skills described in the statements match their assigned 

ACTFL proficiency levels, and whether each statement accurately measures the underlying 

construct: language proficiency on the ACTFL subscales. 

This study addresses these concerns by analyzing a self-assessment instrument composed 

of fifty NCSSFL-ACTFL (2015) Can-Do Statements targeting spoken language proficiency. 

American university students of varying proficiency levels in Spanish language classes (N = 382) 

rated the Can-Do Statements as: 1 (I cannot do this yet), 2 (I can do this with some help), 3 (I can 

do this with much help), 4 (I can do this). I analyzed their item responses using a Rasch rating 

scale model (Andrich, 1978; Rasch, 1960/80).  I compared the difficulty levels estimated by the 

model to the proficiency levels assigned to the statements, and assessed each item’s fit to the 

model by considering the item’s measures of infit and outfit. The mean item difficulties 

estimated by the Rasch model matched the difficulty level predicted by the ACTFL scale at the 

major threshold proficiency levels, and these differences were statistically significant. The mean 

item difficulties did not show statistically significant increases at the ACFTL proficiency 

sublevel, and there was a decrease in difficulty from Advanced Low (M = 1.72, SD = 1.51) to 

Advanced Mid (M = 1.37, SD = 1.32) items, rather than an increase. The analysis also revealed 

14 items that did not fit the model measuring spoken proficiency.  

In the second phase of the study, I revised the self-assessment instrument based on the 

findings of the first phase, and the revised assessment was used for a second round of testing. 

Spanish language learners (N = 886) rated the items in the revised instrument of the ACTFL 

(2015) Can-Do Statements. I analyzed their item responses using an exploratory factor analysis 

(EFA) and a Rasch rating scale model. The results of the EFA revealed two possible models of 

spoken language proficiency as represented by the Can-Do Statements that were included in the 

instrument: a unidimensional model in line with ACTFL’s unitary and hierarchical model of 

spoken proficiency, and a two-factor model. The Rasch analysis revealed that some of the items 

and some of the test takers in the analysis did not behave as expected. The analysis also 

replicated the finding that the mean item difficulties estimated by the Rasch model matched the 

difficulty level predicted by the ACTFL scale at the major threshold proficiency levels, and that 

these differences were statistically significant. The mean item difficulties in the revised 

assessment also ascended according to the ACTFL sublevels: There were significant differences 

between items from the lower proficiency sublevels, but the instrument did not discriminate well 

between statements pegged at higher proficiency sublevels. Findings are discussed in terms of 

how the NCSSFL-ACTFL (2015) Can-Do Statements can be used to self-assess spoken language 

proficiency, and how the statements should be assessed for content validity and psychometric 

value. 

 

ACKNOWLEDGMENTS 

No assessment task is entirely satisfactory. Each format has its own weaknesses. Rather than 

searching for one ideal task type, the assessment designer is better advised to include a 

reasonable variety in any test or classroom assessment system so that the failings of one format 

do not extend to the overall system. (Green, 2014, p. 140) 

 

As I’ve been working on this study of self-assessment, I have had the opportunity to 

reflect on my own self-assessments. One of the well-documented failings of self-assessment is 

the Dunning-Krueger effect, where students who are less experienced tend to overestimate what 

they are capable of, while students with more experience and knowledge tend to under-evaluate 

their skills. As I approach the end of my dissertation project (with far more experience and 

knowledge than when I first started out), I have sometimes concluded that this research is not 

worth doing or that my work is not good enough. Thankfully, I have also received a 

preponderance of other evaluations along the way that have provided very different perspectives: 

“Really nice job” (S. Gass, personal communication, May 15, 2017); “Your dissertation is good” 

(C. Polio, personal communication, April 23, 2018); “Fantastic…Really great!!!” (P. Winke, 

personal communication, April 19, 2018); “Because I plan to include this manuscript in the Fall 

2017 issue of Foreign Language Annals, I am writing this afternoon to let you know that I look 

forward to receiving a revised version that addresses our queries and suggestions as soon as 

possible” (A. Nerenz, personal communication, June 24, 2017); “I am pleased to inform you that 

your application for the MLJ/NFMLTA Dissertation Support Grant has been recommended for 

funding by the NFMLTA Dissertation Support Grant Committee members. Congratulations!” 

 iv 

(A. Schleicher, personal communication, November 10, 2017). These perspectives have forced 

me to re-evaluate my self-assessments and provided a more balanced view of my work. I would 

like to thank a number of my outside evaluators, whose encouragement and support have helped 

me to stay on track over the course of this project.  

First, I would like to thank my dissertation co-chairs, Drs. Charlene Polio and Paula 

Winke. I am grateful for Charlene’s insight and leadership throughout the course of my doctoral 

career. Thank you Paula for your enthusiastic support, encouragement, and ability to see 

possibilities where I see challenges. I am also grateful to Dr. Ryan Bowles for introducing me to 

the fascinating world of measurement. To Dr. Sue Gass, thank you (and Paula) for generously 

making the Flagship Grant data available to me, and for the wonderful opportunity to work 

together with you on SSLA.  More generally, I would like to thank the Second Language Studies 

program faculty members and past and current SLS and MA TESOL students for your support 

and camaraderie along the way.  

On a more personal note, I would also like to thank my family for their support and 

inspiration. I’m thankful to have grown up with parents and siblings who place great value in 

discovery and learning. I’m fortunate to have found a similar love of learning and curiosity about 

life in my in-laws. I am particularly grateful to my brother Evan for his patient, enthusiastic, and 

loving childcare while I’ve returned to work. Thank you baby Meriwether for inspiring me to set 

aside procrastination and to use my time efficiently, and for motivating me to set an example of 

hard work and perseverance. And last, and most importantly, thank you Joel for your patience, 

for always reminding me to trust my outside evaluators, and for your unwavering belief in me.      

 v 

TABLE OF CONTENTS 

 

LIST OF TABLES ....................................................................................................................... viii 
LIST OF FIGURES ........................................................................................................................ x 
INTRODUCTION .......................................................................................................................... 1 
BACKGROUND ............................................................................................................................ 3 
Defining the construct: Oral proficiency .................................................................................... 3 
Dimensions of second language proficiency. ......................................................................... 5 
Assessing the validity of oral proficiency self-assessments ....................................................... 8 
Factors that threaten the validity of self-assessment items ....................................................... 12 
Motivation for the current study ............................................................................................... 15 
PHASE I: SPRING 2015 PROFICIENCY TESTING ................................................................. 19 
Methods..................................................................................................................................... 19 
Participants. ........................................................................................................................... 20 
Materials. .............................................................................................................................. 20 
Procedure. ............................................................................................................................. 21 
Data analysis. ........................................................................................................................ 23 
Results ....................................................................................................................................... 26 
Fit to the Rasch model. ......................................................................................................... 26 
Item difficulty estimates. ...................................................................................................... 33 
Discussion ................................................................................................................................. 39 
PHASE II: SPRING 2017 PROFICIENCY TESTING ................................................................ 45 
Methods..................................................................................................................................... 46 
Participants. ........................................................................................................................... 46 
Materials. .............................................................................................................................. 46 
Procedure. ............................................................................................................................. 49 
Data Analysis. ....................................................................................................................... 50 
Exploratory factor analysis (EFA). ....................................................................................... 51 
Rasch analysis. ...................................................................................................................... 52 
Results ....................................................................................................................................... 52 
Factor analysis. ..................................................................................................................... 52 
Fit to the Rasch model. ......................................................................................................... 57 
Item difficulty estimates. ...................................................................................................... 70 
Discussion ................................................................................................................................. 74 
Dimensions of spoken proficiency........................................................................................ 75 
Fit to the Rasch model. ......................................................................................................... 79 
Item difficulty. ...................................................................................................................... 81 
DISCUSSION AND CONCLUSION .......................................................................................... 85 
Conclusions ............................................................................................................................... 90 
ENDNOTES ................................................................................................................................. 92 

 vi 

APPENDICES .............................................................................................................................. 94 
APPENDIX B: Phase I principal components analysis ............................................................ 98 
APPENDIX C: ACTFL OPIc 1-5 levels and revised Can-Do Statements ............................. 100 
APPENDIX D: Phase II principal components analysis ........................................................ 103 
REFERENCES ........................................................................................................................... 105 
 
 

 

 vii 

 

LIST OF TABLES 

Table 1: Number of participants who completed each level of the self-assessment ......................... 22 
 
Table 2: Distribution of 2015 OPIc ratings by class level ....................................................................... 23 
 
Table 3: Linacre’s interpretation of mean-square fit statistics .............................................................. 24 
 
Table 4: Misfitting items from the 2015 self-assessment questionnaire .............................................. 29 
 
Table 5: Fit statistics for 36 fitting items ...................................................................................................... 31 
 
Table 6: Descriptive statistics for difficulty estimates of ACTFL threshold levels ........................... 33 
 
Table 7: Descriptive statistics for difficulty estimates of ACTFL sublevels ....................................... 34 
 
Table 8: Advanced-level items in order of difficulty .................................................................................. 38 
 
Table 9: 14 misfitting items and their replacements .................................................................................. 47 
 
Table 10: Number of participants who completed each level of the self-assessment....................... 49 
 
Table 11: Distribution of 2017 OPIc ratings by class level .................................................................... 50 
 
Table 12: Factor loadings for the 1-factor and 2-factor models ........................................................... 55 
 
Table 13: Misfitting items from the initial 2017 Rasch model ................................................................ 59 
 
Table 14: More misfitting items ....................................................................................................................... 60 
 
Table 15: Misfitting people, ratings and response strings (and most unexpected responses) ...... 61 
 
Table 16: Final model fit statistics for the revised self-assessment questionnaire ........................... 65 
 
Table 17: Fit statistics for the fourteen replacement items included in the revised Spring 2017 

assessment .................................................................................................................................................... 67 

 
Table 18: Fit statistics for original and revised items .............................................................................. 68 
 
Table 19: Descriptive statistics for 2017 difficulty estimates of ACTFL threshold levels .............. 72 
 
Table 20: Descriptive statistics for 2017 difficulty estimates of ACTFL sublevels .......................... 73 
 
Table 21: Items with large and significant DIF. ......................................................................................... 95 

 viii 

 
Table 22: ACTFL OPIc level 1 Can-Do Statements .................................................................................. 95 
 
Table 23: ACTFL OPIc level 2 Can-Do Statements .................................................................................. 95 
 
Table 24: ACTFL OPIc level 3 Can-Do Statements .................................................................................. 96 
 
Table 25: ACTFL OPIc level 4 Can-Do Statements .................................................................................. 96 
 
Table 26: ACTFL OPIc level 5 Can-Do Statements .................................................................................. 97 
 
Table 27: First contrast in the original Rasch model ............................................................................... 98 
 
Table 28: First contrast in the final Rasch model ...................................................................................... 99 
 
Table 29: ACTFL OPIc level 1 Can-Do Statements ................................................................................ 100 
 
Table 30: ACTFL OPIc level 2 Can-Do Statements ................................................................................ 100 
 
Table 31: ACTFL OPIc level 3 Can-Do Statements ................................................................................ 101 
 
Table 32: ACTFL OPIc level 4 Can-Do Statements ................................................................................ 101 
 
Table 33: ACTFL OPIc level 5 Can-Do Statements ................................................................................ 102 
 
Table 34: First contrast in the original Rasch model ............................................................................. 104 
 
Table 35: First contrast in the final Rasch model .................................................................................... 104 
 

 
 
 
 
 

 

 

 

 ix 

 

 

LIST OF FIGURES 

Figure 1: ACTFL (2012) Proficiency levels. Reprinted with permission. ............................................. 3 
 
Figure 2: Test takers’ path................................................................................................................................. 21 
 
Figure 3: Distribution of OPIc ratings in Spring 2015.The numeric OPIc ratings from 1-9 

represent the range of ACTFL proficiency levels from Novice Low to Advanced High. .... 23 

 
Figure 4: Wright map of test taker ability and item difficulty. A period indicates one person. A 

hash mark is equal to three people. ....................................................................................................... 36 

 
Figure 5: Distribution of OPIc ratings in Spring 2017. The numeric OPIc ratings from 1-9 

represent the range of ACTFL proficiency levels from Novice Low to Advanced High, and 
Above Range and Below Range are represented by -1 and 0, respectively. ............................. 50 

 
Figure 6: Scree plot of item-level EFA. ........................................................................................................ 55 
 
Figure 7: Wright map of test taker ability and item difficulty. ............................................................... 71 
 
Figure 8: 95% confidence intervals of the mean threshold difficulty estimates. 1 = Novice; 2 = 

Intermediate; 3 = Advanced; 4 = Superior. ......................................................................................... 72 

 
Figure 9: 95% confidence intervals of the mean sublevel difficulty estimates. 1 = NL; 2 = NM; 3 
= NH; 4 = IL; 5 = IM; 6 = IH; 7 = AL; 8 = AM; 9 = AH; 10 = S. .............................................. 74 

 
Figure 10: Phase I standardized residual contrast plot. ............................................................................. 98 
 
Figure 11: Phase II standardized residual contrast plot for the original model. .............................. 103 
 
Figure 12: Phase II standardized residual contrast plot for the final model. .................................... 103 

 x 

INTRODUCTION 

Traditionally, research on assessment in the field of second language (L2) learning has 

focused on the assessment of language learning, but recent trends in educational assessment have 

called for the use of assessment for language learning (Butler, 2016; Lee, 2016; Nikolov, 2016; 

Purpura & Turner, 2014, 2015; VanPatten, Trego, & Hopkins, 2015). Such a shift requires that 

L2 learners gain awareness of their language abilities and deficiencies by taking a more active 

role in their assessment, rather than being the passive recipients of an outsider’s rating or 

judgment of their proficiency. One way that they can participate in this process is to reflect on 

their own language use by engaging in self-assessment.  

To facilitate this process, materials have been developed for language learners to take 

stock of what they can do in the target language. The first publication of can-do statements was 

created as part of the Swiss European Language Portfolio project (Schneider & North, 2000). 

Large language assessment entities including the Common European Framework of Reference 

for Languages (CEFR; Council of Europe, 2001), WIDA (2014), and the American Council on 

the Teaching of Foreign Languages (ACTFL, 2012, 2015, 2017) have since developed can-do 

statements that relate to their existing commonly used proficiency standards. These statements 

were designed to be used by language learners as “self-assessment checklists...to assess what 

they ‘can do’ with language” (ACTFL, 2015).   

Some of the benefits associated with can-do statements are that they are positive, 

concrete, clear, brief, and can promote independence (Fang, Yang & Zhu, 2011). They are 

designed to be psychologically affirming, focusing on abilities rather than deficiencies (e.g., I 

can schedule an appointment). Can-do statements are also clear and brief. They use specific, 

understandable language that describes functional skills rather than linguistic jargon (e.g., I can 

 1 

describe a childhood experience rather than I can narrate in the past tense) and they divide 

complex language features into short, simple descriptions (e.g., I can describe a place I have 

visited). Finally, in line with learner-centered language teaching (Lee, 2016; Little, 2005) and the 

use of assessment for language learning (Butler, 2016; Purpura & Turner, 2014, 2015), can-do 

statements are designed to promote language awareness (Sweet, Mack, & Olivero-Agney, in 

press) and independence by allowing learners to take stock of what they can do in their L2. 

Experts in language assessment have researched can-do statements and their 

corresponding language proficiency descriptors to validate them for assessment purposes (e.g., 

Faez, Majhanovich, Taylor, Smith, & Crowley, 2011; Jones, 2002; North, 2000; Shin, 2013; 

Weir, 2005; WIDA, 2014). One main concern is that the ACTFL (2012) proficiency scales were 

“constructed according to the shared experiences and beliefs of language teachers and experts” 

(Shin, 2013, p. 2). While this is a reasonable first step in the construction of language proficiency 

descriptors, without psychometric validation, it is not certain that the abilities described in the 

scales and their assigned difficulties are valid or useful for measuring language proficiency. In 

the case of the European scale (CEFR, Council of Europe, 2001), extensive empirical work has 

been done to scale the proficiency descriptors using Rasch modeling (North, 2000; North 2011; 

North & Schneider, 1998). However, the ACTFL (2012, 2015) scale and its corresponding 

language descriptors have yet to be empirically tested. Therefore, more work needs to be done to 

provide empirical evidence for the construct validity of the ACTFL (2015) Can-Do Statements 

for language proficiency. The purpose of this study was to assess the validity of a selection of 

ACTFL Can-Do Statements targeting spoken proficiency that were used for self-assessment by 

university-level Spanish language learners.  

 

 

 2 

BACKGROUND 

Defining the construct: Oral proficiency 

The ACTFL Proficiency Guidelines for speaking are considered a model of oral language 

proficiency, in the sense that they provide “a theoretical overview of what we understand by 

what it means to know and use a language” (Fulcher & Davidson, 2009, p. 126). The Guidelines 

for speaking describe what language learners can do with language at five major proficiency 

levels: Novice, Intermediate, Advanced, Superior, and Distinguished. The first three major levels 

are further divided in Low, Mid, and High sublevels (ACTFL, 2012, p. 3). These proficiency 

levels are shown in the pyramid in Figure 1, colloquially known as the ACTFL pyramid.  

 

 

 

 

 

 

 

 

Figure 1: ACTFL (2012) Proficiency levels. Reprinted with permission. 

The spoken proficiency guidelines describe some of the linguistic features of second 

language speech (e.g., accuracy, discourse structure, language functions) at each level. For 

example, Advanced level speakers are expected to be able to narrate and describe in all the 

major time frames. This is distinct from Superior level speech, which is marked by the ability to 

use argumentation and to hypothesize. The Guidelines also specify content and tasks that 

 3 

speakers should be able to accomplish at the major threshold levels (i.e., Novice, Intermediate, 

Advanced, Superior) and each of the proficiency sublevels. For example, at the Advanced 

threshold level, speakers should be able to express themselves on concrete and familiar topics, 

while abstract topics are representative of the content of Superior-level language use. The 

ACTFL (2012) Guidelines and the NCSSFL-ACTFL (2015, 2017) Can-Do Statements further 

specified tasks at each of the proficiency sublevels (e.g., Novice Low, Novice Mid, Novice 

High). In the case of the 2015 Can-Do Statements, ACTFL defined specific performance 

indictors for two spoken modalities: presentational speaking (“learners present information, 

concepts, and ideas to inform, explain, persuade, and narrate on a variety of topics using 

appropriate media and adapting to various audiences of listeners, readers, or viewers,” The 

National Standards Collaborative Board (NSCB), 2015) and interpersonal communication 

(“learners interact and negotiate meaning… to share information, reactions, feelings, and 

opinions,” NSCB, 2015). The 2017 publication also included statements in a third category 

called intercultural communicative competence. There are between five (Novice Low) and 

twenty-five (Novice Mid) indicators for all eleven proficiency sublevels in both modes: Each 

indicator is in the form of a Can-Do Statement (e.g., I can tell someone my name, Novice Low, 

Interpersonal Communication).   

In addition to the qualities of second language speech reviewed above, what distinguishes 

speech from one proficiency level to the next, according to the ACTFL (2012) Guidelines for 

speaking, is in part defined by the idea of language quantity. Language users at the High level of 

a proficiency band (who have more language than those at lower levels) are expected to be able 

to do some of the tasks associated with the proficiency band directly above, but are unable to 

sustain performance at this level. For example, an Advanced High speaker may be able to 

 4 

“discuss some topics abstractly, especially those relating to their particular interests and special 

fields of expertise, but in general, they are more comfortable discussing a variety of topics 

concretely.” (ACTFL, 2012, p. 5). Superior-level speakers, on the other hand, would be expected 

to sustain discussions on abstract topics.  

Dimensions of second language proficiency.  

The model of language proficiency represented by the ACTFL pyramid (see Figure 1) 

and represented by the language written in the ACTFL (2012) Guidelines strongly suggests that 

spoken language proficiency grows hierarchically over time (with use, practice, and 

coursework): The “levels form a hierarchy in which each level subsumes all lower levels” (p. 3). 

The model depicted in the pyramid also represents a large, unitary construct of something 

individuals can refer to as spoken language proficiency. The NCSSFL-ACTFL (2015, 2017) 

Can-do Statements, however, appear to indicate that there are at least two dimensions (or 

subskills) of speaking ability within the larger construct of speaking ability in general, although 

these horizontal categorizations (or genres) of the spoken language construct are not indicated in 

the pyramid. In the original publication of the Can-Do Statements (NCSSFL-ACTFL, 2015) 

speaking performance was separated into two areas: presentational speaking and interpersonal 

communication. In the revised 2017 version of Can-Dos, a third dimension was added: 

intercultural communicative competence. Thus, according to ACTFL, speaking can be 

considered a unitary construct, but it also has at least two prominent sub-categories, which means 

spoken language proficiency can also be seen as a multidimensional construct with at least two 

major categories. 

A unidimensional model of speaking is in line with many other second language 

assessments that consider speaking to be one of four skills. The Test of English as a Foreign 

 5 

Language Internet-Based Test (TOEFL iBT), for example, is divided into four modalities: 

speaking, listening, reading, and writing. In an item factor analysis of the structure of test takers’ 

performance on the TOEFL iBT, each of the four skills or modalities was found to be 

unidimensional (Sawaki, Stricker, & Oranje, 2005). McNamara (1991) used items from tasks 

that were considered to represent distinct aspects of L2 listening (listening comprehension and 

making inferences) in a Rasch model and showed that it was possible to construct a single 

dimension for the measurement of listening ability. These findings suggest that spoken language 

proficiency can be modeled mathematically as a unitary construct.  

At first glance, these models of L2 proficiency seem overly simplistic when compared to 

other theoretical models of language use (e.g., Bachman & Palmer, 1996; 2010; Canale & Swain, 

1980; Celce-Murcia, Dörnyei, & Thurell, 1995). Canale and Swain (1980), for example, 

proposed a longstanding and influential model of communicative language ability that included 

four dimensions: grammatical competence, discourse competence, sociocultural competence, and 

strategic competence, which can each be broken into subcategories. The domain of grammatical 

competence, for example, can be further broken down into smaller units (e.g., grammatical 

complexity, accuracy, lexical knowledge, and fluency), which can again be broken down to the 

point that the construct of language proficiency has been described as a true “Pandora’s Box” 

(Canale & Swain, 1980; McNamara, 1995).    

McNamara (1991), however, argued that “models should be judged in terms of their 

utility and not merely in terms of their relative approximation of reality” (p. 156). This 

perspective is widely held in the fields of language assessment and measurement. While no 

assessment can fully capture the complexity of a given human attribute (for example, IQ or 

motivation level), useful quantitative estimates can nevertheless be made (Bond & Fox, 2015). A 

 6 

useful illustration of this point is the difference between a two-dimensional map and the section 

of earth that it is designed to represent: “Even when it is known that the assumption of flatness is 

incorrect, that is, the model is at variance with what is known of the reality being modelled, such 

maps are useful and adequate for most purposes” (McNamara, 1991, p.147). Thus, a 

measurement model of language proficiency that represents a simplification of what is known 

about the nature of L2 proficiency and development can still be useful.  

One last point to make about the nature and measurement of spoken proficiency regards 

the distinction between general spoken proficiency and an academic register (Bailey, 2007; Lin, 

2015). Hulstijn (2007) hypothesized a core language proficiency minimally required to function 

communicatively in a second language. This core includes the mental representation of 

phonology and frequent lexical and grammatical constructions, and the ability to process these 

while speaking. Speakers with higher educational backgrounds will have similar profiles to 

speakers with lower educational levels in terms of core language skills, but beyond this core 

there will be differences in proficiency. Speaking ability beyond the core requires higher order 

cognition: “explicit, conscious knowledge of all sorts of topics…as well as attention allocation, 

decision making, inferencing ability, and the like” (Hulstijn, 2007, p. 664). Thus, going beyond 

core language proficiency may rely on additional factors such as education and intelligence. 

Hulstijn’s idea is similar to Cummins’ (1979, 2008) basic interpersonal communicative skills 

(BICS), which he distinguished from cognitive/academic language proficiency (CALP). BICS 

encompasses speaking as conversational fluency, while CALP refers to a separate dimension of 

language proficiency that captures language users’ ability to express academic concepts and 

ideas. Cummins (1980, 1981) showed that these two dimensions of spoken proficiency in L2 

English develop at different rates: English language learners who could function well 

 7 

conversationally required several more years to develop academic proficiency. Given this 

differential growth, in tests of second language oral proficiency, it may be possible to measure a 

general or conversational language dimension and another dimension that goes beyond 

conversational language.  

Assessing the validity of oral proficiency self-assessments 

Researchers who have investigated the self-assessment of oral proficiency have primarily 

been concerned with the concurrent validity of the self-assessment instruments, and particularly 

how well self-assessments correlate with other assessments (e.g., Brown, Dewey & Cox, 2014; 

Malabonga, Kenyon & Carpenter, 2005; Sweet et al., in press; Tigchelaar, in press; Trofimovich, 

Isaacs, Kennedy, Saito, & Crowther, 2014). This line of research has not produced conclusive 

results. In a meta-analysis of self-assessment in language testing, Ross (1998) found that in 29 

studies correlating self-assessments with outside measures of speaking skills, the average 

correlation and effect sizes were smaller than correlations between self-assessment and reading 

or listening. He also observed a large range in the correlations between self-assessment and 

speaking in the studies he surveyed (from r = .09 to .78). He therefore argued, “self-assessment 

of speaking skill is quite susceptible to extraneous factors in the self-assessment process” (p. 9). 

Factors might include the instrument used to conduct the self-assessment, the test taker’s 

interpretation of the assessment (e.g., assessing communicative intention rather than the success 

of that communication, Ross, 1998), or the purpose for self-assessment, among others.  

One factor that may lead researchers in applied linguistics to observe weak correlations 

between self-assessment and other test scores is the type of self-assessment instruments used in 

the research. Recently, Trofimovich et al. (2014) compared self- and other-assessments and 

found a very weak relationship between the two measures. This study was focused on L2 English 

 8 

learners’ ability to assess how native-like and comprehensible their spoken production was. They 

found a non-significant correlation for accent (r = .06, p = .50) and a very weak correlation for 

comprehensibility (r = .18, p = .03). It is possible that the weak associations found in this study 

were due to the fact that language learners are not trained (as applied linguists are) to assess 

linguistic components of L2 speech. In an earlier study with similar findings, Brantmeier (2006) 

hypothesized that the instrument she used may have resulted in self-assessments as poor 

predictors of other-assessment scores. Her participants assessed their abilities using a simple 

questionnaire to rate how well they thought they understood a reading passage. Their ratings did 

not correlate with the results of an online placement test. She concluded that subsequent research 

in self-assessment should consider the use of “more contextualized, criterion-referenced self-

assessment instruments” (p. 30).  

More recently researchers have responded to this call for studying criterion-referenced 

self-assessments by investigating the validity of proficiency descriptors and related can-do 

statements for the self-assessment of oral skills. Sweet et al. (in press) compared self- and other-

assessment scores by correlating scores on a can-do questionnaire and ACTFL Oral Proficiency 

Interview by computer (OPIc) scores. They found that the strength of the relationship increased 

with language learning and self-assessment experience: Participants in the second semester of 

language study had much lower correlations (r = .37) than students in the sixth through eighth 

semester (r = .61). Stansfield, Gao, and Rivers (2010) found that self-assessments written with 

can-do statements led second language users to assess their proficiency more accurately than 

more general global proficiency statements. They also found a moderate correlation (r = .41) 

between self-assessments using can-dos and Oral Proficiency Interview (OPI) scores. Brown et 

al. (2014) found significant small to medium correlations between both pre-study-abroad OPI 

 9 

scores and self-assessments (r = .27) and post-study-abroad OPI scores and self-assessments (r = 

.21). Comparing the strength of the correlations observed in these studies with those found by 

Trofimovich et al. (2014) suggests that the type of self-assessment used matters: Researchers 

find stronger concurrent validity between can-do statement self-assessments and outside 

proficiency measures.  

The purpose for self-assessment may also impact the observation of the relationship 

between self-assessed proficiency and outside proficiency measures (Oscarson, 1997; Stansfield 

et al., 2010). One important use of self-assessment is to establish a starting point for test takers in 

computer-adaptive test (CAT) contexts (Calhoub-Deville & Deville, 1999). In ACTFL’s (2012) 

computerized oral proficiency interview by computer (OPIc), test takers begin by choosing one 

of five levels at which to begin the speaking test. One issue with this procedure is that if test 

takers overestimate their abilities in the self-assessment, they may select a task or test form that 

is too difficult for them, which can prevent raters from being able to rate their proficiency. 

However, using self-assessments may help to guide test takers toward the appropriate starting 

point. Malabonga et al. (2005) at the Center for Applied Linguistics designed a short self-

assessment to guide computer-adaptive-speaking-test takers in choosing a starting level for their 

computerized oral assessment. The self-assessment was in the form of a questionnaire that 

included 18 questions. Based on their score on the self-assessment, one of four task levels was 

suggested for examinees to select for their first speaking task. The authors found that 92% of 

participants accurately used the self-assessment questionnaire and subsequently chose a starting 

level that was at an appropriate level of difficulty. They also found that the results of the self-

assessments correlated strongly (r = .88) with the results of the oral proficiency test. In a similar 

study, Tigchelaar (in press) investigated the relationship between self-assessment scores and 

 10 

ACTFL OPIc scores. Second language learners of French completed a self-assessment of 

ACTFL Can-Do Statements designed to guide test takers toward one of five ACTFL OPIc forms. 

The majority (94%) of the test takers received a proficiency rating, which suggests that they used 

the self-assessment accurately to choose an appropriate test form. The remainder of the 

participants were assigned the rating of above range or below range, which indicated the test 

taker took the wrong test form, either too easy or too hard, respectively, and the raters could not 

match the test taker’s performance to the range of ability being assessed by the selected form. 

The correlations between self-assessments and test scores were also strong (between r = .46 and 

ρ = .64, depending on the scale used to convert proficiency ratings). One should note that these 

correlations have a certain amount of collinearity: That is, the outcome variable (the final test 

score) relied in part on the initial self-assessment outcome.  

Taken together, these studies suggest that learners may be better equipped to self-assess 

functional speaking skills than more fine-grained linguistic components of speech, and that the 

use of criterion-referenced instruments such as can-do statements may facilitate this process. The 

purpose for self-assessment also impacts how well self- and other-assessments are related. 

However, while this body of research has provided some preliminary validity evidence by 

establishing links between can-do statements and criterion references, there has been minimal 

work looking at other aspects of the validity of these instruments. One exception is Brown et al. 

(2014), who provided three pieces of validity evidence for a selection of ACTFL (2015) Can-Do 

Statements targeting spoken proficiency. They assessed the reliability and predictive validity of 

the statements for OPI scores (reported above). They also assessed whether the items ascended 

according to the ACTFL (2012) proficiency hierarchy. Their findings indicate that the items 

ascended in the order predicted, based on a Rasch analysis. However, they only considered 

 11 

whether items followed the order of the major thresholds (i.e., Novice, Intermediate, Advanced 

and Superior), and not the order of the sublevels (e.g., Novice Low; Novice Mid; Novice High). 

Another area that they left unexplored was how well the items fit the Rasch model, which could 

have provided evidence that the items they assessed were good indicators of spoken proficiency 

(i.e., construct validity). In particular, more work needs to be done to assess the validity of the 

individual can-do statements, since little evidence exists that these discrete proficiency 

descriptors are all important and accurate components of oral proficiency.  

Factors that threaten the validity of self-assessment items 

Heilenman (1990) called for language testing researchers to identify which factors 

influence how well self-assessment items function and how assessees respond to them in order to 

refine questionnaire items. Since this call, researchers have uncovered a number of features in 

the content of items that mitigate and threaten the usefulness of self-assessment questionnaires. 

Haladyna, Downing, and Rodriguez (2002) synthesized the consensus from educational testing 

textbooks and research targeting classroom assessment to create a taxonomy of guidelines for 

item writing. In terms of content, their first suggestion is that “every item should reflect specific 

content and a single specific mental behavior” (p. 312). Other research has shown that explicitly 

negative statements (e.g., I cannot…) perform badly in second language self-assessment, as 

higher ability learners tend to endorse these statements (Heilenman, 1990; Jones, 2002). This has 

prompted test developers to phrase items in terms of things test takers can do (ACTFL, 2015; 

Council of Europe, 2001).  

In terms of how items function, Turner (1984) identified two main factors that contribute 

to unexpected responses to self-report items: items that are vague or that have ambiguous 

referents, and items that are irrelevant to a test-taker’s daily life. More recent research has 

 12 

documented how each of these factors can influence self-report responses in the context of 

language testing. For example, Jones (2002) found that can-do statements that are brief or vague 

tend to be easier than expected, in that items that were written to describe upper-level tasks were 

endorsed by lower-proficiency language learners. Conversely, he identified statements about 

language use that include highly specific examples, refer to stressful situations, or involve 

channels such as speaking on the phone as more difficult than expected.  

Commenting on why the CEFR can-do statements do not distinguish well between 

adjacent levels of upper levels of language proficiency, Weir (2005) noted, “the likely root cause 

is that so few contextual parameters or descriptions of successful performance are attached to 

such ‘Can-do’ statements. Both the context and the quality of performance may be needed to 

ground these distinctions” (p. 288). Not surprisingly, research on the self-assessment of language 

proficiency has confirmed that items addressing skills and contexts that are relevant to test-

takers’ lives perform better than those that are less relevant (Butler & Lee, 2006; Ross, 1998; 

Suzuki, 2015). Ross (1998) found that self-assessment items that matched instructional content 

correlated more strongly with outside measures than more abstract proficiency-based items. 

Building on this finding, Butler and Lee (2006) found that the more recently learners had 

engaged in a self-assessment task in the classroom, the more accurate their scores were. Suzuki 

(2015) found that learners with more experience in the target language in more naturalistic 

settings, measured by length of residency, had less discrepancy between their self-assessed 

proficiency and that assessed by outside measures. Clearly, the contexts addressed in can-do 

statements have an impact on how test-takers respond to them, and those items with contexts that 

are unfamiliar to the test-taking population are likely not useful for measuring language 

proficiency.  

 13 

The studies reviewed above suggest that when constructing items for self-assessment of 

L2 proficiency, items function best when they address a single mental skill, when they include 

content that is specific, and when they address contexts and content that are familiar to the test-

taking population. These qualities provide some general guidelines for self-assessment test 

design. However, well-constructed self-assessment items of L2 proficiency do not guarantee that 

language learners will respond to these items exactly as test designers and researchers might 

expect. The expectation is that items that represent the most difficult skills should only be 

endorsed by people with higher proficiency levels, while the easiest items should be endorsed by 

every test taker. In other words, in order for items to be valid for measuring L2 proficiency, 

variation in item responses should be caused by variation in test takers’ L2 proficiency 

(Borsboom, Mellenbergh, & van Heerden 2004).  

An excellent test of this conception of validity is the use of a Rasch model, which 

hypothesizes a single dimension (or scale) of item difficulty and person ability. An analysis of 

test takers’ item responses is a test of this hypothesis: items fit the Rasch measurement scale if 

they generate item responses that align with what the model predicts (e.g., the most difficult 

items will only be endorsed by test-takers with the highest ability). Items that elicit unexpected 

responses from test takers misfit the model (e.g., an item with a low difficulty estimate that is not 

endorsed by a high ability test-taker). In the case of misfit to the model, several conclusions can 

be drawn: 

In relation to items, it may indicate (1) that the item is poorly constructed; (2) that if the 

item is well constructed, it does not form part of the same dimension as defined by other 

items in the test, and is therefore measuring a different construct or trait. In relation to 

persons, it may indicate (1) that the performance on a particular item was not indicative 

 14 

of the candidate’s ability in general, and may have been the result of irrelevant factors 

such as fatigue, inattention, failure to take the test item seriously…; (2) there is a 

heterogeneous test population in terms of the hypothesis under consideration. 

(McNamara, 1991, p. 143).  

On the other hand, if the parameters of dimensionality and model fit are met, Rasch 

measurement provides evidence that the items (and the people) considered in the analysis are 

useful for the construction of measurement (Wright, 1991). Namely, when items and people fit 

the model, there is evidence of construct validity, which is widely considered the most important 

step in assessing the validity of a test (AERA, APA & NCME, 2014; Cumming & Berwick, 

1996). The idea is that if the test worked in measuring the skill of the people who were given the 

test (as evidenced through model fit with Rasch analysis), then the test will work for other people 

(in the future) who are also given the test. This assumption holds as long as the people who are 

given the test in the future are similar to the population originally tested. Thus, Rasch modeling 

is often used as a norming procedure in educational measurement.  

Motivation for the current study 

This study draws on data that was collected as part of the National Education Security 

Program’s (NESP, 2016) Language Flagship Proficiency Initiative, which provided ACTFL 

language tests to L2 learners at three American universities over the course of three academic 

years (Fall 2014 until Spring 2017). All students in the project took the ACTFL OPIc in the 

second language they were learning. At the current study’s institution, each student was able to 

self-select which of five OPIc level tests to take, which ranged in proficiency from Novice Low 

to Superior.  

 15 

The ACTFL OPIc is a standardized speaking test that measures second language learners’ 

functional speaking ability. According to the test specifications, the OPIc assesses the 

Interpersonal mode of communication (ACTFL, 2014): “learners interact and negotiate 

meaning…to share information, reactions, feelings, and opinions” (NSCB, 2015). It should be 

noted that other dimensions of spoken proficiency defined by ACTFL, namely presentational 

speaking and intercultural competence, are not included in the test’s scope. The test is 

administered over the Internet by an avatar that delivers questions to the test taker. The test can 

be considered to be somewhat adaptive as the test form generated depends on which level an 

examinee chooses after they complete a simple self-assessment of their oral proficiency. Test 

takers interact with the avatar for 20-30 minutes, and the resulting speech sample is recorded and 

rated by a certified ACTFL rater by comparing the OPIc performance to the ACTFL Proficiency 

Guidelines (2012). Ratings are given in terms of four major levels, or thresholds: Novice, 

Intermediate, Advanced, and Superior. The first three thresholds are further subdivided into 

High, Mid and Low sublevels. 

In the first semester of the Language Flagship project, students were told to self-select 

which level of OPIc to take by responding to ACTFL’s quick self-assessment of language 

proficiency that was provided on the starting page of the ACTFL OPIc. On the test website, the 

language learners were presented with five short descriptions, which were matched to the five 

levels of the test. For example, description 1 was: “I can name basic objects, colors, days of the 

week, foods, clothing items, numbers, etc. I cannot always make a complete sentence or ask 

simple questions.” (Note that all five short descriptors are in Appendix A). If the student 

believed that this description represented his or her proficiency level, he or she was instructed to 

take OPIc level 1. The level 1 test assessed Novice Low through Novice High skills, and level 5 

 16 

assessed Advanced Low to Superior level language skills. Thus, selecting the wrong test level 

could result in the non-scoring of the learners’ responses. If he or she took a level test that was 

too difficult, instead of a score, the test taker would get a score of “below range,” meaning that 

the raters were unable to rate the speech provided.1 A score of “above range” meant the test taker 

took a form that was too easy, and raters were unable to rate the speech. After the first semester 

of testing, it became clear that the undergraduate students needed a more nuanced way to self-

select their ACTFL OPIc test level: Many students received “below range” scores because they 

were taking level tests that were too hard. Therefore, the principle investigator and the project 

team created a more nuanced, 50-statement self-assessment of language proficiency using the 

NCSSFL-ACTFL (2015) Can-do statements, which are aligned with ACTFL’s model of second 

language proficiency.  

Language testers are primarily concerned with establishing the validity of the tests that 

they use. At the heart of this concern is determining whether a test measures what it claims to 

measure. The ACTFL (2015) Can-Do Statements, which are items that claim to be performance 

indicators of second language proficiency, were developed based on language teachers’ common 

experiences and beliefs (Shin, 2013). Unlike their European equivalent (CEFR, Council of 

Europe, 2001), these performance indicators have not been subject to thorough empirical 

analysis that provides evidence that they can indeed be useful as a measure of language 

proficiency. Brown et al. (2014) provided some initial evidence for the scaling of the ACTFL 

Can-Do Statements to ACTFL’s hierarchical model of spoken proficiency. However, their 

analysis was limited to whether the Rasch difficulty estimates followed the same hierarchy as the 

four major threshold levels (i.e., Novice, Intermediate, Advanced) of the ACTFL scale. The 

current study continues this line of research by comparing item difficulties with both the levels 

 17 

and sublevels (i.e., Low, Mid, High) of the ACTFL (2012) scale, seeking evidence for construct 

validity (i.e., item and person fit to the model) and searching for possible explanations for why 

misfitting items may not successfully measure language proficiency.  

Phase one of this study is an analysis of Language Flagship Proficiency test takers’ 

responses to the original 50-statement self-assessment of language proficiency that the Flagship 

team created and administered in Spring 2015. The purpose of this analysis was to determine 

whether the self-assessment items were productive for the construction of a measurement of 

second language proficiency. The following questions guided the study:  

1. Do the individual Can-do Statements fit ACTFL’s unitary and hierarchical model of 

spoken language proficiency when used for self-assessment? If they do not fit, can a reason for 

the misfit be identified in the content of the statements? 

2. To what extent do the difficulty levels of the Can-do Statements match the hierarchy of 

the statements’ assigned ACTFL (2012) levels and sublevels?  

 

 

 18 

PHASE I: SPRING 2015 PROFICIENCY TESTING 

Methods 

This study draws on data that were collected as part of the National Security Education 

Program’s (NSEP, 2016) Language Flagship Proficiency Initiative,2 which provided ACTFL 

language tests to L2 learners at three American universities over the course of three academic 

years (Fall 2014 until Spring 2017). All students in the project took the ACTFL Oral Proficiency 

Interview by computer (OPIc) in the language they were learning. The participants in the current 

study were Spanish students at one university who took a 50-statement self-assessment prior to 

taking the ACTFL OPIc in Spring 2015 (phase one) and in Spring 2017 (phase two). Although 

the ACTFL (2012) Guidelines were designed to measure proficiency in any second language, I 

chose to consider only learners in the Spanish program. A differential item functioning (DIF) 

analysis3 revealed that some of the items may not function equally across language groups: some 

of the statements were significantly easier for one language group compared to another group of 

language learners. Another main reason for only considering the Spanish learners was that these 

students had experience self-assessing their language proficiency because their classroom 

instruction included curriculum-specific can-do statements (for a description, see VanPatten et 

al, 2015). Since the ability to self-assess improves with experience using self-assessments (Sweet 

et al, in press), this made the Spanish learners the best choice of sample. This group also had the 

largest number of test takers, and the largest number of high-proficiency learners. Inclusion of 

the other L2 learners’ data (i.e., item responses from Chinese, French, and Russian learners) 

would have added more numbers, but primarily at lower levels of proficiency. This also may 

have added language learners to the sample who had less experience self-assessing their 

language proficiency. Since previous research has shown that lower proficiency learners who do 

 19 

not have experience with self-assessment tend to over-estimate their ability (Trofimovich et al, 

2014), I chose not to consider these learners.  

Participants.  

The students in the first phase of the study were in the first year of university-level 

Spanish (N=113), the second year (N=132), the third year (N=102), and the fourth year (N=35), 

for a total of 382 Spanish test takers. 

Materials.  

The materials included a computer-adaptive self-assessment questionnaire that was 

divided into five testlets, or sets of ten Can-Do Statements. The 50 statements were selected to 

represent the five levels of the ACTFL OPIc from the fuller list of NCSSFL-ACTFL (2015) Can-

Do statements by one of the principal investigators (PI Paula Winke) of the Language 

Proficiency Flagship Initiative and the project’s research team at the university at which the 

study was conducted. The PI and the project team consulted with ACTFL officials and received 

feedback on earlier versions of the computer-adaptive self-assessment. The questionnaire was 

modified according to this feedback so that the chosen statements represented the ACTFL OPIc 

levels as accurately as possible. Each statement was followed by a Likert scale: Participants rated 

their ability to execute the task described in each statement on a scale ranging from one to four: 1 

(I cannot do this yet), 2 (I can do this with much help), 3 (I can do this with some help), 4 (Yes, I 

can do this well). The 50 statements that appeared in the Spring 2015 self-assessment as well as 

information on both the ACTFL sublevels from which the 50 statements were selected and the 

way in which the statements were arranged into the five testlets are presented in Appendix A. 

Each testlet increased with difficulty and was presented in a computer adaptive way: After the 

 20 

first set of items, subsequent testlets were presented according to the test takers’ self-assessment 

responses to the previous set of Can-Do Statements. 

 

Procedure.  

As part of the larger project, the Spanish language learners came into computer labs 

during their intact classes (which were 50 minutes in duration) to take the self-assessment with 

the 50 Can-do statements, and then the ACTFL OPIc (ACTFL, 2012b). If a learner indicated he 

or she could not do well on nine or more Can-do statements on a set of ten, he or she was 

recommended to take that level’s OPIc. If the learner indicated he or she could do nine or more 

of the ten Can-do statements well, he or she moved on to the next set of ten Can-dos. At level 5, 

if the learner indicated he or she could do eight out of the ten very well, he or she was 

recommended to take level 5; otherwise, he or she was recommended to take level 4. Figure 2 

shows the test takers’ path through the sets of can-do statements, which led to the generation of 

an OPIc form targeted to each participant’s approximate proficiency level. 

Figure 2: Test takers’ path.  

  

 21 

Table 1 shows the number of participants who completed each of the levels of the self-

assessment questionnaire and how many students were recommended to take which levels of the 

OPIc.  

Table 1: Number of participants who completed each level of the self-assessment 

Questionnaire 

Corresponding 

N of test takers who responded 

statements 

OPIc level 

to statements at that level 

N recommended 
to take that level 

1-10  
11-20 
21-30 
31-40 
41-50 

 

1 
2 
3 
4 
5 

382 
141 
84 
51 
38 

of OPIc 

241 
57 
33 
32 
19 

Since OPIcs are official ACTFL tests, they were rated by certified ACTFL raters according to 

the ACTFL (2012) proficiency guidelines for speaking. Of the 382 Spanish test takers, 15 (4%) 

did not receive an OPIc rating: some of these test takers did not produce enough speech for the 

ACTFL raters to assess, and others selected a test form that was too difficult for them, so they 

were unable to perform the speech tasks asked of them. 367 (96%) received OPIc ratings that 

ranged from Novice Low to Advanced High, which were distributed as shown in Figure 3. The 

distribution by class level is shown in Table 2.  

 

 22 

Figure 3: Distribution of OPIc ratings in Spring 2015.The numeric OPIc ratings from 1-9 
represent the range of ACTFL proficiency levels from Novice Low to Advanced High.  

 

 
Table 2: Distribution of 2015 OPIc ratings by class level  

OPIc Rating 

102 

202 

(N=114) 

(N=135) 

300-level 
(N=97) 

400-level 
(N=36) 

Novice 
 
 

Intermediate 

 
 

Advanced 

 
 

 

Low (N=22) 
Mid (N=80) 
High (N=70) 
Low (N=79) 
Mid (N=84) 
High (N=19) 
Low (N=9) 
Mid (N=3) 
High (N=1) 

Data analysis.  

4 
25 
21 
16 
31 
8 
3 
1 
 

18 
45 
26 
22 
13 
4 
1 
1 
1 

 
8 
15 
31 
28 
5 
3 
1 
 

 
2 
8 
10 
12 
2 
2 
 
 

I analyzed the test takers’ item responses to the self-assessment questionnaire using a 

Rasch model (Rasch, 1960/80). One of the specifications of item response modeling is that the 

items of a test measure a single latent trait (Embretson & Reise, 2000; Wright, 1991). In order to 

test this hypothesis using a Rasch model, researchers should provide evidence of 

unidimensionality and item fit to the model. Linacre (2016b) suggested that evidence for 

 23 

multiple dimensions can be assessed using a Principal Components Analysis (PCA) of residuals. 

A secondary dimension (i.e., evidence that threatens the assumption of unidimensionality) is 

indicated by an eigenvalue greater than 2.0 for the first factor in the PCA and a disattenuated 

correlation substantially less than 1.00 between separately calibrated theta scores. The second 

assumption is that all items in a test fit the model. An item is said to fit if it generates item 

responses that align with what the model predicts (e.g., the most difficult Can-Do Statements 

will only be endorsed by test-takers with the highest ability). Items that elicit unexpected 

responses from test takers misfit the model (e.g., an item with a low difficulty estimate that is not 

endorsed by a high ability test-taker). Fit to the Rasch model is assessed by measures of infit and 

outfit. Infit is “sensitive to unexpected patterns of observations by persons on items that are 

roughly targeted on them,” while outfit is “more sensitive to unexpected observations by persons 

on items that are relatively very easy or very hard for them” (Linacre, 2016b). The expected 

value for these two measures is 1.00; higher values indicate more error than expected, while 

lower values indicate more redundancy than expected. Wright and Linacre (1994) suggested a fit 

analysis cutoff between .6 and 1.4 for rating scale models. In an addendum to the reasonable 

mean-square fit value publication (Wright & Linacre, 1994), Linacre also proposed four mean-

square ranges for interpreting fit statistics, shown in Table 3. 

 

Table 3: Linacre’s interpretation of mean-square fit statistics 

Statistic 
> 2.0 
1.5 - 2.0  
0.5 - 1.5  
< 0.5 
 

Interpretation 
Degrades the measurement system 
Unproductive for construction of measurement, but not degrading 
Productive for measurement 
Less productive for measurement, but not degrading 

 24 

If the parameters for dimensionality and model fit are met, Rasch measurement provides 

evidence of construct validity.  

The item responses to the Spring 2015 self-assessment by 382 test-takers in the current 

study were analyzed using WINSTEPS Version 3.92 (Linacre, 2016a). Responses for participants 

who did not complete all of the self-assessment items were entered as missing data. A Rasch 

rating scale model (Andrich, 1978) was selected for the analysis because each item of the 

questionnaire was rated on the same scale4. This model is expressed in equation (1): 

 

(1) 

 

 

The model gives the probability of a given item response, taking into consideration the test 

takers’ ability and each item’s difficulty. In the equation, θ represents test taker ability (i.e., 

language proficiency), β represents item difficulty, and τ is the number of category thresholds. In 

the case of the 4-point rating scale used in this study, there are three category thresholds.  

To test the assumption that the self-assessment items used in this study are useful for 

measuring spoken language proficiency (research question 1), I analyzed the rating scale model 

for dimensionality and item fit as described above. Since one of the objectives of the study was 

to construct an assessment instrument that was productive for measuring language proficiency by 

self-assessment, I considered fit values between 0.5 and 1.5 as acceptable, following Linacre’s 

suggestion (Wright & Linacre, 1994).  

To answer the second research question, I evaluated the model by conducting a difficulty 

analysis of the items to compare the item difficulty estimated by the Rasch model and the 

 25 

ACTFL proficiency level associated with these items. I plotted the item difficulties on a Wright 

map and calculated the mean item difficulty for items at each threshold level (i.e., Novice, 

Intermediate, Advanced, Superior) and sublevel (e.g., Novice Low, Novice Mid, Novice High) to 

determine whether the items ascended in the order of difficulty defined by the ACTFL scale.  

Results 

Fit to the Rasch model.  

The first research question addressed whether the individual Can-do statements fit 

ACTFL’s model of spoken language proficiency when used for self-assessment and was 

answered by evaluating the dimensionality and fit to the model of the 50 Can-do self-assessment 

statements describing spoken language proficiency. The initial Rasch rating scale model had a 

person reliability of .95, indicating that the self-assessment instrument discriminated well 

between test takers of varying proficiency levels. To test the assumption of unidimensionality, 

the model was analyzed using a Principal-Components Analysis (PCA) of the residuals. The aim 

of the PCA was to determine whether the instrument under consideration was measuring 

multiple dimensions.  

The Rasch dimension explained 59.9% of the variance while the largest secondary 

dimension, or first contrast in the residuals, explained 2.4% of the variance. The eigenvalue of 

this first contrast was 3.04, or the strength of approximately three items. However, the 

disattenuated correlation for person measures was 1.00 and the contrast plot (shown in Appendix 

B) did not show a group of three outlying items at the bottom or top of the plot. Inspection of the 

items with the highest and lowest factor loadings did not reveal any obvious contrasts in content: 

both clusters contained items from both the interactional and presentational modes of speaking, 

and both covered different topics and proficiency levels (also shown in Appendix 3). Finally, 

 26 

according to Holmes (1982), “an item set may be considered unidimensional if the first 

eigenvalue from the analysis is large compared to the second, and all eigenvalues other than the 

first are the same size (p. 141).” This was exactly the case for the values in this analysis: the first 

eigenvalue was large (59.9%), and the first through fifth contrasts were similar in size, ranging 

from 3.04 to 2.18. Therefore, it appears that the unexplained variance may be due to random 

noise. In summary, the PCA did not provide any evidence of multidimensionality, which 

indicates that this set of items can be considered unidimensional and is appropriate for use with 

Rasch analysis.  

With sufficient evidence that the items were measuring a single dimension, spoken 

language proficiency, the next step of the Rasch analysis was to determine how well the items 

measured the underlying trait. A fit analysis cutoff between .5 and 1.5, targeting items that are 

productive for measurement (Wright & Linacre, 1994), was adopted as the criterion for 

determining fit to the Rasch model. Fourteen items, presented in Table 4, displayed outfit values 

that were outside the cutoff range. These misfit values suggest that these items did not contribute 

to the unidimensional measurement of spoken proficiency. They are, in other words, 

psychometrically problematic items.  

After identifying misfit, the next step in the Rasch analysis was to examine the 

construction (McNamara, 1991) and the content of the misfitting items to determine why they 

may not be good measures of the construct. In the case of the 14 items identified in Table 4, 

three common features were identified as problematic: they were either vague (e.g., I can ask for 

help), described experiences that many college students might not have (e.g., I can explain an 

injury or illness and manage to get help), described multiple skills in one item (e.g., I can say the 

date and day of the week), or a combination of the above. The research team discussed the 

 27 

misfitting items and reached consensus as to which misfit category each item belonged to. These 

problematic item features are shown in in the last three columns of Table 4 for each of the 

misfitting items. In addition to the above issues that were common to multiple questionnaire 

items, a couple of other item features are of note. First, Item 32 (I can give a presentation about 

my interests, hobbies, lifestyle, or preferred activities) may be a skill that test takers would not 

have experience doing because this genre does not reflect real-life language use. Secondly, Items 

12 (I can describe a place I have visited or want to visit) and 33 (I can ask for and provide 

descriptions of places I know and also places I would like to visit) describe language use that 

might not be possible (i.e., describing a place one has not yet visited).

 28 

Table 4: Misfitting items from the 2015 self-assessment questionnaire 

Statement 

Difficulty and Misfit Estimates 

Misfit Categorization 

1. I can say the date and the day of the week. 
2. I can list the months and seasons. 
5. I can state my favorite foods and drinks and those I do not like. 
8. I can list my classes and tell what time they start and end. 
12. I can describe a place I have visited or want to visit. 
13. I can ask for help at school, work, or in the community. 
14. I can talk about my daily routine. 
15. I can talk about my interests and hobbies. 
18. I can plan an outing with a group of friends. 
23. I can describe a childhood or past experience. 
30. I can explain an injury or illness and manage to get help. 
32. I can give a presentation about my interests, hobbies, lifestyle, 
or preferred activities. 
33. I can ask for and provide descriptions of places I know and also 
places I would like to visit. 
38. I can exchange general information about leisure and travel, 
such as the world’s most visited sites or most beautiful places to 
visit. 

Difficulty 
estimate  
(S.E.) 

-4.68 (.14) 
-3.43 (.12) 
-4.45 (.13) 
-2.43 (.11) 
-1.87 (.26) 
-2.16 (.28) 
-2.17 (.28) 
-3.07 (.36) 
-1.43 (.24) 
-0.49 (.36) 
1.90 (.23) 
-0.66 (.75) 

Infit MNSQ 

(z-std) 

1.14 (1.4) 
1.06 (0.8) 
1.23 (2.3) 
1.03 (0.4) 
0.89 (-0.5) 
0.91 (-0.4) 
1.01 (0.1) 
1.07 (0.4) 
0.84 (-0.9) 
1.04 (0.3) 
1.16 (1.0) 
0.95 (0.1) 

Outfit 

MNSQ (z-

std) 

1.94 (2.4) 
2.53 (4.1) 
2.96 (4.3) 
1.70 (2.5) 
0.45 (-1.0) 
0.39 (-1.1) 
0.48 (-0.6) 
0.38 (-0.9) 
0.49 (-1.1) 
2.97 (2.3) 
1.53 (2.2) 
0.42 (-0.2) 

-0.66 (.75) 

0.95 (0.1) 

0.42 (-0.2) 

0.42 (.51) 

0.66 (-0.9) 

0.34 (-1.0) 

Vague 

Exper

. 

Depe
nd. 
 
 
 
 
 
 
 
 
√ 
 
√ 
√ 

 

√ 

 
 
 
 
 
√ 
√ 
√ 
 
√ 
 
 

 

 

Multipl
e skills 

√ 
√ 
√ 
√ 
√ 
√ 
 
√ 
 
 
√ 
√ 

√ 

√ 

 

29 

A second Rasch analysis was conducted by deleting the 14 items with the greatest misfit. 

This resulted in a final version that included 36 items with both infit and outfit mean square 

values that all fell within 0.5 and 1.5, shown in Table 5. This revised model had a person 

reliability of .94. In terms of dimensionality, 64% of the variance was explained by the Rasch 

dimension, while 3% of the variance was explained by the largest secondary dimension. The 

eigenvalue of the first contrast decreased to 2.92 and the disattenuated correlation remained at 

1.00, and there was still no evidence from the content of the items in the first contrast that a 

secondary dimension was at play (see Appendix 3). These findings provide evidence that the 

remaining 36 items are useful for measuring spoken language proficiency among college-level 

Spanish learners.

 

30 

Table 5: Fit statistics for 36 fitting items 

Statement 

Difficulty and Misfit Estimates 

3. I can say which sports I like and don’t like. 
4. I can list my favorite free-time activities and those I don’t like. 
6. I can talk about my school or where I work. 
7. I can talk about my room or office and what I have in it. 
8. I can list my classes and tell what time they start and end. 
9. I can answer questions about where I’m going or where I went. 
10. I can present information about something I learned in a class or at work. 
11. I can describe a school or workplace. 
16. I can schedule an appointment. 
17. I can talk about my family history. 
19. I can explain why I was late to class or absent from work and arrange to make up the lost time. 
20. I can tell a friend how I’m going to replace an item that I borrowed and broke/lost. 
21. I can give some information about activities I did. 
22. I can talk about my favorite music, movies, and sports. 
24. I can ask for and follow directions to get from one place to another. 
25. I can return an item I have purchased to a store. 
26. I can arrange for a make-up exam or reschedule an appointment. 
27. I can present an overview about my school, community, or workplace. 
28. I can compare different jobs and study programs in a conversation with a peer. 
29. I can discuss future plans, such as where I want to live and what I will be doing in the next few 
years. 
31. I can present ideas about something I have learned, such as a historical event, a famous person, 
or a current environmental issue. 
34. I can explain how life has changed since I was a child and respond to questions on the topic. 
35. I can discuss what is currently going on in another community or country. 
36. I can provide a rationale for the importance of certain classes, subjects, or training programs. 
37. I can talk about present challenges in my school or work life, such as paying for classes or 
dealing with difficult colleagues. 
39. I can give a presentation about cultural influences on society. 
 

 

31 

Difficulty 
estimate  
(S.E.) 

-6.24 (.15) 
-5.89 (.14) 
-4.53 (.12) 
-2.80 (.11) 
-1.87 (.26) 
-3.74 (.12) 
-2.41 (.11) 
-2.43 (.25) 
-0.43 (.19) 
-0.25 (.19) 
-0.86 (.20) 
0.21 (.19) 
-2.42 (.54) 
-2.75 (.61) 
-0.18 (.29) 
0.71 (.26) 
-0.18 (.29) 
-0.82 (.33) 
0.36 (.27) 
-1.30 (.37) 

Infit MNSQ 

(z-std) 

1.23 (2.2) 
0.88 (-1.3) 
0.86 (-1.8) 
0.91 (-1.1) 
0.89 (-0.5) 
0.95 (-0.6) 
1.09 (1.1) 
0.99 (0.0) 
1.19 (1.5) 
1.16 (1.0) 
1.01 (0.1) 
1.15 (1.1) 
1.02 (0.2) 
1.00 (0.2) 
0.87 (-0.8) 
0.79 (-1.3) 
0.86 (-0.8) 
1.00 (0.1) 
0.76 (-1.5) 
0.84 (-0.6) 

Outfit 

MNSQ (z-

std) 

1.41 (1.2) 
0.61 (-1.3) 
0.78 (-0.8) 
1.41 (1.7) 
0.45 (-1.0) 
0.39 (-1.1) 
0.71 (-1.2) 
0.93 (0.1) 
1.45 (1.3) 
1.14 (0.5) 
0.70 (-0.8) 
1.18 (0.8) 
0.96 (0.3) 
1.33 (0.6) 
0.70 (-0.5) 
0.84 (-0.4) 
0.68 (-0.6) 
1.11 (0.4) 
0.71 (-0.7) 
0.93 (0.1) 

1.88 (.33) 

1.16 (0.9) 

1.45 (1.4) 

-0.14 (.51) 
1.40 (.36) 
1.53 (.35) 
-0.43 (.56) 

1.01 (0.3) 
0.70 (-0.8) 
0.89 (-0.5) 
0.81 (-0.4) 
0.97 (-0.1)   0.82 (-0.4) 
1.09 (0.3) 
0.89 (0.2) 

1.88 (.33) 
1.77 (.34) 

1.02 (0.2) 
1.05 (0.3) 

0.86 (-0.4) 
1.20 (0.7) 

 

Table 5 (cont’d) 
 
40. I can participate in conversations on social or cultural questions relevant to speakers of this 
language. 
41. I can interview for a job or service opportunity related to my field of expertise. 
42. I present an explanation for a social or community project or policy. 
43. I can present reasons for or against a position on a political or social issue. 
44. I can give a clear and detailed story about childhood memories, such as what happened during 
vacations or memorable events and answer questions about my story. 
45. I can exchange general information about my community, such as demographic information and 
points of interests. 
46. I can exchange factual information about social and environmental questions, such as retirement, 
recycling, or pollution. 
47. I can usually defend my views in a debate. 
48. I can exchange complex information about my academic studies, such as why I chose the field, 
course requirements, projects, internship opportunities, and new advances in my field. 
49. I can provide a balance of explanations and examples on a complex topic. 
50. I can explain participate actively and react to others appropriately in academic debates, 
providing some facts and rationales to back up my statements. 
 
 

3.85 (.37) 
2.73 (.39) 
3.30 (.37) 
1.93 (.42) 

0.77 (-0.9) 
0.84 (-0.6) 
0.94 (-0.2) 
0.94 (-0.2) 

0.80 (-0.7) 
0.78 (-0.8) 
0.90 (-0.3) 
0.90 (-0.3) 

1.93 (.42) 

0.93 (-0.2) 

0.93 (0.0) 

2.42 (.40) 

0.83 (-0.7) 

0.87 (-0.3) 

2.43 (.40) 
2.73 (.39) 

0.64 (-1.8) 
1.07 (0.4) 

0.56 (-1.6) 
0.99 (-1.6) 

3.30 (.37) 
3.44 (.37) 

1.01 (0.1) 
1.36 (1.4) 

0.96 (-0.1) 
1.44 (1.6) 

 32 

Item difficulty estimates.  

The second research question assessed the extent to which the difficulty of the statements 

(estimated by the Rasch model) matched the ACTFL (2012) proficiency levels associated with 

the Can-Do statements. I calculated the mean logit scores from the Rasch analysis for the 

statements belonging to each of the threshold levels (i.e., Novice, Intermediate, Advanced, 

Superior) and sublevels (e.g., Novice Low, Novice Mid, Novice High) to evaluate whether they 

ascended according to the hierarchy of the ACTFL (2012) scale.  

The mean difficulty estimates for each of the major threshold levels, presented in Table 6, 

ascended in the expected order: Novice statements were the easiest, followed by Intermediate 

and Advanced, with Superior statements being the most difficult. In other words, when grouped 

by major threshold level, the statements acted as expected. Furthermore, the 95% confidence 

intervals for each threshold did not overlap, which suggests that the mean statement difficulty at 

each of the major ACTFL levels differs to a statistically significant degree. 

 

Table 6: Descriptive statistics for difficulty estimates of ACTFL threshold levels 

ACTFL threshold 
1 - Novice 
2- Intermediate 
3- Advanced 
4- Superior 
Note. N = number of statements. 

N  Mean logit score (SD) 
10 
17 
21 
2 

-3.33 (1.25) 
-0.68 (1.46) 
1.78 (1.34) 
3.72 (0.08) 

SE 
.39 
.36 
.29 
.06 

95% CI 

[-4.22, -2.44] 
[-1.43, 0.07] 
[1.18, 2.39] 
[2.96, 4.48] 

The mean difficulty estimates for the ACTFL sublevels, presented in Table 7, ascended 

as anticipated (although the 95% confidence intervals for adjacent means all overlap, thus the 

differences between sublevels cannot be statistically significant), for the most part. However, 

there was one notable exception: there was a decrease in difficulty from Advanced Low (M = 

1.72, SD = 1.51) to Advanced Mid (M = 1.37, SD = 1.32), rather than an increase.   

 

33 

 

Table 7: Descriptive statistics for difficulty estimates of ACTFL sublevels 

-4.06 (0.88) 
-3.39 (1.19) 

-1.421 

95% CI 

-12.00, 3.89 
-4.49, -2.29 

- 

N  Mean logit score (SD) 
2 
7 
1 
3 
6 
8 
10 
7 
4 
2 

ACTFL sublevel 
1 - Nov-Low 
2 - Nov-Mid 
3 - Nov-High 
4 - Int-Low 
5 - Int-Mid 
6 - Int-High 
7 - Adv-Low 
8 - Adv-Mid 
9 - Adv-High 
10 - Superior 
Notes: 1. Because there was only one statement at the Novice High level in the instrument, the 
mean logit score was not included in the difficulty analysis. 2. The mean logit score at this level 
falls out of the expected order. 
 

-1.82 (0.36) 
-1.48 (1.24) 
0.34 (1.22) 
1.72 (1.51) 
1.372 (1.32) 
2.65 (0.42) 
3.72 (0.08) 

-2.73, -0.91 
-2.77, -0.18 
-0.68, 1.36 
0.65, 2.81 
0.15, 2.58 
1.98, 3.32 
2.95, 4.48 

SE 
.62 
.45 
- 
.21 
.51 
.43 
.48 
.50 
.21 
.06 

Inspection of the Wright map, in Figure 4, sheds some light on why the Advanced Low 

and Mid levels did not ascend in the anticipated order. The right half of the map displays each 

item and its intended ACTFL sublevel in order of difficulty based on item responses. Items at the 

top of the map were rated the most difficult and items at the bottom were the easiest. The range 

of item difficulty for the Advanced Low and Advanced Mid sublevels spans a wide distance 

(from -0.03 to 4.14 logits for Advanced Low and from -0.49 to 2.85 logits for Advanced Mid). 

These ranges also closely overlap, and the Advanced Mid range includes items with the easiest 

item difficulty estimates of the two levels, while the Advanced Low range includes the most 

difficult items in the analysis. These values could have skewed this sublevel’s mean difficulty 

estimate higher than the Advanced Mid sublevel. These most difficult items addressed topics 

(Item 42: I can present an explanation for a social or community project or policy) or 

experiences (Item 41: I can interview for a job or service opportunity in my field of service) that 

many college students may not be familiar with. It is possible that for this population, the skills 

described by the most difficult Advanced Low items may represent higher level proficiency 

 34 

 

skills than Advanced Low. Another possibility is that items belonging to these two sublevels 

may not form two distinguishable groups when used by this population. 

 

 

 
 

 

 35 

 

 

     

 

 

A
L
 
(
-
0
.
0
3
 
-
 
4
.
1
4
)
 

MEASURE    PERSON - MAP - ITEM  
                
          9          .###  + 
                     | 
                     | 
    8             .  + 
                 .#  | 
                  .  | 
    7             .  + 
                .##  | 
                  . T| 
    6            .#  + 
                 ##  | 
                 .#  | 
    5           .##  +T 
                 ##  | 
                .##  | 
    4           ###  +  41_AL 
               #### S|  43_AL  49_S   50_S 
                  #  | 
    3          .###  +  42_AL  46_AL  47_AH  48_AH 
               ####  | 
                 ##  |S 31_IH  39_AH  40_AH  44_AM  45_AM 
    2           .##  +  30_AM  35_AL  36_AL 
                 ##  | 
                  #  |  25_IH 
    1 .############  +  20_AL  28_AL 
                 .# M|  17_IH  24_IH  26_IH 
          .########  |  16_IM  34_AL  38_AM 
    0    ##########  +M 19_AL  27_AL  37_AM 
            .######  |  23_IM  29_AM 
          .########  |  32_IH  33_IH 
   -1     .########  + 
               ####  |  10_NH  11_IL  18_IH 
               ####  |  21_IM  7_NM 
   -2           ### S+  12_IL  13_IL  22_IM 
                ###  |S 14_IM  8_NM 
                .##  |  9_NM 
   -3            .#  +  15_IM 
                ###  |  2_NL   6_NM 
                .##  | 
   -4             .  + 
                  #  |  4_NM   5_NM 
                     |  1_NL   3_NM 
   -5            .# T+T 
                  .  | 
                     | 
   -6             .  + 
                     | 
                     | 
   -7                + 
                     | 
                  .  | 
   -8                + 
                     | 
                     | 
   -9             .  + 
               
Figure 4: Wright map of test taker ability and item difficulty. A period indicates one person. A 
hash mark is equal to three people. 

A
M
 
(
-
0
.
4
9
 
-
 
2
.
8
5
)
 

 36 

 

As mentioned above, the items at both the Advanced Low and Mid levels had wide 

ranges of difficulty. The ranges for the two Superior level items and the Advanced High items 

were much smaller: The Advanced High items ranged from 2.25 to 3.14 logits and the Superior 

level items ranged from 3.66 to 3.78 logits. The difficulty range and item content for the 

Advanced and Superior level items are shown in Table 8.  

Inspection of the content of the four most difficult Advanced Low items indicates why 

these items were as difficult or more difficult than items at the Superior level. Items 42, 43 and 

46 require speakers to provide policy explanations, reasons for a position, and rationales. 

Recalling that speakers at the Advanced level are expected to be able to use language for 

narration and description, while Superior level speakers should be able to use argumentation, 

hypothesize, and discuss abstract topics, the language functions in items 42, 43, and 46 may 

better describe Superior-level language use. If these items were modified to use descriptive 

language on concrete topics (e.g., I can describe a community project), they may better reflect 

Advanced proficiency as defined by the ACTFL (2012) Guidelines. The most difficult item, Item 

41 (I can interview for a job or service opportunity), requires high-stakes language use that 

would likely require speakers to hypothesize about how they would perform in a job or service 

opportunity (e.g., If I were to teach a Research Methods course, I would include the following 

topics…). Again, this type of language use aligns better with ACTFL’s definition of Superior 

language proficiency. The content of the remainder of the Advanced Low and Mid items appear 

to require narration (e.g., I can give a clear and detailed story about childhood memories) and 

description (e.g., I can compare different jobs and study programs in a conversation with a peer). 

 37 

Table 8: Advanced-level items in order of difficulty  

Advanced Low 

Advanced Mid 

Advanced High 

Superior 

29. I can discuss future plans, such as where 
I want to live and what I will be doing in the 
next few years. (-0.49)  
37. I can talk about present challenges in my 
school or work life, such as paying for 
classes. (0.14) 
 

 

 

 

 

 

 

 

44. I can give a clear and detailed story 
about childhood memories and answer 
questions about my story. (2.38) 
45. I can exchange general information 
about my community, such as demographic 
information and points of interests. (2.38) 
36. I can exchange factual information about 
social and environmental questions, such as 
retirement, recycling, or pollution. (2.85) 
(-0.49) 

40. I can participate in conversations on 
social or cultural questions relevant to 
speakers of this language. (2.25) 
39. I can give a presentation about cultural 
influences on society. (2.36) 

47. I can usually defend my views in a 
debate. (2.85) 

48. I can exchange complex information 
about my academic studies. (3.14) 
 

19. I can explain why I was late to class or 
absent from work and arrange to make up 
the lost time. (-0.03) 
27. I can present an overview about my 
school, community, or workplace. (0.14) 

34. I can explain how life has changed since 
I was a child and respond to questions on 
the topic. (0.42) 
38. I can compare different jobs and study 
programs in a conversation with a peer. 
(0.42) 
30. I can tell a friend how I’m going to 
replace an item that I borrowed and 
broke/lost. (1.90) 
45. I can discuss what is currently going on 
in another community or country. (2.38) 

 

46. I can provide a rationale for the 
importance of certain classes, subjects, or 
training programs. (2.85) 
42. I present an explanation for a social or 
community project or policy. (3.14) 
43. I can present reasons for or against a 
position on a political or social issue. (3.66) 

 

41. I can interview for a job or service 
opportunity related to my field of expertise. 
(4.14) 

 

 

 

 

 

 

38 

 

 

 

 

 

 

 

 

 

49. I can provide a balance of 
explanations and examples on a 
complex topic. (3.66) 
50. I can explain participate 
actively and react to others 
appropriately in academic debates, 
providing some facts and rationales 
to back up my statements. (3.78) 
 

 

Discussion 

In the first phase of this study I analyzed 50 ACTFL (2015) Can-Do Statements targeting spoken 

language proficiency that were included in a computer-adaptive self-assessment constructed for 

the Language Flagship Initiative with input from ACTFL and LTI. The first research question 

assessed how well the fifty items included in the analysis fit the Rasch model measuring spoken 

language proficiency. Fourteen items did not fit the model, and when these items were deleted, 

there was evidence that the remaining 36 items did, in fact, measure a single latent trait. 

Specifically, the remaining items had fit values that fell within an acceptable range and did not 

display any obvious evidence of multidimensionality.  

As for the 14 misfitting items, the discussion that follows outlines three possible reasons 

why these items elicited unexpected response patterns from test-takers in the current study and 

also offers recommended revisions. According to Jones (2002), discrepancies in item difficulty 

between what is observed and what is predicted can often be explained by considering the 

features of the misfitting items. The items in the present study that did not fit the model 

displayed a number of features that have been documented in the literature on response patterns 

to self-assessment items (Haladyna et al., 2002; Heilenman, 1990; Jones, 2002; Ross, 1998; 

Suzuki, 2015; Turner, 1984): items that addressed multiple skills, items that depended on 

specific experiences using the target language, and items that were vague. These problematic 

item features are shown in the last three columns of Table 4 for each of the misfitting items. The 

research team discussed the misfitting items and reached consensus as to which misfit category 

each item belonged to. 

The majority of the misfitting items (11 out of 14) included one or more coordinating 

conjunctions (i.e., and, or). Combining two skills which may have different degrees of difficulty 

 39 

 

and for which learners have different degrees of mastery may have influenced their fit to the 

model. For example, Item 1 (I can say the date and day of the week) involves two skills with 

different degrees of difficulty. Saying the day of the week simply requires vocabulary knowledge 

of the days of the week, whereas saying the date requires both vocabulary knowledge of the 

months and numbers and knowledge of how to combine this information. Although coordinating 

conjunctions were also included in items that did fit the Rasch model, one possibility for why 

these items fit is that the skills had similar degrees of difficulty. Further, it is generally accepted 

that assessment items should target only one skill (Haladyna et al, 2002). The unexpected item 

response patterns for some items that contained coordinating conjunctions may be due to the 

disparate nature of the skills being assessed, a problem that is exacerbated when multiple 

coordinating conjunctions are used in the same Can-Do Statement, as in Items 5, 8, 30, and 33.  

Another feature of items that did not fit the model was that they were brief and vague. In 

the present analysis, four items fit this description (Items 13, 14, 15, 23). These items are all 

noticeably shorter than the other statements, and include topics that are not specific (“asking for 

help”, “daily routines”, “interests” and “past experiences”). A daily routine, for example, could 

be anything from what a person does (e.g., working eight hours per day) to a more detailed, 

grammar-focused (i.e., reflexive verbs) list of a person’s morning routine that is found in many 

foreign language textbooks and curricula (e.g., Garcia & Asención, 2001). The finding 

corroborates Jones’ (2002) analysis of the CEFR Can-Do statements, which showed that items 

that were brief and included topics that were vague tended to elicit response patterns that were 

easier than expected.  

 

Five of the statements that did not fit the model made reference to experiences or 

knowledge with which the typical college Spanish student would likely not have encountered 

 40 

 

(Items 18, 30, 32, 38, 47). Research has clearly shown that self-assessment items that are not 

relevant to students’ lives tend to perform less well than items that describe skills with which 

students have experienced in the classroom or in naturalistic settings (Butler & Lee, 2006; Ross, 

1998; Suzuki 2015; Turner, 1984). Among these items, 18 and 30 describe experiences that 

likely would require having spent time abroad, having made friends who speak the target 

language, or having experienced a medical emergency. Without this kind of experience, it would 

be very difficult to evaluate whether one can effectively carry out these tasks in the target 

language. Items 32 and 47, on the other hand, include modes of communication that could be 

more common in a language classroom. However, without the experience of giving a 

presentation specifically about one’s interests or defending one’s views in a debate format, it 

might not be possible to self-judge these skills. Item 38 requires very specific travel knowledge, 

which could have made this statement more difficult for assessees to endorse as something they 

can do (Jones, 2002). Without experience performing these skills, college-aged examinees would 

likely be unable to judge how difficult the tasks actually are, causing them to guess and leading 

to measurement error (Haladyna et al., 2002).  

In addition to the above issues that were common to multiple questionnaire items, Items 33 and 

12 were poorly constructed because they included language use that might not be possible (i.e., 

describing a place one has not yet visited). 

The second research question addressed the extent to which the difficulty of the items 

considered in this study would follow the hierarchy predicted in the ACTFL scale. Comparison 

of the mean Rasch difficulty estimates for items at each of the major threshold levels of the 

ACTFL scale revealed that the mean logit scores ascended in the anticipated order and that mean 

differences were statistically significant. This finding is in line with Brown et al. (2014), who 

 41 

 

found that the ACTFL Can-Do statements which they modeled ascended in the same order of 

difficulty as the threshold levels in the scale (although not to a statistically significant degree).  

The current study took the analysis one step further by considering the mean difficulty 

estimates for the sublevels at each proficiency level, which revealed that most sublevel means 

ascended in the predicted order. However, the mean difficulty score for items at the Advanced 

Low level was higher than items at the Advanced Mid level, and the ranges of these categories 

overlapped considerably. One explanation for this unexpected item difficulty is that some of the 

Advanced Low items described language use that may go beyond ACTFL’s expectations of 

Advanced proficiency. Specifically, Items 41, 42, 43, and 46 included language use that would 

require speakers to discuss abstract topics and hypothetical situations, which reflect Superior-

level language functions.   

The finding that the order of difficulty did not match the sublevel identifications exactly 

is not completely unexpected. For example, Shin (2013, p. 4) warned that learners’ performances 

may not always indicate their levels of proficiency because language learners may differ in their 

interactions with the tasks and/or may have unstable language abilities because they are, in fact, 

learners. Further, it may not be possible to create Can-Do statements that precisely discriminate 

at the sublevel. The findings of this study showed significant differences in difficulty between 

items at the major threshold levels, as expected. But the sublevel identifications were less 

accurate. This may not be unexpected because in theory the sublevels describe examinee 

performance between two adjacent levels where barely sustained performance of the floor level 

is associated with the Low sublevel, strong performance of the floor level with some success at 

the ceiling level indicates Mid sublevel proficiency, and near sustaining performance at the 

ceiling level indicates High proficiency. These theoretical interpretations of Low, Mid, and High 

 42 

 

suggest that researchers should not expect to find unique differences in the difficulty of items at 

the sublevels.  

One might also speculate based on data reported here that items from the Advanced Low 

level address topics or experiences that many college students are not familiar with. For example, 

the most difficult items from this sublevel clustered with the items from the Superior level and 

addressed topics (Item 42: I can present an explanation for a social or community project or 

policy) or experiences (Item 41: I can interview for a job or service opportunity in my field of 

service) for which many college students may not have personal experiences. It is likely that 

most college students have not interviewed for a job in the target language, making it very 

difficult for them to accurately judge the difficulty of this and other tasks with which they do not 

yet have personal experience, as they may not recognize that they require higher level 

proficiency skills. An alternate explanation is that for this population, the skills described in 

these Advanced items may represent language use that better describes Superior level language 

proficiency.  

The current study has informed the larger Language Flagship Proficiency Initiative by 

signaling which items in the self-assessment were not good indicators of spoken language 

proficiency. For the second phase of this dissertation, the research team revised the misfitting 

items identified in this analysis for subsequent data collection and analyses. The aim was to 

create a self-assessment instrument that would elicit item responses from test takers that might 

better fit the model measuring spoken proficiency. One of the items that included coordinating 

conjunctions was revised by separating them into distinct statements that would be evaluated 

separately: Item 1 was changed to two items: 1. I can say the day of the week. 2. I can say the 

date. To address the misfitting items in the assessment instrument that were brief and vague, 

 43 

 

these items were removed and replaced with more specific statements from the bank of 

NCSSFL-ACTFL (2015) Can-Dos. Another option for revision would have been to make these 

misfitting items more specific by including an exemplification (Weir, 2005). For example, Item 

15, I can talk about my interests and hobbies, could be modified to give specific examples of 

topics that a test taker might address when they discuss their interests (e.g., I can talk about my 

interests, such as sports, music, parties). However, the aim was to include statements that were 

as true to the original ACTFL (2015) statements as possible. Finally, several misfitting items 

included in the self-assessment addressed experiences that college students may not have had. To 

increase the content validity of the revised instrument, these items were replaced with skills that 

are targeted to the population’s typical foreign language experiences (e.g., describing summer 

plans or talking about weekend activities).  

The findings of the current study also appear to have informed a revision of the Can-Do 

Statements by ACTFL and NCSSFL. Shortly after phase one of this study was fast-tracked to 

publication in ACTFL’s journal, Foreign Language Annals (Tigchelaar, Bowles, Winke, & Gass, 

2017), NCSSFL-ACTFL (2017) released a new set of Can-Dos that includes revisions to the 

previous statements and additional statements targeting intercultural communication.       

 

 

 44 

 

PHASE II: SPRING 2017 PROFICIENCY TESTING 

 

For the second phase of this dissertation, I revised the self-assessment instrument that was 

administered in the first phase of the study in collaboration with the Language Flagship research 

team. The motivation for this revision came from the finding that 14 of the items misfit the 

model measuring spoken proficiency. In addition, 4% of the test takers from the Spring 2015 

round of testing over- or under-assessed their Spanish proficiency using the self-assessment. The 

same self-assessment was used in Spring 2016, and 9% of the Spanish test takers over- or under-

assessed their proficiency. The goal for the revised instrument, then, was to create a more refined 

self-assessment that might elicit item responses from test takers that would better fit the model 

measuring spoken proficiency and that would guide even more test takers toward an appropriate 

OPIc test form. To do this, we kept the items that fit the model in the first phase of the study, 

removed the misfitting items that were identified and replaced them with new Can-Do 

Statements. We then administered the revised self-assessment instrument to test takers who took 

the Language Flagship Proficiency testing in Spring 2017. In order to compare the results of the 

second phase of the study to the findings of the first phase, I formulated similar research 

questions (RQ2 and RQ3). I also added a question (RQ1) to explore in more detail the question 

of dimensionality of the measurement model of spoken proficiency. The questions that guided 

phase two of the study were:    

1. How many factors do the Can-Do Statements for spoken proficiency measure? Can 

they be used to measure a unitary dimension? The hypothesis, based on the ACTFL Guidelines 

(2012), is that there is one factor, speaking ability in general. An alternative hypothesis, based on 

the NCSSFL-ACTFL (2015) Can-Do Statements, is that there are two factors, presentational 

speaking and interactional communication.  

 45 

 

2. Do the individual Can-Do Statements in the revised self-assessment instrument fit 

ACTFL’s (2012) unitary and hierarchical model of spoken language proficiency when used for 

self-assessment?  

2a. If they do not fit, can a reason for the misfit be identified in the content of the 

statements or in the characteristics of the test takers?  

3. Do the difficulty levels of the Can-Do Statements in the revised self-assessment 

instrument match the hierarchy of the statements’ assigned ACTFL (2012) levels and sublevels? 

Methods 
 
For the second phase of the study, I used data from the Language Flagship Proficiency testing 

project that was administered in Spring 2017.   

Participants.  

The participants in the second phase of the study were university-level Spanish students 

who took the revised 50-statement self-assessment and the ACTFL OPIc in Spring 2017. The 

students were in the first year of university-level Spanish (N=137), the second year (N=265), the 

third year (N=387), and the fourth year (N=98), for a total of 886 Spanish test takers. 

Materials.  

The materials included the revised computer-adaptive self-assessment questionnaire that 

was composed of five testlets, or sets of ten Can-Do Statements, which were organized in the 

same way as in the first round of testing. This self-assessment included 36 items from the 

original Spring 2015 assessment that were found to have mean-square fit values that were 

productive for measurement (i.e., between 0.5 and 1.5) in the first phase of the study.  

 46 

 

I also selected 14 new items to replace the misfitting items from the first round of testing. 

I revised one of the original items that assessed multiple skills (I can say the date and day of the 

week) into two items (I can say the day of the week and I can say the date). I revised another of 

the original items by simplifying it (I can describe a place I have visited or want to visit to I can 

describe a place I have visited). I selected 11 other Can-Do Statements from NCSSFL-ACTFL’s 

(2015) large bank of items that fit the following criteria: items that were a) specific, b) described 

language use for college-level test takers and c) included a single language task. For the most 

part, I kept the original content and structure of the ACTFL Can-Do Statements. However, in 

addition to the three revised items described above, I eliminated some of the language that was 

not geared to college students’ experiences in the replacement items. This text is shown in 

strikeout in Table 9. 

 

Table 9: 14 misfitting items and their replacements  

Misfitting Items 

Misfit Categorization 

Replacement Items 

Vague 

Exper. 
Depend. 

Multiple 

skills 

I can say the date and the day of 
the week.  
-NL 
I can list the months and seasons. 
-NL 
I can state my favorite foods and 
drinks and those I don’t like. -
NM 
I can list my classes and tell what 
time they start and end. -NM 

Table 9 (cont’d) 
 
I can describe a place I have 
visited or want to visit. -IL 
 
I can ask for help at school, 
work, or in the community. -IL 

 

 

 

 

 

√ 

 

 

 

 

 

 

I can say the day of the week. -NL 

I can say the date. -NL 

I can say what someone looks like 
-NM 

I can talk about what I do on the 
weekends. -NM 

 
 
I can describe a place I have 
visited. -IL 

I can describe what my summer 
plans are. -IL 

√ 

√ 

√ 

√ 

 
 
√ 

√ 

 47 

 

I can talk about my daily routine. 
-IM 

I can talk about my interests and 
hobbies. -IM 

√ 

√ 

I can plan an outing with a group 
of friends. -IH 

 

I can describe a childhood or past 
experience. -IM 
I can explain an injury or illness 
and manage to get help. -AM 

I can give a presentation about 
my interests, hobbies, lifestyle, 
or preferred activities. -IH 

I can ask for and provide 
descriptions of places I know and 
also places I would like to visit. -
IH 

I can exchange general 
information about leisure and 
travel, such as the world’s most 
visited sites or most beautiful 
places to visit. -AM 

√ 

 

 

 

 

 

 

√ 

 

√ 

√ 

 

√ 

 

√ 

 

 

√ 

√ 

√ 

√ 

I can report on a social event that I 
attended. -IM 

I can bring a conversation to a 
close. -IM 

I can explain a series of steps 
needed to complete a task or 
experiment.1 -IH 
I can give a short presentation on a 
current event. -IM 
I can describe in detail a social 
event or local celebration. -AM 

I can make a presentation about an 
interesting person. -IH 

I can explain to someone who was 
absent what took place in class or 
on the job. -IH 

I can recount the details of a 
historical event. -AM 
 

1. Note: The original NCSSFL-ACTFL (2015) statements include the text that is struck out. This language was 
eliminated to reduce the statements to a single topic that is geared toward college students’ knowledge or 
experiences. 
 

Appendix C presents the revised 50-statement questionnaire that appeared in the Spring 

2017 self-assessment, as well as information on both the ACTFL sublevels from which the 50 

statements were selected, and the way in which the statements were arranged into the five 

testlets. As in the first round of data collection, each testlet increased with difficulty and was 

presented in a computer adaptive format. After the first set of items, subsequent testlets were 

presented according to the test takers’ self-assessment responses to the previous set of Can-Do 

Statements. Participants rated more difficult items if they said they could do the majority of the 

previous set of items. 

 48 

 

 

Procedure.  

The test takers from 2017 followed the same procedure as those in 2015: They took the revised 

self-assessment with the 50 Can-Do Statements, followed by the ACTFL OPIc (ACTFL, 2012b). 

If a learner indicated he or she could not do well on nine or more Can-Do Statements on a set of 

ten, he or she was recommended to take that level’s OPIc. If the learner indicated he or she could 

do nine or more of the ten Can-Do statements well, he or she moved on to the next set of ten 

Can-Dos. At level 5, if the learner indicated he or she could do eight out of the ten very well, he 

or she was recommended to take level 5; otherwise, he or she was recommended to take level 4. 

Table 10 shows the number of participants who completed each of the levels of the self-

assessment questionnaire and how many students were recommended to take which levels.  

 

Table 10: Number of participants who completed each level of the self-assessment 

Questionnaire 

Corresponding 

N of test takers who 

statements 

OPIc level 

responded to statements at 

1-10  
11-20 
21-30 
31-40 
41-50 

 

1 
2 
3 
4 
5 

that level 

886 
385 
157 
105 
73 

N recommended 
to take that level 

of OPIc 

501 
228 
52 
70 
35 

The OPIcs were rated by certified ACTFL raters according to the ACTFL (2012) proficiency 

guidelines for speaking. Of the 886 Spanish test takers, 63 (7%) did not receive an OPIc rating 

because they either over-assessed (i.e., took a test that was too difficult; n=59) or under-assessed 

(i.e., took a test that was too easy; n=4) their speaking ability. Thus, 823 (93%) received OPIc 

ratings, which were distributed as shown in Figure 5. The distribution by class level is in Table 

11.  

 49 

 

  
Figure 5: Distribution of OPIc ratings in Spring 2017. The numeric OPIc ratings from 1-9 
represent the range of ACTFL proficiency levels from Novice Low to Advanced High, and 
Above Range and Below Range are represented by -1 and 0, respectively.  

 

 

Table 11: Distribution of 2017 OPIc ratings by class level  

OPIc Rating 

102 

202 

(N=137) 

(N=265) 

300-level 
(N=386) 

400-level 
(N=98) 

Above Range 
Below Range 
Novice 
 
 

Intermediate 

 
 

Advanced 

 
 
Data Analysis.  

N=4 
N=59 

Low (N=10) 
Mid (N=64) 
High (N=136) 
Low (N=237) 
Mid (N=255) 
High (N=92) 
Low (N=21) 
Mid (N=7) 
High (N=1) 

0 
3 
9 
51 
61 
10 
3 
 
 
 
 

1 
7 
1 
13 
50 
107 
78 
3 
5 
 
 

3 
36 
 
 
22 
104 
149 
60 
8 
3 
1 

0 
13 
 
 
3 
16 
25 
29 
8 
4 
 

I used two statistical procedures for the second round of data analysis. I analyzed the test 

takers’ item responses using a Rasch model and assessed the items’ difficulty estimates, 

dimensionality, and fit to the model. The Rasch model hypothesizes that person ability and item 

 50 

 

difficulty can be measured on a unidimensional scale (Embretson & Reise, 2000; McNamara, 

1991; Wright, 1991). The item response analysis of item fit and dimensionality represents a test 

of this hypothesis in relation to the data. To further explore the specification of item response 

modeling that the items in the assessment can measure a single construct, I also conducted an 

exploratory factor analysis (EFA).   

Exploratory factor analysis (EFA).  

EFA allows researchers to assess the underlying structure of a test with no a priori 

hypothesis of how the items group into factors. The model identifies how the items form factors, 

or underlying latent traits, and estimates the strength of association of each item to a factor. To 

address the question of the dimensionality of the self-assessment instrument, I ran an EFA on the 

test takers’ responses to the revised set of 50 items using Mplus Version 8 (Muthén & Muthén, 

2017). Because of the adaptive nature of the assessment, many of the test-takers did not provide 

item responses for all of the items. Given the amount of missing data, I followed Muthén and 

Muthén’s (2017) suggestion to conduct the analysis using full information maximum likelihood 

(FIML) estimation. If there is more than one factor within the construct of speaking, I assumed 

that they would be strongly correlated (and when factors are believed to be correlated, they are 

called oblique). So, based on that theoretical assumption, I performed the EFA using Geomin 

rotation, which is one of the standard oblique rotation methods available in Mplus. Data are 

rotated in EFA to “further analyze initial… EFA results with the goal of making the pattern of 

loadings clearer, or more pronounced” (Brown, 2009, p. 20), and researchers have the choice of 

oblique rotations, or orthogonal rotations, depending on whether the factors are assumed to be 

correlated or not, respectively. I performed the EFA with data from 1,268 observations: 886 test 

takers’ item responses to the 50-item, revised self-assessment from the second round of data 

 51 

 

collection, plus 382 test takers’ item responses from the first round of data collection to the 36 

items that were common to both datasets. I considered factor loadings of .45 to be fair, loadings 

of .55 to be good, and loadings of .71 and above to be excellent (Comrey & Lee, 1992).  

Rasch analysis.  

As in the first phase of the study, I analyzed the test takers’ item responses to the self-

assessment questionnaire using a Rasch rating scale model (Rasch, 1960/80; Andrich, 1978). To 

evaluate whether the self-assessment items used in the second phase of study are useful for 

measuring spoken language proficiency, I tested the hypotheses of dimensionality and item fit to 

the Rasch model. In the second data analysis, I flagged only items that had fit statistic measures 

with values greater than 2.0 as problematic, since these items could be degrading to the 

measurement system. Because the objective of this second data analysis was no longer 

construction of the assessment but quality control, this more liberal cutoff was adopted.  

To answer the third research question, I evaluated the model by conducting a difficulty 

analysis of the items to compare the item difficulty estimated by the Rasch model and the 

ACTFL proficiency level associated with these items. I plotted the item difficulties on a Wright 

map and calculated the mean item difficulty for items at each threshold level (i.e., Novice, 

Intermediate, Advanced, Superior) and sublevel (e.g., Novice Low, Novice Mid, Novice High) to 

determine whether the items ascended in the order of difficulty predicted by the ACTFL scale. 

Results 

Factor analysis.  

To answer the first research question regarding the factor structure of the Can-Do 

Statements included in the revised self-assessment, I began by performing an EFA. Even though 

 52 

 

the ACTFL (2012) Guidelines “are not based on any particular theory” (p. 3) of language 

proficiency, the model of language proficiency represented by the ACTFL pyramid (see Figure 

1) and represented by the language written in the ACTFL (2012) Guidelines strongly suggests 

that spoken language proficiency grows hierarchically through time, and that speaking represents 

a single construct. The Can-Do Statements, however, appear to indicate that there are at least two 

dimensions of speaking ability within the larger construct of speaking ability in general. The 

2015 statements are separated into two areas of speaking performance: presentational speaking 

and interpersonal communication. The 2017 statements include an intercultural communicative 

competence dimension. Thus, according to ACTFL, speaking is either a unitary construct, a 

general construct with two prominent sub-categories, or a multidimensional construct with at 

least two major categories. Therefore, I chose an EFA to explore how many dimensions might be 

represented by the data I collected. I also chose this method because I wanted to check the 

assumption of unidimensionality for using Rasch analysis, since unidimensionality is a 

requirement for data to be useful for latent trait measurement (Wright, 1991). In my analyses, 

both a one-factor model and a two-factor model converged (meaning that both are possible 

models). The first unidimensional model of speaking accounted for 19.14% of the variance in 

self-assessment scores. The second dual-dimensional model of speaking accounted for 34.08% 

of the variance: 19.14% in factor 1 and 14.94% in factor 2. The eigenvalues for each of the 

factors (used to calculate the variance) are shown in the scree plot, in Figure 6. Although these 

models leave a lot of unexplained variance, previous studies have shown that “when item-level 

data are factor analyzed, it is not uncommon to see a relatively low percentage of variance 

accounted for by the first factor” (Young et al, 2008, p. 177). This is the case even when a one- 

or two-factor solution is selected as the most parsimonious model.  

 53 

 

Inspection of the factor loadings in Table 12 shows that the one-factor solution had 

excellent factor loadings (i.e., greater than .7, Comrey & Lee, 1992) that were all statistically 

significant. The two-factor model, on the other hand, included some items that had significant 

cross-loadings on both factors. All of the items from the Novice Low to Intermediate Low levels 

loaded onto the first factor. The second factor included the most difficult items in the analysis 

(i.e., the Advanced Mid to Superior level items). Items from the Intermediate Mid to Advanced 

Mid proficiency levels were less clear: Items 26, 29, 32, 34-36 and 38 had factor loadings with 

significant values that were fair (> .45) or good (> .55) on both factors. The remainder of the 

items from these Intermediate/Advanced proficiency levels were split between factors one and 

two. This pattern of factor loadings may be consistent with a difficulty factor; the pattern could 

also be interpreted as two factors that represent basic or core language proficiency at the lower 

levels, and academic speaking at the upper levels (Cummins, 1979; Hulstijn, 2007). This was a 

rather unexpected finding, as ACTFL does not distinguish between these constructs in its model. 

In other words, two hierarchical factors are not directly represented by the ACTFL pyramid. 

Further, the content of the items did not form any of the proposed distinct patterns of language 

use (e.g., presentational versus interactional speaking; description versus argumentation). If 

researchers want to keep the theoretical model of speaking based on the ACTFL Guidelines, the 

clear pattern of the loadings in the one-factor model is preferable. The second, two-factor model 

challenges the commonly accepted, ACTFL-based model of speaking proficiency as represented 

by the pyramid: It suggests that speaking at the lower levels of proficiency is qualitatively (and 

measurably) different (i.e., a different construct) than speaking at the upper levels of proficiency, 

at least for the college-level language leaners in this study. This interpretation does not negate 

 54 

 

the ACTFL model, but it does suggest that it could be refined—I will discuss this challenge 

further in the discussion section of this dissertation.  

 

Figure 6: Scree plot of item-level EFA.   

 

 
Table 12: Factor loadings for the 1-factor and 2-factor models 

Item 

1. I can say the date. -NL 
2. I can say the day of the week. -NL 

3. I can say which sports I like and don’t like. -NM 

4. I can list my favorite free-time activities and those I don’t like. -
NM 
5. I can say what someone looks like. -NM 

1-factor 
model: 

0.723* 

0.740* 

0.782* 

0.860* 

0.819* 

6. I can talk about my school or where I work. -NM              
0.878* 
7. I can talk about my room or office and what I have in it. -NM            0.832* 
8. I can talk about what I do on the weekends. -NM 
0.901* 

9. I can answer questions about where I’m going or where I went. -
NM 

0.844* 

2-factor model: 

 

Factor 1  Factor 2 
0.567*        0.385* 
0.502*        0.533* 
0.754*       0.087 
0.884*       -0.032 

0.831*        0.004 
0.937*       -0.111 
0.968*       -0.249* 
1.031*       -0.267* 
0.963*       -0.238 

10. I can present information about something I learned in a class or 
at work. -NH 
11. I can describe a school or workplace. -IL 

12. I can describe a place I have visited. -IL 

0.798* 

0.911*       -0.218 

0.776* 

0.863* 

1.040*       -0.298* 
1.008*       -0.169 

 55 

 

13. I can describe what my summer plans are. -IL 
14. I can report on a social event that I attended. -IM 

15. I can bring a conversation to a close. -IM 

16. I can schedule an appointment. -IM             

17. I can talk about my family history. -IH 

18. I can explain a series of steps needed to complete a task. -IH 
19. I can explain why I was late to class or absent from work and 
arrange to make up the lost time. -AL 

20. I can tell a friend how I’m going to replace an item that I 
borrowed and broke/lost. -AL 
21. I can give some information about activities I did. -IM 
22. I can talk about my favorite music, movies, and sports. -IM 
23. I can give a short presentation on a current event. -IM            
24. I can ask for and follow directions to get from one place to 
another. -IH 
25. I can return an item I have purchased to a store. -IH 
26. I can arrange for a makeup exam or reschedule an appointment.  -
IH 
27. I can present an overview about my school, community, or 
workplace. -AL 
28. I can compare different jobs and study programs in a 
conversation with a peer. -AL 
29. I can discuss future plans, such as where I want to live and what I 
will be doing in the next few years. -AM 
Table 12 (cont’d) 
 
30. I can describe in detail a social event. -AM 
31. I can present ideas about something I have learned, such as a 
historical event, a famous person, or a current environmental issue. -
IH 
32. I can make a presentation about an interesting person. -IH 
33. I can explain to someone who was absent what took place in 
class. -IH 
34. I can explain how life has changed since I was a child and 
respond to questions on the topic. -AL 
35. I can discuss what is currently going on in another community or 
country. -AL  
36. I can provide a rationale for the importance of certain classes, 
subjects, or training programs. -AL 
37. I can talk about present challenges in my school or work life, 
such as paying for classes or dealing with difficult colleagues. -AM 

0.876* 

0.906* 

0.853* 

0.898* 

0.881* 

0.943* 
0.960* 

0.862*        0.030 
0.841*        0.096 
0.831*        0.036 
0.787*        0.181 
0.768*        0.168 
0.773*        0.266 
0.782*        0.288 

0.939* 

0.803*        0.237 

0.931* 
0.860* 
0.955* 
0.902* 

0.214 
0.754* 
0.279 
0.568* 
0.703*        0.367* 
0.647*        0.371* 

0.917* 
0.941* 

0.735*        0.302 
0.586*       0.465* 

0.945* 

0.652*        0.423* 

0.961* 

0.647*        0.436* 

0.929* 

0.549*        0.471* 

 

 

 

0.978* 

0.955* 

0.655*        0.436* 
0.287 
0.654* 

0.981* 
0.904* 

0.608*        0.499* 
0.750*        0.323 

0.974* 

0.583*        0.524* 

0.961* 

0.501*        0.594* 

0.978* 

0.534*        0.585* 

0.973* 

0.327 

0.725* 

38. I can recount the details of a historical event. -AM 
39. I can give a presentation about cultural influences on society. -
AH 
40. I can participate in conversations on social or cultural questions 
relevant to speakers of this language. -AH 

0.975* 
0.985* 

0.483*        0.611* 
0.318 
0.746* 

0.973* 

0.281 

0.728* 

 56 

 

41. I can interview for a job or service opportunity related to my 
field of expertise. -AL 

0.984* 

0.315 

0.756* 

42. I present an explanation for a social or community project or 
policy. -AL 
43. I can present reasons for or against a position on a political social 
issue. -AL 
44. I can give a clear and detailed story about childhood memories, 
such as what happened during vacations or memorable events and 
answer questions about my story. -AM 
45. I can exchange general information about my community, such 
as demographic information and points of interests. -AM 
46. I can exchange factual information about social and 
environmental questions, such as retirement, recycling, or pollution. 
-AM 
47. I can usually defend my views in a debate. -AH 

48. I can exchange complex information about my academic studies, 
such as why I chose the field, course requirements, projects, 
internship opportunities, and new advances in my field. -AH 
49. I can provide a balance of explanations and examples on a 
complex topic. –S 
50. I can explain participate actively and react to others appropriately 
in academic debates, providing some facts and rationales to back up 
my statements. -S 
Geomin factor correlations 

0.993* 

0.069 

0.923* 

0.992* 

0.069 

0.923* 

0.993* 

0.005 

0.949* 

0.992* 

-0.008 

0.944* 

0.993* 

-0.093 

0.997* 

0.997* 

0.991* 

0.153 
-0.32 

 0.897* 

1.078* 

0.996* 

-0.202 

 1.072* 

0.997* 

-0.086 

1.024* 

1.00 

.520* 

Note: * = significant at the .05 level; good factor loadings (> 0.55) are highlighted in bold.  
 

Fit to the Rasch model.  

The second research question addressed whether the individual Can-Do statements fit 

ACTFL’s model of spoken language proficiency when used for self-assessment. The question 

was answered by evaluating the dimensionality and fit to the model of the 50 Can-Do self-

assessment statements describing spoken language proficiency. The initial Rasch rating scale 

model had a person reliability of .96, indicating that the self-assessment instrument discriminated 

well between test takers of varying proficiency levels. To re-examine the assumption of 

unidimensionality, the model was analyzed using a Principal-Components Analysis (PCA) of the 

residuals. The aim of the PCA was to further explore whether the instrument under consideration 

was measuring multiple dimensions.  

 57 

 

The Rasch dimension explained 53.8% of the variance, while the largest secondary 

dimension, or first contrast in the residuals, explained 3.3% of the variance. The eigenvalue of 

this first contrast was 3.80, or the strength of approximately four items. The disattenuated 

correlation for person measures was .452. Because this value is much lower than 1.00, I had 

some concerns that the instrument might be multidimensional. Further, the contrast plot (shown 

in Appendix D) showed a group of five outlying items at the top of the plot. Inspection of the 

items showed that these statements were from the final set of Can-Dos (Items 44, 46, 48, 49, 50). 

One possibility, as describe above, is that these items might be measuring an academic or higher 

order speaking dimension, at least with this group of test takers. One of the items in the first 

contrast requires speakers to give a clear and detailed story about childhood memories, which 

does not necessarily describe academic speaking, but may still fit within a dimension that goes 

beyond a core language proficiency (Hulstijn, 2007) of speaking. Beyond difficulty, these items 

share little in common: They represented both the interactional and presentational modes of 

speaking, and covered different topics (also shown in Appendix 5). Thus, these items may form 

an interpretable secondary dimension, but this dimension is not what is predicted by ACTFL.  

Next, I evaluated the item fit to the Rasch model, considering items that are productive 

for the construction of measurement (i.e., with fit statistics between 0.5 and 1.5) and looking for 

items that would distort the measurement (i.e., items with fit statistics > 2.0). In addition, I 

considered (a) the items that were selected to replace the misfitting items from the Spring 2015 

self-assessment and (b) the items that were revised from the original assessment for the Spring 

2017 version. 

 

Analysis of a Rasch model of the Spring 2017 test takers’ responses revealed seven items 

that had very large outfit values (which indicates the items were not working; a zero is perfect fit, 

 58 

 

and large values above or below zero indicate misfit), shown in Table 13. Inspection of the 

content of these items showed that two items (3 and 4) described likes and dislikes in the same 

statement, which could have distorted the measurement. Item 35, I can discuss what is currently 

going on in another community or country, stands out because this statement likely relies more 

on intercultural competence than it does on spoken proficiency, and also because the two objects 

(community and country) are so vastly different. Test takers thinking of one or the other may 

have extremely different responses, which would contribute to item noise or the item’s inability 

to measure one thing well.  

 

 

Table 13: Misfitting items from the initial 2017 Rasch model 

Statement 

Difficulty and Misfit Estimates 

3. I can say which sports I like and don’t 
like. (NM) 
4. I can list my favorite free time activities 
and those I don’t like. (NM) 
5. I can say what someone looks like. (NM) 
7. I can talk about my room or office and 
what I have in it. (NM) 
11. I can describe a school or workplace. 
(IL) 
17. I can talk about my family history. (IH) 
35. I can discuss what is currently going on 
in another community or country. (AL) 
 

Difficulty 
estimate  
(S.E.) 

-5.51 (.10) 

Infit 

Outfit 

MNSQ (z-

MNSQ (z-

std) 

std) 

1.06 (0.8) 

9.90 (9.9) 

-5.15 (.09) 

0.78 (-3.7) 

4.50 (9.9) 

-3.89 (.08) 
-2.64 (.07) 

0.94 (-1.0) 
0.92 (-1.6) 

3.35 (9.1) 
5.32 (9.9) 

-2.34 (.14) 

1.19 (2.0) 

9.90 (9.9) 

0.07 (.11) 
1.79 (.26) 

1.17 (2.0) 
1.11 (0.7) 

4.51 (9.7) 
2.07 (2.0) 

Deleting these items resulted in more misfitting items (unlike the analysis in phase one of the 

study, where deleting misfitting items resulted in the remainder of the items having fit values 

within the acceptable range). After deleting the above items, seven more items had fit values 

 59 

 

greater than 2.0, shown in Table 14. Inspection of this model’s residuals revealed an even weaker 

disattenuated correlation of .151. Thus, deletion of all the items that initially misfit the model 

does not seem to result in a set of items that all contribute to the measurement of spoken 

proficiency.  

 

Table 14: More misfitting items 

Statement 

Difficulty and Misfit Estimates 

2. I can say the day of the week. 
10. I can present information about 
something I learned in a class or at work. 
Table 14 (cont’d) 
 
16. I can schedule an appointment. 
19. I can explain why I was late to class or 
absent from work and arrange to make up 
the lost time. 
20. I can tell a friend how I’m going to 
replace an item that I borrowed and broke 
or lost. 
24. I can ask for and follow directions to 
get from one place to another. 
 

Difficulty 
estimate  
(S.E.) 

-7.78 (.13) 
-3.06 (.07) 

Infit 

Outfit 

MNSQ (z-

MNSQ (z-

std) 

1.16 (2.5) 
1.04 (0.7) 

std) 

2.32 (4.0) 
3.35 (9.9) 

 

 

 

-0.80 (.11) 
-1.49 (.12) 

2.06 (3.7) 
1.00 (0.1) 
0.86 (-1.7)  2.51 (4.5) 

-0.10 (.11) 

1.01 (.2) 

3.25 (7.1) 

-0.26 (.23) 

1.06 (0.5) 

3.88 (3.4) 

These findings suggest that the misfit may not be a result of the characteristics of the 

items, or at least not this alone. The magnitude of the outfit values highlights that the items had 

unexpected item responses from test takers whose ability was far from the item difficulty level. 

For example, Items 3-5 (Novice Mid level items with very low difficulty estimates) would likely 

have item responses from high ability test takers that said they could not do these things, which 

is unexpected. Inspection of the person fit to the model showed that this was in fact the case, as 

thirteen test takers had outfit mean-square values of 9.90. The item responses of these test takers 

 60 

 

show that the majority of these people rated all 50 Can-Do Statements, which suggests that they 

have high spoken proficiency levels. Each of the response strings of these misfitting people 

(shown in Table 15) includes at least one statement from either the first form (statements 1-10) 

or second form (statements 11-20) that was highly unexpected, as indicated by Winsteps. For 

example, test taker 698, who received an Advanced Mid proficiency rating, indicated that they 

would need some help to list favorite free time activities (Item 4). This rating is indeed quite 

unexpected, because listing favorite free time activities should be something that someone with 

Advanced proficiency should be able to do independently and easily. All of the items that misfit 

in the original model (except Item 35) are among these most unexpected item responses from the 

test takers who had very high fit statistics. These responses are highlighted in bold in the 

response strings (which represent ratings from 1-4 for each of the 50 items in the self-

assessment) and the item number is shown in parentheses. 

Table 15: Misfitting people, ratings and response strings (and most unexpected responses) 

Person 
number 

Outfit 

OPIc form 

MNSQ (z-std) 

715 

9.90 (3.4) 

5 

Levels 
assessed 

AM - S 

Proficiency Rating 

BR 

 

44444444444444443444444444444444444444444444444444 (17) 

512 

9.90 (5.1) 

5 

AM - S 

BR 

 

44444434444444444443444433444444444444443444444443 (7) 

257 

9.90 (9.9) 

4 

IH - AM 

BR 

 

44344444444444444433444444444444443444443334434434 (3) 

185 

9.90 (7.5) 

5 

AM - S 

BR 

 

44444434444444444443444344444444444444444444444444 (7) 

 61 

 

149 

9.90 (3.4) 

5 

AM - S 

BR 

 

44444444444444443444444444444444444444444444444444 (17) 

652 

9.90 (9.9) 

5 

AM - S 

AR 

 

44444444443444444444444444444444444444444444444444 (11) 

119 

9.90 (4.7) 

5 

AM - S 

IH 

 

44444444434444434444443444444434444444443444444444 (10, 16) 

509 

9.90 (9.9) 

4 

IH - AM 

IM 

 

44344444444444444444444444434444443443443434433433 (3) 

Table 15 (cont’d) 

698 

9.90 (9.9) 

4 

IH - AM 

AM 

 

44434444444444444443444444334444444443443334443343 (4) 

616 

9.90 (7.1) 

4 

IH - AM 

IH 

 

44443434444444444444444444444434444344443334344443 (5, 7) 

065 

9.90 (4.2) 

4 

IH - AM 

 

4444343444444444444444443444444343344444 (5, 7) 

006 

9.90 (9.4) 

4 

IH - AM 

IH 

IH 

 

44344434444444434444444444443444444444443333332233 (3, 7) 

237 

9.90 (6.2) 

3 

IM - AL 

IM 

 

434444444444443444444433333344 (2) 

Note: the most unexpected item responses are indicated in parentheses.  

 

Of these people, five (715, 512, 257, 185 and 149) did not receive a rating (BR) because 

they took an OPIc test that was too difficult for them. Interestingly, two test takers (119 and 509) 

selected an OPIc form that assessed proficiency levels that were above their final rating: Person 

119 selected form 5, which is designed to elicit speech at the Advanced Mid to Superior levels, 

 62 

 

and was rated as Intermediate High (i.e., two levels below what the form is designed to measure). 

Person 509 selected form 4, which is designed to elicit speech at the Intermediate High to 

Advanced Mid levels. These seven test-takers over-assessed their spoken proficiency using the 

self-assessment instrument. One highly misfitting test taker under-assessed their ability: They 

selected the highest OPIc form after saying they could do nearly all of the statements in the self-

assessment. However, the ACTFL rating indicated that their proficiency was higher than form 5 

could assess (i.e., higher than Superior-level proficiency). In fact, this test taker was a heritage 

speaker or a native speaker of Spanish (they spoke Spanish at home), as I found out after 

inquiring about the student’s background (without personal identifiers) from the Flagship team. 

Five other test takers had extreme outfit values and received official ACTFL ratings that 

corresponded to the OPIc test forms that they selected. Their proficiency ratings ranged from 

Intermediate Mid to Advanced Mid. Interestingly, none of the highly misfitting test takers 

selected a test form that tested the lowest four proficiency levels (i.e., Novice Low to 

Intermediate Low).      

Deletion of these highly misfitting test takers (outfit MNSQ of 9.90) and Item 35 (which 

seems to rely on intercultural competence than spoken proficiency) resulted in a final model that 

had a person reliability of .96, with evidence of unidimensionality, and fewer misfitting items. 

With these deletions, the Rasch dimension explained 57.7% of the variance (an increase from 

53.8%), and the disattenuated correlation between measures increased to .86. Further, there was 

no longer an obvious cluster of items at the top of the contrast plot (shown in Appendix 5), 

which lends support for a unidimensional model of spoken proficiency. The final model (shown 

in Table 16) still includes some misfitting items, indicating that scale improvement beyond a 

 63 

 

certain point may not be possible with these data. This may not be surprising given that this is 

self-assessment, which I will discuss in the discussion section.

 64 

 

Table 16: Final model fit statistics for the revised self-assessment questionnaire 

Item 

1. I can say the date. -NL 
2. I can say the day of the week. -NL 
3. I can say which sports I like and don’t like. -NM 
4. I can list my favorite free time activities and those I don’t like. -NM 
5. I can say what someone looks like. -NM 
6. I can talk about my school or where I work. -NM              
7. I can talk about my room or office and what I have in it. -NM            
8. I can talk about what I do on the weekends. -NM 
9. I can answer questions about where I’m going or where I went. -NM 
10. I can present information about something I learned in a class or at work. -NH 
11. I can describe a school or workplace. -IL 
12. I can describe a place I have visited. -IL 
13. I can describe what my summer plans are. -IL 
14. I can report on a social event that I attended. -IM 
15. I can bring a conversation to a close. -IM 
16. I can schedule an appointment. -IM             
17. I can talk about my family history. -IH 
18. I can explain a series of steps needed to complete a task. -IH 
19. I can explain why I was late to class and arrange to make up the lost time. -AL 
20. I can tell a friend how I’m going to replace an item that I borrowed and broke/lost. -AL 
21. I can give some information about activities I did. -IM 
22. I can talk about my favorite music, movies, and sports. -IM 
23. I can give a short presentation on a current event. -IM 
24. I can ask for and follow directions to get from one place to another. -IH 
25. I can return an item I have purchased to a store. -IH 
26. I can arrange for a makeup exam or reschedule an appointment.  -IH 
27. I can present an overview about my school, community, or workplace. -AL 
Table 16 (cont’d) 
 
28. I can compare different jobs and study programs in a conversation with a peer. -AL 

 65 

Estimate  

(S.E.) 

-5.61 (..09) 
-7.26 (.13) 
-6.19 (.10) 
-5.80 (.09) 
-4.49 (.08) 
-4.77 (.08) 
-3.20 (.07) 
-4.93 (.08) 
-3.93 (.08) 
-2.69 (.07) 
-2.80 (.14) 
-2.87 (.15) 
-3.05 (.15) 
-1.77 (.12) 
-2.16 (.13) 
-0.54 (.11) 
-0.30 (.11) 
-0.94 (.12) 
-1.17 (.12) 
0.16 (.11) 
-2.08 (.47) 
-1.72 (.40) 
0.77 (.22) 
-0.11 (.25) 
1.25 (.21) 
0.23 (.23) 
-0.36 (.26) 

 

Infit 

Outfit 

MNSQ     z-std 
2.8 
3.1 
0.9 
-3.6 
-0.6 
-4.4 
-1.4 
-5.0 
-1.9 
2.6 
2.2 
-0.2 
0.9 
-0.9 
0.7 
-0.2 
2.6 
0.9 
-1.3 
0.4 
0.0 
0.0 
0.0 
0.5 
0.2 
0.5 
-1.4 

1.18 
1.29 
1.06 
0.79 
0.97 
0.77 
0.92 
0.74 
0.90 
1.14 
1.21 
0.98 
1.09 
0.93 
1.06 
0.98 
1.23 
1.08 
0.90 
1.03 
0.94 
0.97 
1.00 
1.07 
1.01 
1.06 
0.78 

 

 

MNSQ       z-std 
1.5 
-1.4 
-0.7 
-2.5 
2.8 
-3.0 
0.1 
-2.7 
-0.7 
3.5 
4.2 
-0.9 
-0.9 
-0.8 
-0.9 
3.1 
5.5 
1.0 
-1.4 
5.2 
-1.0 
-0.6 
1.4 
1.1 
2.6 
-0.4 
-1.1 

1.30 
0.67 
0.84 
0.58 
1.50 
0.57 
1.00 
0.58 
0.89 
1.45 
2.56* 
0.75 
0.74 
0.80 
0.79 
1.77 
2.60*  
1.20 
0.70 
2.45* 
0.39 
0.54 
1.77 
1.58 
2.64* 
0.73 
0.46 

 

 

1.19 (.21) 

0.84 

-1.2 

0.60 

-0.8 

 

29. I can discuss future plans, such as where I want to live and what I will be doing in the next 
few years. -AM 
30. I can describe in detail a social event. -AM 
31. I can present ideas about something I have learned, such as a historical event, a famous 
person, or a current environmental issue. -IH 
32. I can make a presentation about an interesting person. -IH 
33. I can explain to someone who was absent what took place in class. -IH 
34. I can explain how life has changed since I was a child and respond to questions on the topic. 
-AL 
36. I can provide a rationale for the importance of certain classes, subjects, or training programs. 
-AL 
37. I can talk about present challenges in my school or work life, such as paying for classes or 
dealing with difficult colleagues. -AM 
38. I can recount the details of a historical event. -AM 
39. I can give a presentation about cultural influences on society. -AH 
40. I can participate in conversations on social or cultural questions relevant to speakers of this 
language. -AH 
41. I can interview for a job or service opportunity related to my field of expertise. -AL 
42. I present an explanation for a social or community project or policy. -AL 
43. I can present reasons for or against a position on a political social issue. -AL 
44. I can give a clear and detailed story about childhood memories, such as what happened 
during vacations or memorable events and answer questions about my story. -AM 
45. I can exchange general information about my community, such as demographic information 
and points of interests. -AM 
46. I can exchange factual information about social and environmental questions, such as 
retirement, recycling, or pollution. -AM 
47. I can usually defend my views in a debate. -AH 
48. I can exchange complex information about my academic studies, such as why I chose the 
field, course requirements, projects, internship opportunities, and new advances in my field. -AH 
49. I can provide a balance of explanations and examples on a complex topic. -S 
50. I can explain participate actively and react to others appropriately in academic debates, 
providing some facts and rationales to back up my statements. -S 

-0.30 (.26) 

0.63 (.22) 
2.34 (.29) 

0.43 (.41) 
-0.16 (.49) 
0.99 (.35) 

2.09 (.29) 

1.11 (.34) 

3.10 (.27) 
2.19 (.29) 
2.81 (.27) 

5.54 (.32) 
4.89 (.34) 
5.74 (.32) 
3.40 (.41) 

3.56 (.40) 

5.54 (.32) 

5.54 (.32) 
5.01 (.34) 

5.54 (.32) 
5.22 (.33) 

1.02 

0.74 
1.02 

0.71 
1.10 
0.90 

1.03 

0.86 

1.15 
0.92 
1.17 

1.00 
0.75 
0.91 
1.05 

0.99 

0.91 

0.62 
1.14 

0.78 
0.63 

0.2 

-2.3 
0.2 

-1.1 
0.4 
-0.4 

0.3 

-0.6 

0.8 
-0.4 
0.9 

0.1 
-1.1 
-0.3 
0.3 

0.1 

-0.3 

-1.9 
0.7 

-1.0 
-1.8 

1.05 

0.60 
2.96* 

0.31 
1.43 
0.58 

2.03 

0.91 

1.89 
1.01 
1.47 

1.12 
0.68 
0.89 
0.70 

0.76 

0.85 

0.60 
1.24 

0.75 
0.55 

0.3 

-0.7 
2.5 

-0.7 
0.7 
-0.3 

1.5 

0.2 

2.0 
0.2 
1.1 

0.6 
-1.2 
-0.4 
-0.4 

-0.3 

-0.6 

-1.9 
0.9 

-1.1 
-2.0 

 66 

 

Replacement items.  

The fourteen items that were selected to replace the misfitting items identified in Spring 

2015 and their fit statistics from the second round of testing are presented in Table 17. These 

items were selected preferentially so that they were specific (avoiding statements that were 

vague to the extent possible), included a single spoken task (avoiding the description of multiple 

skills in the same item), and addressed language use that college level test takers would have 

experience with.  

Table 17: Fit statistics for the fourteen replacement items included in the revised Spring 2017 
assessment 

Infit 

Outfit 

MNSQ        z-std 

MNSQ       z-std 

Difficulty 
estimate  

(S.E.) 

-5.61 (.09) 
-7.26 (.13) 

-4.49 (.08) 

-4.93 (.08) 

-2.87 (.15) 

-3.05 (.15) 

-1.77 (.12) 

-2.16 (.13) 

-0.94 (.12) 

1.18 
1.29 

0.97 

0.74 

0.98 

1.09 

0.93 

1.06 

1.08 

2.8 
3.1 

-0.6 

-5.0 

-0.2 

0.9 

-0.9 

0.7 

0.9 

1.30 
0.67 

1.50 

0.58 

0.75 

0.74 

0.80 

0.79 

1.20 

0.77 (.22) 

1.00 

0.0 

1.77 

0.63 (.22) 

0.43 (.41) 

-0.16 (.49) 

3.10 (.27) 

0.74 

0.71 

1.10 

1.15 

-2.3 

-1.1 

0.4 

0.8 

0.60 

0.31 

1.43 

1.89 

Item 

1. I can say the date. -NL 
2. I can say the day of the 
week. -NL 
5. I can say what someone 
looks like. -NM 
8. I can talk about what I do 
on the weekends. -NM 
12. I can describe a place I 
have visited. -IL 
13. I can describe what my 
summer plans are. -IL 
14. I can report on a social 
event that I attended. -IM 
15. I can bring a conversation 
to a close. -IM 
18. I can explain a series of 
steps needed to complete a 
task. -IH 
23. I can give a short 
presentation on a current 
event. -IM 
30. I can describe in detail a 
social event. -AM 
32. I can make a presentation 
about an interesting person. -
IH 
33. I can explain to someone 
who was absent what took 
place in class. -IH 
38. I can recount the details 
of a historical event. -AM 

1.5 
-1.4 

2.8 

-2.7 

-0.9 

-0.9 

-0.8 

-0.9 

1.0 

1.4 

-0.7 

-0.7 

0.7 

2.0 

Twelve of the fourteen replacement items had fit statistics that fell between 0.5 and 1.5, 

indicating that they are productive for measuring spoken language proficiency. Two items (23 

 67 

 

and 38) had fit values that were between 1.5 and 2.0, indicating that these items are unproductive 

for the construction of measurement, but not degrading to the instrument.  

Revised items.  

After the first round of testing, two items were revised and reassessed by test takers in the 

second round of testing. These items and their fit statistics are shown in Table 18.  

Table 18: Fit statistics for original and revised items 

Original 

(2015) 

Revision 

(2017) 

Difficulty 
estimate  

(S.E.) 

-4.68 (.14) 

-5.61 (.09) 

-7.26 (.13) 

Infit 

 

Outfit 
z-std 

MNSQ    z-std 

MNSQ    

1.14  

1.18 

1.29 

1.4 

2.8 

3.1 

1.94* 

1.30 

0.67 

2.4 

1.5 

-1.4 

 

1. I can say 
the date. 
2. I can say 
the day of the 
week. 

1. I can say the date 
and day of the week. 
 

 

12. I can describe a 
place I have visited 
or want to visit. 
 

 

 

-1.87 (0.26) 

0.89 

-0.5 

0.45*  

-1.0 

-2.87 (.15) 

0.98 

-0.2 

0.75 

-0.9 

12. I can 
describe a 
place I have 
visited. 

The first and easiest item, I can say the date and day of the week, misfit the model (outfit 

MNSQ = 1.94) in the first round of testing. While this value does not indicate that the item 

would be degrading for measurement, it is likely not productive for the construction of a test 

measuring spoken proficiency (Wright & Linacre, 1994). The original item includes two skills, 

and thus it was separated and included as two items in the modified instrument that was 

administered in the second round of testing. The item responses in the Spring 2017 round of 

testing give some indication of why the original item may have misfit the model: there is a 

difference in difficulty of nearly two logits for the two skills, where saying the day of the week 

was rated as much easier -7.26 logits) than saying the date (-5.61 logits). This is not surprising 

given that saying the day of the week requires simply memorizing seven vocabulary words, 

whereas saying the date includes month vocabulary, numbers, and the knowledge of how to 

 68 

 

combine this information. Test takers’ responses in the first round of testing resulted in a 

difficulty estimate of -4.68 for the item combining the two skills, which was similar to the 

estimate for saying the date in the second round of testing. Some test takers may have ignored 

the part of the item that referred to saying the day of the week, while others may have rated 

whether they could do both of the skills. These potential differences in interpretation may have 

caused the original item to misfit the model.  

Item 12 originally misfit the model and contained two types of descriptions: places I have 

visited and places I want to visit. In the Spring 2017 round of testing, the item was modified to 

remove the description of a place someone would like to visit (which may not be possible). This 

change resulted in an item that was productive for measurement, with fit values between 0.5 and 

1.5.  

To summarize the findings for the second research question, which addressed the fit to 

the Rasch model, when all items and all test takers were included in the model, there were seven 

misfitting items and evidence that the items in the instrument did not contribute to a unitary 

measurement of spoken proficiency. Deleting thirteen test takers with very high outfit values 

(who were people who either over-assessed their spoken proficiency and selected a test form that 

was too difficult, may not have been second language learners, or who interacted with the self-

assessment in an unexpected way), and Item 35, which appears to rely more on intercultural 

competence than on spoken language proficiency or which may be a compound question asking 

(problematically) two very different questions, resulted in a final model that had fewer misfitting 

items and evidence that the items in the model contributed to a unitary measurement of spoken 

language proficiency.  The items that were revised or selected as replacements for the 2017 

round of data collection all fit the model.  

 69 

 

Item difficulty estimates.  

The third research question addressed the extent to which the difficulty of the statements 

(estimated by the Rasch model) matched the ACTFL (2012) proficiency levels associated with 

the revised Can-Do Statements. The items are plotted in order of difficulty in the Wright map 

shown in Figure 7. 

 

 

 70 

 

    7                                .  +                              
                                    ##  | 
                                   .## T| 
    6                                .  +                              
                                    .#  |T 
                                   .##  | 
    5                              ###  +                              

            .#  |  41_AL 

                                 #####  |  43_AL 49_S  47_AH  50_S     
    4                             .###  +  46_AL  42_AL                
                                    .#  |  48_AH 
                                .##### S| 
    3                           ######  +S                             
                              .#######  |  45_AM  38_AM 
                                     .  |  44_AM  40_AH 
    2                  .##############  +  31_IH                       
                                .#####  |  39_AH  36_AL  35_AL 
                                 .####  |  28_AL  25_IH 
    1                           .#####  +  34_AL  23_IM                
                               .######  |  37_AM  30_AM  20_AL 
                                     # M|  32_IH  26_IH  24_IH 
    0          .######################  +M 29_AM  27_AL  16_IM  17_IH  
                    .#################  |  33_IH 
                       .##############  |  19_AL  18_IH  
   -1                                .  +                             
                        .#############  |  22_IM  14_IM 
                          .###########  |  15_IM 
   -2                       ##########  +  21_IM  10_NH               
                              .#######  |  12_IL  11_IL 
                              .####### S|  7_NM  13_IL 
   -3                            #####  +S                            
                                   ###  |  9_NM 
                                 .####  | 
   -4                              ###  +  5_NM                       
                                   .##  |  8_NM  6_NM 
                                    .#  | 
   -5                                .  +  4_NM  1_NL                 
                                     .  | 
                                     . T|T 3_NM 
   -6                                #  +                             
                                        | 
                                        |  2_NL 
   -7                                   +                             
 

Figure 7: Wright map of test taker ability and item difficulty.   

I calculated the mean logit scores from the Rasch analysis for the 50 statements of the 

revised instrument at each of the threshold levels (i.e., Novice, Intermediate, Advanced, 

Superior) and sublevels (e.g., Novice Low, Novice Mid, Novice High) to evaluate whether they 

ascended according to the hierarchy of the ACTFL (2012) scale.  

The mean difficulty estimates for each of the major threshold levels, presented in Table 

19, ascended in the expected order: Novice statements were the easiest (M = -4.27, SD = 1.35), 

 71 

 

followed by Intermediate (M = -0.56, SD = 1.38) and Advanced (M = 2.08, SD = 1.66), with 

Superior statements being the most difficult (M = 3.31, SD = 0.11). The 95% CI for each 

threshold did not overlap (shown in Figure 8), which suggests that the differences in mean 

statement difficulty between each of the major ACTFL levels were statistically significant. 

  

Table 19: Descriptive statistics for 2017 difficulty estimates of ACTFL threshold levels 

ACTFL threshold 
1 - Novice 
2- Intermediate 
3- Advanced 
4- Superior 
Note. N = number of statements. 

N  Mean logit score (SD) 
10 
17 
21 
2 

-4.27 (1.35) 
-0.56 (1.38) 
2.08 (1.66) 
3.31 (0.11) 

SE 
.43 
.33 
.36 
.08 

95% CI 

-5.25, -3.31 
-1.27, 0.15 
1.33, 2.84 
3.29, 5.33 

 

 

 

Figure 8: 95% confidence intervals of the mean threshold difficulty estimates. 1 = Novice; 2 = 
Intermediate; 3 = Advanced; 4 = Superior. 

The mean difficulty estimates for the ACTFL sublevels, presented in Table 20, ascended 

as anticipated. Inspection of the 95% CIs, plotted in Figure 9, showed some statistical differences 

between item difficulties at the sublevels. The intervals at the Novice Mid, Intermediate Low and 

Intermediate Mid levels did not overlap. No interval could be calculated for the Novice High 

 72 

 

level since only one item was included from this level. The interval at the Novice Low level is 

very wide because this level only included two items; this interval overlaps with all other 

intervals. The 95% CIs for adjacent means at the Intermediate Mid level and above all 

overlapped. Thus, the differences between higher proficiency sublevels are not necessarily 

statistically significant. Also of note are the mean difficulty estimates for the Advanced Low (M 

= 1.84) and Advanced Mid (M = 1.89) levels. The means and 95% CIs are very similar, 

suggesting that the items at these two proficiency levels do not discriminate well at the sublevel.  

 

 

 

Table 20: Descriptive statistics for 2017 difficulty estimates of ACTFL sublevels 

-5.80 (1.21) 
-4.15 (0.99) 

-2.161 

95% CI 

-16.66, 5.07 
-5.06, -3.23 

- 

N  Mean logit score (SD) 
2 
7 
1 
3 
6 
8 
10 
7 
4 
2 

ACTFL sublevel 
1 - Nov-Low 
2 - Nov-Mid 
3 - Nov-High 
4 - Int-Low 
5 - Int-Mid 
6 - Int-High 
7 - Adv-Low 
8 - Adv-Mid 
9 - Adv-High 
10 - Superior 
Notes: 1. Because there was only one item at the Novice High level in the instrument the mean logit score was not 
included in the difficulty analysis.  
 
 

-2.44 (0.12) 
-0.94 (1.10) 
0.43 (0.86) 
1.84 (1.91) 
1.89 (1.50) 
3.01 (1.21) 
4.31 (0.11) 

-2.73, -2.14 
-2.09, 0.21 
-0.29, 1.15 
0.47, 3.21 
0.51, 3.27 
1.08, 4.94 
3.29, 5.33 

SE 
.86 
.37 
- 
.07 
.45 
.30 
.60 
.56 
.60 
.08 

 73 

 

 

Figure 9: 95% confidence intervals of the mean sublevel difficulty estimates.1 = NL; 2 = NM; 3 
= NH; 4 = IL; 5 = IM; 6 = IH; 7 = AL; 8 = AM; 9 = AH; 10 = S.    

The items at both the Advanced Low and Advanced Mid levels had very wide ranges of 

difficulty (AL: -1.17 – 5.54; AM: -0.30 – 3.56). As in the analysis in phase one of the study, the 

two most difficult items (41 and 43) were from the Advanced Low level. These items shared 

similar difficulty levels with the two Superior level items in the revised assessment. As noted in 

phase one, the likely cause for this unexpected item difficulty lies in the content of the four most 

difficult Advanced Low items. These items require speakers to provide policy explanations, 

reasons for a position, rationales, and perform an interview. These are language functions that 

better describe Superior-level language use, as they require speakers to use argumentation and to 

hypothesize.   

Discussion 

In the second phase of this study I analyzed the revised computer-adaptive self-

assessment constructed for the Language Flagship Proficiency Initiative. The instrument 

included 50 ACTFL (2015) Can-Do Statements targeting spoken language proficiency: 36 items 

 74 

 

from the original self-assessment and 14 items that were either new ACTFL (2015) Can-Do 

selections or revised items. 

Dimensions of spoken proficiency.  

The first research question addressed the number of factors of spoken proficiency that are 

measured by the revised 50-item self-assessment. The hypothesis, based on the ACTFL 

Guidelines, was that speaking can be measured as a unitary construct. An alternative hypothesis, 

based on the NCSSFL-ACTFL (2015, 2017) Can-Do Statements, was that spoken proficiency 

includes multiple dimensions: a presentational speaking factor and an interpersonal 

communication factor. The most recent publication of the Can-Do Statements (ACTFL, 2017) 

also includes a possible third factor: intercultural communicative competence. To test these 

hypotheses, I performed an EFA and also considered the results of a PCA of the residuals of a 

Rasch model of test takers’ item responses. The EFA provided evidence for both a 

unidimensional model of spoken proficiency and a model with two factors. The unidimensional 

model had a very clear pattern of significant, strong factor loadings, which is in line with the 

theoretical model of speaking represented by the ACTFL pyramid. A Rasch model that included 

49 of the 50 self-assessment items (excluding Item 35, which appeared to rely more on 

intercultural competence than on spoken proficiency) also provided evidence for a 

unidimensional model: the PCA of the model residuals showed no separate clusters of items in 

the contrast plot, and items in the first and third contrasts did not form any obvious patterns of 

difficulty or speaking mode (i.e., presentation versus interpersonal communication). These 

contrasts were also strongly correlated (.85), suggesting that the items contributed to a 

unidimensional measurement model. If researchers want to keep the traditional theoretical model 

of speaking represented by ACTFL and other language assessments such as the speaking portion 

of the TOEFL (Sawaki et al., 2005), the clear pattern of the loadings in the one-factor model is 

preferable. The current analysis showed that the ability to speak about current events in other 

 75 

 

communities or countries may be measured differently than the other items we included as 

measures of spoken proficiency, since removal of these topics resulted in a unidimensional 

model. This finding lends support for NCSSFL-ACTFL’s (2017) decision to create a new 

category for intercultural communicative competence in the revised publication of the Can-Do 

Statements.  

In the two-factor model, all of the Novice Low to Intermediate Low level items clearly 

loaded onto the first factor, while all of the Advanced Mid to Superior level items clearly loaded 

onto the second factor. Items at the proficiency levels in between (i.e., Intermediate Mid to 

Advanced Low) were less clear: some loaded onto both factors, while others loaded clearly onto 

factor one or factor two. A similar pattern of factor loadings was also observed in the Rasch 

model that included all 50 Can-Do Statements: there was a cluster of five items (44, 46, 48, 49, 

50) from the Advanced Mid, Advanced High, and Superior levels at the top of the contrast plot 

that were separate from the rest of the items, and that did not correlate strongly with items in the 

third contrast. This suggests that these upper-level speaking skills may be measured differently 

than the easier items in the analysis. These findings were unexpected as they challenge the 

commonly accepted, ACTFL-based model of speaking proficiency and the two- and three-factor 

models detailed in the NCSSFL-ACTFL (2015, 2017) Can-Do Statements. Two factors, 

presentational speaking and interactional communication, are posited in the original Can-Do 

Statement publication. However, the items that had been designated to these two modes of 

speaking did not form factors in this way. This may be because presentation and interaction are 

not easily disentangled. For example, saying the date (Item 1) involves presenting information, 

but presumably to another interlocutor (i.e., in interaction).  

The two-factor model in the current analysis suggests that speaking at the lower levels of 

proficiency may be qualitatively (and measurably) different (i.e., a difference construct) than 

speaking at the upper levels of proficiency. This interpretation does not negate the single-factor 

 76 

 

ACTFL model of speaking, but it does suggest ways in which it could be enriched. At the 

Intermediate level, language learners can “create with the language when talking about familiar 

topics related to their daily life,” while Advanced-level speakers can speak about “topics of 

community, national, or international interest” and Superior level speakers can handle all of 

these topics “in formal and informal settings from both concrete and abstract perspectives” 

(ACTFL, 2012, pp. 5-7). According to these definitions, in order to achieve upper levels of 

proficiency, it is also necessary to have knowledge about national and global topics (Advanced) 

and to be able to discuss these topics from abstract perspectives (Superior). This may require a 

level of education, experience, or intelligence that is acquired separately from the ability to speak 

a second language. Speaking, as a construct, may change with growth in proficiency: at lower 

levels of proficiency, anyone may be able to function regardless of level of education, whereas 

beyond a certain threshold, certain explicit knowledge or education may be necessary to add to 

speaking skills.   

In the two-factor model, factor one may represent a core language proficiency (Hulstijn, 

2007), or basic interpersonal communication skills (Cummins, 1979), as it includes the most 

basic conversational speaking tasks (e.g., I can describe what my summer plans are; I can report 

on a social event that I attended; I can bring a conversation to a close). Factor two may 

represent spoken proficiency beyond a general core, as it includes the most difficult speaking 

tasks that are more academic or may require higher order cognition (e.g., I can recount the 

details of a historical event; I can present reasons for or against a position on a political or 

social issue; I can exchange factual information about social and environmental questions). This 

implies that speaking, as a construct, may change with grown in proficiency. Speakers at 

Advanced and Superior proficiency levels may use their core language skill set differently than 

Novice and Intermediate speakers, who are still learning to convey and interpret basic meaning. 

The threshold between the two, not surprisingly, is not entirely clear, but may lie somewhere 

 77 

 

between the Intermediate Mid and Advanced Low proficiency levels (i.e., the proficiency levels 

of items that loaded onto both factors in this analysis). A similar threshold has been documented 

both in language learning research and in task analyses of language use in the workplace. For 

example, DeKeyser (2010) found that students in a Spanish study abroad program did not benefit 

very much from their time abroad if they had not acquired an adequate baseline proficiency 

level. This was the case for the majority of the students he observed, as they had only taken two 

years of college level Spanish courses. As a result, they were effectively unable to interact with 

native speakers and gained very little in terms of linguistic accuracy and self-perceived language 

improvement. Those with better preparation (i.e., more automatized language skills) gained the 

most. For the workplace, ACTFL (2012) reports that based on task analyses, at least Intermediate 

Mid proficiency is required to function in the most basic professions (e.g., Cashier, Tour Guide). 

Professions that require more specialized training (e.g., Teacher, Nurse, Translator, Lawyer) 

require at least Advanced Low proficiency. This implies that in practice, the ACTFL (2012) 

Proficiency levels are used as a two-dimensional scale: one scale to reach basic, working 

proficiency and one scale to describe proficiency above this threshold.  

Based on the findings of the EFA, two possible conclusions can be drawn: either it is 

possible to create a unidimensional measurement of spoken proficiency with the revised 

instrument, or the items separate into two different dimensions of speaking. In the case of the 

two-factor solution, a two-parameter item response model would be more appropriate for 

measurement. If we accept the one-factor solution, a single parameter Rasch model is suitable for 

evaluating how well the people and items in the analysis fit the model of spoken language 

proficiency. This model serves as a proxy for the unitary and hierarchical model of proficiency 

represented by the ACTFL Scale descriptors. 

 78 

 

Fit to the Rasch model.  

Analysis of the single parameter Rasch model of the test takers’ responses to the revised 

self-assessment questionnaire revealed that some of the items and some of the people in the 

analysis did not behave as expected. Thirteen people had very high misfit values (outfit MNSQ = 

9.90) and seven items had misfit values greater than 2.0, indicating that these items would distort 

the measurement. Following McNamara (1991), several interpretations can be made based on 

these misfit values. The person misfit suggests that either test performance did not reflect the 

misfitting participants’ true ability, or that they may not belong to the intended test taking 

population. Inspection of these test takers’ item response patterns and proficiency ratings showed 

evidence for these interpretations. Several of the test takers in the analysis who had extreme 

misfit values were also test takers who selected an OPIc form that assessed proficiency levels 

above their level. This indicates that they over-assessed their ability using the self-assessment. 

The other misfitting test takers were people who had high ability levels and gave unexpected 

ratings to Can-Do Statements in the first set of items (i.e., they assessed that they could not do 

one of these items well). They may not have performed these-lower level Can-Dos tasks 

recently, or they may have been overly modest about their abilities. One possibility to further 

explore this misfit would be to use the CUTLO option in Winsteps, which would remove highly 

unexpected responses on items that are far from the ability level of the test takers.  

The person fit analysis also revealed one test taker who was not a language learner. For 

the misfitting Can-Do Statements, since the items appeared to be well constructed, I assessed 

whether any items might be measuring a dimension other than spoken proficiency. Inspection of 

the content of the misfitting items revealed that Item 35 may be addressing another dimension, 

discussed above. These item and person characteristics justified removing them from the Rasch 

model, which resulted in a final model that had evidence of unidimensionality and only five 

misfitting items.   

 79 

 

After analyzing fifty items in the first phase of this study and fourteen additional items in 

the second phase, forty-four items were identified that fit the unitary model measuring spoken 

proficiency. These items are psychometrically constructive for the self-assessment of L2 

speaking for the college-level test taking population. Thirty items fit the model in both phases of 

the study. The 14 items added to the revised questionnaire all fit the model in phase two. These 

items were selected preferentially to represent speaking tasks that were specific and described 

language use for college language learners. In addition, they included a single speaking task per 

statement. Three of the items were revised versions of items included in the first round of testing 

that originally misfit the measurement model. The revised items were simplified versions of the 

original items, reducing them to a single statement. The revised versions resulted in items that fit 

the model. Further, when Items 1 and 2 were split, they showed large differences in difficulty 

estimates. These findings provide support for the suggestion that items are more productive for 

measurement when they address a single skill (Haladyna et al., 2002).  

Of the thirty-six items that fit the original model in phase one, five of these did not fit the 

final model in phase two. One explanation for why these items fit the first time and then misfit 

may be that the samples of Spanish language learners were not identical. There may have been 

slight differences in the sample in terms of proficiency level, number of heritage speakers, and 

the way the test takers interacted with the instrument. For example, fewer test takers over-

assessed their ability in the first phase (4%) than in the second phase (7%), and one participant 

who was not a language learner was identified in the second round of testing. Another difference 

was that there were more than twice the number of test takers in the second round of data 

collection. The revised self-assessment also had a slightly higher person reliability (.96) than the 

original instrument (.94). One might speculate based on these findings that if we were to control 

for a more homogeneous test taking population (e.g., eliminate heritage learners from the 

sample), provide better self-assessment training (Sweet et al., in press), and eliminate test-taking 

 80 

 

fatigue, better model fit might be observed. One should also expect that using Can-Do 

Statements to self-assess spoken proficiency will never produce a perfect estimate of person 

ability and item difficulty. As Green (2014) highlighted,  

No assessment task is entirely satisfactory. Each format has its own weaknesses. Rather 

than searching for one ideal task type, the assessment designer is better advised to include 

a reasonable variety in any test or classroom assessment system so that the failings of one 

format do not extend to the overall system. (p. 140) 

Therefore, although the final model of L2 Spanish learners’ responses to the self-assessment 

items included in this study is not perfect, it provides a reasonable estimate of language 

proficiency that can be included in a broader assessment system.   

Item difficulty.  

The third research question of the second phase of the study addressed the extent to 

which the item difficulties of the Can-Do Statements in the revised self-assessment instrument 

ascended according to ACTFL’s hierarchy of proficiency levels. Since the ACTFL (2012) 

Guidelines are used to measure proficiency both at the major threshold level (i.e., Novice, 

Intermediate, Advanced, Superior) and the first three major level subdivisions (i.e., Novice Low, 

Novice Mid, Novice High; Intermediate Low, Intermediate Mid, Intermediate High; Advanced 

Low, Advanced Mid, Advanced High), I evaluated whether the Can-Do Statements distinguished 

proficiency at both the threshold level and at the sublevel. Comparison of the mean Rasch 

difficulty estimates for items at each of the major threshold levels of the ACTFL scale revealed 

that the mean logit scores ascended in the anticipated order and that mean differences were 

statistically significant. This finding is in line with Brown et al. (2014) and replicated the results 

of the first phase of my dissertation. Brown et al. (2014) found that the ACTFL Can-Do 

Statements that they modeled ascended in the same order of difficulty as the threshold levels in 

the scale, although the differences were not statistically significant. I found that the statements 

 81 

 

included in the original self-assessment instrument also ascended in order of the threshold levels, 

and that the differences in mean item difficulty between major proficiency levels were 

significant. Taken together, these findings suggest that the ACTFL (2015) Can-Do Statements 

can be useful at least for estimating second language proficiency in broad brush strokes (i.e., at 

the major thresholds of proficiency) when used for self-assessment.    

I also considered the mean difficulty estimates of the revised Can-Do Statements for each 

of the proficiency sublevels. In the second round of testing, I found that each of the sublevels 

ascended in the predicted order. In phase one, the assessment items also ascended according to 

the expected sublevel difficulty, except at the Advanced Low and Advanced Mid levels. This 

result was similar to the revised self-assessment in phase two: the mean difficulty estimates at 

the Advanced Low and Mid levels were nearly identical, and the most difficult items in the 

analysis were from the Advanced Low level. Analysis of the content of these items revealed a 

mismatch between their required language use and the ACTFL proficiency level descriptors as 

these Advanced items would require speakers to perform Superior-level tasks. 

In phase two I found that the difference between items written for the Novice Mid and 

Intermediate Low proficiency levels was significant, and that the difference in difficulty between 

Intermediate Low and Intermediate Mid items was also significant. These findings are slightly 

different than the findings from phase one of the study. In the original assessment, none of the 

mean difficulty estimates were statistically significant at the sublevel, but the revised self-

assessment did discriminate at some of the Novice and Intermediate sublevels. This suggests that 

the revised instrument may allow for more accurate self-assessments at lower proficiency 

sublevels. At the Intermediate High level and above, however, both the original and revised self-

assessment instruments did not distinguish well between proficiency sublevels. This is similar to 

Jones (2002), who had difficulty matching the CEFR (Council of Europe, 2001) can-do 

statements to the upper levels of proficiency on the CEFR scale. He noted, “[o]ne problem is that 

 82 

 

in the current analysis the highest level (C2) statements are not well distinguished from the level 

below (C1)” (p. 177).  Weir (2005) suggested that in order to have better discrimination between 

language use at upper levels of proficiency, inclusion of the communicative context and the 

quality of the performance may be necessary. Considering performance of Item 36 (I can provide 

a rationale for the importance of certain classes, subjects, or training programs, Advanced Low) 

provides a good illustration. A test taker might be able to provide a very simple rationale for the 

importance of a class to a peer: “It is important for me to take Spanish 202 so that I can complete 

my language requirement.” A speaker at a higher proficiency level, on the other hand, might be 

able to give a more formal presentation of a rationale for a more abstract subject (e.g., the 

importance of freedom of speech) and elaborate in greater detail. These two performances would 

both match the content of the Can-Do Statement, but represent different levels of spoken 

proficiency. In order for upper-level Can-Do Statements to be useful, then, it might be necessary 

to include the context (e.g., specification of the interlocutor) and the quality (e.g., length of text 

or amount of detail provided) of the language task. Another consideration is that of the total 

sample of test takers (886), there were 157 participants who responded to the testlets that 

included Advanced-level statements. Of these, 39 participants received official ACTFL OPIc 

ratings at the Advanced level. A sample including more participants at higher proficiency levels 

may show improved accuracy of the Advanced level statements. Language researchers may lack 

accurate descriptions of Advanced level proficiency because few university learners reach this 

level (Byrnes & Ortega, 2008; Soneson & Tarone, in press). The flip side of this coin, however, 

is that descriptions of functional language use for Novice and Intermediate levels of proficiency 

are becoming more refined. 

As mentioned previously, given ACTFL’s characterization of emerging ability at the Low 

sublevels and sustained ability at the High sublevels, the finding that the revised self-assessment 

instrument did not discriminate well between all of the proficiency sublevels is not unexpected. 

 83 

 

These interpretations of Low, Mid, and High suggest that researchers should not expect to find 

unique differences in the difficulty of items at the sublevels. The finding that there were 

significant differences in item difficulty between some of the Novice and Intermediate sublevels 

is in fact unexpected. This suggests that language proficiency on the ACTFL subscales may 

develop in more incremental steps in the initial levels, followed by the phenomenon of 

emerging/sustaining performance at upper levels. In their current form, the use of Can-Do 

Statements for self-assessment may be more accurate for Novice and Intermediate proficiency 

levels than upper levels of proficiency. 

 

 

 84 

 

DISCUSSION AND CONCLUSION 

The aim of this study was to assess the validity of a selection of NCSSFL-ACTFL (2015) 

Can-Do Statements for use with postsecondary Spanish language learners. These statements 

were assembled in a self-assessment instrument whose intended purpose was to guide test takers 

toward an appropriate OPIc form. Analysis of the Can-Do Statements included in the original 

self-assessment highlighted several items that could be improved in terms of content validity 

(i.e., required language use and contexts) and item construction. These items were removed and 

replaced with items that were well-constructed, and described language use relevant to college 

test takers’ experience in the revised version of the instrument. Using the original instrument, 4% 

of test takers in the 2015 test administration and 9% in the 2016 administration over- or under-

assessed their language ability. Using the revised instrument, 7% of the test takers under- or 

over-assessed their ability in the 2017 round of testing. These findings suggest that it may not be 

possible to refine the Can-Do Statements to improve self-assessment accuracy beyond a certain 

point.   

When interpreting the results of this dissertation, it should be pointed out that in this 

study the NCSSFL-ACTFL (2015) Can-Do Statements were analyzed as a means of 

approximating college learners’ L2 proficiency. However, this is not one of the intended uses for 

the statements. Rather, two purposes are articulated for the Can-Do Statements: “for programs, 

the statements provide learning targets for curriculum and unit design, serving as performance 

indicators; for language learners, the statements provide a way to chart their progress through 

incremental steps” (ACTFL, 2015, p. 1). Despite this slight difference in usage, the item 

response analysis of the self-assessment statements as a measure of foreign language proficiency 

in this study provides a picture of a) whether the skills articulated in the statements are important 

indicators or learning targets for foreign language proficiency, and b) whether the incremental 

 85 

 

steps documented in the Can-Do Statements match actual gains in self-assessed language 

proficiency.  

The current study has implications for the self-assessment of Spanish learners’ language 

proficiency at the college level. The self-assessment instrument under consideration was normed 

on postsecondary Spanish language learners over two rounds of testing. The results of the study 

showed that the instrument did not discriminate well between statements pegged at higher 

proficiency levels. The analysis also showed that the ability to speak at Advanced levels of 

proficiency may develop differently than language proficiency at lower levels, and therefore may 

be considered a construct separate from speaking proficiency. However, this may not be a 

serious problem for testing college level test takers, since this population rarely reaches 

Advanced levels of proficiency (Byrnes & Ortega, 2008). The median proficiency level of 

Spanish test takers in both 2015 and 2017 was Intermediate Low; only 3.4% (2015) and 3.3% 

(2017) of the learners in the study received Advanced-level proficiency ratings. Since the items 

in the current study were normed on Spanish language learners, the results may not be 

generalizable across languages. The DIF analysis of the test takers’ responses in this study 

suggested that some of the items may not behave the same way for French and Spanish students.   

Although it seems premature to make pedagogical recommendations, this study has 

implications for foreign language assessment. Can-Do Statements are ripe for use as a self-

assessment tool, but teachers should be aware that when they select specific statements for 

classroom use, they should be selective. The data in this study suggest that some of the Can-Do 

Statements may not be relevant for use with all populations and may not be interpreted in the 

same way by all language learners. Teachers may want to work with students by conducting a 

needs analysis to identify what types of language use and performance they anticipate needing, 

and then match those needs with Can-Do Statements that describe them. This study also suggests 

that the statements can be used with college language learners to estimate their language 

 86 

 

proficiency at the major ACTFL threshold levels and at the lower sublevels of L2 proficiency. 

The descriptors of proficiency at the Advanced level and above my be less accurate.      

This study also has implications for the development and categorization of descriptors of 

language proficiency. First, this study provides further evidence that the content and construction 

of Can-Do Statements impact the way learners can evaluate their proficiency. Including language 

tasks that have differing degrees of difficulty in the same self-assessment item can interfere with 

language learners’ ability to accurately evaluate how well they can accomplish the tasks. Test 

users also stand a better chance of accurate self-assessment when the content of the items is in 

line with their experience using the language. The finding that some of the content of the 

Advanced-level statements required Superior-level language use implies that future development 

of performance indicators should be more carefully aligned with current descriptions of language 

proficiency. Descriptors of higher levels of L2 proficiency may also require specifications of the 

quality of language production required for upper-level performance that are not currently 

included in the Can-Do Statements.  

Another consideration for the development and categorization of individual descriptors of 

language proficiency regards the factors of spoken proficiency that test designers and researchers 

seek to measure. ACTFL has created two measures of spoken language proficiency: The OPIc 

and the Can-Do Statements for speaking. The OPIc is described as an assessment of 

interpersonal communication (ACTFL, 2014), while the Can-Do Statements are designed to 

describe interpersonal communication, presentational speaking (ACTFL, 2015), and intercultural 

competence (ACTFL, 2017). The factor structure of spoken proficiency identified in the Can-Do 

Statements in the current study did not show a clear distinction between these modes of 

speaking. Instead, two possible factor structures converged: a unidimensional model and a two-

factor model of general/academic speaking. This finding makes the relationship between official 

ACTFL ratings of spoken proficiency (i.e., OPIc ratings) and the speaking tasks described in the 

 87 

 

Can-Do Statements unclear. One possibility, based on the unitary model, is that the OPIc and the 

Can-Do Statements both measure a single dimension of spoken proficiency. Another possibility 

is that the OPIc is a measure of core spoken proficiency, and that some of the Can-Do tasks 

belong to this same core, while other statements rely on content knowledge (e.g., world 

knowledge, intercultural competence, travel experience, course or curriculum content) that goes 

beyond core language proficiency. These possibilities present a challenge for language 

assessment researchers to continue to refine construct definitions of language proficiency and to 

create stronger ties to theoretical models of L2 proficiency (e.g., Canale & Swain, 1980; 

Cummins, 1979; Bachman & Palmer, 1996) and development (e.g., Pienemann, 1998).     

The results of the factor analysis also have implications for the measurement of spoken 

language proficiency. ACTFL’s descriptors of functional language use may form a single factor, 

but it may also be possible to separate out multiple factors. If the performance indicators defined 

in the Can-Do Statements do form multiple dimensions of spoken proficiency, it may be possible 

to measure different profiles of language users. For example, a speaker who has highly 

automatized language ability but little world knowledge or education may be considered highly 

proficient in terms of core language ability, but lacking in the higher order cognition required to 

accomplish the language tasks that have been assigned to the higher levels of proficiency in the 

ACTFL model. Speakers who are highly educated in their first language may be able to tackle 

abstract topics of global interest, which represent Advanced- and Superior-level language use, 

while lacking in basic interpersonal communication skills. Thus, it may be more appropriate to 

measure these constructs on different scales. Other factors such as age and intercultural 

competence may further challenge the possibility of measuring spoken proficiency as a unitary 

and hierarchical construct. These challenges for spoken proficiency measurement models 

provide many avenues for future research.  

 88 

 

Because the current study was limited to language learners’ self-assessments of their 

ability to perform spoken language tasks, and these performances can be unstable (Shin, 2013), 

future research should include outside ratings of learners’ ability to perform each of these tasks. 

This would allow for further exploration of the factor structure of the tasks described in the Can-

Do Statements and performance descriptors. Only a selection of the Can-Do Statements for 

spoken proficiency have been analyzed in this study and by Brown et al. (2014), so inclusion of 

more statements and the revised ACTFL (2017) performance indicators may provide a more 

complete picture. Although the ACTFL (2012) Guidelines “are intended to be used for global 

assessment in academic and workplace settings,” this study showed that some items may not be 

well targeted for college-aged test-takers, let alone test-takers in secondary settings. Younger 

learners, such as students who have K-12 experience in dual immersion settings, working adults, 

and college students may have different enough foreign language experiences and needs that 

they should be considered different populations, requiring test items that are normed accordingly. 

The differences in item difficulty between postsecondary Spanish students’ item responses and 

predicted ACTFL difficulty levels merit further study so that empirical evidence can be provided 

for the difficulty of the descriptors for all age groups (i.e., young children, children, teens, 

college-students, and working adults). 

The current research is limited in that the majority of the test takers in this study had 

Novice- and Intermediate-level speaking proficiency. As Byrnes and Ortega (2008) highlighted, 

the study of advanced language learners is under-researched, and future research on self-

assessment of speaking abilities should include more focus on this population so that the 

descriptors of functional language for Advanced proficiency can be refined. A stratified sample 

of equal numbers of language learners from all proficiency levels would allow for more accurate 

descriptions of proficiency standards at all levels (Crocker & Algina, 1986). The current study 

was limited to Spanish learners’ use of the statements, as there was some indication of 

 89 

 

differential item functioning when the Spanish learners’ responses were compared to learners of 

French. Therefore, another avenue for future research is to further explore whether the 

statements can be used in the same way for learners of all languages.   

Conclusions 
 

In the first phase of my dissertation, I identified 14 misfitting items in a self-assessment 

of ACTFL (2015) Can-Do Statements. The suspected reason for the misfit was in the 

construction of the items. Specifically, these items assessed multiple skills in a single Can-Do 

Statement, were not specific, or included experiences that were not relevant to college test-

takers’ typical experiences. In the second phase of the study, I revised the self-assessment to 

include Can-Do Statements that were well-constructed and targeted language use for college 

language learners. The items that were selected preferentially to fit these criteria were found to 

be useful for measuring L2 Spanish spoken language proficiency. Not all of the statements fit the 

model, and therefore the first conclusion is that the revised self-assessment instrument is not 

perfect. It can be concluded, however, that the revised instrument is an improvement over the 

original, based on the results of the difficulty analysis. 

The revised assessment discriminated well between the ACTFL threshold levels and 

lower sublevels (up to Intermediate Low) of proficiency. While the original self-assessment did 

not show any significant differences between the difficulty of items at the ACTFL proficiency 

sublevels, the revised instrument showed some significant differences. Particularly, the Novice 

mid items were significantly easier than the Intermediate Low items, and the Intermediate Low 

items were significantly easier than the Intermediate Mid items. These proficiency levels match 

the official OPIc ratings of the majority of the test takers in the 2017 test administration. 

Therefore, the revised instrument appears to be useful for estimating major threshold proficiency 

levels and for discriminating language proficiency at the sublevel for college language learners.   

 90 

 

The analysis of the revised self-assessment also resulted in two possible conclusions on 

the dimensionality of spoken proficiency. Either it is possible to create a unidimensional 

measurement of spoken proficiency using the ACTFL Can-Do Statements in the revised 

instrument and the majority of these items are useful for measuring spoken proficiency, or the 

items separate into two different dimensions of speaking. In this case, a 2PL model would be 

more appropriate for the measurement of spoken proficiency using ACTFL’s descriptors of 

spoken L2 proficiency.  

 

 

 91 

 

ENDNOTES 

 
1. OPIc raters use floor and ceiling scoring, meaning they need to hear speech that is sustained at 
one of the major levels and find evidence of linguistic breakdown at the next major level. Thus, 
if a Novice High examinee took a test that did not have a Novice floor, this examinee would be 
in constant breakdown and the rater would be unable able to determine the major level the 
examinee was at. 
 
2. This research was funded by the National Security Education Program’s Language Proficiency 
Flagship Initiative (Grant # 2340-MSU-7-PI-093-PO1) awarded to principal investigators Drs. 
Paula Winke and Susan Gass. 
 
3. Differential item functioning (DIF) tests whether a test measures a latent trait in the same way 
for all subgroups. In a research report by Educational Testing Service (ETS, Zwick, 2012) DIF 
items are split into three categories. First of all, no DIF items have a Mantel-Haenszel (MH) chi-
square statistic that is not significant at the .05 level and a DIF contrast value less than 0.43 
logits: Items with no DIF are considered to measure the construct in the same way for all groups. 
Affirmative DIF items have DIF contrasts greater than 0.64 logits and a significant (i.e., p < .05) 
Mantel-Haenszel (MH) chi-square statistic, and may measure the construct differentially. Neutral 
DIF items are those that do not meet the criteria of “no DIF” or “affirmative DIF.” Another 
common DIF cutoff for Rasch analysis is - 0.5 < x < 0.5., with scores at X being an indication of 
neutral to no DIF. If an item shows (affirmative) DIF, researchers must make a determination 
about the source of the difference by examining the content of the item. It is possible that the 
DIF is not real (e.g., a Type I error has occurred), that the DIF might not be interpretable, or that 
the DIF is expected, especially if the subgroups are expected to perform differently due to 
expected influencing differences (Zumbo, 1999; 2007).  

According to Davey and Wendler (2001), the minimum sample size for a DIF analysis is 
200 for the smaller group, and 500 in total for the construction of a test. In the Spring 2015 data, 
there were 220 French test takers and 382 Spanish test takers, making these two groups ideal for 
testing the hypothesis that the ACTFL (2012, 2015) performance indicators for speaking 
measure language proficiency in the same way irrespective of language.  In other words, are the 
Can-Do Statements equally useful for learners of Spanish and French, as evaluated using a DIF 
detection method? 

To answer this question, I performed a DIF analysis using the Mantel-Haenszel 

procedure based on the comparison of Spanish learners (reference group) and French learners 
(focal group). I considered items that had large DIF contrasts (- 0.5 < x < 0.5) that were 
statistically significant.  

Of the fifty items included in the self-assessment, three items, shown in Table 21, 

exhibited DIF across L2 language groups. Item 2 was easier for the Spanish learners than the 
French learners, while Items 3 and 4 were easier for the French learners. To determine whether 
this DIF is interpretable, I considered the linguistic and psycholinguistic features of each item. 
Listing months, seasons (Item 2) and free time activities (Item 4) requires learning vocabulary 
items (that are relatively similar), and is therefore likely equally difficult in both languages. 
Talking about likes, on the other hand, requires the gustar structure in Spanish, which is 
considered to be linguistically, developmentally, and psycholinguistically complex for L1 
English speakers (Cerezo, Caras, & Leow, 2016). To talk about likes in French requires a regular 
present tense verb construction, which can be considered a linguistically simple form as it only 
requires a single transformation (Spada & Tomita, 2010). Thus, it might be possible to interpret 
the DIF exhibited in Item 3, which was significantly easier for French learners than Spanish 
learners. 

 92 

 

 

Table 21: Items with large and significant DIF. 

Item 

French 

Spanish 

DIF 

t test 

sig. 

DIF 

DIF 

Contrast 

measure 
-4.23 

measure 
-3.54 

0.69 

t (465) = 3.60 

p = .005* 

-4.32 

-4.91 

-0.59 

t (527) = -2.84  p = .017* 

-4.05 

-4.59 

-0.54 

t (516) = -2.69  p = .046* 

2. I can list the 
months and 
seasons. (NL) 
3. I can say 
which sports I 
like and don’t 
like. (NM) 
4. I can list my 
favorite free 
time activities 
and those I do 
not like. (NM) 

 
In this analysis, I considered whether any of the 50 Can-Do Statements included in a self-
assessment showed differential item functioning when comparing Spanish and French language 
learners. Three items showed statistically significant DIF. Of these, one item may include 
language use that is significantly more difficult for Spanish language learners than French 
language learners: talking about likes and dislikes.  
This finding is interesting because the great majority of the items appear to measure language 
proficiency in the same way for learners of Spanish and French. This is in line with the way the 
ACTFL (2012) Proficiency Guidelines were intended to be used: to describe and evaluate 
functional language use in any second language. However, the finding that Item 3 was more 
difficult for Spanish learners than French learners and that this difficulty may be attributable to 
linguistic and acquisitional features of the language use required may call ACTFL’s global 
description of language proficiency into question. Talking about likes and dislikes has been 
assigned to the Novice Mid proficiency level. This may be appropriate for French language 
learners, since this requires a linguistically simple construction. A learner of Spanish, on the 
other hand, is not likely to acquire this construction until later in their language learning (Cerezo, 
Caras, & Leow, 2016). Thus, this descriptor of language proficiency may not be suitable for 
describing functional language ability in all languages. 
 
4. Sufficient evidence that the rating scale worked as it was designed was observed. First, the 
average measure increased from category one to four: 1 (M = -1.39), 2 (M = -0.03), 3 (M = 
2.63), 4 (M = 5.69). In addition, the distribution of the item responses was unimodal and peaked 
in category four (Linacre, 2002).  
 
 

 

 93 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

APPENDICES

 94 

 

APPENDIX A: ACTFL OPIc 1-5 levels and Can-Do Statements 
 
 
Table 22: ACTFL OPIc level 1 Can-Do Statements 

 

 

 

I can name basic objects, colors, days of the week, foods, clothing items, numbers, 
etc. I cannot always make a complete sentence or ask simple questions. 

 

Can-do statements 

ACTFL 
Levels 

Mode 

❑  I can say the date and the day of the week.  
❑  I can list the months and seasons. 
❑  I can say which sports I like and don’t like. 
❑  I can list my favorite free-time activities and those I don’t like. 
❑  I can state my favorite foods and drinks and those I don’t like. 
❑  I can talk about my school or where I work. 
❑  I can talk about my room or office and what I have in it. 
❑  I can list my classes and tell what time they start and end. 
❑  I can answer questions about where I’m going or where I went. 
❑  I can present information about something I learned in a class 

NL 
NL 
NM 
NM 
NM 
NM 
NM 
NM 
NM 
NH 

PS 
PS 
PS 
PS 
PS 
PS 
PS 
PS 
IC 
PS 

or at work. 

 
 
Table 23: ACTFL OPIc level 2 Can-Do Statements 

I can give some basic information about myself, work, familiar people and places, 
and daily routines speaking in simple sentences. I can ask some simple questions. 

 

Can-do statements 

❑  I can describe a school or workplace. 
❑  I can describe a place I have visited or want to visit. 
❑  I can ask for help at school, work, or in the community. 
❑  I can talk about my daily routine. 
❑  I can talk about my interests and hobbies. 
❑  I can schedule an appointment. 
❑  I can talk about my family history. 
❑  I can plan an outing with a group of friends. 
❑  I can explain why I was late to class or absent from work and 

arrange to make up the lost time. 

❑  I can tell a friend how I’m going to replace an item that I 

borrowed and broke/lost. 

ACTFL 
Levels 

Mode 

IL 
IL 
IL 
IM 
IM 
IM 
IH 
IH 
AL 

AL 

PS 
PS 
IC 
IC 
IC 
IC 
IC 
IC 
IC 

IC 

 
 
 
 

 

 95 

 

Table 24: ACTFL OPIc level 3 Can-Do Statements 

I can participate in simple conversations about familiar topics and routines. I can 
talk about things that have happened but sometimes my forms are incorrect. I can 
handle a range of everyday transactions to get what I need. 

 

Can-do statements 

❑ I can give some information about activities I did. 
❑ I can talk about my favorite music, movies, and sports. 
❑ I can describe a childhood or past experience. 
❑ I can ask for and follow directions to get from one place to 

another. 

❑ I can return an item I have purchased to a store. 
❑ I can arrange for a make-up exam or reschedule an 

appointment. 

❑ I can present an overview about my school, community, or 

workplace. 

❑ I can compare different jobs and study programs in a 

conversation with a peer. 

❑ I can discuss future plans, such as where I want to live and 

what I will be doing in the next few years. 

❑ I can explain an injury or illness and manage to get help. 

 
 
Table 25: ACTFL OPIc level 4 Can-Do Statements 

ACTFL 
Levels 

Mode 

IM 
IM 
IM 
IH 

IH 
IH 

AL 

AL 

AM 

AM 

IC 
IC 
PS 
IC 

IC 
IC 

PS 

IC 

IC 

IC 

I can participate in fully and confidently in all conversations about topics and 
activities related to home, work/school, personal and community interests. I can 
speak in connected discourse about things that have happened, are happening, 
and will happen. I can explain and elaborate when asked. I can handle routine 
situations, even when there may be an unexpected complication. 

 

Can-do statements 

❑ I can present ideas about something I have learned, such as a 
historical event, a famous person, or a current environmental 
issue. 

❑ I can give a presentation about my interests, hobbies, lifestyle, 

or preferred activities. 

❑ I can ask for and provide descriptions of places I know and 

also places I would like to visit. 

❑ I can explain how life has changed since I was a child and 

respond to questions on the topic. 

❑ I can discuss what is currently going on in another community 

or country. 

ACTFL 
Levels 

IH 

Mode 

PS 

IH 

IH 

AL 

AL 

PS 

IC 

IC 

IC 

 
 
 

 96 

 

Table 25 (cont’d) 

❑ I can provide a rationale for the importance of certain classes, 

AL 

subjects, or training programs. 

❑ I can talk about present challenges in my school or work life, 
such as paying for classes or dealing with difficult colleagues. 

❑ I can exchange general information about leisure and travel, 

such as the world’s most visited sites or most beautiful places 
to visit. 

❑ I can give a presentation about cultural influences on society. 
❑ I can participate in conversations on social or cultural 

AM 

AM 

AH 
AH 

PS 

IC 

IC 

PS 
IC 

questions relevant to speakers of this language. 

 
Table 26: ACTFL OPIc level 5 Can-Do Statements 

I can engage in all informal and formal discussions on issues related to personal, 
general or professional interests. I can deal with these issues abstractly, support 
my opinion, and construct hypotheses to explore alternatives. I am able to 
elaborate at length and in detail on most topics with a high level of accuracy and a 
wide range of precise vocabulary. 

 

Can-do statements 

ACTFL 
Levels 

Mode 

❑ I can interview for a job or service opportunity related to my 

field of expertise. 

❑ I present an explanation for a social or community project or 

policy. 

❑ I can present reasons for or against a position on a political or 

social issue. 

❑ I can give a clear and detailed story about childhood 

memories, such as what happened during vacations or 
memorable events and answer questions about my story. 

❑ I can exchange general information about my community, 
such as demographic information and points of interests. 

❑ I can exchange factual information about social and 

environmental questions, such as retirement, recycling, or 
pollution. 

❑ I can usually defend my views in a debate. 
❑ I can exchange complex information about my academic 

studies, such as why I chose the field, course requirements, 
projects, internship opportunities, and new advances in my 
field. 

❑ I can provide a balance of explanations and examples on a 

complex topic. 

❑ I can explain participate actively and react to others 

appropriately in academic debates, providing some facts and 
rationales to back up my statements. 

AL 

AL 

AL 

AM 

AM 

AM 

AH 
AH 

S 

S 

IC 

PS 

PS 

IC 

IC 

IC 

IC 
IC 

PS 

IC 

 97 

 

APPENDIX B: Phase I principal components analysis 
 
 

 
Figure 10: Phase I standardized residual contrast plot.  

 
 
 
Table 27: First contrast in the original Rasch model 

 

Cluster 1 

35. I can discuss what is currently going on 
in another community or country. (AL; IC) 

36. I can provide a rationale for the 
importance of certain classes, subjects, or 
training programs. (AL; PS) 
29. I can discuss future plans, such as 
where I want to live and what I will be 
doing in the next few years. (AM; IC) 
 
 
 

Cluster 3 

32. I can give a presentation about my 
interests, hobbies, lifestyle, or preferred 
activities. (IH; PS) 
33. I can ask for and provide descriptions 
of places I know and also places I would 
like to visit. (IH; IC) 
49. I can provide a balance of explanations 
and examples on a complex topic. (S; PS) 

 98 

 

Table 28: First contrast in the final Rasch model 

Cluster 3 

20. I can tell a friend how I’m going to 
replace an item that I borrowed and 
broke/lost. (AL; IC) 

35. I can discuss what is currently going on 
in another community or country. (AL; IC) 
36. I can provide a rationale for the 
importance of certain classes, subjects, or 
training programs. (AL; PS) 

Cluster 1 

48. I can exchange complex information 
about my academic studies, such as why I 
chose the field, course requirements, 
projects, internship opportunities, and new 
advances in my field. (AH; IC) 
49. I can provide a balance of explanations 
and examples on a complex topic. (S; PS) 
50. I can explain participate actively and 
react to others appropriately in academic 
debates, providing some facts and 
rationales to back up my statements. (S; IC) 
 
 
 
 
 
 

 

 99 

 

 

APPENDIX C: ACTFL OPIc 1-5 levels and revised Can-Do Statements 
 
 
 
 
Table 29: ACTFL OPIc level 1 Can-Do Statements 

 

I can name basic objects, colors, days of the week, foods, clothing items, numbers, etc. I 
cannot always make a complete sentence or ask simple questions. 

 

1. 
2. 
3. 
4. 

5. 
6. 
7. 
8. 
9. 

10.

Can-do statements 

ACTFL 
Levels 

Mode 

I can say the date.  
I can say the day of the week. 
I can say which sports I like and don’t like. 
I can list my favorite free-time activities and those I don’t 
like. 
I can say what someone looks like. 
I can talk about my school or where I work. 
I can talk about my room or office and what I have in it. 
I can talk about what I do on the weekends. 
I can answer questions about where I’m going or where I 
went. 
I can present information about something I learned in a class 
or at work. 

NL 
NL 
NM 
NM 

NM 
NM 
NM 
NM 
NM 

NH 

PS 
PS 
PS 
PS 

PS 
PS 
PS 
PS 
IC 

PS 

 
 
 
Table 30: ACTFL OPIc level 2 Can-Do Statements 

I can give some basic information about myself, work, familiar people and places, and 
daily routines speaking in simple sentences. I can ask some simple questions. 

Can-do statements 

ACTFL 
Levels 

Mode 

11. I can describe a school or workplace. 
12. I can describe a place I have visited. 
13. I can describe what my summer plans are. 
14. I can report on a social event that I attended. 
15. I can bring a conversation to a close. 
16. I can schedule an appointment. 
17. I can talk about my family history. 
18. I can explain a series of steps needed to complete a task. 
19. I can explain why I was late to class or absent from work 

and arrange to make up the lost time. 

20. I can tell a friend how I’m going to replace an item that I 

borrowed and broke/lost. 

IL 
IL 
IL 
IM 
IM 
IM 
IH 
IH 
AL 

AL 

PS 
PS 
PS 
PS 
IC 
IC 
IC 
PS 
IC 

IC 

 

❑ 

❑ 

❑ 

❑ 

❑ 

❑ 

❑ 

❑ 

❑ 

❑ 

 

 100 

 

Table 31: ACTFL OPIc level 3 Can-Do Statements 

I can participate in simple conversations about familiar topics and routines. I can talk 
about things that have happened but sometimes my forms are incorrect. I can handle a 
range of everyday transactions to get what I need. 

 

❑ 

❑ 

❑ 

❑ 

❑ 

❑ 

❑ 

❑ 

❑ 

❑ 

Can-do statements 

ACTFL 
Levels 

Mode 

21. I can give some information about activities I did. 
22. I can talk about my favorite music, movies, and sports. 
23. I can give a short presentation on a current event. 
24. I can ask for and follow directions to get from one place to 

another. 

25. I can return an item I have purchased to a store. 
26. I can arrange for a make-up exam or reschedule an 

appointment. 

27. I can present an overview about my school, community, or 

workplace. 

28. I can compare different jobs and study programs in a 

conversation with a peer. 

IM 
IM 
IM 
IH 

IH 
IH 

AL 

AL 

29. I can discuss future plans, such as where I want to live and 

AM 

what I will be doing in the next few years. 

30. I can describe in detail a social event. 

AM 

IC 
IC 
PS 
IC 

IC 
IC 

PS 

IC 

IC 

PS 

 
 
 
Table 32: ACTFL OPIc level 4 Can-Do Statements 

I can participate in fully and confidently in all conversations about topics and activities 
related to home, work/school, personal and community interests. I can speak in 
connected discourse about things that have happened, are happening, and will happen. 
I can explain and elaborate when asked. I can handle routine situations, even when 
there may be an unexpected complication. 

 

Can-do statements 

❑ 

31. I can present ideas about something I have learned, such 

as a historical event, a famous person, or a current 
environmental issue. 

32. I can make a presentation about an interesting person. 
33. I can explain to someone who was absent what took place 

in class. 

34. I can explain how life has changed since I was a child and 

respond to questions on the topic. 

35. I can discuss what is currently going on in another 

community or country. 

36. I can provide a rationale for the importance of certain 

classes, subjects, or training programs. 

❑ 

❑ 

❑ 

❑ 

❑ 

ACTFL 
Levels 

IH 

Mode 

PS 

IH 
IH 

AL 

AL 

AL 

PS 
PS 

IC 

IC 

PS 

 101 

 

Table 32 (cont’d) 

 
 
 

❑ 

❑ 

❑ 

37. I can talk about present challenges in my school or work 

AM 

IC 

life, such as paying for classes or dealing with difficult 
colleagues. 

38. I can recount the details of a historical event. 
39. I can give a presentation about cultural influences on 

society. 

40. I can participate in conversations on social or cultural 

questions relevant to speakers of this language. 

AM 
AH 

AH 

PS 
PS 

IC 

 
 
Table 33: ACTFL OPIc level 5 Can-Do Statements 

I can engage in all informal and formal discussions on issues related to personal, 
general or professional interests. I can deal with these issues abstractly, support my 
opinion, and construct hypotheses to explore alternatives. I am able to elaborate at 
length and in detail on most topics with a high level of accuracy and a wide range of 
precise vocabulary. 

 

Can-do statements 

ACTFL 
Levels 

Mode 

❑ 

❑ 

❑ 

❑ 

❑ 

❑ 

❑ 

❑ 

❑ 

❑ 

41. I can interview for a job or service opportunity related to 

my field of expertise. 

42. I can present an explanation for a social or community 

project or policy. 

43. I can present reasons for or against a position on a political 

social issue. 

44. I can give a clear and detailed story about childhood 

memories, such as what happened during vacations or 
memorable events and answer questions about my story. 

45. I can exchange general information about my community, 
such as demographic information and points of interests. 

46. I can exchange factual information about social and 

environmental questions, such as retirement, recycling, or 
pollution. 

47. I can usually defend my views in a debate. 
48. I can exchange complex information about my academic 

studies, such as why I chose the field, course 
requirements, projects, internship opportunities, and new 
advances in my field. 

49. I can provide a balance of explanations and examples on a 

complex topic. 

50. I can explain participate actively and react to others 

appropriately in academic debates, providing some facts 
and rationales to back up my statements. 

AL 

AL 

AL 

AM 

AM 

AM 

AH 
AH 

S 

S 

IC 

PS 

PS 

IC 

IC 

IC 

IC 
IC 

PS 

IC 

 102 

 

APPENDIX D: Phase II principal components analysis 
 

 
Figure 11: Phase II standardized residual contrast plot for the original model. 

 

 

Figure 12: Phase II standardized residual contrast plot for the final model. 

 

 

 103 

 

Table 34: First contrast in the original Rasch model 

 

Cluster 1 

50. I can participate actively and react to 
others appropriately in academic debates, 
providing some facts and rationales to back 
up my statements. (S; IC) 
44. I can give a clear and detailed story 
about childhood memories, such as what 
happened during vacations or memorable 
events and answer questions about my 
story. (AM; IC) 
49. I can provide a balance of explanations 
and examples on a complex topic. (S; PS) 

Cluster 3 

20. I can tell a friend how I’m going to 
replace an item that I borrowed and 
broke/lost. (AL; IC) 

25. I can return an item I have purchased to 
a store. (IH; IC) 

41. I can interview for a job or service 
opportunity related to my field of expertise. 
(AL; IC) 

 
 
 
Table 35: First contrast in the final Rasch model 

Cluster 1 

Cluster 3 

48. I can exchange complex information 
about my academic studies, such as why I 
chose the field, course requirements, 
projects, internship opportunities, and new 
advances in my field. (AH; IC) 
49. I can provide a balance of explanations 
and examples on a complex topic. (S; PS) 
50. I can explain participate actively and 
react to others appropriately in academic 
debates, providing some facts and 
rationales to back up my statements. (S; IC) 
 
 
 
 

 

25. I can return an item I have purchased to 
a store. (IH; IC) 

33. I can explain to someone who was 
absent what took place in class. (IH; PS) 
41. I can interview for a job or service 
opportunity related to my field of expertise. 
(AL; IC) 

 104 

 

REFERENCES

 105 

 

 

REFERENCES 

ACTFL. (2012). ACTFL proficiency guidelines - speaking. Retrieved December 12, 

2016 from http://www.actfl.org 

 
ACTFL. (2014). ACTFL OPIc familiarization manual. Retrieved May 14, 2018 from 

https://www.languagetesting.com/pub/media/wysiwyg/manuals/actfl-fam-manual-
opic.pdf  

 
ACTFL. (2015). NCSSFL-ACTFL can-do statements. Retrieved December 12, 2016 

from http://www.actfl.org/global_statements 

 
ACTFL. (2017). NCSSFL-ACTFL can-do statements. Retrieved April 27, 2018 from 
https://www.actfl.org/sites/default/files/CanDos/Can-Do%20Introduction.pdf  

  
American Educational Research Association, American Psychological Association, 

National Council on Measurement in Education, Joint Committee on Standards 
for Educational, & Psychological Testing (US). (2014). Standards for educational 
and psychological testing. Washington, DC: American Educational Research 
Association.  

 
Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 

43, 561–573. 

 
Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Oxford: 

Oxford University Press. 

 
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford 

University Press.  

 
Bailey, A. L. (2007). Introduction: Teaching and assessing students learning English in 

school. In A. L. Bailey (Ed.), The language demands of school: Putting academic 
English to the test (pp. 1–26). New Haven, CT: Yale University Press. 

 
Bond, T., & Fox, C. (2015). Applying the Rasch model: Fundamental measurement in the 

human sciences. New York/London: Routledge.  

 
Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. 

Psychological Review, 111, 1061-1071. 

 
Brantmeier, C. (2006). Advanced L2 learners and reading placement: Self-assessment, 

CBT, and subsequent performance. System, 34, 15-35.  

 

 106 

 

Brown, J. D. (2009). Choosing the right type of rotation in PCA and EFA. JALT Testing 

& Evaluation SIG Newsletter. 13(3), 20-25. Available from 
http://hosted.jalt.org/test/PDF/Brown31.pdf 

Brown, N. A., Dewey, D. P., & Cox, T. L. (2014). Assessing the validity of can-do 
statements in retrospective (then-now) self-assessment. Foreign Language 
Annals, 47, 261. 

 
Butler, Y. G. (2016). Self-assessment of and for young learners’ foreign language 

learning. In M. Nikolov (Ed.), Assessing young learners of English: Global and 
local perspectives (pp. 291-315). New York, NY: Springer International 
Publishing. 

 
Butler, Y. G., & Lee, J. (2006). On-task versus off-task self-assessments among Korean 
elementary school students studying English. The Modern Language Journal, 90, 
506–518. 

 
Byrnes, H. & Ortega, L. (2008). The longitudinal study of advanced L2 capacities. New 

York, NY: Routledge. 

 
Canale, M., & Swain, M. 1980. Theoretical bases of communicative approaches to 

second language teaching and testing. Applied Linguistics 1, 1-47. 

 
Celce-Murcia, M., Dörnyei, Z., & Thurrell, S. (1995). Communicative competence: A 
pedagogically motivated model with content specifications. Issues in Applied 
Linguistics, 6, 5-35. 

 
Chalhoub–Deville, M., & Deville, C. (1999). Computer adaptive testing in second 

language contexts. Annual Review of Applied Linguistics, 19, 273-299.  

 
Cerezo, L., Caras, A., & Leow, R. (2016). The effectiveness of guided induction versus 
deductive instruction on the development of complex Spanish gustar structures: 
An analysis of learning outcomes and processes. Studies in Second Language 
Acquisition, 38, 265-291.  

 
Comrey, A. L., & Lee, H. B. (1992). A first course in factor analysis (2nd Ed.). Hillsdale, 

NJ: Lawrence Erlbaum Associates, Inc. 

 
Council of Europe. (2001). Common European framework of reference for languages: 
Learning, teaching, assessment. Cambridge, UK: Cambridge University Press. 

 
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New 

York: Holt, Rinehart and Winston. 

 
Cumming, A. H., & Berwick, R. (1996). Validation in language testing. Clevedon, 

England: Multilingual Matters. 

 

 107 

 

Cummins, J. (1979). Cognitive/Academic language proficiency, linguistic 

interdependence, the optimum age question and some other matters. Working 
Papers on Bilingualism, 19, 121–129. 

 
Cummins, J. (1980). Psychological assessment of immigrant children: Logic or intuition? 

Journal of Multilingual and Multicultural Development, 1, 97-lll. 

Cummins, J. (1981). Age on arrival and immigrant second language learning in Canada: A 

reassessment. Applied Linguistics, 1, 132-149. 

 
Cummins, J. (2008). BICS and CALP: Empirical and theoretical status of the distinction. 
In B. Street & N. H. Hornberger (Eds.), Encyclopedia of language and education, 
vol. 2: Literacy (2nd ed., pp. 71–83). New York: Springer Science + Business 
Media LLC. 

 
Davey, T., & Wendler, C. (2001). DIF best practices in statistical analysis [ETS internal 

memorandum]. Princeton, NJ: ETS. 

 
DeKeyser, R. (2010). Monitoring processes in Spanish as a second language during a 

study abroad program. Foreign Language Annals, 43, 80-92. 

 
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, 

NJ: Lawrence Erlbaum Associates. 

 
Faez, F., Majhanovich, S., Taylor, S., Smith, M., & Crowley, K. (2011). The power of 

"can do" statements: Teachers' perceptions of CEFR-informed instruction in 
French as a second language classrooms in Ontario. Canadian Journal of Applied 
Linguistics/Revue Canadienne De Linguistique Appliquee, 14, 1-19.  

 
Fang, X., Yang, H., & Zhu, Z. (2011). The background of and approach to Can-Do 
description of language ability: Taking CEFR as an example. Shijie Hanyu 
Jiaoxue / Chinese Teaching in the World, 25, 246-257.  

 
Garcia, P., & Asención, Y. (2001). Interlanguage development of Spanish learners: 
Comprehension, production, and interaction. Canadian Modern Language 
Review, 57, 377-401.  

 
Green, A. (2014). Exploring language assessment and testing. New York, NY: 

Routledge.  

 
Haladyna, T., Downing, S., & Rodriguez, M. (2002). A review of multiple choice item-

writing guidelines for classroom assessment. Applied Measurement in Education, 
15, 309-334. 

 
Heilenman, L. K. (1990). Self-assessment of second language ability: The role of 

response effects. Language Testing, 7, 174–201. 

 

 108 

 

Holmes, S. E. (1982). Unidimensionality and vertical equating with the Rasch Model. 

Journal of Education Measurement 19, 139-47. 

 
Hulstijn, J. (2007). The shaky ground beneath the CEFR: Quantitative and qualitative 

dimensions of language. The Modern Language Journal, 91, 663–6. 

 
Jones, N. (2002). Relating the ALTE framework to the Common European Framework of 

Reference. In Council of Europe (Eds.), Case studies on the use of the Common 
European Framework of Reference. Cambridge, Cambridge University Press: 
167-183. 

Linacre, J. M. (2002). Optimizing rating scale category effectiveness. Journal of Applied 

Measurement 3, 85-106. 

 
Linacre, J. M. (2016a). WINSTEPS (Version 3.92) [Computer program]. Chicago: MESA 

press. 

 
Linacre, J. M. (2016b). A user’s guide to WINSTEPS. [Computer software manual]. 

Retrieved December 12, 2016 from 
http://www.winsteps.com/winman/principalcomponents.htm 

   
Little, D. (2005). The Common European Framework and the European Language 
Portfolio: Involving learners and their judgments in the assessment process. 
Language Testing, 22, 321–336. 

 
Malabonga, V. M., Kenyon, D. M., & Carpenter, H. (2005). Self-assessment, preparation 
and response time on a computerized oral proficiency test. Language Testing, 22, 
59-92.  

 
McNamara, T. F. (1991). Test dimensionality: IRT analysis of an ESP listening test. 

Language Testing, 8, 139-159.   

 
McNamara, T. F. (1995). Modelling performance: Opening Pandora's Box. Applied 

Linguistics, 16, 159-179.  

 
Muthén, L. K., & Muthén, B. O. (2017). Mplus user's guide (8th ed). Los Angeles, CA: 

Muthén & Muthén. 

 
National Education Security Program. The language flagship. Retrieved December 12, 

2016 from http://www.nsep.gov/content/language-flagship  

 
Nikolov, M. (2016). A Framework for young EFL learners’ diagnostic assessment: ‘Can 
do statements’ and task types. In M. Nikolov (Ed.), Assessing young learners of 
English: Global and local perspectives (pp. 65-92). New York, NY: Springer 
International Publishing. 

 

 109 

 

North, B. (2000). The development of a common framework scale of language 

proficiency. New York, NY: Peter Lang. 

 
North, B. (2011). Describing language levels. In B. O’Sullivan (Ed.), Language testing: 

Theories and practices (pp. 33–59). London, England: Palgrave Macmillan. 

 
North, B., & Schneider, G. (1998). Scaling descriptors for language proficiency scales. 

Language Testing, 15, 217–63. 

 
Oscarson, M. (1997). Self-assessment of foreign and second language proficiency. In C. 
Clapham & D. Corson (Eds.), The encyclopedia of language and education, Vol. 
7, Language testing and assessment (pp. 175-187). Dordrecht, The Netherlands: 
Kluwer Academic Publishers. 

 
Pienemann, M. (1998). Language processing and second language development. 

Amsterdam/Philadelphia: John Benjamins. 

 
Purpura, J. E., & Turner, C. E. (2014). “A learning-oriented assessment approach to 

understanding the complexities of classroom-based language assessment.” 
Teachers College, Columbia University Roundtable in Second Language Studies: 
Roundtable on Learning-Oriented Assessment in Language Classrooms and Large 
Scale Assessment Contexts, 10 October 2014, Teachers College, Columbia 
University, New York, NY. Retrieved from http://www.tc.columbia.edu/tccrisls/  

 
Purpura, J. E., & Turner, C. E. (2015). Learning-oriented assessment in second and 

foreign language classrooms. In D. Tsagari & J. Banerjee (Eds.), Handbook of 
Second Language Assessment. Boston, MA: De Gruyter Mouton.  

 
Rasch, G. (1960/80). Probabilistic models for some intelligence and attainment tests. 

Copenhagen: Danish Institute for Educational Research. 

 
Ross, S. (1998). Self-assessment in second language testing: A meta-analysis and 

analysis of experiential factors. Language Testing 15, 1-20.  

 
Schneider, G., & North, B. (2000): Fremdsprachen können – was heisst das? 

Chur/Zürich: Rüegger. 

 
Shin, S.-Y. (2013). Proficiency scales. In C. A. Chappelle (Ed.), The Encyclopedia of 

Applied Linguistics (pp.1-7). Oxford, UK: Wiley-Blackwell.  

 
Soneson, D. & Tarone, E. (in press). Picking up the PACE: Proficiency assessment for 

curricular enhancement. In P. Winke & S. Gass (Eds.), Foreign language 
proficiency in higher education. New York: Springer. 

 
Spada, N., & Yasuyo, T. (2010). Interactions between type of instruction and type of 

language feature: A meta-analysis. Language Learning, 60, 263-308. 

 110 

 

 
Stansfield, C. W., Gao, J., & Rivers, W. P. (2010). A concurrent validity study of self-
assessment and the federal interagency roundtable Oral Proficiency Interview. 
Russian Language Journal, 60, 299–315. http://www.jstor.org/stable/43669189 

 
Suzuki, Y. (2015). Self-assessment of Japanese as a second language: The role of 

experiences in the naturalistic acquisition. Language Testing, 32, 63-81. 

 
Sweet, G., Mack, S., & Olivero-Agney, A. (in press). Where am I? Where am I going, 

and how do I get there?: Increasing learner agency through large-scale self 
assessment in language learning. In P. Winke & S. Gass (Eds.), Foreign language 
proficiency in higher education. New York: Springer. 

 
The National Standards Collaborative Board. (2015). World-readiness standards for 

learning languages (4th ed). Alexandria, VA: Author. 

 
Tigchelaar, M. (in press). Exploring the relationship between self-assessments and OPIc 

ratings of oral proficiency in French. In P. Winke & S. Gass (Eds.), Foreign 
language proficiency in higher education. New York: Springer. 

 
Tigchelaar, M., Bowles, R. P., Winke, P., & Gass, S. (2017). Assessing the validity of 
ACTFL can-do statements for spoken proficiency: A Rasch Analysis. Foreign 
Language Annals, 50, 584-600. DOI: 10.1111/flan.12286  

 
Trofimovich, P., Isaacs, T., Kennedy, S., Saito, K., & Crowther, D. (2014). Flawed self-
assessment: Investigating self-and other-perception of second language speech. 
Bilingualism: Language and Cognition, 19, 1-19. 

 
Turner, C. F. (1984). Why do surveys disagree? Some preliminary hypotheses and some 
disagreeable examples. In Turner, C.F. & Martin, E., (Eds.), Surveying subjective 
phenomena, Vol. 2. New York: Russell Sage Foundation. 

 
VanPatten, B., Trego, D., & Hopkins, W. (2015). In-class vs. online testing in university-

level language courses: A research report. Foreign Language Annals, 48, 659–
668. 

 
Weir, C. (2005). Limitations of the Common European Framework for developing 

comparable examinations and tests. Language Testing, 22, 281–300. 

 
WIDA (2014). WIDA can do descriptors. Retrieved May 10, 2018 from 

http://www.wida.us/standards/CAN_DOs/ 

 
Wright, B. (1991). Scores, reliabilities and assumptions. Rasch Measurement 

Transactions, 5, 157-158. 

 
Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch 

 111 

 

Measurement Transactions, 8, 370. 

 
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago, IL: Mesa Press. 
 
Young, J., Cho, Y., Ling, G., Cline, F., Steinberg, J., & Stone, E. (2008). Validity and 

fairness of state standards-based assessments for English language learners. 
Educational Assessment, 13, 170-192, DOI:10.1080/10627190802394388. 

 
Zumbo, B. D. (2007). Three generations of DIF analyses: Considering where it has been, 

where it is now, and where it is going. Language Assessment Quarterly, 4, 223-
233. 

 
Zumbo, B. D. (1999). A handbook on the theory and methods of differential item 

functioning (DIF): Logistic regression modeling as a unitary framework for 
binary and Likert-type (ordinal) item scores. Ottawa, ON: Directorate of Human 
Resources Research and Evaluation, Department of National Defense.  

 
Zwick, R. (2012). A review of ETS differential item functioning assessment procedures: 

Flagging rules, minimum sample size requirements, and criterion refinement. 
[ETS RR-12-08]. Princeton, NJ: ETS. 

 

 112