SELF-ASSESSMENT: A FEISTY OR RELIABLE TOOL TO ASSESS THE ORAL 

PROFICIENCY OF CHINESE LEARNERS? 

 
By 
 

Wenyue Ma 

A THESIS 

Submitted to 

Michigan State University 

in partial fulfillment of the requirements 

for the degree of 

 
 

 

 

 

Teaching English to Speakers of Other Languages – Master of Arts 

2018 

 

 

 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 

 

 

 

SELF-ASSESSMENT: A FEISTY OR RELIABLE TOOL TO ASSESS THE ORAL 

PROFICIENCY OF CHINESE LEANERS? 

ABSTRACT 

 
By 
 

Wenyue Ma 

 

 

In this study, I took a close look at the results of oral proficiency self-assessment tests 

and OPIc (Oral Proficiency Interview-computer) tests taken twice by the same group of students. 

I did this to explore the role of self-assessment in Chinese language programs. The data were 

collected as part of a Language Flagship Proficiency Assessment Project. I used data from 80 

college students who were studying Chinese. During the spring of two consecutive years, the 

students took a self-assessment (with NCSSFL-ACTFL Can-do Statements, 2015) as part of the 

project, and then immediately took an official ACTFL OPIc with a level of difficulty that was 

matched to their self-assessment outcome. I analyzed the self-assessment results on both the test 

and item level. In general, I investigated whether self-assessment can reliably indicate students’ 

language gains over time, with the benchmark of true gain being (in this study) their OPIc scores. 

The findings revealed that most students’ language trajectories were reflected by the results of 

the self-assessment. In addition, the accuracy rate of self-assessment was positively correlated 

with students’ proficiency levels. After a close examination of the items that were misidentified 

by the students regarding the difficulty level, students tended to under-assess rather than over-

assess their oral proficiency. The comparison of the scores of repeated self-assessments and OPIc 

tests showed that there was no significant difference in how accurately students could self-assess 

themselves before and after an academic year in a language program.  

Keywords: self-assessment, Chinese proficiency test, oral proficiency, validity 
 
 

 

 

 

 

 

This master thesis is dedicated to Mom and Dad. 

Thank you for always supporting me. 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

iii 

 

ACKNOWLEDGEMENTS 

 
 

The data used in this MA thesis was collected as part of a larger grant project funded by the 

National Security Education Program's Language Proficiency Flagship Initiative (grant # 2340-
MSU-7-PI-093-PO1) awarded to principal investigators Paula Winke and Susan Gass. I, Melody 

Wenyue Ma, was a Graduate Research Assistant on the project: I served as one of the proctors 

who administered the test to undergraduate students at Michigan State University, and I worked 

with the PIs on subsequent data analyses and other research tasks. I borrowed the data from the 

project, having received the data as "pre-existing," and without identifiers (the names and ID 

numbers were removed before I received the data). I would like to thank the Flagship Project and 

the PIs Drs. Winke and Gass, along with the other proctors, research assistants, and key project 

personal who helped me with various tasks, including Dr. Emily Heidrich, the Project Manager, 

Dr. Angelika Kraemer, the executive director of CeLTA, the unit that hosted the grant project, 

and Mr. Amaresh Joshi, the project's and CeLTA's data manager, who provided me with the 

anonymized data in a readable file format.  

In addition, I would like to express my appreciation to my parents, who always support and 

believe in me through my two-year master studies. Then I want to express my appreciation to my 

boyfriend, who keeps me company when I need, and to my friends Xiaowan Zhang, Zhonghao 

Wang, Hao Wang, Shinhye Lee, Myoengeun Son, Ian Solheim, Rachel Lin, and Ziyue Deng, 

who have offered me different kinds of help on both my studies and life, 

 

 

 

 

 

iv 

 

TABLE OF CONTENTS 

 

LIST OF TABLES..................................................................................................................... vi 

LIST OF FIGURES .................................................................................................................. vii 

INTRODUCTION ...................................................................................................................... 1 

LITERATURE REVIEW ........................................................................................................... 4 

METHODOLOGY ................................................................................................................... 13 

RESULTS ................................................................................................................................ 20 

DISCUSSION .......................................................................................................................... 29 

CONCLUSION ........................................................................................................................ 35 

APPENDIX .............................................................................................................................. 37 

REFERENCES ......................................................................................................................... 40 

 

 

 

 

v 

 

LIST OF TABLES 

 

Table 1. Criteria for Accurately and Inaccurately Self-assessing Oral Proficiency ..................... 17 

Table 2. Number of Students Taking OPIc at Different Difficulty Levels .................................. 22 

Table 3. Number of Students Using Self-Assessment Successfully or Unsuccessfully Tracking 
their  Language Development Trajectories (Percentage Data in Parentheses) ............................ 22 

Table 4. Summary of the Number of Items Responded by Students (Self-Assessing Accuracy and 
Error Rate in Parentheses) ........................................................................................................ 25 

Table 5. Agreement Rate of Self-Assessment Responses and OPIc Scores ................................ 26 

Table 6. Correlations between the Number of Accurately Self-assessing Items and OPIc Scores 
(Exact-agreement Approach) .................................................................................................... 28 

Table 7. Summary of the Contrast between Self-assessment Agreement in Two Years ............. 28 

 

 

 

 

vi 

 

LIST OF FIGURES 

 

Figure 1. Descriptive statistics of the number of students and their OPIc scores in two years .... 21 

Figure 2. Two approaches presenting agreement rate of students at different oral proficiency 
levels ........................................................................................................................................ 25 

 

 

 

vii 

 

To assess learners’ language proficiency, two main approaches can be adopted: 

INTRODUCTION 

 

examinations, or self-assessment. The former is sometimes considered the only reliable way of 

assessing one’s foreign language proficiency, whereas the latter, although drawing growing 

attention and having been applied in a lot of language test administrations, is often criticized 

(e.g., Davidson & Henning, 1985) as an improper assessment method. Most arguments against 

the use of self-assessment focus on the inherent unreliability of subjective evaluation. However, 

the question of subjective evaluation does not necessarily imply the invalidity of self-assessment. 

Actually, self-assessment has been widely used in diverse language learning and teaching 

contexts ranging from a small-scale, classroom-based language learning assessment to a high-

stakes, language proficiency test. In the ACTFL (American Council on the Teaching of Foreign 

Languages) Oral Proficiency Interview–computer (OPIc) test (ACTFL, 2012a) and ACTFL 

Writing Proficiency Test (WPT) (ACTFL, 2012b), test candidates are asked to complete a self-

assessment survey at the beginning of the assessment. The test form, test content, and even test 

result to some extent are dependent on their responses to the questions in the survey. Similarly, 

the Avant PLACE test, which was developed by the Center for Applied Second Language 

Studies (CASLS), starts with a self-evaluation where students need to assess their own language 

proficiency before answering the questions in the test. The results of these high-stakes tests can 

have a great impact on important decisions related to students, educators, and institutions, such 

as successful conferring a degree, admission to a program, and fulfilling job requirements. As for 

self-assessment in the classroom context, it can simply be a questionnaire with a couple of 

questions, and students may need to offer their own evaluation on how much they have learned 

 

1 

 

from the class. This feedback can be crucial to their language learning, and may provide valuable 

implications for language teachers.  

The sustained popularity of self-assessment has prompted researchers to conduct empirical 

studies related to this topic focusing on various research questions, such as examining the validity 

of  self-assessment  in  different  language  teaching  and  learning  contexts,  investigating  the 

influences  of  the  features  of  a  self-assessment  surveys  with  regard  to  their  effectiveness  or 

accuracy, and teasing apart test takers characteristics that may affect self-assessment results. In the 

current study, which is an extension of the previous studies, I focus mainly on college learners of 

Chinese  who  completed  an  oral  proficiency  test  and  a  self-assessment  survey  with  Can-Do 

statements two times before and after an academic year in a language program. The students took 

these measures as part of their Chinese program’s participation in a Language Proficiency Flagship 

grant, a program that was implemented to give recommendations to the language programs at the 

university in regards to their established language program goals and how they were meeting them.  

In this study, which was carved out from the grant project, I examined whether the results 

of self-assessment could be used as a reliable tool to track Chinese program students’ language 

trajectories, and I looked at the results on both the global and item levels. I also investigated how 

accurately these students could self-evaluate their oral proficiency on the item level and whether 

they could successfully perceive the difficulty level of each NCSSFL-ACTFL Can-Do statement. 

In addition, I tried to tease apart the factors that potentially influence the result of self-assessment, 

such as students’ language proficiency levels and the different criteria of determining whether the 

difficulty level of a task was successfully identified by the student. In the next section, I review 

both theory-based and empirical studies on self-assessment, including (a) the rationale of the use 

 

2 

 

of self-assessment, (b) the validity examination of self-assessment, and (c) the factors that may 

have an influence on the result of self-assessment.  

 
 

 

 

3 

 

Why Self-Assessment? 

LITERATURE REVIEW 

In language studies, self-assessment has been widely used and serves a variety of 

purposes, including placement, program evaluation, judgement of attitudes and socio-

psychological differences, measure of course grade, learning diagnosis, and feedback to learners 

(Henning, 1987). Self-assessment has been criticized by some researchers, however, who 

consider the use of self-assessment as no more than a subjective and cursory self-grading done 

by learners themselves (Dickinson, 1987; Patri, 2002). Raymond and Gisèle (1985) held a 

different idea and pointed out in their article by first making a distinction between the concepts 

of self-assessment and self-grading, although the former, they noted, might include the latter in 

some cases. In addition, they suggested that self-assessment was not “an informal exercise based 

loosely on the student’s intuition” (p. 674). Furthermore, researchers have suggested that the 

accuracy of self-assessment can be improved in several ways (e.g., Taras, 2001; Delgado, 

Guerrero, Goggin, & Ellis, 1999). I will elaborate more on this in a later section. In this section, I 

will mainly focus on the advantages of the use of self-assessment. 

Shared assessment burden. In 1985, Raymond and Gisèle conducted a study in which 

they carred out several experiments leading to the use of self-assessment as a placement test.  

Raymond and Gisèle (1985) wrote in their article that the use of a self-assessment questionnaire 

could alleviate test administration burdens. Recruiting proctors is not needed in self-assessment. 

After all, for students, it is meaningless and unnecessary to cheat in self-assessments. Apart from 

that, students can even take questionnaires home and give them back to their teachers after 

finishing them, thus enabling teachers or test administrators to be exempted from routine 

procedures in examinations like establishing testing schedules and finding appropriate testing 

 

4 

 

rooms. This was in line with a claim made by Dickinson (1987) who wrote that learner 

participation in evaluation is beneficial because then the students share the burden of assessment 

with the teacher.  

Beneficial effects on learning. Learners’ joint efforts in the assessment process can be 

considered beneficial to their language development. This advantage of self-assessment was 

addressed in Oscarson's (1989) article, who attached great importance to learners’ ability to make 

reliable and valid autonomous judgements of their own oral proficiency. According to him, these 

judgements were regarded as a crucial part of the learning process. Similarly, Cardoso (2010) 

claimed that self-assessment provides learners with opportunities to gain control over their 

learning, thus prompting them to reflect on their learning and determine whether their gains are 

concomitant with their efforts and goals. Cavana and Luisa (2012) investigated the effects of 

self-assessment on learners’ learning styles and learning strategies in their pilot project by asking 

17 volunteer students of English to use the electronic European Language Portfolio (eELP), 

which is an electronic version of language biography to assess learners’ language proficiency. 

They found out that the use of the eELP could affect the learners’ learning positively in giving 

them insight into their learning process, increasing their self-confidence, and helping them to set 

learning goals. The findings of a number of empirical studies imply that the use of development-

oriented self-assessment would ultimately lead to enhanced learning productivity and learner 

autonomy, greater motivation, less frustration, and higher learning retention rates (Peirce, Swain, 

& Hart, 1993; Rivers, 2001). For example, in Rivers’s study, she investigated if students could 

correctly assess their progress, learning styles, and strategy preferences. All learners in her study 

were found to have self-directed learning behaviors based on their self-assessments. 

 

 

 

5 

 

The Validity of Self-Assessment 

In recent years, an increasingly growing number of researchers have investigated the 

validity of self-assessment (Butler & Lee, 2006; Cardoso, 2010; Delgado et al., 1999; Dolosic, 

Brantmeier, Strube, & Hogrebe, 2016; Kaderavek, Gillam, Ukrainetz, Justice, & Eisenberg, 

2004; Malabonga, Kenyon, & Carpenter, 2005). In second language studies, examining the 

correlation between results of self-assessment and performances of specific skill areas is the 

most commonly-used approach to evaluate whether the self-assessment is valid or not. However, 

the results of their findings are mixed. Below I review these studies and what they have found. 

Positive correlation between SA and performance. In many cases the authors of 

research studies correlating test scores with self-assessment scores have indicated that the two 

measures of the learners’ performances were highly correlated (Stansfield, Gao, & Rivers, 2010; 

Dolosic, Brantmeier, Strube, & Hogrebe, 2016; Malabonga, Kenyon, & Carpenter, 2005). In 

other words, the validity of self-assessment was examined, and found to be high.  

For example, Stansfield et al. (2010) investigated if self-assessment scores from 323 

learners of total 8 different languages could be utilized to provide information for the National 

Language Service Corps (NLSC), thus enabling the Corps to make important decisions in terms 

of applicants’ screening into the program. Specifically, the authors and the Corps wanted to 

examine if the score of the self-assessment could be used to accurately identify whether 

applicants’ target language proficiency was adequate to perform their jobs. In the study, each 

applicant completed a two-part self-assessment composed of a serious of Can-Do statements and 

a simplified set of ILR (Interagency Language Roundtable: 

http://www.govtilr.org/Skills/ILRscale1.htm) skill level description, and the score of self-

assessment is a composite score of the two parts. Researchers found that Oral Proficiency 

 

6 

 

Interview (OPI) scores received by the applicants were highly correlated with the oral self-

assessment scores at a statistically significant level.  

In Dolosic et al.'s (2016) study, the authors examined the relationship between self-

assessment and oral production in French. They included 24 students who enrolled in a French 

language summer camp. Although students were not able to accurately self-assess their French 

proficiency skill upon arrival at the intensive language-learning summer camp (pre-test), they 

demonstrated great improvement in the accuracy of self-assessment at the end of the program 

(post-test). 

In another study, Malabonga et al. (2005) investigated if self-assessment was a suitable 

tool to help examinees choose an appropriate starting level on the Computerized Oral 

Proficiency Instrument (COPI: http://www.cal.org/resource-center/publications/copi). They had 

55 learners of Arabic, Chinese, or Spanish come into a laboratory setting to take two exams, the 

Simulated Oral Proficiency Instrument (SOPI) and then the COPI, with the order of tests 

randomized. Their findings revealed that self-assessment was a reliable tool for examinees to be 

assigned to the test tasks at appropriate difficult levels.  

Negative or no significant correlation between SA and performance. However, not all 

studies have found positive correlations between the results of self-assessment and objective 

measures of language proficiency. For example, Peirce et al. (1993) examined whether self-

assessment is a valid and reliable indicator of French proficiency. They had approximately 500 

leaners in French immersion programs take a self-assessment and a French proficiency test 

successively. Results showed only weak correlations between self-assessments of language 

proficiency and learners’ later tested performance.  

 

7 

 

Another study was conducted by Lim (2007) with learners of English. She compared the 

results of self-assessment on learners’ oral proficiency and ratings by their tutors. She revealed 

that although self-assessment could be a potentially new way to assess learners’ own language 

proficiency, some learners, especially those who were at a lower level of language proficiency, 

lacked the objectivity and confidence to correctly self-assess, and found it difficult to identify the 

weaknesses of both their own and others’ language skills.  

The results of the abovementioned studies are in line with the study by Brantmeier (2006) 

involving 71 Advanced L2 learners of Spanish participate. She investigated if the scores of self-

assessments could be utilized to accurately predict learners’ reading performance and subsequent 

reading achievement. However, the findings showed that self-assessment was not a reliable 

indicator for either placement purpose or subsequent performance. In another study by 

Brantmeier and Vanderplank (2008), the researchers investigated if pre-test self-assessment 

ratings of reading, as measured via both descriptive and criterion-referenced instrument, could 

reliably predict achievement on a computer-based test. They had 359 learners of Spanish self-

assess their L2 reading abilities and then complete several tests for placement. Based on their 

findings, they concluded that self-assessment could provide useful, however fairly limited, 

reliability for reading placement purposes. In more detail, when learners’ comprehension was 

measured via sentence completion and multiple choice items, a descriptive and criterion-

referenced self-assessment can be an appropriate indicator for both reading scores and 

subsequent classroom performance. When it came to a measure of reading comprehension with a 

writing task for recalling short stories, the criterion-referenced questionnaire was not a reliable 

predictor. 

 

8 

 

The mixed results of the studies I have reviewed above seem to align with suggestions 

concerning the “high-stakes” issue of placement, and that perhaps self-assessment might not be 

best for placement testing: Both Bachman and Palmer (1981) and Brown and Gerhardt (2002) 

advocated for the use of self-assessment in non-high-stakes testing situations, such as for 

formative assessment, self-monitoring, or lower-stakes assessment purposes. But a general trend 

in the research seems to be suggesting that self-assessment might work or be an economical way 

to help students choose their starting point in a high-stakes computer adaptive test (Malabonga et 

al., 2005).  

Factors Influencing the Accuracy of Self-Assessment 

Apart from the issue of validity, some researchers were specifically interested in those 

factors which may have an influence on the validity of self-assessment, on which I would like to 

elaborate a little bit in this section. 

Feedback. Some researchers emphasize the importance of providing feedback for 

learners to promote more accurate self-assessment. In general, self-assessments of language 

skills or abilities were found to be more accurate and reliable when learners received feedback in 

regards to their performance on objective measures of the targeted skills or abilities. 

For example, researchers (Delgado et al., 1999) examined the following two areas: first, 

the extent of accuracy of bilingual students (80 bilingual Spanish and English college students) 

judging their language competence; and second, if providing feedback for students could 

influence the results of self-assessment. Findings of their research showed that feedback from the 

objective test improved self-assessment accuracy on both languages, but more significantly in 

Spanish. 

 

9 

 

The positive influence of feedback provided to learners on the accuracy of self-

assessment was also verified in another study by Taras (2001). Her study was not on language 

learning, but rather on self-assessment of skills in higher education in general. The students were 

asked to prepare translation text and translation commentary, which were subsequently returned 

with their tutor’ feedback. The tutor would withhold the students’ grades until they worked 

through the tutors’ feedback and completed self-assessment. The findings of the study showed 

that student receiving tutor feedback prior to self-assessment was better able to identify their own 

weaknesses and errors. She went further in another study (Taras, 2003) to have 17 final-year 

undergraduate students carry out two types of self-assessment: self-assessment prior to peer and 

tutors’ feedback and self-assessment incorporating feedback as an integrated part. The results 

revealed that students were overwhelmingly in favor of the latter and did better in the latter 

context. 

Age. While participants in most of the studies on self-assessment are university students, 

there is some research involving children. In these studies, researchers were particularly looking 

at what kind of role age was playing in influencing the results of self-assessment.  

One of the studies (Kaderavek, Gillam, Ukrainetz, Justice, & Eisenberg, 2004) was 

conducted with 401 children whose ages ranged from 5 to 12 years old. The researcher mainly 

focused on learners’ metacognitive ability and oral narrative production. They had the children 

take the Test of Narrative Language (TNL) and had them self-evaluate their ability of narration. 

The results of their study demonstrated that younger children could not as accurately as older 

children self-assess their narrative performance. In addition, the difference was statistically 

significant in the performances in narrative production skills between children who evaluated 

themselves as less competent speakers and those who evaluated themselves as more skilled 

 

10 

 

speakers. Their findings corresponded well to the results of another study by Butler and Lee 

(2006), who suggested that students younger than those at the fourth grade level were not good at 

self-assessment. Administering self-assessments to them may not be a good choice.  

Gender. Another important issue related to the accuracy of self-assessment that 

researchers have been concerned about is gender differences. Results of a study (Pallier, 2003) 

showed that compared with women, men tended to consistently rate themselves higher, which 

implied the overestimation of their performance, and this tendency for men to express higher 

levels of confidence than women in self-assessment appeared to remain consistent across the age 

ranges. In the abovementioned study (Kaderavek et al., 2004), researchers also examined how 

the accuracy of self-assessment varied in relation to gender. They found that male students were 

more likely to overestimate their narrative skill than female students. 

Language proficiency. Learners’ language proficiency is another possible factor that 

may be playing a role in the results of self-assessment. Its influence has been investigated by 

some researchers. For example, Brantmeier, Vanderplank, and Strube (2012) provided details 

demonstrating that with the use of self-assessment, compared with students of lower level 

proficiency, students at the Advanced stages of proficiency were better at identifying the skills in 

which they were relatively better or poorer. Accordingly, other researchers (Kaderavek et al., 

2004) drew a similar conclusion that in comparison. In comparison with children who had more 

Advanced speaking skill, children with poorer narrative skill were more likely to overestimate 

the performance of their narrative production. 

Research Questions 

To my knowledge, based on the findings from the prior literature, only a few researchers 

(e.g., Dolosic et al., 2016) have investigated whether students who self-assess multiple actually 

 

11 

 

assess themselves as getting better in the skills. In addition, research on items in self-assessment 

has been narrowly focused, so a fine grained look at the items in self-assessment is needed. In 

this case, the following research questions are established for the present study: 

1.  Can the results of self-assessment reflect students’ language gain or attrition over an 

academic year? 

2.  Can students perceive the difficulty level of the questions in the self-assessment? Does this 

perception vary among students at different proficiency levels? 

3.  Can students better self-assess themselves regarding their oral proficiency after they spend an 

academic year learning? 

 

 

 

12 

 

Participants 

METHODOLOGY 

The participants in the current study were students from a large Mid-western U.S. 

university in a Chinese language program. The data were collected as part of a larger grant-

funded project from 42 students who took the 50-statement self-assessment questionnaire and the 

Oral Proficiency Interview – computer (OPIc) (Language Testing International, 2012) in spring 

2015 and spring 2016, and from 40 students who took the 50-statement self-assessment 

questionnaire and the OPIc in spring 2016 and spring 2017. Three things need to be noted here: 

First, those students who did not have completed sets of OPIc were not included in the study. 

Second, in my dataset, there are 20 students who took the test three times in three consecutive 

years. To make full use of the data, I randomly and evenly divided these 2 students into two 

groups and used, for each of these 20, only two of their three test results as aligned with their 

group assignment; either their 2015-2016 data, or their 2016-2017 data. This allowed me to keep 

the 20 students in the participant pool. Third, the data that meet either of the two following 

conditions were eliminated for analytic purposes: Missing data (students did not take the test) 

and the data of those who had a BR (below range) or a UR (unratable). Accordingly, the data of 

four students were excluded due to a BR (N=3) or a UR (N=1).  

Materials 

The materials used in the study include two components: A computer-adaptive self-

assessment questionnaire and an official ACTFL OPIc. The questionnaire was developed by the 

PI (Winke) and her research assistants in consultation with ACTFL assessment team, which 

included five sets of ten Can-Do statements (50 in total) that were selected from the fuller list of 

NCSSFL-ACTFL Can-Do Statements (ACTFL, 2015). Each set of statements covered a range of 

 

13 

 

ACTFL levels with each statement targeting a certain level of proficiency (e.g., the first set of 

statements covered ACTFL levels from Novice Low (NL) to Novice High (NM) and item 3 “I 

can say which sports I like and don’t like” targeted the level Novice Mid (NM)). Likert scales 

were used in the questionnaire: Participants were asked to rate how well they could perform the 

task described in each statement on a scale ranging from one to four: 1 (“I cannot do this yet”), 2 

(“I can do this with much help”), 3 (I can do this with some help), and 4 (“Yes, I can do this 

well”).  

The items of the questionnaire administered in spring 2015 and spring 2016 were the 

same. However, a revised version with 15 items taken off and another 15 items added in was 

administered in spring 2017; this was done after an examination of the validity of the statements; 

the 15 items that were taken off were identified as misfitting (Tigchelaar, Bowles, Winke, & 

Gass, 2017). Thus, the continued use of these items would be considered problematic. Thus, my 

analyses in this paper are with the 35 common items used across the questionnaire 

administrations (see Appendix). 

Procedure 

The Chinese language learners were told by their instructors to take the OPIc as a course 

requirement, although their grades were not influenced by their performance. The self-

assessment and OPIc, which lasted roughly 50 minutes, were administered by proctors within the 

university’s language learning computer lab, which is maintained by the language programs’ 

center on language learning. The test takers needed to complete a background questionnaire, the 

self-assessment, and then an official ACTFL OPIc with a level of difficulty that was matched to 

their self-assessment outcome (there were five OPIc forms, as will be described below).  

Learners first took the computer-adaptive self-assessment test with five levels, and based on the 

 

14 

 

outcome of the self-assessment, were recommended by the self-assessment algorithm output to 

take one of the five levels of the OPIc. 

On the self-assessment, the five levels were computer-adaptive, and here I explain this 

more. The learners indicated on each level (that had 10 self-assessment questions) the extent to 

which they could do well on the Can-Do statements within that set, and if they scored high 

enough (80% or higher), they moved on to the next set of 10 statements. In the last set, the 

learners would be recommended to take the Level 5 OPIc if they indicated that they could do at 

least 8 tasks very well, or they would be recommended to take the Level 4 OPIc if they indicated 

they could not do 8 of the last 10 tasks very well.  The items included in each level and the cut-

off score were determined by the PI and her research assistants in consultation with ACTFL 

assessment expertise to ensure that the test would work the best. 

Student performances on OPIc were rated by official certified ACTFL raters (as hired by 

Language Testing International), and students were informed of their proficiency level 

approximately two weeks after testing. Table 1 is summary on the number of participants 

completing each set of statements and the number of participants who took each level of the 

OPIc test.  

Data Analysis 

Before a further step was taken, I recoded some data for analytical purposes. First, I 

assigned a numeric value to each ACTFL proficiency level on a scale from 1 for Novice Low 

(NL) to 10 for Superior (S). Second, I recoded the responses to the items in the survey to cater to 

the need of this study, even though in the original survey, a Likert scale ranging from one to four 

was used, as designed by the original creators of the survey (see Tigchelaar, Bowles, Winke, & 

Gass, 2017):1 (“I cannot do this yet”), 2 (“I can do this with much help”), and 3 (I can do this 

 

15 

 

with some help) were recoded as 0; while 4 (“Yes, I can do this well”) was recoded as 1. The 

rationale of this recoding was that a student should only move on to the next difficulty level if he 

or she could do most tasks in the level well (“Yes, I can do this well”). 

Research question one 

To answer the first research question on whether the self-assessment can reflect students’ 

language gain or attrition over an academic year, I present some descriptive data. In this study, 

students’ language gain or attrition was measured by the comparison of the two OPIc scores they 

got before and after an academic year. I report the result of the self-assessment in two ways: the 

difficulty level of the OPIc test they took; and their responses to each individual item. My 

rationale behind presenting item responses as well as the difficulty level of the test is that, as 

mentioned earlier, the items included in the data analysis are the same 35 items having been used 

consistently across the three academic years, which means that there can be some discrepancies 

between the inferences drawn from these 35 items and from the full 50 items. The difficulty level 

suggested for each student could be supplementary evidence apart from the item responses. 

With the data at hand, I answer the first research question by tallying and comparing the 

number of items in the survey assumed to be attainable by students in two years with their OPIc 

test scores. For example, a student received Intermediate Mid (IM; 5) and Advanced Low (AL; 

7) on the OPIc tests before and after an academic year, and the student took the third and fourth 

difficulty level of OPIc test respectively. Among the items completed, the student indicated that 

there are 22 and 31 items that aligned with what the student can do, respectively, each year. This 

way the result of the self-assessment can be considered successfully having tracked the student’s 

language gain over a year. However, if the student received Advanced Low (AL; 7) in the first 

year and Intermediate Mid (IM; 5) in the second year, everything else being the same, the result 

 

16 

 

of the self-assessment cannot be considered successful in this case. Table 1 clearly illustrates in 

what situation the result of self-assessment is respectively considered successful or not. 

Table 1.  
Criteria for Accurately and Inaccurately Self-assessing Oral Proficiency 
  Accurate 

Inaccurate 

 

  S1 > S2; T1≤T2; L1≥L2 

  S1 < S2; T1≥T2; L1≤L2 

  S1 = S2; T1≠T2; L1≠L2 

 

 

 

S1 > S2; T1 > T2 

S1 > S2; L1 > L2 

S1 < S2; T1 < T2 

S1 < S2; L1 < L2 

S1 = S2; T1 = T2 

S1 = S2; L1 = L2 

Note. S1: a student’s first-year OPIc score; S2: a student’s second-year OPIc score;  

T1: the number of tasks assumed to be attainable in the first year;  

T2: the number of tasks assumed to be attainable in the second year;  

L1: the difficulty level of OPIc test taken in the first year; 

L2: the difficulty level of OPIc test taken in the second year. 

Research question two 

The second research question mainly dealt with the extent to which students could 

perceive the difficulty level that the items targeted in the survey. To have a good understanding 

of how well the students assessed their own language proficiency, I divided the items that they 

responded to into three different groups: the accurately-assessing group, the over-assessing 

group, and the under-assessing group. In the accurately-assessing group, students were able to 

correctly identify the items targeting the proficiency level lower or higher than their oral 

proficiency level (items with the difficulty level the same as the learner’s proficiency level were 

all included in this group). The target difficulty level of each individual item is listed in 

NCSSFL-ACTFL Can-Do Statements (ACTFL, 2015), whereas students’ proficiency level is 

shown by their OPIc test score. In the over-assessing group, students claimed to be able to 

 

17 

 

perform the task well (rate the item “4”) when their oral proficiency did not reach the target 

difficulty level of the task. The under-assess group is the opposite, where students rated the task 

1, 2, or 3 when their OPIc score is higher than the target difficulty level. 

To address the criticism of the use of self-assessment from some researchers who claim 

that the accuracy of self-assessment can be influenced by learners’ experience and proficiency 

(Ehrlinger, Johnson, Banner, Dunning, & Kruger, 2008; Caputo & Dunning, 2005) and the 

argument that a hierarchy of item difficulty levels in the Can-Do Statement might not be 

perceived successfully by students (N. A. Brown, Dewey, & Cox, 2014), I adopted two ways of 

defining the accurately-assessing group. First, I calculated the rate of exact agreement, where a 

student had to precisely identify the difficulty level of the items at, below, or above his or her 

oral proficiency level. Second, I calculated adjacent-agreement, which is when a student 

incorrectly assesses items but only by one level, that is, one level below or above his or her oral 

proficiency level. Both of these methods are described in full by Carr (2011). By using these two 

different approaches to tallying the number of items and the students in each group, I have a 

better understanding of the extent to which the students could perceive the difficulty level of the 

questions in the self-assessment target.  

In addition, to examine the assumption that students’ oral language proficiency was 

playing a role in how accurately they assessed themselves, I took a closer look at the relationship 

between the agreement rate of self-assessment results and the students’ OPIc scores by 

calculating the accuracy rate of the students’ responses to the items in the survey and their OPIc 

scores. I presented the data in different proficiency level categories, which enables me to tease 

apart this proficiency level effect on the agreement or disagreement in terms of the match 

between the difficulty level of an item and students’ OPIc scores. To achieve this, for each 

 

18 

 

student, I tallied the total number of the items he or she answered in total and the ones that he or 

she accurately, over-, and under-self-assessed respectively in two years. Accordingly, I 

calculated six percentages from the above noted values and the total number of items the student 

answered in the two surveys.  

Research question three 

As for the last research question, which concerns the comparison of the accuracy of self-

assessment regarding learners’ language development, I compared their responses to the items in 

the surveys with the corresponding OPIc test scores received in either the academic year 2015 

and 2016 or the year 2016 and 2017. Specifically, I calculated Spearman’s rank correlation 

coefficient based on the number of items they accurately self-assessed and their OPIc scores in 

two years to see the one-year studies in the language program or their language development 

influenced the way they assessed themselves. Similar to the second research question, I 

incorporated exact agreement and adjacent agreement into the analysis. Because the same group 

of students took the test twice, I used a paired t test to measure if the differences between the 

percentages of accurately assessing, under-, and over-assessing items were significant or not.  

 

 

 

 

19 

 

RESULTS 

Figure 1 displays the two-year data about the number of students in each OPIc score level 

on the ACTFL scale.  As shown in the chart, the highest proficiency level that the students 

reached in both years is Advanced Mid (AM), and only a very small proportion of students (N=8, 

5.0%)  reached the Advanced level. The oral proficiency of most students (N=123, 76.9%) 

clustered between Novice High (NH) and Intermediate Mid (IM). In addition, more students 

received a higher OPIc score in the second year, which can be reflected by the increased number 

of students in the levels between NH and IM combined with fewer students in the first two 

Novice levels in the second year.  

Table 2 presents the difficulty levels students took for their OPIc tests based on their 

responses to the items in the self-assessment survey. It can be seen from the table that most of 

the students took the first two difficulty levels of the OPIc test. In other words, only a few 

students responded to the first 10 or 20 questions in a way that allowed them to cross the 

threshold of the third difficulty level of the test. Corresponding to the OPIc test results shown in 

Figure 1, more students took the higher levels of the test in the second year. This is clearly 

reflected by the fact that the majority of the students (N=61, perc.=76.3%) took the first 

difficulty level of the OPIc test in the first year.  

 

20 

 

 

s
t
n
e
d
u
t
s

f
o
r
e
b
m
u
n
e
h
T

30

25

20

15

10

5

0

14

8

4

1

NL

NM

26

24

23

22

17

11

1

1

2

1

3

2

IL

IH
NH
OPIc scores on the ACTFL scale

IM

Year 1

Year 2

AL

AM

 

Figure 1.  Descriptive statistics of the number of students and their OPIc scores in two years 

Note. Novice Low (NL), Novice Mid (NM), Novice High (NH), Intermediate Low (IL), 

Intermediate Mid (IM), Intermediate High (IH), Advanced Low (AL), Advanced Mid (AM) 

 

 

 

21 

 

Table 2. 
Number of Students Taking OPIc at Different Difficulty Levels 
 

Difficulty level 

Year 1 

Year 2 

1 

2 

3 

4 

5 

 

 

61 

14 

2 

2 

1 

55 

17 

4 

1 

3 

Table 3. 
Number of Students Using Self-Assessment Successfully or Unsuccessfully Tracking their  
Language Development Trajectories (Percentage Data in Parentheses) 
 

 

Group 

n 

Successful 

  Not Successful 

Language gain 

Language attrition 

No difference 

Total 

35 

17 

28 

80 

18 (.51) 

10 (.59) 

21 (.75) 

49 (.61) 

  17 (.49) 

  7 (.41) 

  7 (.25) 

  31 (.39) 

 

The overall descriptive statistics about how well the results of self-assessment can be 

used to predict students’ language proficiency trajectories are presented in Table 3. Specifically, 

the results concern how many tasks that a student found he or she could complete with 

confidence in the survey and which difficulty level of OPIc test he or she took. More details 

about the criteria for the results either successfully or unsuccessfully exhibiting students’ 

language gain or attrition can be found in Table 1. Both the number of students and the 

percentage of the number of students in each group are shown. The data in Table 3 show that 35 

 

22 

 

out of 80 students received a higher OPIc score the second time when they took the test, and 

among this group of students, half of them responded to the self-assessment in line with the 

improvement in their oral proficiency reflected in the increased OPIc scores they received in the 

second year. As for language attrition group, after an academic year, 17 students’ OPIc scores 

declined by at least one level on the ACTFL scale. For 10 out of these 17 students, the results of 

the self-assessment are considered to have correctly predicted their language loss. In addition, 28 

students received exactly the same scores in their OPIc tests before and after an academic year, 

and this no-change in scores was accurately reflected in 21 students’ self-assessment results. The 

data show that the overall success rate of the results of self-assessment is moderately satisfactory 

(.61). Among these three groups, the success rate of the no difference group is the highest 

(.75%). Specifically, the students whose language proficiency levels remained the same across 

the two years tended to respond to the items consistently in the two years. By contrast, the 

students who received higher OPIc scores could not respond to the items as accurately as the 

other two groups in a way that reflected their language proficiency improvement. 

With respect to the second research question which concerns students’ perception of the 

difficulty level of each item in the self-assessment survey, Table 4 displays the data about the 

agreement and disagreement rate of self-assessment that students respectively accurately, under-, 

and over-assessed themselves. The results suggest that although in the exact-agreement 

approach, only about half of the items (53%) in the self-assessment survey were accurately 

identified by the students either above, below, or at their oral proficiency levels, the accuracy 

rate increased to a great extent when the adjacent agreement approach was adopted where for 

74% of the items, the item difficulty level and students’ responses corresponds accurately to their 

oral proficiency level. Besides that, the data indicate that among the items whose target difficulty 

 

23 

 

level mismatch students’ responses regarding their oral proficiency, the difficulty levels of most 

of the items were shown to be below rather than above students’ oral proficiency, and this 

contrast is even sharper in the exact agreement group (43% and 4%). In other words, students 

were more likely to under-assess rather than over-assess themselves in terms of how well they 

could complete the task. 

To examine whether this tendency of students accurately or not accurately to assess 

themselves is related to their language proficiency levels, two scatter plots were drawn based 

respectively on the exact-agreement approach () and the adjacent-agreement approach (right), 

which presents the relationship between students’ oral language proficiency levels and the 

agreement rate of students’ responses to the items with regard to their OPIc scores. It can be seen 

from these two figures that the agreement rate is consistently higher for the students at Novice 

level (OPIc scores: 1-3) and Advanced level (OPIc scores: 7-8) than the students at Intermediate 

level, which is especially clearly presented by the data analyzed using the adjacent agreement 

approach. However, the contrast between the way the dots are scattered in these two figures is 

not conspicuous in the Intermediate-level score band where the agreement rate among this group 

of students spread out from 0 to 1. This proficiency-level related agreement rate pattern can also 

be well illustrated by the descriptive statistics in Table 5 where the mean agreement rate of the 

Novice group (.99) and the Advanced group (.80) is much higher than that of the Intermediate 

group (.51). As shown in the second half of the table, the contrast is especially notable when the 

adjacent agreement approach was adopted. 

 

 

 

24 

 

Table 4. 
Summary of the Number of Items Responded by Students (Self-Assessing Accuracy and Error 
Rate in Parentheses) 
 

  Adjacent agreement 

Exact agreement 

Group 

n 

M 

SD 

95% CI 

Accurately-

assessing 

Under-

assessing 

Over-

assessing 

816 

447 

104 

5.10 

5.32 

[4.28, 5.92] 

(.53) 

(.34) 

([.48, 58]) 

2.79 

2.18 

[2.46, 3.13] 

(.43) 

(.37) 

([.37, .48]) 

.64 

2.03 

[.16, .53] 

(.04) 

(.10) 

([.01, .03]) 

Total 

1366 

8.54 

5.79 

[7.64, 9.43] 

 

 

 

 

 

n 

M 

SD 

95% CI 

1062 

249 

55 

6.64 

5.75 

[5.75, 7.53] 

(.74) 

(.34) 

([.69, 79]) 

1.56 

2.04 

[1.24, 1.87] 

(.24) 

(.34) 

([.19, .29]) 

.34 

1.21 

[.16, .53] 

(.02) 

(.06) 

([.01, .03]) 

1366 

8.54 

5.79 

[7.64, 9.43] 

Note. CI = confidence interval. 

1
0.8
0.6
0.4
0.2
0

e
t
a
R
t
n
e
m
e
e
r
g
A
t
c
a
x
E

0

1

2

8
OPIc test scores on the ACTFL scale

6

3

5

4

7

1
0.8
0.6
0.4
0.2
0

e
t
a
R
t
n
e
m
e
e
r
g
A
t
n
e
c
a
d
A

j

1

2

0
8
OPIc test scores on the ACTFL scale 

3

4

5

6

7

   

 

Figure 2. Two approaches presenting agreement rate of students at different oral proficiency 

levels. Novice Low (NL) = 1, Novice Mid (NM) = 2, Novice High (NH) = 3, Intermediate Low 

(IL) = 4, Intermediate Mid (IM) = 5, Intermediate High (IH) = 6, Advanced Low (AL) = 7, 

Advanced Mid (AM) =8. 

 

 

25 

 

Table 5. 
Agreement Rate of Self-Assessment Responses and OPIc Scores 
 

Agreement Rate 
Exact-agreement 
 
 
 
Adjacent agreement  Novice 
 
Intermediate 
 
Advanced 
Total 
 
Note. CI = confidence interval. 

Oral Proficiency Level  n 
72 
Novice 
Intermediate 
80 
8 
Advanced 
Total 
160 
72 
80 
8 
160 

M (SD) 
.56 (.37) 
.49 (.31) 
.69 (.29) 
.53 (.34) 
.99 (.06) 
.51 (.33) 
.80 (.33) 
.74 (.34) 

95% CI 
[.48, .65] 
[.42, .56] 
[.49, .89] 
[.48, .58] 
[.98, 1] 
[.44, .59] 
[.58, 1] 
[.69, .78] 

In addition, the data in Table 6 display the Spearman’s rank correlation coefficients 

between the students’ OPIc scores and the number of items that students accurately identified 

based on their own oral proficiency levels in two years. The rationale of presenting the data is to, 

based on what is shown in Table 5 and Figure 2, tease apart the proficiency-level effect on the 

extent to which students could accurately assess themselves. The agreement rates shown above 

in Table 5 and Figure 2 are determined by the total number of items students responded in the 

survey, which is already highly related to their proficiency levels, whereas the number of items 

that students accurately identified in the survey are less influenced by other variables. To 

examine whether students’ language proficiency levels were playing a role in the correlation or 

the lack of a correlation, the data in Table 6 are presented so they correspond to the three major 

language proficiency levels: Novice, Intermediate, and Advanced. Table 6 shows that the 

coefficient of the correlation between the number of items that students correctly identified 

regarding their oral proficiency levels and their OPIc scores is different for students at the 

different oral proficiency levels. While there seemingly exists no, or very weak if any, 

 

26 

 

correlation between the OPIc scores and the number of items that students at Intermediate level 

accurately assessed themselves, the correlation is moderate to strong among the students at 

Novice and Advanced levels, and this finding is consistent before and after an academic year. 

For Advanced students, the number of items that they correctly identified based on their oral 

proficiency levels tended to be positively correlated with their OPIc scores, although this result is 

not statistically significant probably due to the small sample size. However, a statistically 

significantly negative correlation (ryear 1 =  -.52, p < .001; ryear 2 =  -.66, p < .001) was found 

among the Novice students, which indicates that the students at Novice Low (NL) tended to 

accurately identify more items in the self-assessment survey regarding their oral proficiency than 

the ones at Novice High (NH).  

With respect to the third research question, which relates to whether students could better 

assess themselves after spending an academic year studying the target language, Table 7 display 

both the two-year data on self-assessment agreement  in two ways where the agreement rate and 

the number of correctly self-assessing items are both presented. Similar to Table 5, the results are 

shown in both the exact-agreement and the adjacent-agreement approaches. Interestingly, it can 

be seen that the two-year data present different results using different standards to measure the 

agreement. Although the results are not statistically significant, the agreement rate drawn from 

students’ responses to the items when they first took the test is slightly higher than that in the 

second year, whereas students tended to respond to fewer items with accuracy in the first year. 

This finding was consistent no matter which agreement approach was adopted. 

 

 

27 

 

Table 6. 
Correlations between the Number of Accurately Self-assessing Items and OPIc Scores (Exact-
agreement Approach)  
 

Proficiency  

Year 1 

Year 1 

Level 

Novice 

Intermediate 

Advanced 

Total 

n 

40 

36 

4 

160 

Spearman’s rho  n 

Spearman’s rho 

-.52*** 

.11 

.54 

.06 

32 

44 

4 

160 

-.66*** 

.17 

.89 

.19 

Note. *p < .05; **p < .01; ***p < .001. 

 

Table 7. 
Summary of the Contrast between Self-assessment Agreement in Two Years 
 

 

  Agreement Rate 

  No. of Accurately Self-assessing Items 

Approach 

  n  M (SD) 

95% CI 

Exact Agreement 

 

 

 

 

Year 1 

  80 

.56 (.33) 

[.49, .63] 

Year 2 

  80 

.51 (.34) 

[.43, .58] 

p 

 

.13 

  n  M (SD) 

95% CI 

 

 

 

 

  80 

4.93 (4.80) 

[3.87, 5.98] 

  80 

5.28 (5.82) 

[4.00. 6.55] 

Adjacent Agreement 

 

 

 

 

 

 

 

 

 

Year 1 

  80 

.77 (.32) 

[.70, .84] 

Year 2 

  80 

.72 (.35) 

[.64, .79] 

.18 

  80 

6.40 (5.12) 

[5.27, 7.52] 

  80 

6.88 (6.34) 

[5.48, 8.27] 

p 

 

.33 

 

.30 

 

 

 

 

28 

 

DISCUSSION 

One of the major findings of this study is that the accuracy rate of the self-assessment is 

related to students’ language proficiency levels, which is not surprising based on the findings of 

previous research studies (e.g., Brantmeier et al., 2012; Kaderavek et al., 2004). However, 

different patterns of accuracy seemed to exist in the three major proficiency levels. Among the 

Advanced-level students, the more proficient the students were, the higher the self-assessment 

accuracy rate. While among the Novice-level students, the lower the students’ language 

proficiency levels are, the higher the self-assessment accurate rate was. As for the students at 

Intermediate level, no clear correlation could be found. This proficiency-level-related pattern in 

the accuracy rate was even more notable when the adjacent-agreement approach was adopted. 

The overall pattern taking all three levels of proficiency into consideration is that the accuracy 

rate increases, there is an accelerating agreement in self-assessment and proficiency—the 

relationship between the two variables (self-assessment and proficiency level) is not linear as 

proficiency increases. Rather it appears in this dataset to be parabolic relationship that is 

relatively high at the lowest level of proficiency. The correlation coefficient gets larger when 

students are mostly at Intermediate level of proficiency and it is even higher when they are at 

high proficiency level. This parabolic relationship calls for test methods that can test for a non-

linear relationship between two variables. However, so far, apart from a few studies, researchers 

of most studies on self-assessment have used a simple correlation, which tests a linear 

relationship between two variables only, and I see a parabolic pattern between the self-

assessment results and students’ proficiency levels: Scatter plots are needed in this type of 

research to better illustrate. Based on what I could find, only the studies conducted by Brown et 

al. (2014) and Dolosic et al. (2016) used scatter plots to present their findings. Most of the 

 

29 

 

researchers of the studies on self-asessment, such as Brantmeier et al. (2012), Lim (2007), 

Roever and Powers, (2005), and Delgado et al. (1999) used only correlations to examine the 

relationship between self-assessment accuracy and the focused variables. Multivariate regression 

or some other statistical methods to test a parabolic relationships are needed for further research. 

The different accuracy patterns that exist in these proficiency levels may be related to 

several factors according to the findings of the previous research studies. The accuracy of self-

assessment can be influenced by learners’ language proficiency levels, and the result of the 

current study align with the findings of the studies by Brantmeier et al. (2012) and Kaderavek et 

al. (2004) acknowledged that Advanced language learners were better at identifying the tasks 

that were beyond or within their capabilities. The sample size for the Advanced group in this 

study is far from satisfactory (n= 8), so a larger sample size of Advanced learners is needed to 

draw a more conclusive result. So far, most the authors of studies on self-assessment have 

focused mainly on learners enrolled in language programs, who are mostly at Novice or 

Intermediate levels. Not many researchers have conducted studies to investigate Advanced 

learners’ performance in self-assessment, and this work has to be done to present more evidence 

for a more generalizable finding. In this study, I attempted to do so but did not succeed because it 

turned out that most of the students included in the current study were still Novice and 

Intermediate level learners. Therefore, most of them only had the opportunity to respond to the 

first set of self-assessment items (only 6 common items that were used in the survey across the 

three years were incorporated in this study). Accordingly, the analysis fell predominantly on the 

items targeting lower proficiency levels. A fine-grained investigation on more item responses 

that target higher proficiency levels would be valuable for future research. 

 

30 

 

To try to answer the questions why language proficiency may influence the accuracy of 

self-assessment, and why the accuracy rate seemed polarize Intermediate students in the current 

study, previous studies can help.  The first question may be explained by the fact that the 

accuracy of self-assessment can be affected by whether students have past experiences with the 

task being asked about. In the study done by Tigchelaar et al. (2017), among the 50 Can-Do 

Statements that were used in the self-assessment survey, 5 out of the 15 misfitting items that did 

not fit the Rasch modeling were found to be experience-dependent. In other words, students who 

did not have similar experience described in the task were not likely to correctly evaluate how 

well they could complete the task. How much the experience that students have with the 

language is highly related to their proficiency. Reasonably, the more Advanced the students, the 

more likely they have the experiences that is specified in the task.  

As for the second question regarding why the accuracy rate appeared to be scattered 

chaotically, the answer may have something to do with the self-assessment itself. Among the 

items in the self-assessment survey, some of them are specified with concrete and detailed 

descriptions (“ I can say which sports I like and I don’t like” or “I can list my favorite free-time 

activities”), whereas some of them are described in a more abstract and general way (“I can 

schedule an appointment” or “I can talk about my favorite music”). For students at low 

proficiency level (NL or NM), it might be easier for them to correctly identify the tasks that are 

within or beyond their capabilities because their language proficiency may only allow them to 

complete those very basic, simple, and concrete tasks, such as listing or naming a few words. 

However, for higher proficiency level students, when encountering the tasks that are described 

without many specific details, different students may adopt different interpretations of the same 

task description, and the difficulty level of the same item may vary among the students at the 

 

31 

 

same proficiency level. It is likely that students’ responses to an item might not only be related to 

their proficiency levels, but also related to the imagined difficulty level of the task answered by 

them. The findings of the previous study by Butler and Lee (2006) may offer some insight into 

how to control this variable: compared with an off-task self-assessment where students were 

asked to self-evaluate their performance in a general and somewhat decontextualized manner, an 

on-task self-assessment where students could attend to the language themselves was shown to 

generate more accurate responses regarding their language proficiency levels.  

Another main focus of this study is the extent to which  the students could perceive the 

difficulty level of each individual item in the self-assessment survey. The results show that 

students tended to under-assess themselves rather than overestimate their oral language 

proficiency, which was not in line with the findings of most previous studies on self-assessment 

(Kaderavek et al., 2004; Stansfield et al., 2010; Dolosic et al., 2016). The results of these studies 

revealed that learners, especially those at lower proficiency levels, tended to over-assess their 

language proficiency. This discrepancy shown between the findings of the current study and the 

previous research might be explained by the different formats of the self-assessment, the 

population differences, and the different methodology for data analysis adopted in these studies.  

In the study conducted by Kaderavek and her colleagues (2004), the learners were 

children between 5 and 12 years old of age. Instead of having the learners self-assessing their 

own language proficiency level based on concrete task descriptions, the researchers asked them 

to evaluate how well they could tell a good story on a Likert scale ranging from 1 to 5 with five 

different  faces representing very sad (1), somewhat sad (2), neutral (3), somewhat happy (4), 

and very happy (5). Unlike the current study where a criterion-referenced instrument was used in 

the self-assessment, the questionnaire adopted in this study was more general, and it is possible 

 

32 

 

that without a clear benchmark for each point on the scale, students could not differentiate one 

from another well. Pinpointing this issue with this type of self-assessment, Brantmeier (2006) 

suggested that a more contextualized descriptive and criterion-referenced instrument might be 

more appropriate and beneficial for self-assessment purposes. What’s more, in the same study, 

Kaderavek and her colleagues already noted that the age of learners may have an impact on how 

accurately they could self-evaluate their language performance – the younger the learners, the 

less accurately they could evaluate their proficiency. Thus to some extent, the designs of these 

two studies are not comparable to each other considering the different target population and the 

different formats employed in the self-assessment surveys. In another two studies, even though 

the researchers used Can-Do Statements as in the current study, the methods of scoring the self-

assessments are not identical. Specifically, in the current study, although a Likert scale ranging 

from 1 to 4 was used in the self-assessment survey, a binary scoring system was adopted to by 

the administrators of the test to decide whether which OPIc test level a student should take. I 

kept using this scoring system in my study due to the same reason: students would move on to 

the next set of ten questions only if they responded to eight out of the ten questions with 4 (“Yes, 

I can do this well”). In comparison, in the study conducted by Dolosic et al. (2016), similarly, 

they used a Likert scale ranging from 1 to 5 with detailed descriptions of the tasks, but they 

analyzed the data using all 5 data points, which is different from the current study. In the study 

done by Stansfield et al. (2010), when the learners responded to the items in the Can-Do 

Statements, they only needed to either accept or reject an affirmative statement. Based on the 

information shown above, it is easy to tell that the methodology of these three studies may share 

some features in common, but we need to be cautious when we evaluate whether the results are 

comparable to each other.  

 

33 

 

It is possible that the way that the data were re-coded in the current study exaggerated the 

extent to which the students under-assessed themselves. For example, a student’s 3 (“I can do it 

with little help”) might not be different from another student’s 4 (“I can do it very well”), but 

these two responses were indeed interpreted as a ‘difference’ by the scoring method used in this 

study. Similarly, one student differentiate the extent to which he or she could complete a task by 

responding to one item with 1 (‘I cannot do this yet’) and another item with 2 (‘I can do this with 

much help’), but this discrepancy was not captured by the data analysis. Indeed, the students’ 

responses to each task were not completely reflected in the results. Even though this coding was 

used in this study, the underlying raw scores are still available for analyses and future research. 

What kind of factors might result in the inconsistency exhibited in the responses to items in the 

self-assessment survey by the students at the same language proficiency level and are there any 

effective ways to alleviate this inconsistency? These might be the future research questions of 

empirical studies on self-assessment. 

 

 

 

34 

 

CONCLUSION 

This study investigated the extent to which the college students learning Chinese could 

accurately self-assess their oral language proficiency based on the descriptions specified in the 

Can-Do statements and whether their ability to self-assess was related to their language 

proficiency levels. The results revealed that Advanced-level students were more likely to be able 

to successfully identify the difficulty level of a task that is within or beyond his or her capability 

when they are compared with their lower-proficiency counterparts, although the sample size was 

too small to draw a conclusive finding. In addition, among the Novice level students, students 

whose language proficiency is closer to beginners (Novice Low) tended to do a better job at self-

assessing their proficiency in comparison to those whose proficiency is at Novice Mid or Novice 

High. Great inconsistency of accuracy rate was shown among the students at Intermediate level, 

more accurately evaluate their oral proficiency. Despite the various patterns displayed among 

students at different proficiency levels, the overall tendency is that the accuracy rate of self-

assessment was shown to increase as students’ proficiency increases. 

In addition, regarding whether the students’ responses to the items in the self-assessment 

survey could successfully predict their language trajectories, it was found that the accuracy rate 

was moderately higher than merely by chance. In other words, the language gain or loss that 

students experienced after an academic year in a language program was somewhat reflected in 

their responses to the statements in the self-assessment. 

Apart from that, students in this study were found to be more likely to under-estimate 

rather than over-estimate their oral proficiency, which was not in line with the findings of 

previous research. After a close examination and a careful comparison of these studies, it was 

found that the result of self-assessment could be influenced by the different formats that the self-

 

35 

 

assessment survey employed and the different scoring systems adopted. The results of both the 

current study and previous research revealed that compared with decontextualized or general 

descriptions without much information given for learners as a reference, a more informative and 

contextualized instrument was shown to generate more accurate response by learners.  

 

 

 

 
 
 
 
 

 

 

 

 

36 

 

 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

APPENDIX 

37 

APPENDIX 

 

Can-Do Statements 

35 Common Items in the Can-Do Statement Used from Spring 2015 to Spring 2017 
ACTFL OPIc 
Level 
1 
1 
1 
1 
1 
1 
2 
2 
2 
2 

I can say which sports I like and don’t like. 
I can list my favorite free-time activities and those I don’t like. 
I can talk about my school or where I work. 
I can talk about my room or office and what I have in it. 
I can answer questions about where I’m going or where I went. 
I can present information about something I learned in a class or at work. 
I can describe a school or workplace. 
I can schedule an appointment. 
I can talk about my family history. 
I can explain why I was late to class or absent from work and arrange to 
make up the lost time. 
I can tell a friend how I'm going to replace an item that I borrowed and 
broke/lost. 
I can give some information about activities I did. 
I can talk about my favorite music, movies, and sports. 
I can arrange for a make-up exam or reschedule an appointment. 
I can ask for and follow directions to get from one place to another. 
I can return an item I have purchased to a store. 
I can present an overview about my school, community, or workplace. 
I can compare different jobs and study programs in a conversation with a 
peer. 
I can discuss future plans, such as where I want to live and what I will be 
doing in the next few years. 
I can present ideas about something I have learned, such as a historical 
event, a famous person, or a current environmental issue. 
I can explain how life has changed since I was a child and respond to 
questions on the topic. 

 

2 

3 
3 
3 
3 
3 
3 
3 

3 

4 

4 

 

ACTFL 
Levels 
NM 
NM 
NM 
NM 
NM 
NM 
IL 
IM 
IH 
AL 

AL 

IM 
IM 
IM 
IH 
IH 
AL 
AL 

AM 

IH 

AL 

38 

 

ACTFL OPIc 
Level 
4 
4 

4 

4 
4 

5 

5 
5 
5 

5 

5 

5 

5 
5 

 

 

ACTFL 
Levels 
AL 
AL 

AL 

AM 

AM 

AH 

Can-Do Statements 

AM 

AH 
AH 

I can discuss what is currently going on in another community or country. 
I can provide a rationale for the importance of certain classes, subjects, or 
training programs. 
I can talk about present challenges in my school or work life, such as 
paying for classes or dealing with difficult colleagues. 
I can give a presentation about cultural influences on society. 
I can participate in conversations on social or cultural questions relevant to 
speakers of this language. 
I can interview for a job or service opportunity related to my field of 
expertise. 
AL 
I can present an explanation for a social or community project or policy. 
I can present reasons for or against a position on a political or social issue.  AL 
AM 
I can give a clear and detailed story about childhood memories, such as 
what happened during vacations or memorable events and answer 
questions about my story. 
I can exchange general information about my community, such as 
demographic information and points of interests. 
I can exchange factual information about social and environmental 
questions, such as retirement, recycling, or pollution. 
I can exchange complex information about my academic studies, such as 
why I chose the field, course requirements, projects, internship 
opportunities, and new advances in my field. 
I can provide a balance of explanations and examples on a complex topic. 
I can explain, participate actively and react to others appropriately in 
academic debates, providing some facts and rationales to back up my 
statements. 

S 
S 

 

39 

 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

REFERENCES 

40 

 

 

REFERENCES 

 
 

ACTFL. (2012a). ACTFL OPIc familiarization manual. Retrieved March 15, 2018, from 

https://www.languagetesting.com/pub/media/wysiwyg/manuals/actfl-fam-manual-opic.pdf 

ACTFL. (2012b). Writing Proficiency TEST familiarization manual. Retrieved March 15, 2018, 
from https://www.languagetesting.com/pub/media/wysiwyg/ACTFL-Writing-Proficiency-
Test-WPT-Familiarization-Manual-.pdf 

ACTFL. (2015). NCSSFL-ACTFL can-do statements. Retrieved December 11, 2017, from 

http://www.actfl.org/global_statements 

Bachman, L. F., & Palmer, A. S. (1981). The construct validation of the FSI oral interview. 
Language Learning, 31(1), 67–86. https://doi.org/10.1111/j.1467-1770.1981.tb01373.x 

Brantmeier, C. (2006). Advanced L2 learners and reading placement: Self-assessment, CBT, and 

subsequent performance. System, 34(1), 15–35. 
https://doi.org/10.1016/j.system.2005.08.004 

Brantmeier, C., & Vanderplank, R. (2008). Descriptive and criterion-referenced self-assessment 

with L2 readers. System, 36(3), 456–477. https://doi.org/10.1016/j.system.2008.03.001 

Brantmeier, C., Vanderplank, R., & Strube, M. (2012). What about me?. Individual self-

assessment by skill and level of language instruction. System, 40(1), 144–160. 
https://doi.org/10.1016/j.system.2012.01.003 

Brown, K. G., & Gerhardt, M. W. (2002). Formative evaluation: An integrative practice model 

and case study. Personnel Psychology, 55, 951–983. https://doi.org/10.1111/j.1744-
6570.2002.tb00137.x 

Brown, N. A., Dewey, D. P., & Cox, T. L. (2014). Assessing the validity of can-do statements in 

retrospective (Then-Now) self-assessment. Foreign Language Annals, 47(2), 261–285. 
https://doi.org/10.1111/flan.12082 

Butler, Y. G., & Lee, J. (2006). On-task versus off-task self-assessment among Korean 

elementary school students studying English. The Modern Language Journal, 90(4), 506–
518. https://doi.org/10.1111/j.1540-4781.2006.00463.x 

Caputo, D., & Dunning, D. (2005). What you don’t know: The role played by errors of omission 
in imperfect self-assessments. Journal of Experimental Social Psychology, 41(5), 488–505. 
https://doi.org/10.1016/j.jesp.2004.09.006 

Cardoso, C. W. (2010). Self-assessment : Indispensable tools for successful learning. New 

Routes, 42, 24–26. 

Carr, N. (2011). Designing and analyzing language tests. Oxford: Oxford University Press. 

 

41 

 

Cavana, P., & Luisa, M. (2012). Autonomy and self-assessment of individual learning styles 
using the European Language Portfolio (ELP). Language Learning in Higher Education, 
1(1), 211–228. https://doi.org/10.1515/cercles-2011-0014 

Davidson, F., & Henning, G. (1985). A self-rating scale of English difficulty: Rasch scalar 

analysis of items and rating categories. Language Testing, 2(2), 164–179. 
https://doi.org/10.1177/026553228500200205 

Delgado, P., Guerrero, G., Goggin, J. P., & Ellis, B. B. (1999). Self-assessment of linguistic 
skills by bilingual Hispanics. Hispanic Journal of Behavioral Sciences, 21(1), 31–46. 
https://doi.org/10.1177/0739986399211003 

Dolosic, H. N., Brantmeier, C., Strube, M., & Hogrebe, M. C. (2016). Living language: Self-
assessment, oral production, and domestic immersion. Foreign Language Annals, 49(2), 
302–316. https://doi.org/10.1111/flan.12191 

Ehrlinger, J., Johnson, K., Banner, M., Dunning, D., & Kruger, J. (2008). Why the unskilled are 

unaware: Further explorations of (absent) self-insight among the incompetent. 
Organizational Behavior and Human Decision Processes, 105(1), 98–121. 
https://doi.org/10.1016/j.obhdp.2007.05.002 

Kaderavek, J. N., Gillam, R. B., Ukrainetz, T. a., Justice, L. M., & Eisenberg, S. N. (2004). 

School-age children’s self-assessment of oral narrative production. Communication 
Disorders Quarterly, 26(1), 37–48. https://doi.org/10.1177/15257401040260010401 

Language Testing International. (2012). ACTFL speaking assessment: The oral proficiency 

interview - computer® (OPIc). Retrieved December 11, 2017, from 
https://www.languagetesting.com/oral-proficiency-interview-by-computer-opic 

Lim, H. (2007). A study of self- and peer-assessment of learners’ oral proficiency. CamLing, 

169–176. 

Malabonga, V., Kenyon, D. M., & Carpenter, H. (2005). Self-assessment, preparation and 

response time on a computerized oral proficiency test. Language Testing (Vol. 22). 
https://doi.org/10.1191/0265532205lt297oa 

Oscarson, M. (1989). Self-assessment of language proficiency: rationale and applications. 

Language Testing, 6, 1–13. https://doi.org/10.1177/026553228900600103 

Pallier, G. (2003). Gender differences in the self-assessment of accuracy on cognitive tasks. Sex 

Roles, 48(5–6), 265–276. https://doi.org/10.1023/A:1022877405718 

Patri, M. (2002). The influence of peer feedback on self- and peer-assessment of oral skills. 

Language Testing, 19(2), 109–131. 

Peirce, B. N., Swain, M., & Hart, D. (1993). Self-assessment, French immersion, and locus of 

control. Appl. Linguist., 14(1), 25–42. Retrieved from 
http://ezproxy.usherbrooke.ca/login?url=http://search.ebscohost.com/login.aspx?direct=true

 

42 

 

&db=fcs&AN=4936928&site=ehost-live 

Raymond, L., & Gisèle, P. (1985). Self-assessment as a second language placement instrument, 

19(4), 673–687. 

Rivers, W. P. (2001). Autonomy at all costs: An ethnography of metacognitive self-assessment 

and self-management among experienced language learners., 85(2), 279–290. 

Roever, C., & Powers, D. E. (2005). Effects of language of administration on a self-assessment 

of language skills. Monograph Series, (February). 

Stansfield, C. W., Gao, J., & Rivers, W. P. (2010). A concurrent validity study of self-

assessments and the federal interagency language roundtable oral proficiency interview. 
Russian Language Journal/Russkii Yazyk, 60, 301–317. Retrieved from 
http://search.proquest.com/docview/1430171560?accountid=13042%5Cnhttp://oxfordsfx.ho
sted.exlibrisgroup.com/oxford?url_ver=Z39.88-
2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&genre=article&sid=ProQ:ProQ%3Allbash
ell&atitle=A+Concurrent+Validity+Study+of+Se 

Taras, M. (2001). The use of tutor feedback and student self- assessment in summative 
assessment tasks : Towards transparency for students and for tutors. Assessment & 
Evaluation in Higher Education, 26(6). https://doi.org/10.1080/0260293012009392 

Taras, M. (2003). To feedback or not to feedback in student self- assessment to feedback or not 

to feedback in student, 28(5). https://doi.org/10.1080/02602930301678 

Tigchelaar, M., Bowles, R. P., Winke, P., & Gass, S. (2017). Assessing the validity of ACTFL 

Can-Do Statements for spoken proficiency: A Rasch analysis, 50, 584–600. 
https://doi.org/10.1111/flan.12286 

 

 

43