AN INVESTIGATION INTO A CHINESE PLACEMENT TEST’S SCORE
                INTERPRETATIONS AND USES
                                 By
                             Wenyue Ma
                         A DISSERTATION
                             Submitted to
                     Michigan State University
             in partial fulfillment of the requirements
                          for the degree of
         Second Language Studies – Doctor of Philosophy
                                2023


                                            ABSTRACT
         Foreign language placement testing, an important component in university foreign
language programs, has received considerable, but not copious, attention over the years in second
language (L2) testing research (Norris, 2004), and it has been mostly concentrated on L2
English. In contrast to validation research on L2 English placement testing, the discussion on
tests in languages other than English is limited (e.g., Mozgalina & Ryshina-Pankova, 2015).
Additionally, these studies have been constrained by two main methodological limitations. First,
the importance of item-level data analysis is largely overlooked. While the researchers have
highlighted the value of examining total test scores in validation research, defensible score
interpretations and uses should not be assumed without further evidence showing all test items
function as intended by test developers. Second, the validity evidence reported in these studies
falls into a narrow range: the evidence mainly focused on generalization (e.g., reporting test
reliability), explanation (group performance comparisons), and extrapolation (correlational
studies on the relationship between test scores and other criterion) inferences, and validation
needs more than that (Chapelle, 2021). In contrast, the documentation of empirical results
supporting domain description (content representation and relevance), evaluation (examination
of item quality), and utilization (stakeholders’ perception of score usefulness) has been limited.
         The primary goal of my dissertation is to provide a comprehensive examination and
evaluation of the test score uses and interpretations for the listening and reading sections of an
in-house, college-level Chinese placement test. For my dissertation, I collect and evaluate
quantitative (placement test scores, item responses, ACTFL proficiency test scores) and
qualitative (interviews, focus group, questionnaires) validity evidence in an argument-based
validation framework that was conceptualized by Kane (2006), and was further expanded by


Chapelle et al. (2008): domain description, evolution, generalization, explanation, extrapolation,
and utilization (see Chapelle, 2021, for a review). Employing mixed-methods, I aim to (1) study
the functioning of test items by identifying and revising psychometrically problematic items, if
any; (2) utilize the empirical results to inform test revisions; (3) demonstrate how the collected
quantitative and qualitative results serve as strong or weak evidence or counterevidence for the
claims within the validity argument; and (4) provide an overall evaluation of the intended
interpretation and use of the placement test scores. With the study I hope to contribute to the
larger discussion of the practices of foreign language assessment and argument-based test
validation, and at the same time, offer insight into the ongoing development of validity research.


Copyright by
WENYUE MA
2023


                                    ACKNOWLEDGEMENTS
         As I reflect on the journey of my doctoral studies, I am humbly grateful for the wealth of
guidance, support, and encouragement I have received from numerous individuals who have had
profound impacts on my life and work.
         First and foremost, I would like to extend my heartfelt gratitude to my advisor, Dr. Paula
Winke. Our journey began seven years ago when I first embarked on my graduate studies at
Michigan State University (MSU), and her initial course was my introduction to this fascinating
academic world. Her steadfast guidance, patience, and contagious enthusiasm for language
testing have been an illuminating beacon throughout my pursuit. I remain forever indebted to
her.
         My appreciation extends to several faculty members whose guidance and insights have
been instrumental in shaping my research and academic growth. Dr. Dan Reed provided patient
and meticulous review for my Qualifying Research Paper (QRP) twice. His invaluable input has
left a profound impact on the development of my research. I was privileged to work with Dr.
Koen Van Gorp as a Graduate Assistant at the Center for Language Teaching Advancement
(CeLTA). This opportunity presented a significant milestone in my academic journey, granting
me hands-on experience in data analysis and invaluable guidance that has deeply influenced my
doctoral studies. Equally deserving of special mention is Dr. Ryan Bowles, whose Applied
Measurement course became the foundation upon which my dissertation was built. The
knowledge and skills gleaned from his course have been instrumental throughout my study, for
which I hold enduring gratitude. Furthermore, I wish to extend my gratitude to Dr. Steven Pierce
from the Center for Statistical Training and Consulting (CSTAT). His mentorship has equipped
                                                  v


me with a wealth of skills and knowledge that will undoubtedly prove invaluable in my future
endeavors.
        My gratitude also extends to the faculty of the Chinese program at MSU, particularly,
Ho-Hsin Huang, Xuefei Hao, and Wenying Zhou. Their assistance and insightful contributions
have been critical to the successful completion of my dissertation project. The value of their
input to my research cannot be overstated.
        Special recognition goes to my cherished peers from the SLS program who have been my
support system throughout this journey. Dylan Burton, Yingzhao Chen, Bronson Hui, Matt
Kessler, Jongbong Lee, Shinhye Lee, Myeongeun Son, Michael Wang, Monique Yoder, and
Xiaowan Zhang, thank you for your steadfast camaraderie and academic companionship. Your
friendship has made my life at MSU both enjoyable and rewarding. In addition, my Seattle
friends, Anqi Chen, Shjjia Chen, Corie Weijia Dai, Bixi Zhang, Hou Wang, and Liwei Wang,
have enriched my remote working experience during the pandemic with excitement and fun.
        Lastly, but most certainly not least, my heartfelt thanks extend to my parents, Feng Ma
and Rebecca Wei Wang. Their boundless love, patience, and understanding during my prolonged
academic journey far exceed what words can adequately express. I eagerly anticipate the joy of
our imminent reunion. I would also like to acknowledge my boyfriend, Kevin Zhai Zihao. Your
support throughout my doctoral journey, coupled with your remarkable efforts in helping me
strike a work-life balance during these challenging times, has been nothing short of
extraordinary. Thank you.
        This dissertation is a culmination of the efforts and contributions of all those mentioned
and many more not mentioned. I am deeply thankful for each of you.
                                                  vi


                                         TABLE OF CONTENTS
CHAPTER 1: INTRODUCTION ................................................................................................... 1
CHAPTER 2: LITERATURE REVIEW ........................................................................................ 3
CHAPTER 3: METHODOLOGY ................................................................................................ 21
CHAPTER 4: RESULTS .............................................................................................................. 41
CHAPTER 5: DISCUSSION...................................................................................................... 100
CHAPTER 6: CONCLUSIONS ................................................................................................. 123
REFERENCES ........................................................................................................................... 124
APPENDIX 1: INSTRUCTOR INTERVIEW QUESTIONS .................................................... 130
APPENDIX 2: STUDENT QUESTIONNAIRE ........................................................................ 131
APPENDIX 3: STUDENT INTERVIEW QUESTIONS ........................................................... 132
APPENDIX 4: ITEMS LOADED ON THE SAME DIMENSION ........................................... 133
APPENDIX 5: ITEM RELEVANCE AND DIFFICULTIES .................................................... 134
APPENDIX 6: RESULTS OF DIF ............................................................................................. 136
APPENDIX 7: MISFITTING ITEMS AND PROPOSED REVISIONS ................................... 138
APPENDIX 8: POST-HOC TEST RESULTS (TOTAL) .......................................................... 141
APPENDIX 9: POST-HOC TEST RESULTS (LISTENING) ................................................... 142
APPENDIX 10: POST-HOC TEST RESULTS (READING).................................................... 143
                                                             vii


                                 CHAPTER 1: INTRODUCTION
        Foreign language placement testing, an important component in university foreign
language programs, has received considerable, but not copious, attention over the years in second
language (L2) testing research (Norris, 2004). Unlike many ESL (English as a Second Language)
programs, where standardized language proficiency test scores are often available for admission
processes and thus can be used to inform placement decisions, foreign language programs at
many universities in the United States often internally develop their own local placement tests
(e.g., Georgetown University, the University of Wisconsin, Michigan State University). These
tests aim to create groups of newly enrolled foreign language learners with homogeneous
language abilities. The goal is to use the test scores to place students into courses which are at
the appropriate levels for the students, which maximizes effective instruction. While these local
placement tests are typically designed to be aligned with the local curriculum and language
needs, validity evidence needs to be collected by the programs and test developers to ensure the
alignment, both at or during test creation, and over the lifetime of the test’s usage. The validity
evidence can be used to justify and confirm the appropriateness of decisions and interpretations
that are based on the test scores. However, against the wealth of existing validation research
focusing on English placement testing, there has been, until recently, comparatively little
discussion regarding the validity evaluations of foreign language placement tests. In addition,
most validation research studies on foreign language placement tests conducted so far, as will be
shown in the literature review section, have paid special attention to a narrow range of validity
evidence, such as test reliability and comparisons of group-level performance, whereas
documentations of validity evidence concerning test content relevance, test stakeholders’
perception of the test, and item functioning, is comparatively limited.
                                                   1


        The primary goal of my dissertation, therefore, is to provide a comprehensive
examination and evaluation of the test score uses and interpretations for the listening and reading
sections of an in-house, college-level Chinese placement test at a large Midwestern university in
the U.S. In this paper, I collected and evaluated the validity evidence in an argument-based
validation framework for the placement test. I proposed a set of warrants and their underlying
assumptions following a sequence of six inferential steps that were conceptualized by Kane
(1992, 2006, 2013), and were later expanded by Chapelle, Enright, and Jamieson (2008) and
Chapelle and Voss (2021): (1) domain description, (2) evolution, (3) generalization, (4)
explanation, (5) extrapolation, (6) utilization, (7) consequence implication. As Chapelle (2020)
argued, “validation research needs to address a variety of different types of claims about scores
encompassing such meanings as their real-world relevance, substantive sense, functional role,
and stability. Such diverse meanings require research undertaken using a variety of
methodologies including both qualitative and quantitative research” (p. 114). Therefore, I
undertook a mixed-methods approach to data collection and analysis and integrated the two
forms of data and their results to address research questions motivated by the warrants and their
underlying assumptions specified in the validity argument, which aim to provide an overall
evaluation of the intended interpretation and use of the placement test scores.
        With this study I hope to contribute to the larger discussion of the practices of foreign
language assessment and argument-based test validation, and at the same time, offer insight into
the ongoing development of validity research.
                                                   2


                              CHAPTER 2: LITERATURE REVIEW
Argument-based validation in testing and assessment
        The Standards for Educational and Psychological Testing (henceforth called the
Standards), defines validity as “the degree to which evidence and theory support the
interpretations of test scores for proposed uses of test scores” (AERA et al., 2014, p. 1). The
definition is different from the previous notion held by some researchers that validity is a
characteristic of tests and that tests can, therefore, be either valid or invalid. The arguments in the
Standards (p. 1) are that “statements about validity should refer to particular interpretations for
specified uses” and “it is incorrect to use the unqualified phrase of ‘the validity of the test.’”
These notions are in line with Kane’s argument-based validity framework (1992, 2006, 2013),
which conceptualizes validation as a process of building and evaluating a validity argument
within the context of the test’s score uses. Kane’s approach to validation provides a means for
defining intended score-based interpretations and uses, so that the specified interpretations and
uses, rather than the test itself, is validated. This thus entails that if a test’s score uses are
changed (for example, if a test designed for college students is additionally applied for the use of
assessing high school students), the validity argument will not apply to the new context, and a
new validity argument structure must be formed and evaluated.
Technical terms and explanations
        Referring to Toulmin’s argument structure (2001, [1958] 2012), Kane’s argument-based
validity framework employs two kinds of argument. An interpretive argument makes claims
about the proposed interpretation and uses of test scores by specifying relevant inferences with
their supporting warrants and underlying assumptions that are necessary to make such claims. A
                                                     3


validity argument evaluates the interpretive argument based on the backing to determine whether
the proposed score interpretations and uses are justified or not.
        According to Kane (2006), inferences are steps denoting the reasoning processing to
bridge examinees’ observed performance to the claims based on the performance. To identify the
types of evidence to support the intended score interpretation and uses, more detail about the
inferences is needed. The detail is expressed in warrants and assumptions. A warrant is a general
rule or an established procedure for inferring claims from observed performance, and
assumptions underlying the warrant clarify what theoretical and empirical evidence, namely the
backing, is needed.
The seven inferences
        Kane’s argument-based validity framework originally classified score interpretations and
uses into scoring, generalization, extrapolation, and decision inferences. Building on Kane’s
work, Chapelle, Enright, and Jamieson (2008) gathered validity evidence and formulated it into a
validity argument to support score inferences for the 2005 revision of the Test of English as a
Foreign Language Internet-based Test (TOEFL iBT; https://www.ets.org/toefl.html) through a
sequence of six steps: domain description, evaluation, generalization, explanation, extrapolation,
and utilization. It is important to note that the consequence implication step was subsequently
introduced by Chapelle and Voss (2021), further refining the framework. A major contribution of
argument-based validity is the explicit logic that connects the claims about test score
interpretations and uses to the score inferences. The logical progression of the validity argument
showing how each inference serves to connect these claims is illustrated Figure 1.
        A domain description inference is made to examine whether the quality of the test
development process for obtaining the observed test performance is appropriate for the proposed
                                                    4


test score interpretations and uses. The warrants and assumptions of the domain description
inferences make direct reference to Sireci’s (1998) research on content validity. The four
elements of content validity described in Sireci, domain definition, domain representation,
domain relevance, and appropriateness of the test development process, provide detail that is
useful for evaluating test content quality. The backing to support the inference may be survey or
interview data from content experts about importance, representation, and relevance of
prospective test content in relation to the target domain.
                                                   5


Figure 1. An outline of the overall structure of a validity argument.
Note: Revised from Chapelle et al., 2008, p. 18
                                                  6


        An evaluation inference is made to assess the extent to which the test scores are
accurately summarizing relevant performance on test tasks. The quality of test scores can be
evaluated from three perspectives, the test administration conditions, the task scoring procedures,
and the observed test item quality (Chapelle, 2020). To investigate the evaluation inference,
researchers may conduct item analysis to inspect statistical item characteristics, including
appropriate difficulty and discrimination, and the existence of items bias; they may examine
examinees’ test-taking processes to study their cognitive engagement during the test; they could
also conduct observation study at test centers to examine whether the required equipment,
troubleshooting procedures, and accommodations to certain disadvantaged examinees are in
place.
        A generalization inference addresses an important issue in educational measurement,
which is the degree to which score properties and inferences are generalizable to various
measurement contexts (Cook & Campbell, 1979; Messick, 1995). More specifically, the
inference is concerned with the extent to which the ratings for an examinee are consistent across
multiple measurement settings, test forms, test tasks, and raters. The supporting evidence for the
generalization inference is normally gathered in generalizability and reliability studies, but there
are cases where appropriate scaling and equating procedures may be needed to ensure intended
score interpretations and uses.
        An explanation inference links test performance to the intended construct. More
specifically, the inference leads to the question as to whether the observed scores can be
attributed to the construct. Qualitative and quantitative research methods can both be used to
investigate the inference. Support for the explanation inference is evidenced when (1) observed
scores support the theorized position of the construct in relation to other constructs; (2) observed
                                                   7


scores support the internal structure of components of the construct; (3) examinees’ test
performance varies according to the amount and quality of the measured ability.
        An extrapolation inference in the argument-based validity framework moves the
argument from the intended construct to examinees’ expected scores in the target domain, which
is defined as “the full range of performances included in the [test score] interpretation” (Kane,
Crooks, & Cohen, 2005, p. 7). The inference can be evaluated using two kinds of evidence, (1)
the evidence collected through criterion-related studies supporting the relationship between
examinees’ performance on the test and other indicators in the target domain, (2) the evidence
showing the quality of test performance, if it can be examined qualitatively (e.g., linguistic
features), is comparable to other target domain performance.
        A utilization inference is used to examine whether the test produces results that are useful
for making appropriate decisions and can be well communicated to stakeholders. The inference
can be evaluated from two perspectives, the intended uses for the test (i.e., utility) and the actual
decision rules adopted by test users (i.e., decision). For the utility aspect of the inference,
researchers need to provide evidence showing that the test scores are judged to be useful for the
intended educational purposes (e.g., admission, placement, performance prediction, instruction
effectiveness evaluation) by test stakeholders. As for the decision aspect, empirical evidence
needs to be presented showing that the cut-off scores used for making decisions or the score
bands used to describe either an individual examinee or groups of examinees are set
appropriately.
        A consequence implication inference connects the test use with its impact on
stakeholders. Specifically, the inference examines whether the test uses have a positive influence
on language teaching and learning. Empirical evidence supporting the inference can be backed
                                                   8


by examining diverse stakeholders’ perspectives of the impacts of the test results on examinees’
enhancement of language skills and curriculum development of language courses. The evidence
can be collected through individual interviews and focus groups.
Advantages of the argument-based framework of validity
        Chapelle, Enright, and Jamieson (2010) identified four advantages for the adoption of an
argument-based approach to test validation over alternative approaches. First, the argument-
based approach has shifted the prominent role of construct in validation research. In contrast to
positioning construct as the foundation for test score interpretation (e.g., Messick’s unitary
validity framework, 1989, 1995), the argument-based validity is considered a more practical and
efficient approach to test validation. This process, as Chapelle (2012) put it, “downplays, but
does not eliminate, the need to define the construct” (p. 19), which has proven to be a daunting
and difficult task in language assessment. Second, the interpretive argument and validity
argument are linked by the research questions and their supporting evidence that are prompted by
the particular warrants and assumptions. Therefore, the way in which the interpretive argument
and validity argument is specified makes clear what research needs to be conducted and what
types of validity evidence is required. Third, the internal logic among the argument is made
apparent by showing how test performance and test score uses and interpretations are connected
through a series of inferences. This allows validation to be completed through a systematic
process of examining inferential steps rather than reviewing a list of types of potential validity
evidence, some of which may not be directly relevant in the testing context of interest. Fourth,
clear warrants and their underlying assumptions provide a place for counterevidence. The way
the validity argument is specified creates an opportunity to challenge and question the proposed
interpretation by presenting evidence for rival hypotheses. Given the advantages noted above, the
                                                  9


argument-based validity framework has thus been increasingly used in the field of second
language testing and assessment in the recent ten years (e.g., Becker, 2018; Chapelle, Cotos, &
Lee, 2015; Knoch & Chapelle, 2018; LaFlair & Staples, 2017; Winke et al., 2022; Yan &
Staples, 2020; Youn, 2015) and has provided language testing researchers with useful guidance
on how to collect and organize validity evidence following a logical structure to justify and
support score-based interpretations and uses.
Validation research on foreign language placement tests
         Over the past several decades, there has been a gradual increase in research focusing on
placement testing (Long, Shin, Geeslin, & Willis, 2018). Within this body of work, the topic of
placement test validity has garnered significant interest from researchers, as validity is a
fundamental consideration in test development and test evaluation. However, in contrast to the
richness of validation research on English placement tests, the discussion on the tests in
languages other than English is relatively limited. Of the limited validation research on foreign
language placement tests, most studies so far have mainly focused on gathering validity evidence
by investigating examinees’ performance at the test level or the group level and/or often relied
on a single source of data, test scores, to claim validity for a given test (see Bernhardt et al.,
2004; Eda et al., 2008; Heilenman, 1983, Long et al., 2018; Mozgalina & Ryshina-Pankova,
2015, Norris, 2004). Specifically, in addition to presenting the evidence of satisfactory test
reliability, researchers claimed that a validity argument for the use of test scores was supported
when (a) the mean total test scores were found to increase from examinees enrolled in lower-
level courses to those in higher-level courses; (b) the total scores of the placement test were
found to be strongly correlated with examinees’ performance assessed by another measurement
instrument assessing similar skills (e.g., a reading proficiency test, the oral proficiency interview
                                                   10


test); and (c) examinees’ performance in the placement test was found to improve after they had
received instruction and practice. Table 1 provides a brief summary of the validity evidence
provided in these validation studies. Below I described these studies in more detail.
         Eda and her colleagues (2008) assessed the reliability and construct validity of the
Japanese Skill Test (JSKIT), which comprises various single-skill tests along with a grammar
test. This evaluation was conducted to determine the test's effectiveness as a placement tool for a
nine-week summer intensive program. Out of the 250 students enrolled in the summer Japanese
program, 136 participated in the study, taking not only the JSKIT, but also an internal placement
test and an oral proficiency interview test at the beginning and end of the program. After
comparing the results from the three tests, the researchers assessed both the reliability and
effectiveness of the JSKIT in differentiating learners at various proficiency levels, the
researchers concluded that the JSKIT served as a reliable and effective placement tool for
students with lower levels of proficiency, specifically those with first and second-year language
abilities.
         The assessment of the Japanese Skill Test (JSKIT) demonstrates the importance of
validation in test development. Following this notion, Cronbach (1971) described test validation
as an "ever-extending inquiry" (p. 452), which necessitates an ongoing research program rather
than a single empirical study. This concept is exemplified in the research conducted by Norris
(2004) and Mozgalina and Ryshina-Pankova (2015), who documented the assessment
development, validation process, test revisions, and validity evidence evaluations for the
placement test used in the Georgetown University German program. The placement test
comprised three sections: a cloze test (C-test) with five progressively more difficult texts, a
reading comprehension test, and a listening comprehension test. In Norris's study, a total of 193
                                                  11


students enrolled in the German program completed all three parts of the placement test,
including previously placed and non-placed students. The placed students are those who entered
the course based on their placement test results, while the non-placed students are those who
progressed through the lower-level courses without taking the placement test. Additionally, 124
of the previously placed students also took both a semester-beginning and a semester-end test
administration (of the same test). Through the analysis of multiple data sources, including
students' placement test scores, course grades, scorers' marking sheets, and instructor interviews,
Norris determined that the C-test produced more reliable test scores and assessed a broader range
of student abilities compared to the listening and reading comprehension tests, making the C-test
more suitable for placement purposes.
         Building on Norris's initial validation efforts, Mozgalina and Ryshina-Pankova (2015)
conducted a validity evaluation of a revised C-test, which was part of the placement test in
Georgetown University's German program. The test revisions were implemented to better align
with the updated German curriculum following Norris (2004). Administered at the beginning and
end of the semester, the researchers reported results from a total of 222 examinees across various
course levels, with 66 of them taking the test at both administrations. The findings indicated that
the test effectively distinguished between examinees of varying abilities and successfully tracked
progress for upper-level students.
         The importance of test validation in the context of Georgetown University's German
program highlights the significance of selecting the appropriate assessment methods. Heilenman
(1983) conducted a study with a larger sample size, examining the C-test scores of 388 students
enrolled in French at Northwestern University to determine whether the test was a valid measure
of language proficiency and could effectively differentiate students at various instructional
                                                   12


levels. In contrast to Norris (2004) and Mozgalina and Ryshina-Pankova (2015), where the C-
test was supported as a placement measure, Heilenman concluded that the C-test should be used
cautiously as an alternative or supplement to other placement measures. This caution stems from
the considerable overlap in scores obtained by students at different instructional levels, resulting
in significant discrepancies between students' actual course assignments and those predicted by
the C-test scores.
         While the studies discussed so far focused on paper-based tests, web-based language
testing, also known as computer-based testing, has gained considerable attention over the past 30
years in second language assessment research. This is due to its potential to greatly enhance the
flexibility and logistical efficiency of test delivery and scoring processes (Long et al., 2018;
Ockey, 2006). Two empirical validation research studies have specifically examined the
practicality and efficiency of web-based language placement tests. Bernhardt et al. (2004)
assessed the utility and validity of two web-based language tests as placement tools for college-
level German and Spanish programs. The test score reliability and validity of the two placement
tests were evaluated using data from 78 students in the German program and 679 students in the
Spanish program, with 14 German learners and 41 Spanish learners retaking the placement test
after three quarters of target language instruction. The results suggested that the test scores were
reliable, and evidence of validity was supported by the trend of students' improved performance
in the second administration of the tests.
         In a similar vein, Long et al. (2018) investigated the reliability and validity of a newly
developed web-based Spanish placement test. Building upon and expanding Bernhardt et al.'s
research, the study analyzed testing data from 2,111 students enrolled in a college-level Spanish
program, with 1,622 of them also taking the paper-based test. Besides providing evidence of high
                                                    13


test reliability, the researchers evaluated the functionality and use of the test scores by examining
content relevance (the alignment between the test content and course materials) and score
invariance across different modes of test delivery (the alignment between the test results
obtained from the web-based test and the original paper-based test). The results suggested that
the test was valid in terms of content relevance and placement decision appropriateness.
The need for the current study
         The studies discussed so far have undoubtedly contributed valuable insights into the
evaluation of measurement validation and the appropriateness of placement testing practices,
advancing researchers’ understanding of factors that contribute to placement test effectiveness.
However, these studies are subject to two main methodological limitations.
         Firstly, most validation studies on foreign language placement tests tend to overlook the
importance of item-level data analysis. While the research endeavors mentioned previously
emphasize the significance of examining total test scores in validation research, defensible score
interpretations and uses should not be assumed without further evidence demonstrating that all
test items function as intended when eliciting examinees' responses (e.g., items are free of bias;
examinees at lower ability levels are less likely to correctly respond to difficult items compared
to their peers with higher abilities). This point is evident in the following example: evidence
suggesting that a test as a whole demonstrates good discriminating power and high reliability
does not necessarily guarantee that all test items are problem-free and equally effective and
appropriate in assessing and discriminating examinees' target abilities.
         Secondly, the validity evidence reported in these studies is somewhat narrow in scope, as
it can be observed that evidence supporting specific aspects of score interpretations and uses is
often missing in building and supporting validity arguments for foreign language placement tests
                                                   14


(see Table 1). A closer examination of the validity evidence reported in these validation research
studies reveals that, primarily for practical reasons, the validity evidence mainly focuses on
generalization (reporting test reliability), explanation (group performance comparisons), and
extrapolation (correlational studies on the relationship between test scores and other criteria)
inferences. In contrast, the documentation of empirical results supporting domain description
(content representation and relevance), evaluation (examination of item quality), utilization
(stakeholders' perception of score usefulness), and consequence (washback effect) is
comparatively limited. Research addressing this gap is necessary, as the seven inferences
together form a complete, logical structure for validity evaluation.
                                                   15


Table 1. Summary of validity evidence in previous foreign language placement test validation
       studies
 Study         Validity evidence
               Domain description inference:
               ●        Interviews with instructors about perceptions of the placement testing
               in relation to their teaching would reveal that the content assessed in the test is
 Bernhardt,    critical in successful course completion;
 Rivera, &     Generalization inference:
 Kamil (2004) ●         The test would yield scores with high reliability;
               Explanation inference:
               ●        Students would perform significantly better on the second
               administration of the test.
               Generalization inference:
               ●        The test would yield scores with high reliability;
               Explanation inference:
               ●        The test would effectively differentiate students at different course
 Eda,          levels;
 Itomitsu, &   Extrapolation inference:
 Noda (2008)   ●        The scores on the test would be positively correlated with the scores on
               other tests (an in-house placement test and the OPI);
               ●        Placement decisions made based on scores of the test would be in
               agreement with those based on scores of the in-house placement test and the
               OPI.
               Explanation inference:
               ●        Students who are enrolled in the progressively higher course levels
               would perform with higher scores on the cloze test than students at the
 Heilenman
               preceding curricular levels;
 (1983)
               ●        Extrapolation inference:
               The scores on the cloze test would be positively correlated with the scores on
               the Reading and Writing parts of the placement test.
                                               16


Table 1 (cont’d)
                 Domain description inference:
                 ●       Assessment items would be matched with corresponding course
                 content;
                 Generalization inference:
 Long, Shin,
                 ●       The test would yield scores with high reliability;
 Geeslin, &
                 Explanation inference:
 Willis (2018)
                 ●       There would be a strong relationship between the scores on the web-
                 based test and the paper-based test;
                 Placement decisions made based on scores of the web-based test would be in
                 agreement with those based on scores of the paper-based test.
                 Generalization inferences:
                 ●       The C-test would yield scores with high reliability;
                 Explanation inferences:
                 ●       The C-test would elicit a wide distribution of scores from examinees of
                 differing abilities
                 ●       The scores on the new C-test would be positively correlated with scores
 Mozgalina &
                 on the old C-test;
 Ryshina-
                 ●       Average C-test scores would increase between the beginning and the
 Pankova
                 end of the semester;
 (2015)
                 ●       Students who are enrolled in the progressively higher curricular levels
                 would perform with higher scores on all five texts than students at the
                 preceding curricular levels;
                 Extrapolation inferences:
                 ●       The new C-test scores would be positively correlated with the scores on
                 the Reading and Listening comprehension parts of the placement test.
                                                17


Table 1 (cont’d)
                 Generalization inferences:
                 ●       The C-test would yield scores with high reliability;
                 Explanation inferences:
                 ●       The C-test would elicit a wide distribution of scores from examinees of
                 differing abilities;
                 ●       Average C-test scores would increase between the beginning and the
                 end of the semester;
                 ●       Students who are enrolled in the progressively higher curricular levels
 Norris
                 would perform with higher scores on all five texts than students at the
 (2004)
                 preceding curricular levels;
                 ●       There would be positive relationships between the three placement
                 exam sub-tests;
                 Utilization inference:
                 ●       The errors associated with specific cut-scores on the tests would be
                 small enough for the scores to be useful for making placement decisions;
                 ●       Teachers would perceive the test as a useful and effective tool for
                 making accurate placement decisions;
Note: Validity evidence reported in these studies was organized and categorized by inference in
the argument-based validity framework
                                                18


        Recognizing the need to address the methodological limitations identified in previous
studies, with the current research I aim to provide a more comprehensive evaluation of foreign
language placement tests. Therefore, my goal with my present study is to expand upon the
existing validation research on foreign language placement testing by gathering and presenting
validity evidence (backing) in an argument-based validation framework (following the seven-
step interferences) that can be utilized to comprehensively evaluate the intended interpretation
and use of test scores in the context of Chinese placement testing for a college-level language
program. In addition, the study provides insight into how the validity evidence collected through
the validation process can inform test revisions. Guided by the main purposes, I formulated the
research questions as shown below focusing on obtaining backing for the domain description,
evaluation, generalization, explanation, extrapolation, and utilization inferences:
RQ1: Do observations of performance on the MSU Chinese placement test reveal relevant
Chinese knowledge, skills, and abilities required for the successful completion of language
courses offered by the MSU Chinese program (warrant 1, see more information in Table 4)?
RQ2: Do tasks on the MSU Chinese placement test exhibit desired statistical characteristics
(warrant 2)?
    ● Do test items yield item difficulty estimates that are appropriate for making placement
        decisions?
    ● Do test items show no evidence of item bias?
    ● Are correct options unambiguous and accurately keyed?
RQ3: Are score-based results generalizable to various measurement contexts (warrant 3)?
    ● Does the MSU Chinese placement test produce scores that are internally consistent?
                                                 19


    ● Are there adequate items to reliably differentiate students’ abilities into three levels as
        intended?
RQ4: Can students’ test scores be attributed to the construct of interest (warrant 4)?
    ● Does students’ test performance vary according to the amount and quality of prior
        Chinese learning experience?
    ● Do students’ test scores support the internal structure of the intended construct?
RQ5: Do students’ test scores support the relationship between their performance on the test and
other indicators of Chinese language proficiency (warrant 5)?
RQ6: Does the test produce results that are useful for test users (warrant 6)?
    ● From the perspective of course instructors, are students placed into appropriate course
        levels?
    ● From the perspective of students, are they placed into appropriate course levels?
    ● Are cut-off scores set appropriately?
RQ7: Does the test have positive effects on Chinese teaching and learning (warrant 7)?
                                                20


                                 CHAPTER 3: METHODOLOGY
Participants
Examinees of the MSU Chinese placement test (before test revisions)
         The testing data for this study are pre-existing and come from 305 examinees (152
females, 153 males) who took the Chinese placement test and were planning to take Chinese
language courses at Michigan State University (MSU). The examinees took the test between
2016 and 2020, and they were between the ages of 14 and 65 (Median = 18, Mean = 19.1, SD =
4.3) when they were taking the test. The anonymized data were provided to me as a loan after
obtaining IRB approval. The data were collected over multiple years (2016 to 2020).
Examinees of the ACTFL language proficiency tests (before test revisions)
         Among the 305 examinees whose placement test data were included in the analysis, 55
examinees’ Chinese language skills were also measured using the ACTFL language proficiency
tests from Language Testing International (LTI, https://www.languagetesting.com/) in speaking
(the computerized oral proficiency test, or OPIc), reading (Reading Proficiency Test, or RPT),
and listening (Listening Proficiency Test, or LPT). Students in the Chinese program at MSU
completed the ACTFL tests during 2014-2018 as a curriculum requirement in conjunction with
the federal grant, known as the Language Proficiency Flagship Initiative (see Winke et al., 2020).
I borrowed the ACTFL test data as well, anonymized, but with codes to match them with the
placement test data. Each student was offered three tests, but not all students took all three tests.
Some students took the same test more than once as they were studying Chinese for more than
one academic year at MSU. In such a case, I only included one test score for analysis purposes.
The score considered was taken closest in time to the placement test for students who had taken
the same test multiple times.
                                                   21


Examinees of the MSU Chinese placement test (after test revisions)
        As will be explained in the procedures section, I collected new data from a separate
cohort of students to gather data on students’ performance in the revised placement test as well
as their perceptions of the placement test post-item-analysis and test revisions. In the first week
of Spring 2022, I sent out an email and invited all students who were enrolled in the Chinese
language courses at MSU to participate in a three-phase research project (see more information
in the procedure section). Thirty-seven students completed the first phase; thirty-two students
completed the first two phases, and twenty-eight students completed all three phases. Table 2
presents the demographic information and Chinese learning background of the 28 students who
completed all three phases of the study. After students completed the third phase, I reached out to
7 students (100-level: n = 2; 200-level: n = 3; 300-level: n = 2) and invited them to participate in
semi-structured interviews.
Table 2. Demographic information of examinees of the placement test post revision
                                                                Years of Chinese
 Course            Gender (n)                     Age
           n                                                    instruction before college
 level
                   Female      Male     Other      Mean (SD)                Mean (SD)
 100       13          9          3        1       18.8 (0.7)                4.15 (4.4)
 200       9           4          4        1       19.9 (1.5)                 6.5 (3.9)
 300       6           2          4        -       20.3 (0.8)                4.67 (1.8)
 Total     28         15         11        2       19.4 (1.2)                  5 (3.8)
                                                  22


Chinese course instructors
       The research project involved the participation of all three instructors from the Chinese
program at MSU. These instructors were selected based on their qualifications, which included
having at least 10 years of Chinese teaching experience and a minimum of 5 years teaching
Chinese at MSU. As the entirety of the program's faculty, they provided a complete
representation of the instructors involved. According to the results of the instructor
questionnaire, each of these instructors has taught Chinese language courses at the 100-, 200-,
and 300-levels at MSU.
                                                 23


Instruments
Michigan State University Chinese placement test
        The MSU Chinese placement test is designed to help students register for the appropriate
level of Chinese course by determining their starting level for college language study at MSU,
based on their proficiency in Chinese. The test begins with a background questionnaire related to
examinees' Chinese learning experiences, consisting of eight questions about their language
learning history, such as the number of years spent studying Chinese, any standardized Chinese
test scores, family connections to the Chinese language, and time spent in a Chinese-speaking
country.
        The test comprises four language assessment sections: listening, reading, speaking, and
writing. The test was implemented in Qualtrics in 2016 (see the link to the test:
https://msu.co1.qualtrics.com/jfe/form/SV_a2C5uBOWKlTCdoN). The placement test is not
timed, but examinees usually finish the test within 25 minutes to an hour. The current study only
includes students’ responses to the questions in the listening and reading section, as only those
who score above a certain level on these sections have their speaking and writing sections scored
by instructors in the Chinese program.
        The listening section features 14 multiple-choice questions, while the reading section
contains 18 multiple-choice questions, each offering three or four choices. The test questions
align with the course curriculum, as they were initially drafted by the program's instructors and
based on materials or content provided to students during the various semester-level courses.
Consequently, the placement test is a compilation of the language program's content, organized
from start to finish according to instructional levels.
                                                  24


        Examinees' total scores on the receptive part of the test are the sum of their scores on the
32 multiple-choice questions in the listening and reading sections. Students who receive scores
of 19 or below are recommended for placement in 100-level (CHS101 or CHS102) courses,
while those scoring between 20 and 29 are recommended for 200-level courses. Students with
scores of 30 and above are tentatively approved to enroll in a 300-level course (CHS301) after an
evaluation of their writing and speaking skills. They are also required to complete an in-person
language assessment interview with an instructor during the first week of class to verify their
language-level placement.
        The cut-off scores were determined in a pilot study prior to the test's official launch in
2016. During the piloting stage, the test was administered to students enrolled in 100-, 200-, and
300-level Chinese courses offered by the program. The cut-off for each level was set to the score
one standard deviation above the mean score obtained by students enrolled in the corresponding
course level. The rationale behind this decision was to place students in the highest level course
in which they have a good chance of success. While students' placement decisions are largely
determined by their total scores from the listening and reading sections, their responses in the
speaking and writing sections assist teachers in evaluating the accuracy of upper-level placement
decisions.
Questionnaire for instructors
        I developed a questionnaire to gather content experts' (i.e., Chinese language course
instructors) perceptions of item difficulty, content representation, and relevance, in order to
assess the extent to which the test captures the target domain. The questionnaire, implemented in
Qualtrics (see the link: https://msu.co1.qualtrics.com/jfe/form/SV_9zVfsONQfsgPnp4), consists
of three parts. The first part focuses on basic information about the course the instructor is
                                                   25


teaching, such as course level, class size, and evaluation criteria. This section concludes with a
question that probes instructors' perceptions of the essential skills and knowledge required for
successful completion of Chinese language courses at each level. The second part features a
survey with the 32 items from the placement test. Instructors are asked to rate these items on a 6-
point Likert scale in terms of overall item difficulty (1: very easy; 6: very difficult). Furthermore,
instructors are presented with a series of checkboxes for each item to assess its relevance and
appropriateness to the content of 100-level, 200-level, and 300-level courses. A separate
checkbox is provided for instructors to indicate if the item is not relevant to any of the three
course levels. Instructors are instructed to mark the appropriate checkbox(es) and leave the
others unchecked. The final section collects feedback on the placement test and solicits
suggestions for improvement.
Interviews with instructors
        I conducted semi-structured interviews with all three Chinese course instructors. Each
interview lasted approximately an hour and followed a set of six predetermined questions, which
can be found in Appendix 1. Subsequently, the instructors were presented with their responses
from the questionnaire during the interview and asked to provide further elaboration or
clarification on their answers. I chose semi-structured interviews because they allowed for
flexibility in exploring the instructors' experiences and perspectives, while still maintaining a
consistent framework for comparing their responses. This approach facilitated rapport building
with the instructors (DiCicco-Bloom & Crabtree, 2006) and enabled the collection of richer,
more nuanced data to better understand their teaching strategies and challenges in mixed-
proficiency classrooms (Galletta, 2013). While the primary focus was on the pre-drafted
questions, I also explored new areas of discussion as they emerged during the interview. The
                                                  26


interviews were conducted in Mandarin Chinese and translated into English in two stages. The
initial translation was done using machine translation software, Xunfeitingjian
(https://www.iflyrec.com/zhuanwenzi.html). I then invited another researcher, who is a highly
proficient L2 Mandarin speaker, to review the translation with me. We identified and discussed
any translation errors or ambiguities, and made the necessary adjustments.
Questionnaire for students
         I created a questionnaire to assess students’ perceptions of item difficulty, content
representation, and relevance. The questionnaire is implemented in Qualtrics (see the link:
https://msu.co1.qualtrics.com/jfe/form/SV_734ClDDR4uChQIm) and consists of three parts. The
first part gathers students’ personal information and inquires whether they took the MSU
placement test prior to their enrollment to the first Chinese language course at MSU. If so, they
were asked about a few questions related to their perception of the accuracy of the placement
test. The first part concludes with a question that taps into students’ perceptions of the essential
skills and knowledge that are required for successful completion of Chinese language courses
that they were placed into. The second part is a survey with 32 items in the placement test.
Students were asked to rate these items on a 6-point Likert scale in terms of the overall item
difficulty (1:very easy; 6: very difficult). In addition, students were asked to rate on the relevance
and appropriateness to the content of the course they were taking (e.g., 1 = the item is NOT
relevant to the course that I am taking; 6 = the item is highly relevant to the course that I am
taking). Note that the reason for using different ways to assess the relevance and appropriateness
of test items to course content for students and instructors is that students are enrolled in different
levels of Chinese language courses, and therefore some test items may be more appropriate and
relevant to higher-level courses, while others may be more suitable for lower-level courses.
                                                   27


Instructors are experts in their field and are better equipped to evaluate the difficulty level and
appropriateness of test items across different levels of courses. On the other hand, students are
the ones better able to judge the relevance of the test items to the specific course material they
are studying. By using different instruments, I can obtain more accurate and informative data on
the difficulty level, appropriateness, and relevance of test items for different levels of Chinese
language courses. The final section gathers feedback on the placement test and suggestions for
improvement. The complete questionnaire is available in Appendix 2.
Interview with students
        As noted earlier, I reached out to seven students and invited them for 45-minute semi-
structured interviews. The interviews were guided by six pre-determined questions (see
Appendix 3) and were conducted in English. Similar to the interviews with the instructors, I
relied on the pre-drafted questions, but I went off-script and pursued other lines of inquiry when
necessary. Subsequently, the students were presented with the responses they provided in the
questionnaire and were asked to expand upon their responses with clarifying comments.
Procedures
Obtaining the pre-existing data
        The pre-existing data for this study consists of (1) the testing data for the MSU Chinese
placement test collected from 2016 to 2020 (hereinafter referred to as the placement test) and (2)
the proficiency data from the American Council on the Teaching of Foreign Languages
(ACTFL) Chinese language proficiency tests. I directly obtained the anonymized, pre-existing
placement test data from the professor who designed and maintains the test for the MSU Chinese
program. The professor downloaded the data from the MSU Qualtrics site where the test data is
stored without names, and with codes instead. Additionally, I obtained the anonymized, pre-
                                                   28


existing ACTFL Chinese language proficiency test data from the Principal Investigator of the
Language Proficiency Flagship Initiative, also with codes, not names. It is important to note that
the ACTFL testing data was included only for those who also had available placement test data.
Revising the MSU Chinese placement test
         Utilizing the pre-existing placement testing data and the employment of Rasch analysis
and item-level analysis, I identified issues with a number of items in terms of their item
characteristics. These psychometrically problematic items were flagged and revised according to
the literature on Chinese grammar rules as well as the feedback from two L1 Chinese speakers
with PhD in applied linguistics, a former Chinese course instructor at MSU and a language
testing researcher. More information about the test revisions is provided in the results section.
Collecting data from instructors
                 In the Spring of 2022, I approached three instructors who were teaching Chinese
language courses at the 100-level, 200-level, and 300-level. All three instructors agreed to
participate in the research project, and were asked to complete a questionnaire evaluating their
perceptions of the difficulty of the items in the placement test, content representation, relevance,
and the accuracy of the placement test results. Upon completing the questionnaire, the instructors
were invited to participate in a one-hour one-on-one, semi-structured interview.
Collecting testing data using the revised placement test from students
         In Spring 2022, I obtained the consent of Chinese language course instructors to invite
their enrolled students to participate in my research project. I emailed invitations to all eligible
students, and out of the 37 students who expressed interest, 28 successfully completed all three
phases of the study (100-level: n = 13; 200-level: n = 9; 300-level: n = 6). In the first phase,
students were asked to take the revised MSU Chinese placement test during the first week of
                                                  29


Spring 2022. Similar to the original test, the revised version began with a background
questionnaire about the examinee's Chinese language learning experiences, followed by 32
multiple-choice items (14 listening and 18 reading). In the second phase, students took the
revised placement test again in the final week of Spring 2022. After completing the test, I sent a
questionnaire to students to assess their perceptions of item difficulty, content representation, and
relevance. Participants were compensated with either $30 or extra credit for their participation in
the research project. As mentioned earlier, seven students were invited to participate in 45-
minute one-on-one semi-structured interviews. These seven students received an additional $10
for their participation in the interviews.
Data analysis
         Table 3 presents the methods for analysis that are used to answer each research question.
I employed both quantitative and qualitative approaches to analyzing the data. For interview
data, I used the iterative qualitative data coding procedures (open coding, theme development,
and coding for patterns) described in Baralt (2012) to examine instructors’ and students’
perceptions of the accuracy of the placement test results (utilization inference), the effects of the
test on teaching and learning (consequence implication inference), as well as the test content
relevance and representativeness (domain description inference). For quantitative testing data, I
conducted Rasch analysis, Differential Item Functioning (DIF) analysis, item analysis to address
the evaluation inference. I reported Rasch-based reliability estimates to evaluate the
generalization inference. To investigate the explanation inference, I compared students’
placement test performance at the beginning and the end of the semester and across different
course levels. In addition, I conducted an exploratory factor analysis to examine whether the
results support the hypothesized internal structure of the measured construct of the placement
                                                  30


test. To examine the extrapolation inference, I calculated the polyserial correlation coefficients
to assess the relationships between students’ placement performance and their scores on ACTFL
language proficiency tests. The polyserial correlation coefficient is better suited for calculating
the correlation between a continuous variable and an ordinal variable in comparison to other
commonly known correlation coefficients utilized by applied linguists, including Pearson's r,
Spearman's rho, and Kendall's tau (Winke, Zhang, & Pierce, 2022). To evaluate whether the cut-
off scores for the placement test are set appropriately (utilization inference), I analyzed teacher
ratings on item relevance and assessed item distribution across course levels. To examine what
consequence implications the test has on Chinese teaching and learning, I reported results
yielded from questionnaire and interview data from teachers and students.
         For quantitative data preparation, I binary-scored students' placement test responses as 1
(correct) or 0 (incorrect) for multiple-choice items, with unanswered questions coded as missing
data (X). Test scores on the ACTFL language proficiency tests are linked directly to the ACTFL
Proficiency Guidelines, a framework for language proficiency on “functional ability” (ACTFL,
2012, p.3), describing what individuals can do with the target language with each skill (i.e.,
speaking, reading, listening, writing). For each skill, the guidelines feature five major levels of
proficiency: Distinguished, Superior, Advanced, Intermediate, and Novice. The major levels
Advanced, Intermediate, and Novice are subdivided into High, Mid, and Low sublevels to
distinguish the language learners at these levels more clearly. For ACTFL language proficiency
data, I assigned a numeric value to each ACTFL proficiency level on a scale of 1 (Novice Low)
to 10 (Superior), a practice following prior research (e.g., Isbell et al., 2018; Kenyon &
Malabonga, 2001; Ma & Winke, 2019; Tigchelaar et al., 2017; Zhang et al., 2020).
                                                  31


Table 3. Summary of the warrant, assumptions, and associated backing in the MSU Chinese placement test interpretive argument
  Inferences     Warrants    Assumptions Underlying Warrant                     Sources for Backing
                              • The relevance of the test items and test        • Questionnaire and interview data about test
  Domain                          criteria to the instructional domain and the     content relevance from course instructors*
                 Warrant 1
  description                     appropriateness of the item difficulties are  • Questionnaire and interview data about test
                                  supported by test stakeholders.                  content relevance from students*
                                                                                • Rasch analysis
                                                                                • Teachers’ perceptions of item difficulties
                              • Item difficulty estimates are appropriate for
                                                                                • Students’ perceptions of item difficulties
                                  making placement decision.
                                                                                • Item difficulties computed from students’
  Evaluation     Warrant 2
                                                                                   actual test performance
                              • Test items exhibit no evidence of item bias     • DIF analysis
                              • Correct responses are unambiguous and
                                                                                • Item-level analysis
                                  accurately keyed.
                              • The test produces scores that are internally
                                                                                • Cronbach’s alpha
  Generalizatio                   consistent.
                 Warrant 3                                                      • Rasch-based item reliability estimates
  n                           • The test yields satisfactory item reliability
                              • The test yields satisfactory person reliability • Rasch-based person reliability estimates
                                                                   32


Table 3 (cont’d)
                                                                                 • Comparison of test performance in first and
                              • Test performance varies according to the
                                                                                   second admissions*
                                  amount and quality of experience in learning
                                                                                 • Comparison of test performance between
  Explanation     Warrant 4       Chinese
                                                                                   students at different course levels*
                              • Test scores support the internal structure of
                                                                                 • Exploratory factor analysis
                                  the construct
                              • Scores on the MSU Chinese placement test
  Extrapolation   Warrant 5       are positively correlated with scores on RPT,  • Polyserial correlation analysis
                                  LPT, and OPIc.
                                                                                 • Questionnaire and interview data from course
                                                                                   instructors about the accuracy of placement
                              • Test users (i.e., instructors) judge the scores
                                                                                   decisions
                                  to be useful.
                                                                                 • Interview data from students about the
  Utilization     Warrant 6
                                                                                   accuracy of placement decisions
                                                                                 • Analysis of instructor ratings on item
                              • Cut-off scores are set appropriately               relevance and assessment of item
                                                                                   distribution across course levels
  Consequence     Warrant 7   • The test has positive effects on Chinese         • Questionnaire and interview data from
  implication                     instruction and learning                         teachers and students
Note: *Analysis conducted using the testing data from the revised MSU Chinese placement test
                                                                  33


Criteria to determine strong, weak, or counter-evidence for proposed validity argument
        The empirical results from the analyses were evaluated based on the following criteria to
determine if they provide strong, weak, or counter-evidence for the proposed validity argument.
Establishing these criteria is vital within the argument-based validity framework, as it leverages
the framework's benefits. This framework emphasizes the systematic organization of validity
evidence and urges researchers to articulate explicit inferences and assumptions that underlie the
validity argument. By offering clear criteria for assessing evidence strength, the framework's
advantages are enhanced, ensuring that the study's conclusions are robust, reliable, and well-
founded. Additionally, this approach improves the research's transparency and
comprehensibility, enabling readers and stakeholders to better grasp the results' implications and
their impact on the placement test's validity.
1. Domain description
The relevance of the test items and test criteria to the instructional domain and the
appropriateness of the item difficulties are supported by test stakeholders.
   ● Strong evidence: 5% or less of the items are considered not relevant to class
       content/evaluations; the appropriateness of the item difficulties is supported by test
       stakeholders
   ● Weak evidence: 10% or less of the items are considered not relevant to class
       content/evaluations.
   ● Counter evidence: More than 10% of the items are considered not relevant to class
       content/evaluations; the appropriateness of the item difficulties is not supported by test
       stakeholders
                                                   34


2. Evaluation
Item difficulty estimates are appropriate for making placement decisions.
    ● Strong evidence:
            o Based on the Wright map, the item difficulty estimates target the students given
                their ability estimates and thus are appropriate for assessing and differentiating
                the students.
            o The item difficulty estimates that are computed from students’ actual test
                performance are consistent with the expected difficulty level for the intended
                audience (students’ and/or teachers’ perceptions of item difficulties).
    ● Counter evidence:
            o Based on the Wright map, the items are too difficult or easy for the students, and
                thus are not appropriate for assessing and differentiating the students.
            o The item difficulty estimates derived from students' actual test performance do
                not align with the expected difficulty level for the intended audience (students’
                and/or teachers' perceptions of item difficulties).
Test items exhibit no evidence of item bias
    ● Strong evidence: 5% or less of the items exhibited DIF across gender
    ● Weak evidence: 10% or less of the items exhibited DIF across gender
    ● Counter evidence: More than 10% of the items exhibited DIF across gender
Correct responses are unambiguous and accurately keyed.
    ● Strong evidence: 95% or more of the items are shown to be unambiguous and accurately
        keyed
                                                  35


    ● Weak evidence: 90% or more of the items are shown to be unambiguous and accurately
        keyed
    ● Counter evidence: Less than 90% of the items are shown to be unambiguous and
        accurately keyed
3. Generalization
The test produces scores that are internally consistent
    ● Strong evidence: Cronbach’s alpha is above .8
    ● Weak evidence: Cronbach’s alpha is above .7
    ● Counter evidence: Cronbach’s alpha is below .7
There are adequate items to reliably differentiate students’ abilities into three levels as intended
    ● Strong evidence: Person reliability is above .9; person separation index is above 2
    ● Weak evidence: Person reliability is above .8; person separation index is above 1.5
    ● Counter evidence: Person reliability is below .8; person separation index is below 1.5
4. Explanation
Test performance varies according to the amount and quality of experience in learning Chinese
    ● Strong evidence:
           o There is a significant and meaningful change (medium to large effect) in students'
               test performance from the beginning of the semester to the end of the semester.
           o There is a significant and meaningful (medium to large effect) difference in test
               performance among students at different course levels.
                                                36


   ● Weak evidence:
            o There is a statistically significant but less meaningful change (small effect) in
                students' test performance from the beginning of the semester to the end of the
                semester.
            o There is a statistically significant but less meaningful (small effect) difference in
                test performance among students at different course levels.
   ● Counter evidence:
            o There is no statistically significant change in students' test performance from the
                beginning of the semester to the end of the semester.
            o There is no statistically significant difference in test performance among students
                at different course levels.
Test scores support the internal structure of the construct
   ● Strong evidence: Compared to alternative models, there is strong evidence supporting a
        single factor (overall Chinese language ability) or a two-factor (reading and listening
        Chinese ability) model. The listening and reading items are loaded onto the
        corresponding factor.
   ● Weak evidence: Compared to alternative models, there is NO strong evidence against a
        single factor (overall Chinese language ability) or a two-factor (reading and listening
        Chinese ability) model. The listening and reading items are loaded onto the
        corresponding factor.
   ● Counter evidence: Compared to alternative models, there is strong evidence against a
        single-factor (overall Chinese language ability) or a two-factor (reading and listening
                                                 37


        Chinese ability) model. The listening and reading items are not loaded onto the
        corresponding factor.
5. Extrapolation
Scores on the MSU Chinese placement test are positively correlated with scores on ACTFL
proficiency tests.
    ● Strong evidence
            o There is a moderate to strong correlation (r ≥ 0.40) between students' scores on
                the Chinese placement test and on ACTFL proficiency tests for corresponding
                skills (e.g., listening and listening).
            o There is a positive correlation between students' scores on the Chinese placement
                test and on ACTFL proficiency tests for non-corresponding skills (e.g., speaking
                and listening); however, the strength of this correlation is expected to be weaker
                compared to the correlation between corresponding skills.
    ● Weak evidence:
            o There is a weak positive correlation (0.20 ≤ r < 0.40) between students' scores on
                the Chinese placement test and on ACTFL proficiency tests for corresponding
                skills (e.g., listening and listening).
            o The correlation between students' scores on the Chinese placement test and on
                ACTFL proficiency tests for non-corresponding skills (e.g., speaking and
                listening) is stronger, which is not consistent with the expectation of a weaker
                correlation compared to corresponding skills.
                                                    38


    ● Counter evidence:
            o There is no or a negative correlation (-1.00 ≤ r < 0.20) between students' scores on
                 the Chinese placement test and on ACTFL proficiency tests for corresponding
                 skills (e.g., listening and listening).
            o The correlation between students' scores on the Chinese placement test and on
                 ACTFL proficiency tests for non-corresponding skills (e.g., speaking and
                 listening) is stronger, which is not consistent with the expectation of a weaker
                 correlation compared to corresponding skills.
6. Utilization
Test users (i.e., instructors and students) judge the scores to be useful.
    ● Strong evidence: From test users’ perspective, the test scores place most students, given
        their language ability levels, in appropriate Chinese language classes.
    ● Counter evidence: From test users’ perspective, the test scores do not place most
        students, given their language ability levels, in appropriate Chinese language classes.
Cut-off scores are set appropriately
    ● Strong evidence:
            o There is an even distribution of items across all course levels.
            o The cut-off scores align well with the instructors' perceptions and the distribution
                 of items, resulting in accurate placement of students in appropriate course levels.
    ● Weak evidence:
            o There is an imbalanced distribution of items across the course levels, which may
                 lead to less accurate measurement of students' language proficiency at certain
                 levels
                                                     39


            o The cut-off scores partially align with the instructors' perceptions and the
               distribution of items, but there is room for improvement in the accuracy of student
               placement.
    ● Counter evidence:
            o There is a highly imbalanced distribution of items across the course levels,
               leading to an inaccurate measurement of students' language proficiency at certain
               levels.
            o The cut-off scores do not align with the instructors' perceptions or the distribution
               of items, resulting in inaccurate placement of students in appropriate course
               levels.
7. Consequence implication
The test has positive effects on Chinese instruction and learning
    ● Strong evidence: Stakeholders indicate positive effects of the test on what teachers teach
        and how students learn.
    ● Counter evidence: Stakeholders indicate negative effects of the test on what teachers
        teach and how students learn.
                                                  40


                                      CHAPTER 4: RESULTS
         The descriptive statistics for the test scores of the 305 test takers who took the test
between 2016 and 2020 (collapsed across years to further preserve anonymity) on the placement
test are shown in Table 4. The table indicates that most students were advised to take 100- or
200-level courses, while only a small group of students with scores above 30 were recommended
to take the 300-level course. As for the score comparison between female and male examinees,
the descriptive statistics suggest that female and male examinees performed comparably on the
placement test. In addition, examinees who were self-identified as heritage speakers (speaking
the language at home) performed better than their peers who did not have family connections to
the language. Not surprisingly, students who are self-identified as L1 speakers of Chinese (born
and raised in a Chinese speaking country, having graduated from a Chinese-speaking high-
school in that country, and at MSU as an international student to obtain a degree) perform
exceptionally well on the test, with their scores being in close proximity to the maximum
achievable score, as one would expect.
         As described in the section on data analysis, I utilized Rasch measurement methods to
identify potential problems and determine if revisions to the placement test should be made in
order for appropriate uses and interpretations of test scores. The Rasch analysis used the item
responses collected from 305 examinees who took the placement between 2016-2020. The
analysis was conducted using the computer program, WINSTEPS Version 4.7.1 (Linacre, 2016).
                                                    41


Table 4. Descriptive statistics for test scores
                               N                        Mean (SD)        95% CI
 Total score                   305                      21.4 (6.5)       [20.7, 22.1]
     0 - 19                    126                      14.8 (2.90)      [14.3, 15.3]
     20 - 29                   137                      24.5 (3.01)      [24.0, 25.1]
     30 - 32                   42                       30.9 (.68)       [30.7, 31.1]
 Gender
     Female                    152                      21.7 (6.8)       [20.6, 22.8]
     Male                      153                      21.2 (6.3)       [20.2, 22.2]
 Learner type
     Native speakers           15                       30.2 (2.4)       [28.9, 31.5]
     Heritage speaker          73                       25.1 (5.91)      [23.7, 26.4]
     Others                    217                      19.6 (5.87)      [18.8, 20.3]
         Prior to conducting the main analysis, I first performed principal components analysis of
the residuals, a standard Rasch approach to examine whether all items in the test could be
considered unidimensional within the Rasch framework (Linacre, 1998). This is important
because evidence of unidimensionality is required for Rasch analysis to yield accurate and
reliable measurements of the underlying construct being assessed (Eckes, 2015). Evidence of
multidimensionality is indicated by an eigenvalue greater than 2.0 for the first factor in the PCA
and by a disattenuated correlation less than 1. The PCA of the total 32 items revealed evidence of
multidimensionality which was indicated by an eigenvalue greater than 2 for the unexplained
variance in the first factor. A close examination of the test showed that five items (see Appendix
4 for detailed information about these items) that were related to a common stimulus loaded
heavily on the same dimension (see Table 5). In Rasch modeling, the group of items or the
questions related to the same topic or prompt in a test is known as a testlet (Wang et al., 2005),
which often results in locally dependent items       (a form of violation of the assumption of
                                                  42


unidimensionality). To address the issue, I bundled the five items into a polytomous super-item
and re-conducted the PCA. After the revision, the results of the PCA provided no evidence of
multidimensionality as indicated by an eigenvalue of 1.92 and by a disattenuated correlation
equal to 1.00 between the theta measures (i.e., test-takers’ ability levels) on items clusters in
contrasts.
Table 5. PCA results for the five items that loaded on the same dimension
  Item No.          Loading         Measure         Infit MNSQ        Outfit MNSQ
  Reading 10        .59             -.99            .84               .71
  Reading 7         .49             -1.99           .87               .43
  Reading 9         .49             -1.55           .91               .93
  Reading 8         .45             -1.50           .89               .74
  Reading 11        .28             -.95            1.00              .90
        I then considered item fit and evaluated whether all items in the test measured the
construct, Chinese language proficiency, as intended. An item is considered to have a good fit
when it generates item responses that align with what the model predicts. Fit to the model
associated with each item is assessed by two Rasch-based statistics, mean square outfit and mean
square infit. Infit and outfit mean squares have an expected value of 1.0, which suggests that the
item generates responses that align with what the model predicts (e.g., a more difficult item
would likely elicit a correct response from a proficient test-taker but would likely elicit an
incorrect response from a less proficient test-taker). Outfit mean squares are more sensitive to
unexpected responses by persons on items that are very easy or very hard for them, whereas infit
mean squares weigh the observations by their statistical information and are consequently more
sensitive to unexpected responses by persons on items roughly targeted on them (Linacre, 2016).
I employed the cut-off values of 0.6 and 1.4 as acceptable infit and outfit mean squares, as
                                                  43


suggested by Wright and Linacre (1994). A higher or lower value of the fit statistics indicates
that the responses generated by the associated item were either too predictable (less than 0.6,
overfit the model) or too unexpected (larger than 1.4, underfit the model). Finally, the items with
the extreme outfit or infit mean square values were flagged as misfitting items and were closely
examined by me to inspect reasons for the misfit.
          The infit and outfit statistics suggested that out of 32 items, three items (see Table 6 for
fit statistics), reading items #2, #3, and #12 displayed outfit mean square values outside the cut-
off range. These misfitting items all had large outfit values, suggesting that the items elicited a
few unexpected responses from test takers given their ability levels and the item difficulties.
More specifically, the responses were considered unexpected when a high-ability test-taker failed
to answer an easy item correctly or a low-ability test-taker answered a difficult item correctly.
The finding was further confirmed by the low item discriminations for reading items #2 and #12.
In other words, the items had a limited capacity to discriminate between the test-takers with
higher proficiency and those with lower proficiency. I returned to these misfitting items and
proposed strategies for addressing them in the section of [RQ 2c: Evaluation inference].
Table 6. Misfitting items from the MSU placement test
 Item                Difficulty             Outfit MNSQ           Infit MNSQ           Estimated
                     estimate (SE)          (z-std)               (z-std)              discrimination
 Reading #2          -.21 (.14)             1.52 (3.03)           1.29 (4.59)          .32
 Reading #3          -1.50 (.18)            2.51 (3.67)           1.09 (.86)           .80
 Reading #12         1.16 (.14)             1.51 (4.69)           1.31 (4.37)          .38
                                                     44


[RQ 1: Domain description inference]:
Are the relevance of the test items and test criteria to the instructional domain and the
appropriateness of the item difficulties supported by test stakeholders?
        The research question related to domain description inference was addressed by
examining the results of instructors’ and students’ questionnaire data about test content relevance
and item difficulties. As noted earlier, to evaluate the relevance and appropriateness of test items
to the course materials for 100-level, 200-level, and 300-level courses, instructors were provided
with a set of checkboxes for each item. Furthermore, instructors were given an extra checkbox to
indicate that the item was not relevant to the content of any of the three courses. The number of
test items relevant to course content by instructor and course is presented in Table 7. The data
presented in Table 7 indicates that the instructors demonstrated some divergence of opinion with
respect to the course level to which an item was relevant. Nevertheless, a clear trend emerged,
showing that most of the test items were considered to be more relevant to lower-level courses
than to higher-level courses. More importantly, it should be noted that none of the items were
judged by any of the instructors as irrelevant to the course material across all three levels of
instruction.
Table 7. Number of test items relevant to course content by instructor and course level
                100-level course     200-level course     300-level course Irrelevant to any course
 Instructor 1   20                   7                    5                  0
 Instructor 2   19                   8                    5                  0
 Instructor 3   21                   9                    2                  0
                                                  45


        As for the students’ questionnaire, students were asked to rate test items on two 6-point
Likert scales in terms of the overall item difficulty (1: very easy; 6: very difficult) as well as the
relevance and appropriateness to the course material they were studying (e.g., 1 = the item is
NOT relevant to the course that I am taking; 6 = the item is highly relevant to the course that I
am taking). Table 7 provides descriptive statistics that illustrate how students perceive the
relevance and difficulty of test items, as reported by course level. Further details regarding
individual items can be found in Appendix 5. Data are presented by course level to account for
the potential influence of the specific level of the course in which students are enrolled on
students’ perceptions of item difficulties and relevance. Table 8 and Appendix 5 further support
this notion, showing that students’ perceptions of item relevance to course content and item
difficulties vary by course level. Specifically, in terms of item difficulty, as expected, students in
higher-level courses found test items easier compared to those in lower-level courses. As for
item relevance to course content, students in 100- and 200-level courses gave higher mean
relevance scores compared to those in 300-level courses. This observation aligns with instructor
ratings that considered the majority of items to be relevant to lower-level courses. Despite
variations in students' perceptions of item relevance across different course levels and test items,
there is a general pattern indicating that students view test items as relevant to their course, as
evidenced by their relatively high mean relevance scores.
                                                   46


Table 8. Descriptive statistics of students' perceptions of test item relevance and difficulties
                  Relevance                                    Difficulties
                  Mean         SD             95% CI           Mean         SD             95% CI
  100-level       4.7          0.8            [4.4, 5]         3            0.9            [2.7, 3.3]
  200-level       5.1          0.5            [4.9, 5.2]       2.7          0.8            [2.4, 3]
  300-level       4.1          0.4            [3.9, 4.2]       2.2          1              [1.9, 2.6]
  Total           4.7          0.5            [4.5, 4.8]       2.7          0.8            [2.4, 3]
[RQ 2a: Evaluation inference]:
Do test items yield item difficulty estimates that are appropriate for making placement decisions?
         I examined the research question related to evaluation inference using two distinct
methods. First, I applied a Rasch-based approach. Rasch modeling shares an important feature
with other item response theory-based models: items and examinees are estimated and compared
on a single common scale that is interval-level. This allows for an accurate comparison of
examinees' abilities based on equal distances on the scale. Rasch measurement aims to achieve
the highest precision in estimating an examinee's ability when the ability estimate aligns with the
item difficulty, a concept known as targeting (Bond & Fox, 2015, p. 69).
         The relationship between a person's ability and item difficulty is of significant interest to
researchers. A Wright map is commonly used to visualize this relationship, providing a
meaningful representation of the data. To determine if the item difficulties accurately depict
students' abilities and if the test is sensitive to variations in the measured construct, I considered
three key aspects as outlined by Beglar (2010): (a) the adequacy of the number of items included
in the test; (b) the presence of targeting for the sampled examinees; and (c) any potential gaps in
the empirical item hierarchy. By following Beglar's guidelines, I employed the Wright map to
                                                     47


investigate these factors and evaluate the appropriateness of the item difficulty estimates for
making placement decisions.
          Figure 2 displays the Wright map, where the ruler on the left (MEASR) indicates the logit
values corresponding to test takers' ability levels and item difficulties, both measured on the
same scale. The item difficulties in Figure 1 range from -2.34 to +1.78 logits, while the test-
takers' ability measures cover a broader range (from -1.78 to +4.961). A thorough examination of
the Wright map reveals that 75% of the examinees (N = 229) fall within the overlapping range,
indicating reasonable item targeting along their ability level. However, the Wright map also
reveals a noticeable ceiling effect, with approximately one-fourth of the examinees (N = 76)
having ability measures above all item difficulties.
          This finding aligns with the descriptive statistics in Table 3, which show that 42
examinees scored 30 or higher, suggesting a need for additional, more challenging items to
accurately assess these high-ability examinees. Notably, the background questionnaire results
reveal that among these 42 high-ability examinees, 15 are L1 Chinese speakers and 18 are
heritage Chinese speakers. Heritage speakers either were born in the United States with at least
one parent from China who spoke Chinese at home or were immigrants who moved to the United
States from China at a young age. Given the language backgrounds of these 33 examinees (15 L1
and 18 heritage speakers), their high test performance is not surprising. Their linguistic exposure
and experiences may have provided them with an advantage on the test, resulting in higher
scores.
1 Eight test takers received a perfect score on the test, and their ability measures (+ 4.96) were not plotted on the
Wright map.
                                                            48


Figure 2. Wright map of the MSU Chinese placement test items.
                                            49


Figure 3. Relationship between students’ perceived item difficulties and item difficulties
computed from students’ actual test performance.
        The second approach I employed to address this research question involved using
correlation analysis with Pearson's correlation coefficients. This analysis aimed to examine the
agreement between students' and teachers' difficulty ratings, comparing these perceived
difficulties with the empirical item difficulties obtained from the quantitative item analysis.
Gaining insights from these different perspectives can be valuable in determining the test's
suitability for making placement decisions (Embretson & Reise, 2000; Downing & Haladyna,
2006). Discrepancies between perceived and empirical item difficulties may point to issues with
                                                   50


the test items. For instance, if students or teachers perceive certain items as more challenging
than the empirical analysis suggests, this could indicate that the items contain ambiguous or
unclear wording, causing confusion among examinees (Haladyna, Downing, & Rodriguez,
2002). Conversely, a high degree of agreement between students' and teachers' ratings and the
empirical difficulties would support the test's appropriateness for placement purposes (DeMars,
2010). If perceived and empirical item difficulties align closely, this implies that the test items
function as intended, accurately measuring the targeted construct and differentiating examinees
based on their abilities (Hambleton, Swaminathan, & Rogers, 1991). Such close alignment would
bolster confidence in using the test for placement decisions, as the items would provide a valid
and reliable representation of examinees' abilities in the target domain (AERA, APA, & NCME,
2014).
         To calculate Pearson's correlation coefficients, I computed the average ratings of item
difficulties across students for each of the 32 items in the placement test, representing their
perceived difficulties. Likewise, I calculated the average rating across teachers for each item.
Figure 3 illustrates the relationship between students' perceived item difficulties and the
empirical item difficulties derived from their actual test performance. The figure comprises four
plots, each representing a different course level (100-level, 200-level, and 300-level) and one for
the aggregated data. Each point in the scatterplots corresponds to an individual item in the test.
As evident from the plots, the correlation between students' perceived item difficulties and
empirical item difficulties varies across course levels. For the 100-level courses, the correlation
coefficient is quite high at .841, suggesting a strong agreement between perceived and empirical
item difficulties. For the 200-level courses, the correlation coefficient is lower at .415, indicating
a weaker relationship between the two sets of item difficulties. In contrast, the correlation
                                                   51


coefficient for the 300-level courses reveals a moderate to strong relationship at .713. Examining
the aggregated data, the overall correlation coefficient is .781, demonstrating a robust
relationship between students' perceived item difficulties and the empirical item difficulties.
        These findings imply that the relationship between students' perceptions of item
difficulties and the empirical item difficulties depends on the course level. However, the strong
overall correlation in the aggregated data suggests that the test items generally align well with
students' perceptions, supporting the test's appropriateness for making placement decisions.
Figure 4. Relationship between teachers’ perceived item difficulties and item difficulties
computed from students’ actual test performance.
                                                  52


         Following the analysis of students' perceptions, Figure 4 presents a scatterplot depicting
the relationship between teachers' perceived item difficulties and the empirical item difficulties
computed from students' actual test performance. The correlation coefficient for this comparison
is .507. While most items are situated along the regression line in the scatterplot, signifying
agreement between teachers' perceptions and empirical item difficulties, a few items deviate
from this trend.
         For instance, reading items #12, #13, and #14 were perceived as very easy by the teachers
(with a mean value around 1.3 on the Likert scale of 1 to 6), but the empirical item difficulty is
around .6, suggesting that these items are among the most challenging in the test. Conversely,
listening item 6 was regarded as an easy item by the teachers, while the empirical item difficulty
is .1, indicating that most students did not encounter difficulty with this item. These findings
reveal some discrepancies between teachers' perceptions of item difficulties and the actual item
difficulties experienced by students. As a result, the relationship between teachers' perceptions
and empirical item difficulties is not as strong as one might anticipate.
                                                  53


Figure 5. Relationship between teachers’ perceived and students’ perceived item difficulties.
        Figure 5 presents a scatterplot illustrating the relationship between teachers' perceived
and students' perceived item difficulties, with a correlation coefficient of .714. While most items
display a strong alignment between students' and teachers' perceptions of item difficulty, one
item notably deviates from the regression line. Listening item #6, perceived as a relatively easy
item by students (with a mean value of 2.2 on the Likert scale of 1 to 6), received higher
difficulty ratings from teachers (with a mean value of 3.7 on the Likert scale of 1 to 6).
Generally, the findings reveal a strong relationship between teachers' and students' perceptions of
item difficulty, suggesting that both groups have similar views on the test items' difficulty. This
alignment supports the notion that the test items are generally suitable for assessing students'
                                                   54


abilities. However, the observed discrepancy for listening item #6 implies that there might be a
difference in how teachers and students interpret or understand this particular item. Factors such
as differences in instructional focus, students' familiarity with the content, or other influences
could impact their respective judgments.
         In summary, the results suggest that the test items generally provide item difficulty
estimates appropriate for making placement decisions, particularly for the 100-level and 300-
level courses. However, some discrepancies and weaker relationships have been observed,
especially for 200-level courses, necessitating further investigation and refinement of the test
items to ensure their suitability across all course levels.
[RQ 2b: Evaluation inference]:
Do test items exhibit no evidence of item bias?
         Building on the previous discussion, one key aspect to consider when examining the
validity evidence for evaluation inference is the principle of invariance. According to this
principle, item measures should remain invariant across different measurement contexts,
meaning that item estimates (i.e., item difficulty estimates) should not depend on the subgroups
of examinees responding to the item (Baker & Kim, 2017; Rasch, 1960). In this study, I
investigated the extent to which item estimates are invariant across two examinee groups: female
versus male examinees. Ensuring item invariance between female and male examinees is crucial
to guarantee that the test items do not favor one gender over the other, thus providing evidence of
fairness and impartiality in the assessment (Kunnan, 2000).
         I analyzed the group invariance of item measures by examining differential item
functioning (DIF). Specifically, DIF is detected for an item when two groups of examinees,
matched on measures of the construct (Chinese language ability in this case), have different
                                                   55


probabilities of answering the item correctly (Ferne & Rupp, 2007; Harding, 2011). If DIF is
found, it suggests that the item may be biased, challenging the validity evidence for the
evaluation inference. I established two criteria for detecting potential DIF relative to the item
difficulty estimates based on the responses of the two groups of examinees: a) statistical
significance of the Mantel-Haenszel test at the .05 level after the Benjamini-Hochberg
adjustment to correct for the inflation of Type I error due to multiple comparisons; b) a
difference in item difficulty of at least .5 logit, considered large enough to impact ability
estimates (Linacre, 2016). Items meeting both criteria were considered to show evidence of DIF.
The results revealed that while four items (listening item #9, reading items #9, #12, and #13)
exhibited a difference in item difficulty larger than .5 logit (see Figure 6), none of them were
significant at the predetermined alpha level, suggesting no evidence of DIF across these two
examinee subgroups (For more detailed information on the DIF analysis results, please refer to
Appendix 6).
                                                     56


Figure 6. Bar plot of item difference (results of DIF).
                                                  57


[RQ 2c: Evaluation inference]:
Are correct options unambiguous and accurately keyed?
        Continuing from the previous section, it is important to emphasize that a well-constructed
multiple-choice test should include effective distractors that challenge examinees, requiring them
to demonstrate their language abilities to select the correct response among plausible
alternatives. Therefore, examining distractors is an essential aspect of investigating the validity
evidence for the evaluation inference, as it helps to determine whether the test items are
functioning as intended and whether the correct options are unambiguous and accurately keyed.
To address the research question, I conducted an analysis of distractors as an item quality
indicator for all test items, aiming to assess the extent to which distractors for each item
discriminated between examinees with different ability levels.
        I compared the average ability estimates of examinees who selected the distractors and
the keyed option. Theoretically, distractors should attract examinees with lower ability estimates
on average, compared to those who select the keyed option (Wolfe & Smith, 2007; Osterlind,
1998). Figure 7 presents the mean ability estimates of examinees who chose the keyed options
and those who did not for each item. As shown, the keyed options generally attracted higher-
ability examinees compared to the distractors, as demonstrated by the upward lines. However,
four items exhibited lower discriminating power, as indicated by their less steep lines. These
items had mean ability estimate differences between the two groups of examinees of less than 1.
Given that examinee ability estimates ranged from -1.78 to 4.96, these differences may not be
practically or meaningfully significant in discriminating examinees' language abilities (Downing
& Haladyna, 2006).
                                                   58


Figure 7. Items flagged as having potentially problematic distractors.
         The functioning of the distractors in the four flagged items was further investigated.
Figure 8 displays the mean ability estimates for each response option for these items. As
illustrated in the graph, the mean estimates of examinees who selected the keyed option did not
significantly differ from those who chose one of the distractors. This suggests that these
distractors may be too similar or equally appealing to the keyed option, leading to examinees
with similar ability levels choosing either option. The overlapping 95% confidence intervals of
the mean ability estimates for the keyed option and the potentially problematic distractor
(distractor A for Reading item #2; distractor C for Reading item #3; distractor C for Reading
item #6; distractor A for Reading item #12) further confirm this observation. These findings
indicate the need for a more detailed review of these items to ensure that the keyed options are
clear and unambiguous and that the distractors are not too close in plausibility to the keyed
                                                   59


option, which could lead to inaccurate measurement of examinees' abilities. This point will be
revisited later in this section to discuss possible improvements to the test items.
Figure 8. Mean ability estimates for distractor analysis.
Note: a) Error bars were 95% confidence intervals generated using non-parametric bootstrap; b)
NA indicates missing data; c) there was only one participant that had missing data for Reading
item #3 and no participant that had missing data for Reading item #6
                                                   60


         Following the previous analysis, a closer examination of the four misfitting items
revealed three potential reasons for the item misfit. These include: a) ambiguous phrasing in the
item prompts, which may result in multiple correct responses; b) poorly selected distractors,
leading to more than one response being justifiable as correct answers; and c) incongruent
information provided in the prompt compared to the intended answer (Downing & Haladyna,
2006). These issues may have caused highly proficient test-takers to provide incorrect responses,
negatively affecting the measurement of their Chinese language abilities. Given these concerns,
it is essential to revise or remove these misleading items from the test, as they do not reliably
assess test-takers' Chinese language abilities. Failing to address these issues could hinder
meaningful score interpretations and compromise the test's appropriateness for making
placement decisions.
Figure 9. Reading #3. If you want to order soup, how many choices do you have?
                                                  61


        Reading item #3 (refer to Appendix 7) serves as an example of an ambiguous item. In
this item, examinees were asked to count the number of soup dishes on a Chinese menu. The
answer key indicated four dishes, which could be deduced by counting the dishes under the soup
category (tāng lèi 汤类). However, several high-scoring examinees provided a different
response, suggesting there were five choices if one were to order soup (see Figure 9). Upon
closely examining each dish on the menu, I discovered one dish, seafood rice noodle soup
(hǎixiān mǐfěn tāng), listed under the porridge and noodle soup category (zhōu, tāngmiàn lèi
粥，汤面类). This dish could arguably be considered a soup dish. The item's ambiguity led
some high-performing examinees to include a dish that was not intended as part of the correct
answer. However, one might argue that for these examinees, the reading processes involved (i.e.,
the ability to read and comprehend each dish on the menu) are indicative of higher ability. In
light of these findings, it is recommended to either remove the item or revise it as suggested in
Appendix 7 to ensure clarity and accurate assessment of examinees' abilities.
Figure 10. Reading item #6. Here is a message that Xiao Li sent to Lao Wang. Please answer the
following questions after reading the note: At what time, should they meet?
        The issue of item ambiguity arose in another instance, specifically with reading item #6.
The stem provided a note—an invitation to a friend—which was translated into English as
follows: "I would like to invite you to come to my home for dinner at 7:00 PM the evening after
tomorrow. I will be waiting for you at 6:45 PM at the Route 3 bus stop. See you the day after
tomorrow." The item asked examinees to determine the time when the two individuals should
                                                 62


meet. Although the incorrect responses from several high-scoring examinees might be attributed
to carelessness, varying interpretations of the invitation could also stem from cultural
differences. In some cultures, guests may be expected to arrive at the scheduled dinner time. To
address the ambiguity, I have provided suggested revisions in Appendix 7.
Figure 11. Reading item #2. Is this a sign for?
        Regarding reading item #2, a thorough examination revealed that the information in the
prompt did not align perfectly with the designated answer. The item assessed whether test takers
could deduce the purpose of a sign based on accompanying pictures and Chinese descriptions.
The answer key, 'shopping mall hours,' corresponded with the Chinese descriptions (yíngyè
shíjiān, zǎo 6:00 - wǎn 10:00; 营业时间, 早 6:00-晚 11：00; operating hours, 6AM–10 PM).
However, the sub-signs below (jìnzhǐ wàishí; 禁止外食; no outside food allowed) indicated that
                                                  63


the sign was more likely intended for a restaurant or movie theater, where outside food or
beverages might be prohibited for health reasons, or to ensure the owner profits from food and
beverage sales. In my proposed revision, I have amended the answer key to resolve this
inconsistency.
        The problem related to problematic item distractors occurred with another misfitting
item, reading item #12. This item tested test takers’ knowledge about the appropriate classifier
used for chairs (yǐzi,椅子) in Chinese. The correction option is 把 (bǎ, classifier), while a few
high-scoring test takers selected 张 (zhāng) as their response (see Figure 3). Research on Chinese
classifiers indicates that multiple correct answers might exist for this question. For instance, Tai's
(1994, p. 9) description of the classifier, zhāng, aligns with the responses of these high-scoring
test takers:
        ‘[i]t is a well-known fact that Mandarin Chinese use the classifier zhāng (张, classifier) to
categorize zhǐ ‘paper’, zhuōzi ‘table’, and chuáng ‘bed’. For many native speakers of Mandarin,
the category of zhāng extends to cover yǐzi ‘chair’ and dèngzi ‘bench’, since they all have a flat
surface like tables, the central member among the class of furniture categorized by zhāng.’
                  Furthermore, I investigated the association between yǐzi (chair) and its
corresponding classifier in the Modern Chinese Corpus (现代汉语语料库, see the link:
http://corpus.zhonghuayuwen.org). The findings indicated that the co-occurrence frequency of bǎ
(classifier) and yǐzi was 45, while the frequency for zhāng (classifier) and yǐzi was 17. This
suggests that both classifiers are actively and frequently used by native Chinese speakers, albeit
to different extents. Consequently, the inclusion of zhāng as a distractor in the item was deemed
inappropriate for placement purposes, as it does not effectively differentiate between test takers
                                                   64


with higher and lower proficiency levels. As a result, I recommend replacing this distractor with
a new classifier, as illustrated in Appendix 7.
[RQ 3a: Generalization inference]:
Does the MSU Chinese placement test produce scores that are internally consistent?
        Examining the internal consistency of test scores is crucial for supporting the
generalizability inference, as it demonstrates the test items' reliable measurement of the intended
construct—in this case, Chinese language proficiency. High internal consistency establishes
confidence in the stability and accuracy of test scores across diverse settings and student groups.
The MSU Chinese placement test's internal consistency was analyzed by calculating Cronbach's
α, yielding a value of 0.88 (95% CI: [0.86, 0.90]), indicating strong internal consistency. An item
analysis was also conducted to assess the impact of individual items on overall internal
consistency (see Table 9). The results showed that removing certain items would lead to a slight
decrease in Cronbach's α from 0.88 to 0.87, while for others, the α value would remain
unchanged at 0.88. These findings suggest that the test items collectively measure the same
underlying construct, with no individual item significantly affecting overall internal consistency.
In summary, the results of this study provide strong evidence supporting the internal consistency
of the MSU Chinese placement test scores.
                                                  65


Table 9. Cronbach's α if item dropped
 Item Cronbach’s α             Item     Cronbach’s α     Item     Cronbach’s α       Item Cronbach’s α
 L01      0.87                 L09      0.88             R03      0.88               R11   0.88
 L02      0.88                 L10      0.87             R04      0.88               R12   0.88
 L03      0.87                 L11      0.87             R05      0.88               R13   0.88
 L04      0.88                 L12      0.88             R06      0.88               R14   0.87
 L05      0.88                 L13      0.87             R07      0.88               R15   0.87
 L06      0.88                 L14      0.88             R08      0.88               R16   0.87
 L07      0.88                 R01      0.88             R09      0.88               R17   0.88
 L08      0.88                 R02      0.88             R10      0.88               R18   0.88
[RQ 3b: Generalization inference]:
Are there adequate items to reliably differentiate students’ abilities into three levels as intended?
         Another key aspect of evaluating the generalization inference of a test involves assessing
the test's ability to differentiate between students' abilities at different levels. Specifically, if a
test is intended to place students into multiple ability levels (such as beginner, intermediate, and
advanced), it is crucial to ensure that the test contains enough items that are appropriately
calibrated to accurately differentiate between students' abilities and assign them to the correct
level (Bachman & Palmer, 2010). Without a sufficient number of items that reliably differentiate
students' abilities, there may be a risk that students are placed in incorrect levels, leading to
inaccurate or inconsistent results. Such misplacement can significantly impact students' language
learning experiences, as they may be placed in courses that are either too easy or too challenging
for their actual abilities (Alderson, 2005). To investigate this inference, the Rasch measurement
model is utilized, providing person reliability and person separation indices as reliability
estimates. These indices determine if the test sufficiently discriminates the sample into the
intended levels. Low person reliability or separation suggests that the instrument may not
                                                   66


effectively distinguish between high and low performers. Linacre (2012) offered guidelines for
interpreting person reliability: 0.9 corresponds to 3 or 4 levels, 0.8 to 2 or 3 levels, and 0.5 to 1
or 2 levels. Regarding person separation indices, 1.50 represents an acceptable level of
separation, 2.00 a good level, and 3.00 an excellent level (Wright & Masters, 1982; Fisher, 1992,
as cited in Duncan et al., 2003).
         The MSU Chinese placement test's reliability indices were as follows: person reliability
= .84; person separation index = 2.32. These results indicate that there were enough examinees
and items to precisely locate examinees' abilities on the underlying trait continuum (i.e., Chinese
language proficiency) and confirm the hierarchy of examinees' abilities. The placement test
effectively discriminated examinees into three levels, as intended by the test developers, given
the person reliability value.
[RQ 4a: Explanation inference]:
Does students’ test performance vary according to the amount and quality of prior Chinese
learning experience?
         To address the research question, I analyzed students' performance at the beginning
(Time 1) and end (Time 2) of the semester, as well as across different course levels. This
analysis enables a deeper understanding of the Chinese placement test's sensitivity to students'
prior learning experiences, thereby providing crucial validity evidence for the test. Descriptive
statistics were calculated for students' total, listening, and reading scores on the Chinese
placement test at both time points during the Spring semester 2022. These results are presented
in Table 10.
                                                     67


Table 10. Descriptive statistics of students' placement test scores
                 Time 1                                      Time 2
  Section        Mean      SD          95% CI                Mean        SD          95% CI
  Listening      8.5       3           [7.3, 9.7]            9.9         2.3         [9, 10.8]
  Reading        11.8      3.2         [10.5, 13]            13.8        2.4         [12.8, 14.7]
  Total          20.2      5.7         [18, 22.5]            23.6        4           [22.1, 25.2]
Table 11. Descriptive statistics of students' placement test scores by course level
                          Time 1                                   Time 2
  100-level               Mean        SD        95% CI             Mean       SD        95% CI
       Listening          6.9         2.7       [5.3, 8.4]         9.3        2.2       [8, 10.5]
       Reading            9.4         2.1       [8.2, 10.6]        12.9       1.3       [12.1, 13.6]
       Total              16.3        4.2       [13.9, 18.7]       22.1       2.7       [20.6, 23.7]
  200-level               Mean        SD        95% CI             Mean       SD        95% CI
       Listening          10          3         [7.5, 12.5]        10.6       2.4       [8.6, 12.6]
       Reading            13.4        2.9       [11, 15.8]         14.1       3.6       [11.1, 17.2]
       Total              23.4        5         [19.2, 27.6]       24.8       5.5       [20.2, 29.3]
  300-level               Mean        SD        95% CI             Mean       SD        95% CI
       Listening          10.3        2         [8.3, 12.4]        10.3       2.3       [7.9, 12.8]
       Reading            15          0.9       [14.1, 15.9]       15.3       1.2       [14.1, 16.6]
       Total              25.3        2.7       [22.5, 28.1]       25.7       3.2       [22.3, 29]
        The data reveals that, on average, students' listening, reading, and total scores on the
Chinese placement test improved from the beginning to the end of the semester. To further
investigate the impact of course level on test scores, descriptive statistics for students' total,
listening, and reading scores were calculated by course level and presented in Table 11.
Additionally, boxplots and summary statistics were utilized to display the distribution of test
                                                   68


scores, mean scores, and 95% confidence intervals for the mean in error bars (Figure 12: total
scores; Figure 13: listening scores; Figure 14: reading scores).
        The results demonstrate that students at different course levels exhibited varying degrees
of improvement in their listening, reading, and total scores on the Chinese placement test
throughout the semester. Higher-level students generally achieved better scores than lower-level
students. The most significant gains were observed among the 100-level students, followed by
the 200-level students, while the least change in scores occurred for the 300-level students. This
pattern was evident at both the beginning and the end of the semester.
Figure 12. Boxplots for total test scores by course level and test time.
                                                 69


Figure 13. Boxplots for listening test scores by course level and test time.
Figure 14. Boxplots reading test scores by course level and test time.
       To determine if the observed total score changes across the semester and course levels
were statistically significant, I conducted a repeated-measures 2 x 3 Analysis of Variance
                                                  70


(ANOVA). Students' total scores served as the dependent variable, time as the within-subject
independent variable, and course level as the between-subject variable. I reported the Wald-type
test statistics (WTS) and ANOVA-type statistics (ATS) calculated by the R package
MANOVA.RM (Friedrich, Konietschke, & Pauly, 2019a) to address the small sample issue and
the violation of the homogeneity of variance. These two statistics were determined using
nonparametric methods that employ resampling techniques for approximating the sampling
distribution, allowing for their application even in small sample sizes. These methods are
suitable in the Behrens-Fisher situation, where equal covariance matrices across groups are not
assumed (Friedrich, Konietschke, & Pauly, 2019b).
         Table 12 summarizes the results of the ANOVA. The results show significant effects of
course level and time, as well as a significant interaction between course level and time, on
students' test scores. I, therefore, performed a series of post-hoc tests to identify where the
significant score differences lie among the different course levels. I used the Bonferroni test to
adjust for the Type I error rate. Table 13 presents the significant results of pairwise post-hoc
comparisons between different time points and course levels, along with their corresponding
Cohen's d effect sizes (please see Appendix 8 for the results of all comparisons). Here's a brief
summary of the key findings:
    1. For 100-level students, there was a significant increase in the total test scores from Time
         1 to Time 2 (t(25) = 5.9, p < .001), with a large effect size (Cohen's d = 1.67).
    2. At Time 1, both 200-level and 300-level students had significantly higher scores
         compared to 100-level students, with large effect sizes (200-level: t(35.7) = 7.1, p = .005,
         Cohen's d = 1.68; 300-level: t(35.7) = 9, p < .001, Cohen's d = 2.13).
                                                   71


    3. At Time 2, both 200-level and 300-level students also had significantly higher scores
        compared to 100-level students, with large effect sizes (200-level: t(35.7) = 8.5, p < .001,
        Cohen's d = 2.01; 300-level: t(35.7) = 9.4, p < .001, Cohen's d = 2.22).
    4. There were no significant differences between the other comparisons, as indicated by p-
        values of 1. However, it is worth noting that the effect sizes for these non-significant
        comparisons were generally small to medium, ranging from 0.08 to 0.62.
        From these results, I can conclude that there was a significant improvement in scores for
100-level students from Time 1 to Time 2, with a large effect size. Additionally, 200-level and
300-level students consistently demonstrated higher scores compared to 100-level students at
both time points, with large effect sizes. However, there were no significant differences in scores
between 200-level and 300-level students or within these course levels over time, and the effect
sizes for these comparisons were generally small to medium.
Table 12. Summary of the results of ANOVA (WTS and ATS reported)
 Wald-type test statistics (WTS)      χ2 value   df                           p-value resampling
       Course level                   28.04      2                            .002
       Time                           12.42      1                            .004
       Course level x Time            14.49      2                            .01
 ANOVA-type statistics (ATS)          F value    df1              df2         p-value
       Course level                   7.02       1.57             19.52       .008
       Time                           12.42      1                INF         <.001
       Course level x Time            5.61       1.76             INF         .005
                                                 72


Table 13. Significant post-hoc pair-wise comparisons results (total scores)
 Contrast                                     t-value       SE         df    p-value     Cohen’s d
 Time 2 100-level - Time 1 100-level          5.9           0.92       25    <.001       1.67
 Time 1 200-level - Time 1 100-level          7.1           1.77       35.7 .005         1.68
 Time 2 200-level - Time 1 100-level          8.5           1.77       35.7 <.001        2.01
 Time 1 300-level - Time 1 100-level          9             1.95       35.7 <.001        2.13
 Time 2 300-level - Time 1 100-level          9.4           1.95       35.7 <.001        2.22
        Next, I performed the repeated-measures multivariate analysis of variance
(MANOVA) to investigate the effects of time and course level on students' listening and reading
scores. A repeated-measures MANOVA was chosen for this analysis because it allows for the
examination of the within-subject effects of time on multiple dependent variables (listening and
reading scores) while accounting for the correlation between repeated measurements on the same
student. Additionally, by including course level as a between-subjects factor, this method can
assess the influence of different course levels on the listening and reading scores and identify any
interactions between time and course level that may exist. I reported the Wald-type test statistics
(WTS) and modified ANOVA-type statistics (MATS) calculated by the R package
MANOVA.RM for similar reasons noted above.
        Table 14 summarizes the results of the MANOVA. The results showed significant main
effects of time and course level, indicating that students' scores improved from the beginning to
the end of the semester and that higher-level students generally performed better. Additionally, a
significant interaction effect between time and course level was found, suggesting varying
degrees of improvement among students from different course levels.
        To further explore these effects, a series of post-hoc tests were conducted to identify
significant listening and reading score differences among various course levels. I used the
                                                  73


Bonferroni test to adjust for the Type I error rate. Table 15 presents the significant results of
pairwise post-hoc comparisons for listening scores between different time points and course
levels, along with their corresponding Cohen's d effect sizes (please see Appendix 9 for the
results of all comparisons). The key findings can be summarized as follows:
        1. A significant improvement in scores (p = 0.03) from Time 1 to Time 2 was observed
            for 100-level students, with a moderate effect size (Cohen’s d = 0.68).
        2. A significant difference in scores (p = 0.02) was found between Time 2 for 200-level
            students and Time 1 for 100-level students, with a large effect size (Cohen’s d =
            0.83).
        3. The remaining pairwise comparisons in the table were not significant and showed
            small effect sizes, indicating that the differences between those specific groups and
            time points were not significantly meaningful.
        As for the reading scores, the post-hoc results are presented in Table 15 (see Appendix 10
for the results of all comparisons). The main results are summarized below:
        1. Improvement over time: 100-level students demonstrated a significant improvement
            in scores from Time 1 to Time 2 (t(25) = 3.4, p < .001), with a large effect size
            (Cohen's d = 0.96).
        2. Comparisons between course levels at Time 1 and Time 2: Both 200-level and 300-
            level students consistently scored significantly higher than 100-level students across
            Time 1 and Time 2, with large effect sizes (Cohen's d ranging from 0.83 to 1.25).
        3. The remaining comparisons in the table were not statistically significant (p > 0.05).
                                                   74


Table 14. Summary of the results of MANOVA (WTS and MATS reported)
 Wald-type test statistics (WTS)                 χ2 value       df            p-value resampling
      Course level                               104            4             < .001
      Time                                       13.58          2             .011
      Course level x Time                        20.86          4             .01
 Modified ANOVA-type statistics (MATS)           F value        -             p-value resampling
      Course level                               93.16          -             <.001
      Time                                       8.02           -             .008
      Course level x Time                        15.17          -             .004
Note: For the MATS, degrees of freedom is not reported since here inference is only based on
resample (Friedrich et al. 2019b, p.391)
Table 15. Significant post-hoc, pair-wise comparisons results (listening scores)
 Contrast                                    t-value       SE         df     p-value     Cohen’s d
 Time 2 100-level - Time 1 100-level         2.4           0.7        25     .03         0.68
 Time 2 200-level - Time 1 100-level         3.8           1.09       42     .02         0.83
Table 16. Significant post-hoc, pair-wise comparisons results (reading scores)
 Contrast                                    t-value       SE         df     p-value     Cohen’s d
 Time 2 100-level - Time 1 100-level         3.4           0.66       25     <.001       0.96
 Time 1 200-level - Time 1 100-level         3.9           0.97       44.4 .003          0.83
 Time 2 200-level - Time 1 100-level         4.7           0.97       44.4 <.001         1
 Time 1 300-level - Time 1 100-level         5.6           1.07       44.4 <.001         1.19
 Time 2 300-level - Time 1 100-level         5.9           1.07       44.4 <.001         1.25
       In addressing the research question, which seeks to determine whether students' test
performance varies according to the amount and quality of prior Chinese learning experience, the
analysis of students' performance (total scores, listening scores, and reading scores) at the
                                                 75


beginning (Time 1) and end (Time 2) of the semester and across different course levels yielded
several key findings.
         The results provide evidence supporting the explanation inference of the validity
argument. First, there was a significant improvement in test scores for 100-level students from
Time 1 to Time 2, indicating that students’ test performance improved as they gained more
experience in learning Chinese. However, it is important to note that for certain course levels, no
significant difference was observed between Time 1 and Time 2 performance. Second, at both
Time 1 and Time 2, 200-level and 300-level students consistently scored significantly higher
than 100-level students, with large effect sizes. This suggests that students with more advanced
learning experience in Chinese demonstrated better test performance than those with less
experience. Interestingly, there was no significant difference observed between 200-level and
300-level students, implying that their test performance was relatively similar. Overall, the
findings support the notion that students' test performance varies according to the amount and
quality of their prior Chinese learning experience. The results lend validity evidence to the
explanation inference, as the observed differences in test performance can be attributed to the
variation in students' prior learning experiences
[RQ 4b: Explanation inference]:
Do students’ test scores support the internal structure of the intended construct?
         Investigating the internal structure of a language placement test is critical for establishing
the test's effectiveness and identifying areas for improvement. By analyzing the factor structure
of the test items, I can determine if the test scores align with the intended construct, in this case,
Chinese language proficiency. A strong alignment between the test items and the underlying
construct offers evidence that the test is a valid measure of language proficiency (In'nami &
                                                    76


Koizumi, 2016). Conversely, if the internal structure reveals inconsistencies or an inadequate
representation of the construct, it can highlight areas that require revision to better assess the
intended language skills.
         I conducted an exploratory factor analysis (EFA) to examine the factor structure
characterizing the 32 items in the placement test. The purpose of the analysis was to examine
whether the internal structure of the scores collected via the placement test is consistent with a
theoretical view of language proficiency. As noted earlier, I binary-scored students’ responses as
1 (correct) or 0 (incorrect) for the multiple-choice items and treated the item responses as
categorical. The pattern of eigenvalues supported a two-factor model, as shown in the scree plot
in Figure 15. The fit statistics further supported the two-factor solution (TLI = .89, root-mean-
square error of approximation [RMSEA] = .05), which was found to be superior to the one-factor
solution (TLI = .73, RMSEA = .07). After a close examination of Promax rotated factor loadings
indicated that the second factor was almost entirely driven by seven relatively easy items (at least
85% of students answered these items correctly). Additionally, five of the items were related to
the same prompt (reading items #7-11), indicating that they may be measuring similar language
skills. I bundled the five items into a polytomous super-item and re-conducted the EFA.
Collapsing these five items led to a substantially smaller second eigenvalue (from 2.46 to 1.86,
see Figure 16) and to a better fit for both the one-factor (TLI = .87, RMSEA = .05) and the two-
factor solution (TLI = .92, RMSEA = .04). Although the fit statistics suggest stronger support for
the two-factor model, it is important to note that the second factor seems to be mainly driven by
relatively easy items and may not necessarily represent a separate dimension of language
proficiency. Overall, the results of the exploratory factor analysis suggest that the 32-item
placement test does appear to measure language proficiency as intended.
                                                  77


Figure 15. Scree plot for exploratory factor analysis (all 32 items).
                                                 78


Figure 16. Scree plot for exploratory factor analysis (after collapsing the five items to the same
prompt).
[RQ 5: Extrapolation inference]:
Do students’ test scores support the relationship between their performance on the test and other
indicators of Chinese language proficiency?
        Establishing the extrapolation inference is a critical aspect of the validity argument, as it
contributes to understanding the degree to which the Chinese placement test scores can be
applied to other indicators of Chinese language proficiency. The investigation of the
extrapolation inference seeks to demonstrate that the test scores not only reflect the students’
performance on this particular test but also accurately represent their overall language
proficiency.
                                                 79


        Correlational analyses were employed to address the research question related to
extrapolation inference. This analytical approach facilitates the assessment of the strength and
direction of the relationship between two variables. In this study, these variables encompass the
students' placement test scores and their performance on the ACTFL tests. By scrutinizing these
relationships, the research seeks to examine the extent to which the placement test serves as a
valid measure of Chinese language proficiency, consistent with other well-established indicators
such as the ACTFL tests.
        The rationale for expecting performances on similar skills (e.g., listening) to exhibit a
strong correlation with one another, while performances on different skills (e.g., listening versus
speaking) display a weaker correlation, lies in the assumption that proficiency in one language
skill should be closely related to proficiency in a similar skill. Consequently, if the placement
test accurately reflects Chinese language proficiency, it should demonstrate a stronger
relationship with corresponding skills (listening and reading) on the ACTFL tests, while a
weaker relationship should be observed with non-corresponding skills (e.g., speaking). In this
case, the observed relationships can be considered supportive evidence for the validity of the
Chinese placement test.
        Table 17 displays the number of students with available ACTFL scores for each test, as
well as the descriptive statistics for these ACTFL ratings and their corresponding placement test
scores. It can be observed from the table that more students took the OPIc test than the RPT and
LPT tests. This was expected as the speaking tests were conducted during class time with their
teacher in attendance, making participation more convenient. In contrast, students had to visit the
language lab independently to take the RPT and LPT tests, resulting in lower participation
despite the incentive of extra credit (Winke & Ma, 2019).
                                                  80


Table 17. ACTFL scores and placement test results for students
                     With available OPIc scores              With available LPT scores            With available RPT scores
                     N    M (SD)          95%CI              N    M (SD)        95%CI             N   M (SD)         95%CI
 ACTFL Score              4.11 (1.53)     [3.69, 4.53]            2.85 (1.7)    [2.34, 3.35]          2.93(1.82)     [2.37,3.49]
 Placement total          21.74 (5.84) [20.15, 23.34]             22.41 (5.96)  [20.64, 24.18]        22.23(6.08)    [20.36,24.1]
                     54                                      46                                   43
 Placement listening      8.96 (3.25)     [8.08, 9.85]            9.33 (3.24)   [8.36, 10.29]         9.19(3.28)     [8.18,10.19]
 Placement reading        12.78 (3.42) [11.84, 13.71]             13.09 (3.51)  [12.04, 14.13]        13.05(3.57)    [11.95,14.14]
Note: OPIc = the computerized oral proficiency test; LPT = Listening Proficiency Test; RPT = Reading Proficiency Test.
                                                                 81


        Figures 17-19 are the scatterplots displaying the relationship between students' ACTFL
scores (listening, reading, and speaking) and their performance in the placement test, with
separate plots for listening, reading, and total scores. The polyserial correlation coefficients
indicate the strength of the relationship between the ACTFL speaking scores and each of the
three placement test score components. Upon reviewing the figures and the polyserial correlation
coefficients, the following observations can be made:
    1. ACTFL Listening: The strongest correlation is observed between ACTFL listening scores
        and placement test listening scores (0.769), followed by total scores (0.783) and reading
        scores (0.623). This suggests that students with higher listening proficiency on the
        ACTFL test perform better on listening and overall components of the placement test,
        with a relatively weaker relationship observed for reading scores.
    2. ACTFL Reading: The correlation coefficients for ACTFL reading scores reveal
        moderately strong relationships with placement test reading (0.706), total scores (0.756),
        and listening scores (0.637). This indicates that students with higher reading proficiency
        on the ACTFL test perform better on reading and overall components of the placement
        test, with a relatively weaker relationship observed for listening scores.
    3. ACTFL Speaking: The correlation coefficients for ACTFL speaking scores show
        moderately strong relationships with placement test listening (0.63) and total scores
        (0.598), and a weaker relationship with reading scores (0.47). This suggests that students
        with higher speaking proficiency on the ACTFL test perform better on listening and
        overall components of the placement test, with the weakest relationship observed for
        reading scores.
                                                   82


         The varying correlation coefficients between ACTFL scores and placement test scores
may be attributed to the distinct nature of language skills being compared. Generally, stronger
relationships are observed between corresponding skills (e.g., ACTFL listening vs. placement
test listening), as these skills share many underlying linguistic competencies. In contrast, weaker
relationships are observed as expected between non-corresponding skills (e.g., ACTFL speaking
vs. placement test reading), which might be due to differences in the specific linguistic and
cognitive processes involved in each skill.
         These findings contribute to establishing the extrapolation inference by demonstrating
that the Chinese placement test scores not only represent students' performance on the test itself
but also show meaningful relationships with other indicators of Chinese language proficiency.
This evidence supports the argument that placement test scores can be used to make inferences
about students' broader language skills and proficiencies beyond the test context.
Figure 17. Scatterplots of ACTFL LPT scores and placement test scores.
                                                  83


Figure 18. Scatterplots of ACTFL RPT scores and placement test scores.
Figure 19. Scatterplots of ACTFL OPIc scores and placement test scores.
                                             84


[RQ 6a: Utilization inference]:
From the perspective of course instructors, are students placed into appropriate course levels?
         An important aspect of the validity argument is to examine the utilization inference,
which provides insight into whether students are placed in course levels that align with the
expectations of test stakeholders. This is crucial because appropriate placement ensures that
students have the best learning experience and can achieve their full potential in Chinese
language courses. By analyzing instructors' perspectives, I can gain valuable insights into the
effectiveness of the placement test in reflecting the students' language proficiency levels.
         To address this research question, both questionnaires and interviews were employed to
collect data from three Chinese course instructors. The questionnaire consisted of several
questions that required instructors to rate various aspects of the placement process on a scale of 6
(1 - never true; 2 - usually not true; 3 - rarely true; 4 - occasionally true; 5 - usually true; 6 -
always true). The questions pertained to the accuracy of student placement and whether the
course was too easy or too difficult for some students.
         The results from the questionnaire are summarized in Table 18, which indicate that the
instructors generally believed that students were accurately placed in the courses according to
their prior Chinese knowledge and language proficiency, as they all responded with a rating of 5
for all the courses (CHS101/102, CHS201/202, and CHS301/302). However, when it came to the
questions about the difficulty of the course for some students, there were mixed responses, with
some instructors rating the course as being too easy or too difficult for certain students. These
findings suggest that, overall, the Chinese placement test is effective in placing students into
appropriate course levels, as reflected by the instructors' perspectives. However, there may still
                                                    85


be room for improvement in terms of better tailoring the course difficulty to the needs of
individual students.
Table 18. Instructor ratings on student placement and course difficulty
                                                                                 Instructor
 Course Level                            Question                      A          B         C
                        ● Accurate placement given prior
                                                                       5          5         5
 CHS101/102                 knowledge and language proficiency
                        ● Class too easy for some students             6          4         6
                        ● Class too difficult for some students        6          4         6
                        ● Accurate placement given prior
                                                                       5          5         5
 CHS201/202                 knowledge and language proficiency
                        ● Class too easy for some students             6          4         6
                        ● Class too difficult for some students        6          4         6
                        ● Accurate placement given prior
                                                                       5          5         5
 CHS301/302                 knowledge and language proficiency
                        ● Class too easy for some students             6          2         6
                        ● Class too difficult for some students        6          4         6
Note: The questions were rated using a Likert scale ranging from 1 to 6, where 1 = never true, 2
= usually not true, 3 = rarely true, 4 = occasionally true, 5 = usually true, and 6 = always true.
        In the interviews, instructors further discussed the issue of mismatched course difficulty
for some students. They explained that this problem is more prevalent in higher-level courses,
primarily because of the diverse range of students' abilities. One instructor mentioned that
students might reach higher-level courses without possessing the necessary skills, not solely due
to the placement test, but also because of how students progress through the course levels:
                                                  86


        This issue [that the class may pose a challenge for certain students] is common,
        particularly in recent years, due to the diverse range of students' abilities. In the first
        grade, it's less of a problem as most students have a relatively low skill level. However,
        by the third grade, there's a wide range of abilities. In the fourth-grade classes I've seen,
        some students' skills are at the second or third-grade level, while others are at the fourth-
        grade level. It may seem surprising that students reach higher levels without having the
        necessary skills. This issue extends beyond the placement test. For instance, a student
        may have completed first, second, and third grades at our school, just barely passing with
        60% marks, and then advanced to the next grade. Unfortunately, their language
        proficiency and actual understanding of the material remain quite unsatisfactory
        (instructor #3).
        Another instructor provided similar comments, emphasizing that the challenges faced by
certain students in Chinese classes are not necessarily due to errors in placement tests resulting in
misplacement:
        “Often, the issue lies with the students themselves. For instance, a student may begin a
        course like 101, which should be relatively easy to master, but only achieve a 70%
        passing grade, indicating poor mastery. Despite this, MSU does not prevent students from
        advancing to 102 based solely on their 70% grade. This reveals the problem: students
        haven't mastered the content of 101, yet they are allowed to progress. Similarly, if they
        achieve only 60% or 65% in 102, even worse than their performance in 101, they can still
        advance to 201. Consequently, students' performance declines with each grade level,
        leaving the weakest students consistently lagging behind. These struggling students can
        be observed in courses 101, 102, 201, 202, and 301. Additionally, given the high number
                                                   87


       of credits associated with Chinese language courses (101 to 202 are five credits each, and
       301 and 302 are four credits each) and the expensive tuition fees, instructors may feel
       compassionate towards their students. They attempt to provide as much help as possible,
       such as awarding extra points to students who revise for each exam. Without these
       additional support measures, many students might be at risk of failing or barely passing
       (instructor #1).”
       When asked about the effectiveness of the placement test, all instructors agreed that, in
most cases, the test successfully placed students into appropriate levels, as evidenced by the
following excerpts from the interview:
       Researcher: Is there a difference in the language abilities of students who are placed into
       201 through the placement test compared to those who progress from 101 and 102? Or is
       the placement test effective in identifying their levels, resulting in both groups of students
       being quite similar?
       Instructor #3: In my opinion, both groups of students are quite similar. Interestingly,
       regardless of whether they have reached the 201 level at MSU or through other means
       before enrolling in the Chinese language courses at MSU, they tend to make the same
       mistakes. We need to constantly remind them of areas where they are prone to making
       errors, such as the use of "be" verbs in English and Chinese, which are actually different.
       Even in the fourth year, some students still make this mistake. Overall, I think the
       placement test generally has a good accuracy rate, particularly at the beginning and
       intermediate levels.
       However, there were some cases where the placement test was considered less effective:
                                                 88


        “The placement test may not be as reliable for specific student groups, such as those who
        have learned through family connections or cultural exposure and therefore have not
        followed a traditional textbook-based curriculum. Heritage learners acquire language
        skills differently, which can lead to the placement test inaccurately reflecting their
        proficiency level. (instructor #2)”
        When the placement test scores are not accurate enough, teachers may need to resort to
in-person interviews as an additional means of assessment for the next step of evaluation. In
other words, interviews can be used as an alternative method to evaluate the proficiency level of
students if the placement test results are not sufficient.
        Taken together, the findings offer support for the utilization inference of the validity
argument, as most instructors agreed that the placement test effectively placed students into
appropriate course levels. However, there are instances where the test may not have been as
effective for specific student groups (heritage learners) or when considering the course difficulty
for individual students. This point will be further explored and addressed in the discussion
section.
[RQ 6b: Utilization inference]:
From the perspective of students, are they placed into appropriate course levels?
        In addition to the insights provided by course instructors, the inclusion of students' voices
allows for a more nuanced understanding of the placement process's effectiveness. Students may
offer unique perspectives on the placement process, highlighting aspects that instructors may
overlook. By sharing their experiences and providing feedback, students can offer valuable
information on whether the course levels align with their language proficiency and learning
needs. To address the research question, I analyzed students' interview responses and
                                                  89


questionnaire data. In the questionnaire, all students were asked about whether they took the
Michigan State University Chinese placement test before enrolling in their first Chinese language
course. However, only students who responded affirmatively were asked the following questions
(see the questions in Appendix 2):
    1. The advised Chinese course based on the test result.
    2. Their GPA in the assigned course.
    3. A rating of the student's preparedness for the course on a scale of 6 (1 - unsatisfactory; 2 -
        needs improvement; 3 - slightly below expectations; 4 - meets expectations; 5 - exceeds
        expectations; 6 - outstanding).
    4. A rating of the student's overall course performance on a scale of 6 (1 - unsatisfactory; 2 -
        needs improvement; 3 - slightly below expectations; 4 - meets expectations; 5 - exceeds
        expectations; 6 - outstanding).
        These questions are crucial for answering the research question as they provide insights
into various aspects of students' experiences with the placement process and their subsequent
performance in the assigned courses. The first question helps determine the alignment of
students' placement test scores with the assigned course levels. The second question focuses on
students' academic performance in the assigned courses, serving as an indicator of the
appropriateness of the course levels. The third question captures students' perceptions of their
preparedness for the course, revealing any potential gaps or mismatches between their prior
knowledge and course expectations. Lastly, the fourth question allows students to evaluate their
overall performance in the course, reflecting their engagement, effort, and success in the learning
process. By analyzing the data collected from these four questions, the study can assess the
                                                  90


effectiveness of the placement test in assigning students to appropriate course levels, taking into
account their language proficiency, preparedness, and academic performance.
        Table 18 summarizes students' responses to the four questions, revealing several key
findings. A majority of the students achieved high GPAs in their assigned courses, with 89% in
the 100-level courses and 75% in the 200-level courses having a GPA of 3.5 or higher, indicating
academic success and appropriate course placement. Additionally, students generally felt well-
prepared for their courses, as shown by the high mean preparedness ratings across all course
levels. Lastly, students also reported positive overall course performance, with high mean
performance ratings in each course level.
Table 19. Summary of student placement outcomes and perceptions in Chinese language courses
                GPA - N (%)                         Preparedness              Performance
        N
                3.5 +            3.0 +              Mean (SD)      Range      Mean (SD)     Range
 100    9       8 (89%)          1 (11%)            4.89 (.93)     [4, 6]     5.44 (.88)    [4, 6]
 200    8       6 (75%)          2 (25%)            4.25 (.89)     [3, 6]     54.62 (.92)   [3, 6]
        Upon reviewing Table 19, it was evident that although most students felt well-prepared
and performed well in their assigned courses, the range value of 3 in Table 18 in both
preparedness and overall performance ratings indicated that some students rated these aspects as
slightly below expectations. To gain a deeper understanding of the challenges faced by these
students and to identify any potential limitations of the placement test, examining the interview
data of a student who provided a lower rating was essential.
        This particular student's experience highlights some issues that may not have been
captured by the average ratings. Initially having learned traditional Chinese, she faced difficulties
with the placement test due to the use of simplified Chinese, which led to stress and uncertainty
                                                  91


about her placement in Chinese 201. After attending a few classes, she found the workload and
the transition to simplified Chinese challenging. This prompted her to consult the Chinese
supervisor, who assessed her skills and recommended Chinese 102. However, since the class was
only offered in the spring, she opted for Chinese 101, which she found more comfortable and
better suited to her needs:
        "I first learned traditional Chinese... So coming in and taking the placement exam was a
        little bit difficult for me... I actually changed to Chinese 101 after attending two or three
        classes... I think it kind of like made me a little bit nervous because of like the amount of
        classwork there was and I still wasn't really comfortable with simplified Chinese... I met
        with the Chinese program supervisor... she's like, I think 102 would probably be the best,
        but because they only offer that in the spring. I decided to just like let's just do 101... I
        feel way more comfortable in Chinese 101 compared to 201." (student #1, CHS101)
        This student's experience underlines the importance of considering individual learner
backgrounds and the possible discrepancies between traditional and simplified Chinese when
evaluating the placement test's effectiveness. It also emphasizes the value of communication and
collaboration between students and program supervisors to ensure appropriate course placement,
especially when students encounter challenges that the placement test may not fully capture.
        In summary, the study results indicate that the Chinese placement test generally assigns
students to suitable course levels, as evidenced by the high GPAs, preparedness ratings, and
performance ratings. However, some students may find their placement below expectations. A
student with a traditional Chinese background encountered difficulties due to the test's focus on
simplified Chinese and its inability to fully capture individual learning needs. Although she
found a more appropriate course after consulting the Chinese supervisor, her experience
                                                     92


highlights the need to address these limitations. In summary, from the students' perspective, the
Chinese placement test effectively places them in appropriate course levels, but addressing
potential limitations and considering individual learner backgrounds can further enhance the
test's accuracy and its ability to meet students' needs.
[RQ 6c: Utilization inference]:
Are cut-off scores set appropriately?
         The appropriateness of test cut-off scores plays a critical role in valid score utilization
and interpretation, as it directly impacts the precision and effectiveness of the exam in assigning
students to suitable course levels. To answer the research question, as noted earlier, I collected
teacher ratings on their perceptions of the relevance of the items targeting each level of the
course. Specifically, instructors assessed all 32 items in the placement test concerning their
relevance and appropriateness to the course content of 100-level, 200-level, and 300-level
courses. A series of checkboxes were provided for each item to enable instructors to make their
assessments. For ease of analysis and interpretation, when instructors' ratings differed regarding
the course an item was targeting, the course with the most instructors selecting it was determined
as the item's target level.
         Upon examining the results, it was found that for 7 out of 32 items, instructors'
perceptions of item relevance and appropriateness to course content differed by one-course level.
However, none of the items had instructor ratings that differed by two levels, indicating that the
items were generally well-aligned with the intended course levels, albeit with some variation.
This consistency in instructors' perceptions of item relevance is an important factor to consider
when evaluating the appropriateness of the cut-off scores. For a summary of the number of items
perceived as relevant to each course level, please see Table 20.
                                                   93


Table 20. Summary of items perceived by instructors as relevant to each course level
  Course level               N of items              Cut-offs for placement decisions
  100-level                  20                      < 20
  200-level                  9                       < 30
  300-level                  3                       >= 30
         Table 20 shows that of 32 items in the test, 20 items were perceived as relevant to the
100-level course, 9 items to the 200-level course, and 3 items to the 300-level course. These
numbers correspond to the placement cut-off scores of less than 20 for 100-level courses, less
than 30 for 200-level courses, and 30 or greater for 300-level courses. Although the number of
items matched to each course level seems to correspond with the cut-off scores, the distribution
of items across the levels appears to be imbalanced. The 100-level courses have significantly
more items (20) than the 200-level (9) and 300-level (3) courses. This imbalance may lead to a
less accurate measurement of students' language proficiency in the upper-level courses, as there
are fewer items to gauge their abilities. These findings suggest that the Chinese placement test
might benefit from a more balanced distribution of items across the various course levels to
better assess students' language proficiency at each level. By adjusting the number of items
targeting the 200-level and 300-level courses and ensuring a more even distribution across all
levels, the test's accuracy in placing students in appropriate course levels can be improved.
                 In summary, the analysis of the teacher ratings and the distribution of items across
course levels provides mixed evidence regarding the appropriateness of the cut-off scores for the
Chinese placement test. On one hand, the fact that the instructor ratings for item relevance did
not differ by more than one-course level for any item suggests that the items are generally well-
aligned with the intended course levels. This consistency in instructors' perceptions of item
                                                   94


relevance supports the validity evidence of the cut-off scores. On the other hand, the uneven
distribution of items across the course levels, with the 100-level courses having significantly
more items than the 200-level and 300-level courses, raises concerns about the accuracy of the
test in measuring students' language proficiency in the upper-level courses. This finding suggests
that the test might benefit from a more balanced distribution of items across the various course
levels to better assess students' language proficiency at each level.
        Therefore, while the evidence does not outright reject the appropriateness of the cut-off
scores, it does indicate that improvements can be made to enhance the test's accuracy in placing
students in appropriate course levels. By adjusting the number of items targeting the 200-level
and 300-level courses and ensuring a more even distribution across all levels, the validity
evidence for the utilization inference of the Chinese placement test can be further strengthened.
[RQ 7: Consequence implication inference]:
Does the test have positive effects on Chinese teaching and learning?
        The consequence implication inference of the Chinese placement test delves into the real-
world effects of the test on teaching and learning, specifically investigating its impact on Chinese
language instruction and student learning experiences. Gaining insights into the implications of
the Chinese placement test is crucial for understanding its validity, effectiveness in promoting
learning experiences, and areas for potential enhancement. This knowledge is valuable for
educators, administrators, and other stakeholders, enabling them to make well-informed
decisions about adopting and implementing the test to fulfill the needs of both instructors and
students.
        To thoroughly address the research question and gain a comprehensive understanding of
the Chinese placement test's consequence implication inference, I employed qualitative analysis
                                                  95


of the feedback from students and instructors collected from the interviews. The analysis enables
a deep exploration of students' and instructors' subjective experiences and perceptions regarding
the test's impact, capturing the intricacies of how the test influences teaching and learning
processes. The collected data from student interviews suggest that the Chinese placement test has
positive effects on Chinese teaching and learning. Students generally reported that the placement
test results were helpful in guiding their decisions regarding course selection. For example, one
student stated,
         "I think [the placement test results] were pretty accurate. (Student #2, CHS202)."
         However, it should be noted that some students viewed the test results as a suggestion
rather than a strict guideline, using it as a basis to make their own decisions about their level of
comfort and willingness to engage with the course material. This student further elaborated on
this point, saying,
         "I know some people who were placed higher, but they chose to go back a step or start
         over completely. I don't know too many people who pushed ahead, which is interesting.
         (Student #2, CHS202)."
         Another student, who was initially placed in Chinese 201, decided to take Chinese 101
instead due to concerns about the time commitment involved in a five-credit course during her
first semester. She explained,
         "I feel like I could do it (take 201) if I put in the time for it, but I didn't really want to do
         it for my first semester. I just felt with it being a five-credit course, I didn't want it to take
         up the majority of my time compared to my other classes. (Student #7, CHS102)."
                                                      96


         Similarly, a student who was advised to take Chinese 201 based on the test results opted
for Chinese 101 because of concerns that his Chinese language skills had become rusty. Upon
reflection, the student admitted,
         "I think the placement test results were more accurate in describing what I would be able
         to do, given that Chinese 101 was very easy. I definitely would have taken 201 if I had
         gone back in time and said to myself, look, this is what it's gonna be. (Student #3,
         CHS102)."
         In light of these individual decisions, another student drew attention to the fact that not all
students in a given Chinese class were equally proficient. This observation underscores that
factors other than the placement test, such as prior language learning experience, could
contribute to differences in proficiency levels. The student observed,
         "I don't think that's necessarily the case [that all students were equally proficient in my
         class]. I believe there is a significant role played by how much Chinese has been taken
         before high school or before college. Some students (who are placed in) had 4 or 5 years
         of prior Chinese learning experience and are noticeably more proficient. (Student #4,
         CHS302)."
         This point further emphasizes the importance of considering individual factors and
preferences when making course placement decisions. Building on the insights gained from
student interviews, the analysis of instructors' interview data further explores the impact of the
Chinese placement test on teaching and learning. Instructors provided valuable perspectives on
the positive effects of the test, as well as the challenges they face in accommodating students
with varying levels of proficiency. One instructor mentioned the advantages of the placement
test, stating,
                                                     97


         "Students who enter our class through the placement test are generally easier to manage,
         as they are usually placed at the appropriate level. (instructor #1)"
         This demonstrates the positive effects of the placement test in accurately assessing
students' proficiency and ensuring they are enrolled in suitable courses. However, the instructor
also noted that there is a range of proficiency levels among students who have advanced from the
101 and 102 classes, rather than being placed by the test. This variation in proficiency creates a
stratified learning environment, where instructors must tailor course difficulty to accommodate
the majority of students and follow their learning pace. For high-achieving students, the
instructor encourages them to take on extra work, such as writing more in-depth essays or
improving their presentations. On the other hand, supporting struggling students who have
advanced from lower-level courses can be quite challenging for teachers. As one instructor
noted,
         "We used to have a teaching assistant (TA) or a Chinese language helper in our
         department. However, due to the pandemic, we haven't had such support for the past
         couple of years. Foreign language teaching assistants (FLTAs) can provide some help,
         but it is often insufficient. (instructor #3)”
         The instructor also highlighted the difficulties in dealing with students who took a break
from their studies and experienced a decline in their language proficiency. She shared,
         "For example, I have encountered one or two students who took a year or two off, forgot
         what they learned, but still earned credits. This creates a difficult situation. Despite their
         enthusiasm, their proficiency has dropped to the 200 level, but they have already earned
         300-level credits and need to graduate. (instructor #3)"
                                                     98


        As a result, these students enroll in 400-level courses, posing a challenge for both the
instructor and the students themselves. To mitigate this issue, the instructor allows these students
to audit lower-level classes if their schedule permits, offering them additional support to catch up
with their peers.
        In conclusion, the data gathered from student interviews demonstrate that the Chinese
placement test has positive effects on both Chinese language instruction and student learning
experiences. Students reported that the test results were accurate and helpful for course selection,
while instructors found it easier to manage students placed at appropriate levels. However, it is
essential to consider individual factors and preferences when making course placement decisions
to ensure the best possible learning outcomes for all students.
                                                  99


                                     CHAPTER 5: DISCUSSION
         In the current study, I aimed to provide a comprehensive examination and evaluation of
the test score uses and interpretations for the listening and reading sections of an in-house,
college-level Chinese placement test. Filling a gap in the literature on foreign language
placement testing, the study focused on a language other than English and addressed
methodological limitations commonly found in existing research. Using an argument-based
validation framework conceptualized by Kane (2006) and expanded by Chapelle et al. (2008),
for the study I collected and evaluated quantitative and qualitative validity evidence across seven
inferences: domain description, evaluation, generalization, explanation, extrapolation, utilization,
and consequence implication.
         The primary goals of the study were to (1) investigate the functioning of test items by
identifying and revising psychometrically problematic items, if any; (2) utilize the empirical
results to inform test revisions, if needed; (3) demonstrate how the collected quantitative and
qualitative results serve as strong or weak evidence or counter-evidence for the validity
argument; and (4) provide an overall evaluation of the intended interpretation and use of the
placement test scores. By employing mixed-methods, the study aimed to contribute to the larger
discussion of foreign language assessment practices and argument-based test validation, while
also offering insight into the ongoing development of validity research.
         In this discussion section, I will summarize and evaluate the validity evidence for each
research question using the criteria determined earlier. I will then discuss the results in relation to
previous SLA literature when applicable. Then, the chapter closes with a brief discussion of
some of the limitations of the current study.
                                                  100


[RQ 1: Domain description inference]:
Are the relevance of the test items and test criteria to the instructional domain and the
appropriateness of the item difficulties supported by test stakeholders?
        The results of the first research question, concerning the domain description inference,
were evaluated by examining the instructors' and students' perceptions of the test items'
relevance, appropriateness, and difficulty. Based on the results, strong evidence supports the
relevance and appropriateness of the test items and criteria to the instructional domain. None of
the items were considered irrelevant to the course material across all three levels of instruction
by the instructors. In addition, students in all course levels generally perceived the test items as
relevant to their course, as evidenced by their relatively high mean relevance scores.
Furthermore, the appropriateness of the item difficulties is supported by the findings that
students in higher-level courses found test items easier compared to those in lower-level courses.
This trend is expected, as students in advanced courses should have a higher proficiency level in
the language, allowing them to find the items less challenging.
        The results align with the existing literature that emphasizes the importance of test
content relevance and appropriateness for ensuring the validity of language assessments (Xi,
2010; Chalhoub-Deville & Deville, 2018). The alignment between the test items and the
instructional domain contributes to the test's ability to accurately measure students' language
proficiency and place them in suitable course levels.
[RQ 2a: Evaluation inference]:
Do test items yield item difficulty estimates that are appropriate for making placement decisions?
        The research question focused on the evaluation inference, investigating whether the test
items produced item difficulty estimates suitable for making placement decisions. To address
                                                  101


this question, two distinct methods were employed: a Rasch-based approach using a Wright map
and correlation analysis with Pearson's correlation coefficients comparing students' and teachers'
perceived difficulties with empirical item difficulties.
         The Wright map analysis indicated that 75% of the examinees fall within the overlapping
range of item difficulties and examinees' ability measures, suggesting reasonable item targeting
along their ability level. However, a noticeable ceiling effect was observed, with approximately
one-fourth of the examinees having ability measures above all item difficulties. This finding
underscores the necessity of including more challenging items to address the ceiling effect for
high-ability examinees, allowing for a more accurate assessment and facilitating appropriate
placement decisions.
         The correlation analysis revealed varying degrees of agreement between students' and
teachers' perceived difficulties and empirical item difficulties. The overall correlation coefficient
of .781 for the aggregated data, as well as data from 100- and 300-level students, demonstrates a
robust relationship between students' perceived item difficulties and empirical item difficulties.
This suggests that the test items generally function as intended and accurately measure the
targeted construct.
         However, the findings also uncovered some discrepancies and weaker relationships,
particularly for the 200-level courses, where the correlation between students' perceptions of
item difficulties and empirical item difficulties was lower (r = .415). One possible explanation
for this observation is the increased heterogeneity in students' abilities and prior exposure to the
content. The 200-level courses may consist of a more diverse group of students in terms of their
abilities and prior exposure to the content, leading to greater variability in students' perceptions
of item difficulties and a lower correlation with empirical item difficulties. This explanation
                                                 102


aligns with Ma and Winke's (2019) study, which found that intermediate students' self-assessed
skills tend to be less accurate compared to those of beginner or advanced students.
         Another observation from the results is the moderate correlation between teachers'
perceptions of item difficulties and the empirical difficulties computed from students'
performance (r = .507). Existing literature has revealed that there may be a mismatch between
teachers' perception of item difficulties and students' actual performance on these items (e.g.,
(van de Watering & van der Rijt, 2006). This observation could be attributed to several factors.
For instance, teachers might not be fully aware of the specific strategies students use when taking
the test. Interviews with students and teachers revealed that students sometimes employed test-
taking strategies when responding to items (e.g., listening for keywords instead of trying to
understand every sentence in the listening items), whereas teachers often evaluated item
difficulties based on the inclusion of difficult vocabulary or phrases. Another possible factor that
may contribute to the moderate correlation between teachers' perceptions and empirical item
difficulties is teachers' cognitive biases, such as overestimating the difficulty of items they
themselves find challenging or underestimating the difficulty of items they consider easy (van de
Watering & van der Rijt, 2006). Providing teachers with more insights into students' test-taking
strategies and refining their understanding of item difficulty can help improve the alignment
between teachers' perceptions and empirical item difficulties, ultimately leading to better test
development and more accurate placement decisions.
[RQ 2b: Evaluation inference]:
Do test items exhibit no evidence of item bias?
         The research question centered on determining whether the test items exhibit no evidence
of item bias, specifically in terms of invariance across gender. Ensuring item invariance between
                                                  103


female and male examinees is vital for maintaining a fair and unbiased assessment (Kunnan,
2000). To address this question, the study analyzed the group invariance of item measures by
examining differential item functioning (DIF) between female and male examinees. Based on the
DIF analysis, the results revealed no evidence of DIF across the two examinee subgroups, as
none of the items met both DIF criteria (i.e., statistical significance of the Mantel-Haenszel test
at the .05 level after the Benjamini-Hochberg adjustment and a difference in item difficulty of at
least .5 logit). This finding contributes to the strong validity evidence for evaluation inference
and highlights the test's capacity to provide unbiased measures of language ability for both
genders.
[RQ 2c: Evaluation inference]:
Are correct options unambiguous and accurately keyed?
         For this research question, the investigation centered around the clarity and accuracy of
the correct options in the test items. The aim was to determine whether the correct options were
unambiguous and accurately keyed, as well-crafted multiple-choice items should include
effective distractors that challenge examinees and require them to demonstrate their language
abilities to select the correct response among plausible alternatives. To address this question, an
analysis of distractors was conducted as an item quality indicator, aiming to assess the extent to
which distractors for each item discriminated between examinees with different ability levels.
         The analysis revealed that the keyed options generally attracted higher-ability examinees
compared to the distractors. However, four items exhibited lower discriminating power, with
mean ability estimate differences between the two groups of examinees of less than 1. Further
investigation of these items revealed issues such as ambiguous phrasing in the item prompts,
poorly selected distractors, and incongruent information in the prompt compared to the intended
                                                   104


answer (Downing & Haladyna, 2006). These findings indicate the need for a more detailed
review of these items to ensure that the keyed options are clear and unambiguous, and the
distractors are not too close in plausibility to the keyed option, which could lead to inaccurate
measurement of examinees' abilities. Considering the criteria provided, the findings show weak
evidence for the evaluation inference, as approximately 90% of the items are shown to be
unambiguous and accurately keyed. This suggests that while the test generally provides accurate
information on examinees' abilities, there is room for improvement, particularly in addressing the
issues found in the four problematic items.
        The importance of test revisions in test development cannot be overstated, as it is crucial
to ensure that test items are reliable and valid measures of examinees' language abilities
(Downing & Haladyna, 2006). Additionally, content experts can provide valuable feedback on
items, identifying potential issues and suggesting improvements (Haladyna, Downing, &
Rodriguez, 2002). Several guidelines can be followed to develop good items, such as ensuring
that items are clear and concise, avoiding misleading or ambiguous language, and selecting
distractors that are plausible but clearly incorrect (Haladyna et al., 2002). Using item-analysis
and the functioning of distracters is an effective approach to examining the effectiveness of
distractors and items (Wolfe & Smith, 2007; Osterlind, 1998). By understanding how examinees
with different abilities respond to various distractors, test developers can make necessary
revisions to ensure that the test items accurately assess language proficiency.
[RQ 3a: Generalization inference]:
Does the MSU Chinese placement test produce scores that are internally consistent?
        The research question aimed to determine if the MSU Chinese placement test produces
scores with internal consistency, which is critical for assessing the reliable measurement of
                                                   105


Chinese language proficiency across various contexts and student populations. High internal
consistency allows for greater trust in the stability and precision of test scores. To analyze the
internal consistency of the MSU Chinese placement test, Cronbach's α was computed, resulting
in a value of 0.88 (95% CI: [0.86, 0.90]), demonstrating strong internal consistency.
        Furthermore, an item analysis was carried out to evaluate the influence of each item on
the overall internal consistency (refer to Table 8). The rationale behind examining Cronbach's α
after removing each item is to identify any problematic items that could potentially lower the
internal consistency of the test. Some factors that may contribute to a noticeable decrease in
Cronbach's α include poor item quality, item difficulty (an item is significantly more difficult or
easier compared to the rest of the test items), lack of content coverage (measuring a different
aspect of the construct), item redundancy, and low item discrimination.
        In this study, the analysis revealed that the removal of specific items would cause a minor
decrease in Cronbach's α from 0.88 to 0.87, while for others, the α value would remain stable at
0.88. These outcomes indicate that the test items are consistently measuring the same underlying
construct, and no single item significantly impacts the overall internal consistency. This item-
level examination provides valuable information for test developers, enabling them to refine and
improve the test's quality by identifying and addressing any problematic items. In conclusion, the
findings offer robust evidence in support of the MSU Chinese placement test scores' internal
consistency, thus reinforcing the generalizability inference in the context of language
assessment.
[RQ 3b: Generalization inference]:
Are there adequate items to reliably differentiate students’ abilities into three levels as intended?
                                                  106


        In addressing the research question, the findings reveal that the MSU Chinese placement
test exhibits a person reliability of .84 and a person separation index of 2.32. Based on the
provided criteria, these results present weak evidence for the test's ability to reliably differentiate
students' abilities into three levels. Despite not reaching the strong evidence threshold, the test
still demonstrates satisfactory performance in assigning students to appropriate proficiency
levels, as intended by the test developers.
        Person reliability and person separation index have been commonly used in educational
and psychological research to examine the psychometric properties of measurement instruments
(e.g., Fan et al., 2021; Hu et al., 2022; Jefferies et al., 2021). For instance, Jefferies et al. (2021)
applied the Rasch model to explore the psychometric properties of PLAYself, a tool designed for
self-description of physical literacy in children and youth. They reported person reliability values
ranging from .7 to .82, which indicates that PLAYself has good internal consistency and can
reliably distinguish between different levels of physical literacy. In the study of Fan et al., the
researchers examined the psychometric properties of the Norwegian Self-Efficacy for
Therapeutic Use of Self questionnaire using Rasch analysis, excellent item and person separation
were observed across all three parts (N-SETMU, N-SERIC, and N-SEMIE). The person
separation index ranged from 2.8 to 4.6, indicating the successful differentiation of subjects into
three to five distinct levels of self-efficacy.
        Although not as commonly reported in SLA research, the use of person reliability and
person separation index is common in educational and psychological research. The results
highlight the utility of these indices in examining the validity evidence for placement tests, as the
purpose of these tests matches well with the objectives of using these two measures: to
effectively distinguish between different levels of language proficiency.
                                                   107


[RQ 4a: Explanation inference]:
Does students’ test performance vary according to the amount and quality of prior Chinese
learning experience?
         In addressing the research question regarding whether students' test performance varies
according to the amount and quality of prior Chinese learning experience, the study found
significant improvements in listening, reading, and total scores on the Chinese placement test
from the beginning to the end of the semester. The level of improvement varied across different
course levels, with the most significant gains observed among the 100-level students, followed
by the 200-level students. The least change in scores occurred for the 300-level students. The
study's findings provide strong evidence for the validity of the Chinese placement test and
contribute to the understanding of the relationship between prior learning experience and
language test performance.
         However, the observed differences in score improvements might be influenced by factors
such as the ceiling effect and practice effect. The ceiling effect could be a potential explanation
for the smaller improvements among higher-level students, as they already performed well in the
first test administration, leaving less room for improvement. This observation suggests that
adding more challenging items to the test might better differentiate the proficiency levels of
higher-level students. However, this explanation remains speculative, and further research is
needed to confirm the presence of the ceiling effect and its impact on the results. The practice
effect, resulting from students taking the same test twice, could potentially inflate the observed
improvements in test performance (Calamia et al., 2013). If the practice effect is substantial, it
might lead to an overestimation of the actual gains in language proficiency. Although the current
study cannot definitively establish the extent to which the practice effect impacted the findings, it
                                                  108


is essential to consider this potential limitation when interpreting the results and evaluating the
validity argument.
        It is worth noting that the sample size for the 300-level students in the study was small (n
= 6). Small sample sizes can lead to low statistical power and increase the likelihood of Type II
errors (Cohen, 1992). In the context of this study, the small sample size for the 300-level
students may limit the ability to draw robust conclusions about the performance of this group and
may not accurately represent the broader population of 300-level students.
[RQ 4b: Explanation inference]:
Do students’ test scores support the internal structure of the intended construct?
        The research question is whether students' test scores provide support for the internal
structure of the intended construct. To answer this question, an exploratory factor analysis (EFA)
was conducted to examine the factor structure of the 32 items in the placement test. The purpose
of the analysis was to determine if the scores collected via the placement test reflect a theoretical
view of language proficiency. If the test items align strongly with the underlying construct, this
provides evidence that the test is a valid measure of language proficiency (In'nami & Koizumi,
2016). Conversely, inconsistencies or an inadequate representation of the construct in the internal
structure can highlight areas that need improvement to better assess the intended language skills.
        In the initial EFA, it was found that the second factor was mainly driven by relatively
easy items, with five of these items being related to the same prompt. This finding prompted
further investigation into the structure of the test items to better understand their impact on the
construct validity. The five items were collapsed into a polytomous super-item, and the EFA was
re-conducted. After this adjustment, both the one-factor and two-factor solutions showed
improved fit. For the adjusted models, the fit statistics suggest stronger support for the two-factor
                                                   109


solution. However, it is important to note that the second factor still appears to be mainly driven
by relatively easy items and may not necessarily represent a separate dimension of language
proficiency.
         These findings suggest that the placement test does appear to measure language
proficiency as intended, but highlights areas that require improvement. One approach to address
the issue is to consider removing the relatively easy items from the test. This solution could
potentially increase the test's ability to differentiate between proficiency levels and reduce the
influence of the identified issues on test validity. However, removing these items might also
result in a loss of content coverage and may not fully address the underlying construct
representation. Another approach is to revise the identified items or add more items to better
represent the intended construct. This could involve introducing more challenging items or
diversifying the prompts, which may help mitigate the potential impact of the relatively easy
items and the clustering of items related to the same prompt on the internal structure of test
items. When implementing this approach, it is important to ensure that the revised or added items
align with the theoretical view of language proficiency and maintain content coverage (In'nami
& Koizumi, 2016).
[RQ 5: Extrapolation inference]:
Do students’ test scores support the relationship between their performance on the test and other
indicators of Chinese language proficiency?
         The extrapolation inference plays a pivotal role in the validity argument by determining if
Chinese placement test scores can be effectively generalized to other indicators of Chinese
language proficiency. This study explored the extrapolation inference by analyzing the
relationship between students' placement test scores and their performance on the ACTFL tests.
                                                   110


        In this study, strong evidence was found for the relationship between students' scores on
the Chinese placement test and on ACTFL proficiency tests for corresponding skills (e.g.,
listening and listening). The correlation coefficients for corresponding skills were generally
moderate to strong (r ≥ 0.40), which supports the extrapolation inference. Furthermore, positive
correlations between students' scores on the Chinese placement test and on ACTFL proficiency
tests for non-corresponding skills (e.g., speaking and listening) were positive but weaker
compared to the correlation between corresponding skills, providing additional evidence for the
extrapolation inference.
        The findings of this study not only contribute to the body of knowledge on validity
evidence for the extrapolation inference, but also highlight the importance of considering
correlations between corresponding and non-corresponding skills in language assessment
research. In this context, a study by Eda, Itomitsu, and Noda (2008) serves as a valuable
example. They investigated the validity evidence for JSKIT, a Japanese skills test, used as a
placement tool in a summer intensive language program. The researchers examined the
correlations between the subcomponents of the JSKIT and the corresponding subcomponents of
the in-house placement test. Their findings indicated that the structure section of the JSKIT was
most strongly correlated with the corresponding grammar section of the placement test (r
= .806), and the reading section of the JSKIT was most strongly correlated with the
corresponding reading section of the placement test (r = .766).
        Comparing correlations between corresponding and non-corresponding skills is an
essential aspect of examining validity evidence for language assessments. Although this crucial
aspect has been addressed in other fields such as psychological and educational measurement
research, it has often been overlooked in foreign language placement testing. Many studies in
                                                 111


this area have primarily focused on establishing positive correlations for corresponding skills,
without considering the comparison of these correlations with those of non-corresponding skills.
[RQ 6a: Utilization inference]:
From the perspective of course instructors, are students placed into appropriate course levels?
         To answer the research question related to utilization inference, data was collected from
three Chinese course instructors using questionnaires and interviews. The findings offer strong
evidence in support of the utilization inference, as instructors generally agreed that the placement
test effectively placed students into appropriate course levels, aligning with the expectations of
test stakeholders.
         However, the current study also revealed challenges and difficulties in using placement
tests for placing heritage learners, who typically have unique language learning experiences due
to exposure to the target language through family or cultural experiences (Li & Duff, 2008). As a
result, traditional placement tests may not adequately capture the abilities of heritage learners,
leading to potential inaccuracies in their placement and, ultimately, an inappropriate course level
that does not align with their needs. This issue is particularly relevant for heritage learners, as
they often possess a strong foundation in oral and listening skills but may struggle with more
formal aspects of the language, such as grammar and writing (Campbell, 2000; Kondo-Brown,
2005). Consequently, placement tests that heavily weigh formal language skills may not provide
an accurate representation of heritage learners' true proficiency levels. Recognizing this
limitation, teachers may need to resort to alternative assessment methods when the placement
test scores are insufficient.
         Oral interviews and background questionnaires, for instance, are the most widely used
alternatives due to the lack of standardized placement tests specifically designed for heritage
                                                 112


learners (Li & Duff, 2008). These methods can offer valuable insights into a learner's
proficiency, as the amount of schooling received in the target language is considered the most
reliable indicator of heritage language proficiency. Nevertheless, individual differences still may
exist even among students from the same class, which presents a significant challenge in placing
Chinese heritage language learners from diverse educational systems into appropriate classes.
        One reason why heritage learners may struggle with placement tests is their often uneven
grasp of the heritage language. As noted earlier, some learners possess strong receptive or
conversational skills while lacking in literacy, grammar, and vocabulary (Sohn, 2004).
Additionally, their sociolinguistic and pragmatic competence may be limited, further
complicating the placement process. To address these issues, some college or university Chinese
language programs have adopted separate tracks for language learners, including heritage tracks
for students who have had previous exposure to the target language, regardless of their ethnicity
(Li & Duff, 2008). However, not all Chinese programs can afford to implement these dual tracks
due to low enrollment, especially in recent years. This limitation may result in mixed classrooms
with varying degrees of proficiency among heritage and non-heritage learners, leading to
challenges in meeting the diverse needs of all students.
        In response to these challenges, instructors from the current study have developed several
strategies to cope with mixed proficiency levels in the classroom. For struggling students, there
were teaching assistants (TAs), Chinese language helpers in the department, and foreign
language teaching assistants (FLTAs) to provide after-class assistance. For high-achieving
students, the instructor encourages them to take on extra work, such as writing more in-depth
essays or improving their presentations. These strategies aim to better support learners with
diverse proficiency levels and help them make the most of their language learning experience.
                                                 113


While these approaches have proven beneficial, further enhancement could be achieved through
the implementation of a heritage track at MSU, specifically designed to address the unique needs
of these students.
        In order to open a heritage track at MSU, it would likely be most effective to focus on the
upper levels, particularly if that is where most heritage learners are placed. Scheduling these
classes at a time that accommodates the majority of heritage learners would be crucial to ensure
adequate enrollment. However, enrollment size may still be an issue as MSU requires at least 15
students in an undergraduate course for it to run. In some cases, this requirement can be
overridden, as has been done in certain departments for language classes, but careful
consideration of class scheduling and enrollment size will be essential to successfully implement
a heritage track at MSU. This addition could complement the strategies already employed by
instructors in mixed proficiency classrooms, ultimately providing a more tailored and effective
learning experience for heritage learners.
[RQ 6b: Utilization inference]:
From the perspective of students, are they placed into appropriate course levels?
        Building upon the insights gained from the instructors' perspective on the effectiveness of
the Chinese placement test, this section focuses on the students' perspective to provide a more
comprehensive understanding of the placement process. While instructors have provided
valuable information on the challenges and strategies associated with placing heritage learners
into appropriate courses, it is essential to consider the students' experiences and feedback to
assess the test's effectiveness fully. By doing so, I can gain a deeper understanding of how the
test outcomes impact students' learning experiences and identify any potential gaps or
mismatches between their prior knowledge, course expectations, and assigned course levels.
                                                  114


         To address the research question, I analyzed students' interview responses and
questionnaire data, revealing several key findings that suggest strong evidence in support of the
utilization inference. Students generally achieved high GPAs in their assigned courses and
reported feeling well-prepared and performing well in these courses, indicating appropriate
course placement. However, similar to the instructors' perspective, there were instances where
students faced difficulties in their assigned courses. In one particular case, a student with a
background in traditional Chinese experienced challenges with the placement test due to its focus
on simplified Chinese. It is important to note that traditional and simplified Chinese are different
orthographic systems, with traditional characters being more complex and used in regions such
as Taiwan and Hong Kong, while simplified characters are used in Mainland China.
         Additionally, different phonetic systems are used in these regions, with Pinyin being used
in Mainland China and Zhuyin in Taiwan. These differences highlight the importance of
considering individual learner backgrounds when evaluating the test's effectiveness. This
example emphasizes the need for open communication and collaboration between students and
program supervisors to ensure appropriate course placement, especially when students encounter
challenges that the placement test may not fully capture.
         In examining the students' experiences with the Chinese placement test, this study
highlights a gap in the literature regarding how students' simplified or traditional Chinese
learning experiences are influenced by their prior traditional or simplified Chinese background.
There has been limited research investigating the challenges and difficulties students may face
when transitioning between these different orthographic and phonetic systems. This lack of
research could be partly attributed to the growing preference for Hanyu Pinyin and simplified
characters in recent years. The majority of teachers and students prefer Hanyu Pinyin, and even
                                                 115


Chinese heritage schools that traditionally teach Zhuyin have started to teach Hanyu Pinyin in
the higher grades (Kwoh, 2007). The College Board's decision to use a computer-based AP
Chinese test has further driven the adoption of Hanyu Pinyin, as it simplifies the input and typing
process for students. Additionally, more schools have begun teaching simplified characters and
Pinyin due to the increasing political and economic influence of Mainland China's official
language, Putonghua (Wei & Hua, 2010). While this trend has led to a shift in focus away from
traditional characters and Zhuyin, it is important to consider the diverse backgrounds and
experiences of students when evaluating the effectiveness of language placement tests.
         In light of these factors, it is crucial to further explore the challenges and difficulties
faced by students with different backgrounds in traditional or simplified Chinese when
transitioning between these orthographic and phonetic systems. By doing so, researchers and
educators can better understand the unique needs of these students and develop more effective
placement tests and language programs that cater to their diverse learning experiences.
[RQ 6c: Utilization inference]:
Are cut-off scores set appropriately?
         The research question related to utilization inference focuses on the appropriateness of
cut-off scores in the Chinese placement test. Cut-off scores are critical for valid score
interpretations and uses, as they directly impact the test's precision and effectiveness in assigning
students to suitable course levels. Accurate cut-off scores ensure that students are placed in
appropriate courses, leading to better learning outcomes, engagement, and satisfaction with the
language program.
         To address the research question, I examined teacher ratings on their perceptions of item
relevance and appropriateness to the course content of 100-level, 200-level, and 300-level
                                                      116


courses. The results revealed that the number of items matched to each course level corresponds
with the cut-off scores, providing supportive validity evidence. However, there were variations
in instructors' perceptions of item relevance and appropriateness for 7 out of 32 items, with these
items' ratings differing by one-course level. Despite the overall consistency in cut-off scores and
item relevance, the distribution of items across the course levels appears to be imbalanced, with
the 100-level courses having significantly more items than the 200-level and 300-level courses.
This imbalance may lead to a less accurate measurement of students' language proficiency in the
upper-level courses, as there are fewer items to gauge their abilities.
         Considering the analysis of the teacher ratings and the distribution of items across course
levels, this study provides weak evidence concerning the appropriateness of the cut-off scores for
the Chinese placement test. Although the instructors' perceptions of item relevance align with the
established cut-off scores for the intended course levels, the imbalanced distribution of items
across course levels raises concerns regarding the test's accuracy in assessing students' language
proficiency, particularly for upper-level courses. In light of the findings, several suggestions can
be made to improve the Chinese placement test and enhance the evidence for appropriate test
score uses and interpretations:
    1. Reevaluate and revise the test items to ensure a more balanced distribution across all
         course levels. A more even distribution of items will help in accurately assessing
         students' language proficiency for upper-level courses and lead to more precise course
         placements.
    2. Consider conducting a comprehensive review of the test items, taking into account the
         variations in instructors' perceptions of item relevance and appropriateness. This process
                                                   117


        may involve revising, removing, or adding items to better align with the course levels and
        reduce variations in instructors' perceptions.
    3. Regularly update and review the test content to ensure that it reflects the evolving course
        content and student profiles, which will contribute to maintaining the accuracy and
        relevance of the test over time.
        By implementing these suggestions, the Chinese placement test can be improved, leading
to a better assessment of students' language proficiency and more accurate course placements.
Furthermore, these improvements can contribute to the overall quality of Chinese language
programs, as accurate placement of students promotes more effective teaching and learning
experiences.
[RQ 7: Consequence implication inference]:
Does the test have positive effects on Chinese teaching and learning?
        The consequence implication inference of the Chinese placement test is explored,
focusing on its effects on Chinese teaching and learning. To address this research question,
qualitative analysis of feedback from students and instructors collected through interviews was
conducted. The findings provide strong evidence that the test has positive effects on Chinese
language instruction and student learning experiences. Students generally reported that the test
results were accurate and helpful for course selection, while instructors found it easier to manage
students placed at appropriate levels.
        However, the results also revealed that some students viewed the test results as a
suggestion rather than a strict guideline and made their own decisions about their level of
comfort and willingness to engage with the course material. It is crucial to acknowledge that
instances of students not being placed at the most suitable level are not always due to
                                                 118


misplacement, but rather the choices these students make. Program administrators and course
instructors can provide recommendations to the students regarding placement decisions based on
the match between course difficulty and students' current proficiency level. However, there are
other considerations such as whether the course credits will be recognized by their major or
program, as this can have a direct impact on their graduation timeline. For instance, from the
interview, an instructor shared that an engineering student was advised to take a 200-level class
based on the placement test results and the Chinese program coordinator's recommendation, but
he insisted on taking a 300-level course because the 200-level course credits would not be useful
for his major, and he needed to graduate. In the end, he did not take the Chinese class.
         Moreover, students' motivation to learn a foreign language may also influence their final
course selection, which can indirectly impact the effectiveness of the Chinese placement test.
Some students are not highly motivated to learn a foreign language and only take the course to
fulfill the university's foreign language requirement. As a result, even if their proficiency is
higher, they may choose to enroll in lower-level courses to minimize effort and secure easy
credits. Consequently, classes may consist of students with mixed proficiency levels, which
presents challenges for course instructors.
         These challenges, revealed through the interviews, underscore the limitations of the
Chinese placement test in accommodating students with varying levels of proficiency. Instructors
reported having to tailor course difficulty to accommodate the majority of students and follow
their learning pace, which can be demanding. Moreover, the lack of teaching assistant support
due to the pandemic has further exacerbated these challenges, highlighting the need for
additional resources and strategies to support both instructors and students in diverse classroom
settings.
                                                 119


        Building on the identified challenges, several recommendations can be considered to
enhance the effectiveness of the Chinese placement test and better support instructors and
students. While increasing the number of teaching assistants may not always be feasible,
exploring alternative support resources, such as peer tutoring or online materials, could help
manage classes with diverse proficiency levels more effectively. For students who have taken a
break from their studies or experienced a decline in their language proficiency, alternative
support strategies could be offered. These may include providing access to supplementary
learning resources, creating customized study plans, or allowing students to audit lower-level
classes alongside their current courses to bridge proficiency gaps. Ultimately, fostering clear
communication between students, program coordinators, and instructors through in-person
interviews becomes essential to ensure the best possible learning outcomes for all students. By
maintaining open dialogue and considering individual factors, a more nuanced and effective
approach to course placement can be achieved, maximizing the benefits of the Chinese
placement test for both teaching and learning.
Limitations and future research directions
        The current study has several limitations and areas for future research to address. First,
validation should be an ongoing endeavor, as this study has highlighted specific psychometric
issues within the Chinese placement test. These issues might stem from factors such as
ambiguous item phrasing, poorly selected distractors, and incongruent information in the
prompts compared to the intended answers. Although suggested revisions were proposed, it is
not clear to what extent the issues will be resolved. Ideally, the study would have collected more
data using a revised version of the test, incorporating the suggested revisions, and re-run the
analysis to investigate whether the revisions effectively addressed the identified issues. However,
                                                 120


this limitation is closely related to the small sample size and the tight timeline, which hinders the
robustness of Rasch analysis and underscores the need for future studies to address a range of
factors, including sample size and time constraints, to strengthen the validation process.
        Second, the relatively small sample size of 28 students also limits the generalizability of
the results The participants in the current study were drawn from three course levels and three
instructors at Michigan State University. This specific context may limit the generalizability of
the identified issues and the results yielded in the study to other Chinese language programs in
different institutions. Thus, when interpreting the results, readers should consider the specific
context in which the study was conducted, such as the unique characteristics of the Chinese
language program at Michigan State University, the specific course materials used, and the
teaching approaches employed by the course instructors. Future research should not only aim to
include larger sample sizes but also address other factors, such as diversity in participants'
backgrounds and institutional contexts, to enhance the generalizability and reliability of the
findings.
        Third, another related limitation is the inability to carry out the originally planned
analysis related to the utilization inference, which was intended to examine whether cut-off
scores are set appropriately. This analysis involved comparing the class performance of students
placed by the placement test to those who did not take the test. Due to the small sample size, this
analysis could not be carried out. Future research should examine the performance of students
placed by the test in comparison to those who did not take the test, as this would provide
additional insights into the effectiveness of the placement test and the appropriateness of the cut-
off scores.
                                                  121


        A fourth limitation is the potential impact of the COVID-19 pandemic on data collection
and student experiences. While data collection took place in Spring 2022 when most classes
were in-person, it remains unclear how the pandemic-induced shift to online instruction may
have affected students' learning motivation, course enrollment, and language performance.
Participants in this study might have experienced different language learning trajectories
compared to students who did not face the challenges posed by the pandemic. Future research
should investigate the potential impact of the COVID-19 pandemic on language learning,
placement test accuracy, and students' language learning experiences. Some students indicated in
interviews that they felt foreign language courses were significantly impacted by the shift to
online instruction, as it disrupted the interactive nature of in-person classes.
        In light of these limitations, future research should also expand the scope of the study to
include different contexts and a broader range of students, which would enhance the
generalizability of the findings. By addressing these points in future research, a clearer
understanding of the validity evidence for the score uses and interpretations of the Chinese
placement test can be achieved, ultimately benefiting students and educators in the field of
Chinese language learning.
                                                  122


                                   CHAPTER 6: CONCLUSIONS
        In conclusion, with this study I aimed to provide a comprehensive examination and
evaluation of the test score uses and interpretations for the listening and reading sections of an
in-house, college-level Chinese placement test. The primary goal was to address the existing
gaps in the literature, particularly the limited discussion on tests in languages other than English
and the methodological constraints of previous research. To achieve this, I utilized an argument-
based validation framework, collecting and evaluating both quantitative and qualitative validity
evidence. By employing mixed-methods, I sought to thoroughly assess the functioning of test
items, identifying any problematic items and proposing revisions as necessary. Additionally, it
aimed to utilize the empirical findings to inform improvements to the placement test, evaluate
the strength of the validity argument based on collected evidence, and offer a comprehensive
analysis of the test score interpretations and uses.
        The findings of this study contribute to the larger discussion on the practices of foreign
language assessment and argument-based test validation. Furthermore, the research offers
valuable insights into the ongoing development of validity research in the field of second
language testing. By providing a comprehensive examination of the Chinese placement test, this
study helps to enhance the understanding of test score uses and interpretations, supporting more
effective and reliable language placement decisions for students in higher education settings.
                                                   123


                                          REFERENCES
ACTFL. (2012). ACTFL proficiency guidelines. ACTFL.
        https://www.actfl.org/uploads/files/general/ACTFLProficiencyGuidelines2012.pdf
AERA, APA, & NCME. (2014). Standards for educational and psychological testing. American
        Educational Research Association.
Alderson, J. C. (2005). Diagnosing foreign language proficiency: The interface between learning
        and assessment. Continuum.
Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice: Developing language
        assessments and justifying their use in the real world. Oxford University Press.
Baralt, M. (2012). Coding qualitative data. In A. Mackey & S.M. Gass (Eds.), Research methods
        in second language acquisition (pp. 222-244). Wiley-Blackwell.
Becker, A. (2018). Not to scale? An argument-based inquiry into the validity of an L2 writing
        rating scale. Assessing Writing, 37, 1–12. https://doi.org/10.1016/j.asw.2018.01.001
Bernhardt, E. B., Rivera, R. J., & Kamil, M. L. (2004). The practicality and efficiency of web-
        based placement testing for college-level language programs. Foreign Language Annals,
        37(3), 356–365. https://doi.org/10.1111/j.1944-9720.2004.tb02694.x
Bond, T., & Fox, C. (2007). Applying the Rasch model: Fundamental measurement in the human
        sciences (2nd ed.). Lawrence Erlbaum Associates.
Calamia, M., Markon, K., & Tranel, D. (2013). The robust reliability of neuropsychological
        measures: Meta-analyses of test–retest correlations. The Clinical
        Neuropsychologist, 27(7), 1077-1105. https://doi.org/10.1080/13854046.2013.809795
Chalhoub-Deville, M., & Deville, C. (2018). Revisiting language testing validation: Empirical,
        analytical, and theoretical considerations. In E. Shohamy, I. Or, & S. May (Eds.),
        Language testing and assessment (pp. 33-48). Springer.
Campbell, R. (2000). Heritage language. In J. W. Rosenthal (Ed.), Handbook of undergraduate
        second language education (pp. 165–184). Lawrence Erlbaum Associates.
Chapelle, C. A. (2012). Validity argument for language assessment: The framework is
        simple…. Language Testing, 29(1), 19–27. https://doi.org/10.1177/0265532211417211
Chapelle, C. A. (2020). Argument-based validation in testing and assessment. SAGE
        Publications.
Chapelle, C. A., Cotos, E., & Lee, J. (2015). Validity arguments for diagnostic assessment using
        automated writing evaluation. Language Testing, 32(3), 385–405.
        https://doi.org/10.1177/0265532214565386
                                                 124


Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (2008). Building a validity argument for the
        test of English as a foreign language. Routledge. Chapelle, C. A., Enright, M. K., &
        Jamieson, J. M. (2010). Does an argument-based approach to validity make a difference?
        Educational Measurement: Issues and Practice, 29(1), 3–13.
        https://doi.org/10.1111/j.1745-3992.2009.00165.x
Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155-159.
        https://doi.org/10.1037/0033-2909.112.1.155
Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for
        field settings. Rand McNally.
Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational measurement (2nd
        ed., pp. 443–507). American Council on Education.
DeMars, C. (2010). Item response theory. Oxford University Press.
DiCicco‐Bloom, B., & Crabtree, B. F. (2006). The qualitative research interview. Medical
        education, 40(4), 314-321. https://doi.org/10.1111/j.1365-2929.2006.02418.x
Downing, S. M. (2006). Twelve steps for effective test development. In S. M. Downing & T. M.
        Haladyna (Eds.), Handbook of test development (pp. 3-25). Lawrence Erlbaum
        Associates.
Downing, S. M., & Haladyna, T. M. (Eds.). (2006). Handbook of test development. Lawrence
        Erlbaum Associates.
Duncan, P. W., Bode, R., Lai, S. M., & Perera, S. (2003). Rasch analysis of a new stroke-specific
        outcome scale: The stroke impact scale. Archives of Physical Medicine and
        Rehabilitation, 84(7), 953. https://doi.org/10.1016/S0003-9993(03)00035-2
Eckes, T. (2015). Introduction to many facet Rasch measurement: Analyzing and evaluating rater
        mediated assessment (2nd ed.). Peter Lang.
Eda, S., Itomitsu, M., & Noda, M. (2008). The Japanese skills test as an on-demand placement
        test: Validity comparisons and reliability. Foreign Language Annals, 41(2), 218–236.
        https://doi.org/10.1111/j.1944-9720.2008.tb03290.x
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Lawrence
        Erlbaum Associates.
Fan, C. W., Yazdani, F., Carstensen, T., & Bonsaksen, T. (2021). Rasch analysis of the self-
        efficacy for therapeutic use of self questionnaire in Norwegian occupational therapy
        students. Scandinavian Journal of Occupational Therapy, 28(4), 274-284.
        https://doi.org/10.1080/11038128.2020.1726453
                                                 125


Ferne T., Rupp A. A. (2007). A synthesis of 15 years of research on DIF in language testing:
        Methodological advances, challenges and recommendations. Language Assessment
        Quarterly, 4(2), 113–148. https://doi.org/10.1080/15434300701375923
Fisher, W. P. (1992). Reliability statistics. Rasch Measurement Transactions, 6(3), 238.
Friedrich, F., Konietschke, F., & Pauly, M. (2019a). MANOVA.RM: Resampling-Based
        Analysis of Multivariate Data and Repeated Measures Designs [Computer software].
        Version 0.4.1. http://github.com/smn74/MANOVA.RM
Friedrich, S., Konietschke, F. & Pauly, M. (2019b). Resampling-based analysis of multivariate
        data and repeated measures designs with the R package MANOVA.RM. The R
        Journal, 11(2), 380–400. https://doi.org/10.32614/RJ-2019-051
Galletta, A. (2013). Mastering the semi-structured interview and beyond: From research design
        to analysis and publication. New York University
        Press. https://doi.org/10.18574/nyu/9780814732939.001.0001
Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice
        item-writing guidelines for classroom assessment. Applied Measurement in Education,
        15(3), 309-334. https://doi.org/10.1207/S15324818AME1503_5
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response
        theory. SAGE Publications.
Heilenman, L. (1983). The use of a cloze procedure in foreign language placement. The Modern
        Language Journal, 67(2), 121–126. https://doi.org/10.1111/j.1540-4781.1983.tb01482.x
Hu, X., Jiang, Y., & Bi, H. (2022). Measuring science self-efficacy with a focus on the perceived
        competence dimension: using mixed methods to develop an instrument and explore
        changes through cross-sectional and longitudinal analyses in high school. International
        Journal of STEM Education, 9(1), 47. https://doi.org/10.1186/s40594-022-00363-x
In'nami, Y., & Koizumi, R. (2016). Factor structure of the revised TOEIC test: A multiple-
        sample analysis. Language Testing, 33(1), 99-119.
        https://doi.org/10.1177/0265532211413444
Isbell, D. R., Winke, P. M., & Gass, S. M. (2019). Using the ACTFL OPIc to assess and monitor
        progress in a tertiary foreign languages program. Language Testing, 36(3), 439–465.
        https://doi.org/10.1177/0265532218798139
Jefferies, P., Bremer, E., Kozera, T., Cairney, J., & Kriellaars, D. (2020). Psychometric
        properties and construct validity of PLAYself: a self-reported measure of physical
        literacy for children and youth. Applied Physiology, Nutrition, and Metabolism, 46(6),
        579-588. https://doi.org/10.1139/apnm-2020-0410
Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112(3),
        527–535. https://doi.org/10.1037/0033-2909.112.3.527
                                                 126


Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational Measurement (4th ed., pp.
        17–64). Greenwood.
Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational
        Measurement, 50(1), 74–83. https://doi.org/10.1111/jedm.12001
Kane, M., Crooks, T., & Cohen, A. (2005). Validating measures of performance. Educational
        Measurement: Issues and Practice, 18(2), 5-17. https://doi.org/10.1111/j.1745-
        3992.1999.tb00010.x
Kenyon, D. M., & Malabonga, V. (2001). Comparing examinee attitudes toward computer-
        assisted and other oral proficiency assessments. Language Learning and
        Technology, 5(2), 60–83. https://www.lltjournal.org/item/2357
Knoch, U., & Chapelle, C. A. (2018). Validation of rating processes within an argument-based
        framework. Language Testing, 35(4), 477–499.
        https://doi.org/10.1177/0265532217710049
Kondo–Brown, K. (2005). Differences in language skills: Heritage language learner subgroups
        and foreign language learners. The Modern Language Journal, 89(4), 563-581.
        https://doi.org/10.1111/j.1540-4781.2005.00330.x
Kunnan, A. J. (2000). Fairness and validation in language assessment. In A. J. Kunnan (Ed.),
        Fairness and validation in language assessment: Selected papers from the 19th
        Language Testing Research Colloquium (pp. 1-16). Cambridge University Press.
Kwoh, S. (2007). Mainstreaming and professionalizing Chinese-language education: A new
        mission for a new century. Chinese America: History and Perspectives, 261-265.
LaFlair, G. T., & Staples, S. (2017). Using corpus linguistics to examine the extrapolation
        inference in the validity argument for a high-stakes speaking assessment. Language
        Testing, 34(4), 451–475. https://doi.org/10.1177/0265532217713951
Li, D., & Duff, P. (2008). Issues in Chinese heritage language education and research at the
        postsecondary level. In He, A. W., & Xiao. Y. (Eds.), Chinese as a heritage language:
        Fostering rooted world citizenry (pp.13–32). National Foreign Language Resource
        Center.
Linacre, J. M. (1998). Structure in Rasch residuals: Why principal components analysis (PCA)?
        Rasch Measurement Transactions, 12, 636. https://www.rasch.org/rmt/rmt122m.htm
Linacre, J. M. (2012). A user’s guide to Winsteps Ministeps Rasch-model computer programs
        [version 4.7.1]. Retrieved from http://www.winsteps.com/index.htm
Linacre, J. M. (2016). WINSTEPS (Version 4.7.1) [Computer program]. Chicago: MESA Press.
                                                127


Long, A. Y., Shin, S.-Y., Geeslin, K., & Willis, E. W. (2018). Does the test work? Evaluating a
        web-based language placement test. Language Learning & Technology, 22(1), 137–156.
        https://dx.doi.org/10125/44585
Ma, W., & Winke, P. (2019). Self-assessment: How reliable is it in assessing oral proficiency
        overtime? Foreign Language Annals, 52(1), 66-86. https://doi.org/10.1111/flan.12379
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-
        103). Macmillan.
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’
        responses and performances as scientific inquiry into score meaning. American
        Psychologist, 50(9), 741–749. https://doi.org/10.1037/0003-066X.50.9.741
Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of educational
        assessments. Measurement: Interdisciplinary Research and Perspectives, 1(1), 3–
        62. https://doi.org/10.1207/S15366359MEA0101_02
Mozgalina, A. & Ryshina–Pankova, M. (2015). Meeting the challenges of curriculum
        construction and change: Revision and validity evaluation of a placement test. The
        Modern Language Journal, 99(2), 346-370. https://doi.org/10.1111/modl.12217
Norris, J. M. (2004). Validity evaluation in foreign language assessment [Unpublished doctoral
        dissertation]. Georgetown University.
Ockey, G. J. (2009). Developments and challenges in the use of computer‐based testing for
        assessing second language ability. The Modern Language Journal, 93, 836-847.
        https://doi.org/10.1111/j.1540-4781.2009.00976.x
Sireci, S.G. (1998). The construct of content validity. Social Indicators Research, 45, 83-117.
        https://doi.org/10.1023/A:1006985528729
Sohn, S. (2004). Placement of Korean heritage speakers: Challenges and strategies. Unpublished
        invited lecture, Center for Korean Research, University of British Columbia.
Tai, J. H. (1994). Chinese classifier systems and human categorization. In M. Chen (Ed), In
        honor of William S. Y. Wang: Interdisciplinary studies on language and language change
        (pp. 479-494). Taipei: Pyramid Press.
Tigchelaar, M., Bowles, R., Winke, P., & Gass, S. (2017). Assessing the validity of ACTFL can-
        do statements for spoken proficiency. Foreign Language Annals, 50(3), 584–600.
        https://doi.org/10.1111/flan.12286
Toulmin, S. ([1958] 2012). The uses of argument. Cambridge University Press.
        https://doi.org/10.1017/CBO9780511840005
Toulmin, S. (2001). Return to reason. Harvard University Press.
                                                128


van de Watering, G., & van der Rijt, J. (2006). Teachers’ and students’ perceptions of
        assessments: A review and a study into the ability and accuracy of estimating the
        difficulty levels of assessment items. Educational Research Review, 1(2), 133-147.
        https://doi.org/10.1016/j.edurev.2006.05.001
Wang, W. C., & Wilson, M. (2005). The Rasch testlet model. Applied Psychological
        Measurement, 29(2), 126–149. https://doi.org/10.1177/0146621604271053
Wei, L., & Hua, Z. (2010). Voices from the diaspora: Changing hierarchies and dynamics of
        Chinese multilingualism. International Journal of the Sociology of Language, 205, 155–
        171. https://doi.org/10.1515/ijsl.2010.043
Winke, P., Zhang, X., & Pierce, S. J. (2022). A closer look at a marginalized test method: Self-
        assessment as a measure of speaking proficiency. Studies in Second Language
        Acquisition. Advance online publication. https://doi.org/10.1017/S0272263122000079
Winke, P., Zhang, X., Rubio, F., Gass, S., Soneson, D., & Hacking, J. (2018). The proficiency
        profile of language students: Implications for programs. Second Language Research &
        Practice, 1(1). https://doi.org/10125/69840
Wolfe, E. W & Smith E. V. (2007). Instrument development tools and activities for measure
        validation using Rasch models: Part II–Validation activities. Journal of Applied
        Measurement, 8, 204–234.
Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement
        Transactions, 8, 370.
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Mesa Press.
Xi, X. (2010). How do we go about investigating test fairness? Language Testing, 27(2), 147-
        170. https://doi.org/10.1177/0265532209349465
Yan, X., & Staples, S. (2020). Fitting MD analysis in an argument-based validity framework for
        writing assessment: Explanation and generalization inferences for the ECPE. Language
        Testing, 37(2), 189–214. https://doi.org/10.1177/0265532219876226
Youn, S. J. (2015). Validity argument for assessing L2 pragmatics in interaction using mixed
        methods. Language Testing, 32(2), 199–225. https://doi.org/10.1177/0265532214557113
Zhang, X., Winke, P., & Clark, S. (2020). Background characteristics and oral proficiency
        development over time in lower-division college foreign language programs. Language
        Learning, 70(3), 807-847. https://doi.org/10.1111/lang.1239
                                                129


                  APPENDIX 1: INSTRUCTOR INTERVIEW QUESTIONS
1. In your opinion what were the essential skills that students need to have for successful
performance in the courses you teach
2. Do you feel the difficulty of the course was appropriate given the language proficiency levels
of all students?
3. Do you feel some students were misplaced to your class? What did you do to accommodate
these students?
4. Are you aware of the placement procedures for the Chinese language courses?
5. After reviewing the placement test, do you agree what is assessed in the test is targeting and
representative of the content covered in your class?
6. Is there anything you feel important in your language class but is missing from the test?
                                                130


                         APPENDIX 2: STUDENT QUESTIONNAIRE
Personal information survey
1. Your first name: __________; Your last name: ________________
2. Your MSU PID (the number found under your name on your student ID, starting with ‘A’ and
then 8 digits): ________________________
3. Your email address: __________________
4. Your age: ___________________________
5. Your gender: ___________________________
6. What is your major: ___________________________
7. Year of graduation: _________________
Questions on students’ perception of the placement test results
1. Have you taken the Michigan State University Chinese placement test
(https://msu.co1.qualtrics.com/jfe/form/SV_a2C5uBOWKlTCdoN)?
Yes/No
2. Which Chinese course are you taking this semester?
CHS101/CHS102/CHS201/CHS202/CHS301/CHS302/CHS401/CHS402
3. According to the test result, which Chinese course were you placed to?
CHS101/CHS102/CHS201/CHS202/CHS301/CHS302
4. Your GPA of the course to which you were placed to?
5. For the course to which you were placed, please rate your performance in the following
categories on a scale of 5 (1- unacceptable; 2- needs improvement; 3- meets expectations; 4 -
exceeds expectations; 5 - outstanding):
    ● Your overall preparedness for the course before the course started
    ● Your overall course performance
    ● Your overall Chinese proficiency
    ● Please provide any comments if you have.
6. For the course to which you were placed, please rate the overall difficulty of the course on a
scale of 5 (1 - very easy; 2 - easy; 3 - medium; 4 - difficult; 5 - very difficult).
Could you please elaborate on your selection? Which aspects of the course you find easy,
medium, or difficult?
7. For the course to which you were placed, in your opinion what were the essential skills for
successful performance?
                                                 131


                     APPENDIX 3: STUDENT INTERVIEW QUESTIONS
1. Have you taken the Michigan State University Chinese placement test
(https://msu.co1.qualtrics.com/jfe/form/SV_a2C5uBOWKlTCdoN)?
2. According to the test result, which Chinese course(s) are you taking right now?
3. If you have completed one or more Chinese language courses. What are your GPAs for the
course(s)?
4. For the course to which you were placed, could you please comment on:
     ● Your overall course performance
     ● Your overall Chinese proficiency
     ● Your overall preparedness for the course before the course started
5. For the course to which you were placed, do you think the overall difficulty of the course was
appropriate given your Chinese proficiency level? In other words, do you feel the course was too
easy or too difficult given your Chines proficiency level?
6. For the course to which you were placed, in your opinion what were the essential skills for
successful performance?
                                                132


               APPENDIX 4: ITEMS LOADED ON THE SAME DIMENSION
Figure 20. Read the email from Li Ming to Ma Ke. Then answer Questions 7 to 11 below.
Reading #7. On what date, does the letter written?
a. September 8, 2015
b. February 28, 2014
c. August 27, 2015
Reading #8. How long has it been since their last correspondence?
a. Three weeks
b. Two months
c. Four months
Reading #9. What kinds of movies has Li Ming seen recently?
a. French, American and Chinese
b. American, British and Chinese
c. Italian, Russian and Chinese
Reading #10. What kinds of sports has the writer done lately?
a. Soccer and jogging
b. Basketball and swimming
c. Football and swimming
Reading #11. Based on this letter, how well do you think the two know each other?
a. They are good friends who see each other very often.
b. They have never met each other, but they are relatives.
c. They are pen pals who are not familiar with each other.
                                               133


                 APPENDIX 5: ITEM RELEVANCE AND DIFFICULTIES
Table 21. Descriptive statistics for students’ ratings for item relevance and difficulties
                          Relevance                                       Difficulties
 Item
           Total     100          200       300            Total      100        200       300
 ID
 L01       4.64      4.23         5.33      4.5            3.11       3.54       2.78      2.67
 L02       4.79      4.38         5.33      4.83           2.89       3.08       3         2.33
 L03       4.46      4.15         5         4.33           3.11       3.54       2.78      2.67
 L04       4.39      3.92         4.89      4.67           2.71       3.15       2.67      1.83
 L05       4.5       5.15         4.22      3.5            1.93       2.15       2         1.33
 L06       4.39      4.38         4.78      3.83           2.11       2.38       2         1.67
 L07       4.79      4.77         4.67      5              2.54       2.38       2.89      2.33
 L08       4.71      4.92         5         3.83           2.93       2.69       3.22      3
 L09       4.68      4.85         5         3.83           3.14       3          3.44      3
 L10       4.71      4.92         5         3.83           3.14       2.92       3.56      3
 L11       4.5       4.54         4.78      4              3.79       4          3.56      3.67
 L12       4.04      3.69         4.44      4.17           4.39       4.77       4.11      4
 L13       4.11      3.92         4.22      4.33           4.21       4.15       4.11      4.5
 L14       3.79      3.15         4.56      4              4.5        4.38       4.44      4.83
 R01       3.25      2.38         4         4              3.29       4.31       3         1.5
 R02       3.86      3.38         4.67      3.67           3.86       4.46       3.67      2.83
 R03       4.93      5.46         5.33      3.17           2.46       2.54       2.67      2
 R04       4.61      5.15         4.78      3.17           3.21       3.46       3         3
 R05       5.18      5.54         5.56      3.83           1.68       1.85       1.78      1.17
 R06       5.07      5.46         5.33      3.83           1.64       1.85       1.67      1.17
 R07       5.04      5.54         5         4              1.61       1.69       1.89      1
 R08       5.11      5.54         5.22      4              1.82       2          2.11      1
 R09       5.07      5.54         5.11      4              1.79       2.08       1.78      1.17
 R10       5.11      5.46         5.33      4              1.75       2          1.89      1
 R11       4.79      5            5         4              2.46       2.46       2.89      1.83
 R12       4.5       4            5.44      4.17           3.18       3.85       2.67      2.5
 R13       4.64      4.23         5.44      4.33           3.07       3.46       3         2.33
                                                  134


Table 21 (cont’d)
 R14       4.93   4.77 5.56 4.33     2.61 3.08 2.22 2.17
 R15       5.11   4.92 5.78 4.5      2.14 2.62 1.89 1.5
 R16       5.25   5.38 5.67 4.33     2.32 2.85 2.11 1.5
 R17       5.29   5.46 5.56 4.5      1.82 2    1.89 1.33
 R18       5.29   5.38 5.67 4.5      2.07 2.31 2.22 1.33
                                 135


                              APPENDIX 6: RESULTS OF DIF
Table 22. Results of DIF: The Mantel-Haenszel test results by item
 Item ID                 Mantel-Haenszel χ2     p-value            Adj. p-value
 L01                     0.71                   .4                 .9
 L02                     0.39                   .53                .9
 L03                     4.55                   .03                .66
 L04                     0.02                   .9                 .95
 L05                     2.56                   .11                .66
 L06                     0.29                   .59                .9
 L07                     0.04                   .84                .95
 L08                     0.25                   .62                .9
 L09                     3.27                   .07                .66
 L10                     0.01                   .93                .95
 L11                     0.08                   .77                .95
 L12                     0.25                   .62                .9
 L13                     3.07                   .08                .66
 L14                     0.52                   .47                .9
 R01                     2.38                   .12                .66
 R02                     < .01                  .94                .95
 R03                     0.21                   .65                .9
 R04                     < .01                  .95                .95
 R05                     0.03                   .86                .95
 R06                     0.97                   .32                .9
 R07                     0.69                   .4                 .9
 R08                     0.27                   .6                 .9
 R09                     0.18                   .67                .9
 R10                     0.02                   .89                .95
 R11                     1.62                   .2                 .84
 R12                     1.41                   .24                .84
 R13                     2.55                   .11                .66
 R14                     1.52                   .22                .84
 R15                     0.42                   .52                .9
                                            136


Table 22 (cont’d)
 R16                    1.23                 .27                    .86
 R17                    0.55                 .46                    .9
 R18                    0.41                 .52                    .9
Note: Multiple comparisons made with Benjamini-Hochberg adjustment of p-values
                                           137


           APPENDIX 7: MISFITTING ITEMS AND PROPOSED REVISIONS
Figure 21. Reading item #3. If you want to order soup, how many choices do you have?
a. 3           b. 4             c. 5
Suggested revisions:
Possible revision 1:
Replacing soup with fried rice in the stem:
If you want to order fried rice, how many choices do you have?
a. 3           b. 4             c. 5
                                              138


Possible revision 2:
Revise the dish on the menu that causes ambiguity and confusion
Figure 22. Reading item #3. If you want to order soup, how many choices do you have?
Figure 23. Reading #6. Here is a message that Xiao Li sent to Lao Wang. Please answer the
following questions after reading the note:
At what time, should they meet?
a. 5:30 PM     b. 6:45 PM     c. 7:00 PM
Suggested revisions:
a. 6:03 PM     b. 6:30 PM     c. 6:45 PM
                                              139


Figure 24. Reading #2. Is this a sign for?
a. 公车时刻表                       b.商场上班时间                     c.飞行时间
a. a bus schedule              b. shopping mall hours       c. Flight hours
Suggested revisions:
a. 公车时刻表                       b.餐厅开放时间                     c.飞行时间
a. a bus schedule              b. restaurant hours          c. Flight hours
Reading #12. 教室有几 （            ） 椅子？
How many (classifier needed) chairs are there in the classroom?
a. 张           b. 条            c. 把
a. zhāng       b. tiáo         c. bǎ
Suggested revisions:
a. 只           b. 条            c. 把
a. zhī         b. tiáo         c. bǎ
                                               140


                   APPENDIX 8: POST-HOC TEST RESULTS (TOTAL)
Table 23. All post-hoc pair-wise comparisons results (total scores)
 Contrast                                  t-value      SE          df   p-value Cohen’s d
 Time 2 100-level - Time 1 100-level       5.9          0.92        25   <.001   1.67
 Time 1 200-level - Time 1 100-level       7.1          1.77        35.7 .005    1.68
 Time 2 200-level - Time 1 100-level       8.5          1.77        35.7 <.001   2.01
 Time 1 300-level - Time 1 100-level       9            1.95        35.7 <.001   2.13
 Time 2 300-level - Time 1 100-level       9.4          1.95        35.7 <.001   2.22
 Time 1 200-level - Time 2 100-level       1.2          1.77        35.7 1       0.28
 Time 2 200-level - Time 2 100-level       2.6          1.77        35.7 1       0.62
 Time 1 300-level - Time 2 100-level       3.2          1.95        35.7 1       0.76
 Time 2 300-level - Time 2 100-level       3.5          1.95        35.7 1       0.83
 Time 2 200-level - Time 1 200-level       1.4          1.21        25   1       0.4
 Time 1 300-level - Time 1 200-level       2            2.16        35.7 1       0.47
 Time 2 300-level - Time 1 200-level       2.3          2.16        35.7 1       0.54
 Time 1 300-level - Time 2 200-level       0.6          2.16        35.7 1       0.14
 Time 2 300-level - Time 2 200-level       0.9          2.16        35.7 1       0.21
 Time 2 300-level - Time 1 300-level       0.3          1.4         25   1       0.08
                                              141


                APPENDIX 9: POST-HOC TEST RESULTS (LISTENING)
Table 24. All post-hoc pair-wise comparisons results (listening scores)
 Contrast                                  t-value       SE         df  p-value Cohen’s d
 Time 2 100-level - Time 1 100-level       2.4           0.7        25  .03     0.68
 Time 1 200-level - Time 1 100-level       3.1           1.09       42  .09     0.68
 Time 2 200-level - Time 1 100-level       3.8           1.09       42  .02     0.83
 Time 1 300-level - Time 1 100-level       3.5           1.2        42  .09     0.76
 Time 2 300-level - Time 1 100-level       3.5           1.2        42  .09     0.76
 Time 1 200-level - Time 2 100-level       0.7           1.09       42  1       0.15
 Time 2 200-level - Time 2 100-level       1.3           1.09       42  1       0.28
 Time 1 300-level - Time 2 100-level       1             1.2        42  1       0.22
 Time 2 300-level - Time 2 100-level       1             1.2        42  1       0.22
 Time 2 200-level - Time 1 200-level       0.6           0.93       25  1       0.17
 Time 1 300-level - Time 1 200-level       0.3           1.33       42  1       0.07
 Time 2 300-level - Time 1 200-level       0.3           1.33       42  1       0.07
 Time 1 300-level - Time 2 200-level       -0.3          1.33       42  1       -0.07
 Time 2 300-level - Time 2 200-level       -0.3          1.33       42  1       -0.07
 Time 2 300-level - Time 1 300-level       0             1.07       25  1       0
                                              142


                 APPENDIX 10: POST-HOC TEST RESULTS (READING)
Table 25. All post-hoc pair-wise comparisons results (readings scores)
 Contrast                                  t-value      SE         df   p-value Cohen’s d
 Time 2 100-level - Time 1 100-level       3.4          0.66       25   <.001   0.96
 Time 1 200-level - Time 1 100-level       3.9          0.97       44.4 .003    0.83
 Time 2 200-level - Time 1 100-level       4.7          0.97       44.4 <.001   1
 Time 1 300-level - Time 1 100-level       5.6          1.07       44.4 <.001   1.19
 Time 2 300-level - Time 1 100-level       5.9          1.07       44.4 <.001   1.25
 Time 1 200-level - Time 2 100-level       0.5          0.97       44.4 1       0.11
 Time 2 200-level - Time 2 100-level       1.3          0.97       44.4 1       0.28
 Time 1 300-level - Time 2 100-level       2.1          1.07       44.4 .76     0.45
 Time 2 300-level - Time 2 100-level       2.5          1.07       44.4 .38     0.53
 Time 2 200-level - Time 1 200-level       0.8          0.88       25   1       0.23
 Time 1 300-level - Time 1 200-level       1.6          1.18       44.4 1       0.34
 Time 2 300-level - Time 1 200-level       2            1.18       44.4 1       0.42
 Time 1 300-level - Time 2 200-level       0.9          1.18       44.4 1       0.19
 Time 2 300-level - Time 2 200-level       1.2          1.18       44.4 1       0.25
 Time 2 300-level - Time 1 300-level       0.3          1.01       25   1       0.08
                                              143