THE ROLE OF NONVERBAL BEHAVIOR AND AFFECT
ON RATINGS OF SECOND LANGUAGE PROFICIENCY
                           By
                   John Dylan Burton
                   A DISSERTATION
                       Submitted to
               Michigan State University
       in partial fulfillment of the requirements
                    for the degree of
   Second Language Studies — Doctor of Philosophy
                          2023


                                                ABSTRACT
         A long-standing problem in applied linguistics is how to account for nonverbal behavior in models
of second language (L2) communicative ability (Canale, 1983; Canale & Swain, 1980; Celce-Murcia, 2007;
Galaczi & Taylor, 2018; Hymes, 1972). Attempts have been made to incorporate some nonverbal behavior
into some of these models, but they generally only account for strategic and interactional competences
rather than the full range of information these behaviors can convey. It is well-established, however, that
nonverbal behavior is fundamental to spoken, face-to-face communication (Hall et al., 2019; Hall & Knapp,
2013; Matsumoto et al., 2016), conveying semantic, cognitive, affective, and social-interactional
information. Affect is one of the most important signals of nonverbal behavior, especially in facial
movements (Knapp et al., 2013), conveying a range of emotive, orientational, and stance-related
information. Nonetheless, language tests rarely account for this vital visual realm of information in their
constructs or rating scales, despite research showing that it is meaningful to raters when formulating
impressions of language proficiency (Choi, 2022; Ducasse & Brown, 2009; Jenkins & Parra, 2003; May,
2009, 2011; Nakatsuhara et al., 2021a; Nambiar & Goon, 1993; Neu, 1990; Orr, 2002; Sato & McNamara,
2019; Thompson, 2016). To date, few studies have observed measurable effects of nonverbal behavior or
affect on impressions of language proficiency (Chong & Aryadoust, 2022; Kim et al., 2023; Nagle, 2022;
Trofimovich et al., 2021; Tsunemoto et al., 2022).
         To address this research gap, I designed a research study to triangulate ratings of affect,
measurements of nonverbal behavior, and cognitive interviews to determine the role of nonverbal behavior
and its affective information when listeners formulate impressions of language proficiency when observing
L2 speakers’ performances in a language test. I recruited 100 naïve, untrained raters to listen and watch
short video recordings of 30 test takers interacting with an examiner during an oral proficiency interview
in a high stakes speaking test. While listening and watching, the raters scored each speech sample on four
categories of language—fluency, vocabulary, grammar, and comprehensibility—and ten categories of
affect, covering dimensions of assuredness, involvement, and positivity. Following the rating activity, 20
of the raters took part in stimulated verbal recall sessions, which captured the raters’ thought processes


while formulating their evaluations of L2 proficiency. In addition, I used automated, machine-learning
software, iMotions, to extract measurements of nonverbal behavior in the speaking test samples in the form
of engagement, attention, and valence. I also manually extracted speech and nonverbal behavior using
multimodal annotations in ELAN.
         The study broadly found that nonverbal behaviors and affect can impact proficiency outcomes in
different ways. Desirable, communicatively-oriented behaviors such as mutual gaze, nodding, leaning
forward posture, and representational gestures can convey confidence, engagement, and positive affect,
which lead to differential outcomes in fluency, vocabulary, grammar, and comprehensibility.
Comprehensibility, for example, was most impacted by raters’ impressions of test taker engagement and
also through behaviors that conveyed approachability, such as smiling and nodding. Fluency, vocabulary,
and grammar were most impacted by impressions of confidence and low anxiety, as well as more target-
like attentional focus. The raters were especially attuned to detecting listening comprehension, and the
negative impact of comprehension breakdowns could be moderated when test takers took an adaptable
stance that showed a desire to communicate. Overall, nonverbal behavior was found to affect perceptions
of language proficiency in complex, dynamic ways that were mediated by interactions with the social
context, requiring holistic interpretations of their impact.
         By using naïve, untrained raters, this study has offered a glimpse into how non-linguists perceive
language in real world settings. It thus has implications for language testing practice, as L2 speaking tests
generally ask raters largely to ignore what they see and award scores based on what they hear. Speaking
tests need to account for the nonverbal and affective repertoires of test takers in their constructs, rating
scales, and rater training in order to capture test takers’ full range of language ability. The raters’ focus on
listening comprehension also has implications for a broader adoption of integrated speaking assessments,
where listening and speaking are assessed together. Finally, the study has important implications for how
applied linguists conceptualize L2 communicative competence, as nonverbal behavior plays a much more
important role than is currently ascribed.


Copyright by
JOHN DYLAN BURTON
2023


Two languages cancel each other out, suggests Barthes, beckoning a third. Sometimes our words are few
 and far between, or simply ghosted. In which case the hand, although limited by the borders of skin and
              cartilage, can be that third language that animates where the tongue falters.
                                              —Ocean Vuong
                                                                                                         v


                                        ACKNOWLEDGEMENTS
         This dissertation wraps up a brilliant four year period at Michigan State University. I can safely say
that I am walking away from this program with a new skillset and appreciation for knowledge that I did not
have before. I have been continuously humbled by my mentors and teachers in this program that I have
learned so much from. Even though my academic progress was rocked by COVID-19 in the second
semester of my first year, the process of adapting and moving forward helped me pivot to this very project,
which I may not have done otherwise.
         First, I would like to recognize the various agencies that have so kindly funded or aided in this
dissertation project. Without financial assistance, this project would have been much more taxing to carry
forward. I would like to thank the International Language Testing System (IELTS) and Mina Patel for so
kindly agreeing to provide the speech samples used in this study. These samples were an invaluable resource,
and the inferences from this study are much stronger because of them. I would also like to thank the British
Council Assessment Research Group, Duolingo, and The International Research Foundation (TIRF) for
providing grants that allowed me to pay for participants, two research assistants, travel, and software for
this project.
         I cannot say how thankful I am to my advisor Professor Paula Winke. It really goes beyond words.
From the day I started at Michigan State in May of 2019, Paula set out to socialize me into the world of
academia. She taught me to become a better writer, a better research, a more critical thinker, and a steward
for the field through editorial practices. Paula is absolutely a role model; she has shown unwavering
positivity and dedication in her role as a mentor. I have learned about academic writing, grant and funding
applications, departmental budgets, applying for jobs, and so much more. Perhaps most important, though,
is how to be a mentor for others. I will take Paula’s lessons with me as I move forward, and I hope I can
provide her level of mentorship to others in the future.
         I would also like to express my gratitude to my committee members. I feel extremely lucky to have
been able to assemble such a stellar combination of experts: Dr. India Plough (language testing and
nonverbal behavior), Dr. Aline Godfroid (SLA and psycholinguistics), Dr. Koen Van Gorp (language
                                                                                                             vi


testing and task-based language teaching and assessment), and Dr. Ryan Bowles (psychological assessment
and statistics). The advice my committee provided at my dissertation proposal defense was indispensable,
and this project is the result of those modifications. I owe a huge debt of gratitude to Dr. India Plough. As
a result of interviewing India as part of a class project in LLT-861, I was motivated to study nonverbal
behavior in the realm of language assessment. She kindly lent me her time and wisdom in an independent
study on the topic during the worst part of the summer lockdown in 2020. I admire your curiosity and
passion, and expertise in this topic area.
         I would also like to thank so many others in the SLS program and at MSU. Dr. Shawn Loewen, Dr.
Charlene Polio, and Dr. Peter De Costa have been outstanding teachers and colleagues to work with during
these years. I am incredibly thankful for the kindness of Professor Monique Turner and her team at the
CASE lab in the Department of Communications for allowing me to use iMotions in their lab. I would also
like to thank SLS students and alumni for their ongoing comradery throughout the program. In particular,
I would like to thank Robert Randez and Curtis Green-Eneix for allowing me to pilot my dissertation
instruments with their students. An immense gratitude also goes to my research assistants Elena Gorshkova
and Bethany Zulick, who were MA TESOL students when they assisted with my project.
         Finally, I would like to thank a few others from my life outside of the program. I am grateful to the
team at Lancaster University, notably my MA supervisor Professor Luke Harding and Professor Tineke
Brunfaut, for their kindness and dedication, and for setting me on this path towards a PhD back in 2016. I
am thankful for my colleagues at the British Council who have supported me over the years and taught me
that work can also be about fun: Dr. Victoria Clark, Professor Barry O’Sullivan, Dr. Jamie Dunlea, Sheryl
Cooke, and Mina Patel. I would like to thank my close friends Jeffery Walker, Amber Davis, Christopher
Schaechtel, Matthew Brizzi, Jeremy Dickerson, Ryan Moltz, Katherine Macnair, and Matteo Cavazos for
being supportive throughout this dissertation and listening to me talk about this project endlessly. Last but
not least, thank you, Hollis Griffin, for your love, support, and understanding throughout the most difficult
months of this writing up and completing this project.
                                                                                                            vii


                                                 TABLE OF CONTENTS
CHAPTER 1: INTRODUCTION .................................................................................................................. 1
CHAPTER 2: LITERATURE ....................................................................................................................... 6
CHAPTER 3: RESEARCH QUESTIONS .................................................................................................. 88
CHAPTER 4: METHOD ............................................................................................................................. 90
CHAPTER 5: AFFECT AND LANGUAGE PROFICIENCY ................................................................. 134
CHAPTER 6: NONVERBAL BEHAVIOR AND LANGUAGE PROFICIENCY.................................. 166
CHAPTER 7: NONVERBAL BEHAVIOR AND RATER COGNITION ............................................... 206
CHAPTER 8: DISCUSSION .................................................................................................................... 254
CHAPTER 9: CONCLUSION .................................................................................................................. 308
REFERENCES .......................................................................................................................................... 315
APPENDIX A: INFORMATION, CONSENT, AND NON-DISCLOSURE AGREEMENT ................. 355
APPENDIX B: RATING STUDY SIGN-UP, INSTRUCTIONS, AND PRACTICE .............................. 360
APPENDIX C: FOLLOW-UP SURVEY .................................................................................................. 367
APPENDIX D: STIMULATED RECALL MATERIALS........................................................................ 368
APPENDIX E: COMMUNICATIONS TO PARTICIPANTS ................................................................. 372
APPENDIX F: ELAN TIER DESCRIPTIONS (ADAPTED FROM BURTON, 2021) .......................... 378
APPENDIX G: STUDY VARIABLES ..................................................................................................... 379
APPENDIX H: CATEGORY STATISTICS............................................................................................. 380
APPENDIX I: DESCRIPTIVE DATA FOR RATERS ............................................................................ 382
APPENDIX J: INTERSECTIONS OF NONVERBAL BEHAVIOR....................................................... 386
                                                                                                                                                      viii


                                     CHAPTER 1: INTRODUCTION
         All life on earth dedicates sensory resources to perceiving the surrounding world. Sense enables
life to preserve its existence by finding food, detecting threats, and reproducing. Life, however, does not
exist in a solitary world. All beings, be they animals or even plants and fungi, interact with their world and
each other by receiving and transmitting information in varying forms of communication. Communication
may exist as words, sounds, signs, chemicals, smells, and possibly even electrical signals. Sense allows life
to detect and decode information critical for survival. For humans and many animals, the primary modes
of communication are visual and auditory, and meaning is conveyed and interpreted by hearing voices,
sounds, barks, squeaks, and shrieks, and by seeing the direction of gaze, changes in posture, mouth
movements, and other bodily actions.
         Humans are unique in their ability to communicate very specific meaning by using highly complex,
structured language. Language, either in the form of sounds or signs, communicates meaning that is
generally symbolic, intentional, propositional, and purpose-driven (Buck, 1984). The use of language and
choice of words can be automatized, but ultimately language choices are made by speakers with a certain
degree of awareness. The body, however, is the existential foundation of human culture and sense
perception (Bourdieu, 1977). Individuals pick up on stances, feelings, and orientations by seeing the body
language of others. This body language, or nonverbal behavior, may convey a wide range of information
about traits and states, and may also align with language to strengthen certain meanings or emphasize
information. Communication occurring through nonverbal behavior is thought to largely occur
spontaneously, automatically, and without attention, and, unlike linguistic forms, is also considered to be
largely non-propositional and unbound from form-meaning relationships (Buck & VanLear, 2002). That is
to say, a person’s hand movement making a rising motion may indicate a rising action, but it could also
indicate any manner of other relationships to other words or ideas. Interactant listeners receive these
multiple channels of information and decode a wealth of information from speakers, ranging from desires
and needs to intuition about underlying feelings and goals. Nonverbal cues may enhance comprehension of
these various lines of information in these interactants, thus serving as the “co-text” for the communicative
                                                                                                             1


event (Rost, 2016, p. 42).
          One defining feature of nonverbal behavior is its usefulness in communicating information about
speakers’ personalities and affective stances (Burgoon et al., 2016; Knapp et al., 2014). These affective
stances may include emotions, feelings, moods, attitudes, and orientations towards others, all of which are
critical for communicators when establishing and maintaining interpersonal relationships. These affective
stances furthermore combine with pragmatic acts to maintain (or disrupt) social harmony (Brown &
Levinson, 1987; Roever, 2021) Nonverbal behavior allows us to monitor conversations and the impact of
our words on others. We can often see when someone is happy by observing their smiles. We can also see
if they are upset if we see them looking away and frowning. Interpreting the affective stances of others
allows people to predict what someone might say or do, and it can clear up ambiguities when the verbal
message might be interpreted in multiple ways (e.g., a smirk when the verbal message is intentionally
ironic). The communication of affect may also be “contagious,” and emotions may be spread through
nonverbal channels (Elfenbein, 2014; Hatfield et al., 1994). One only needs to think of a room of children
in which one child begins laughing and suddenly the room erupts in laughter. These nonverbal channels
may also synchronize between interactants (Hess & Fischer, 2013), as evidenced by people adapting similar
gestural patterns together, or the co-occurrence of smiles in conversation. Where the broad interpretation
of language is through semantics, the interpretation of nonverbal behavior is largely affective-interactional.
          If communication is largely holistic, comprising interwoven threads of verbal and nonverbal
communication (Burgoon et al., 2016), it is unknown whether speaking skills can be assessed independently
of what a person sees during speech. This is an important question for individuals involved in the assessment
of speech, such as those that teach presentation skills, speech pathologists, and individuals that work with
second language development. Workers in these fields are interested in the abilities of people to speak in
their first or second language, and assessors may need to produce scores that provide meaningful inferences
about these abilities. It could then be possible, for example, that affective stances observed through
nonverbal behavior may subtly alter interpretations of language ability. Alternatively, it may be the case
that the most accurate interpretations of language ability may only be possible by considering both verbal
                                                                                                            2


and nonverbal channels of communication together. Understanding how the visual world interacts with
speech is then critical for the field of speech assessment.
Second language assessment
         The communicative turn in applied linguistics (Firth & Wagner, 1997; Halliday, 1985) involved a
paradigmatic shift towards assessing productive skills, that is, speaking and writing, as previously these
skills were rarely assessed or assessed indirectly (Fulcher, 2003). Nowadays, using performance
assessments has become the standard in most language testing programs. These assessments take a range
of formats, but they all share a common defining characteristic:
         [A]ctual performances of relevant tasks are required of candidates, rather than more abstract
         demonstration of knowledge, often by means of pencil-and-paper tests… The format of a
         performance-based assessment is distinguished from the traditional assessment by the presence of
         two factors: a performance by the candidate which is observed and judged using an agreed judging
         process” (McNamara, 1996, pp. 7–10, emphases original).
This judging process has generally involved the use of human raters who award scores based on a set of
predetermined and empirically validated criteria, which are often used operationally in the form of a rating
rubric. Despite decades of research and notable improvements in the reliability in the use of human raters,
the scoring of performance assessments is still an active area of research (Knoch et al., 2021) due to its
centrality in arguments about fairness in language testing (Kunnan, 2018; McNamara et al., 2019).
         Within the domain of performance assessment rating, a great number of researchers have conducted
studies that investigated characteristics that may have an impact on test scores, such as test-task
characteristics (e.g., the question type or question difficulty level), test taker characteristics (e.g., gender,
first language, age), and rater characteristics (e.g., nationality, professional/educational background). The
rationale for these studies is often to uncover potential sources of bias. “Bias occurs … when large numbers
of items systematically and demonstrably (dis)advantage specific populations on construct-irrelevant
grounds, such as gender, educational background, or home language” (Deygers, 2019, p.10). Not all impact
sources, however, qualify as score bias. When construct-relevant aspects of tests affect scores—such as
                                                                                                               3


grammatical accuracy and complexity—the discussion instead surrounds the relative impact of particular
performance features. These factors should influence scores. In assessment contexts, it is critical to
document variance to ensure that scores meaningfully align with the intended construct. If variance is found
due to factors unaccounted for by rating scales, this evidence can support the revision of scales or rater
training programs (or both) where needed, as it can be evidence of construct underrepresentation (Messick,
1989). Desirable variance can lead to elements being added to scales, and undesirable variance (construct
irrelevant-variance, Messick, 1989) can lead to elements being removed or changes in rater training
programs to address those problematic features.
         For many types of speaking tests, behavior elicited through tasks is multimodal; that is, spoken
language is accompanied by the test takers’ visible nonverbal behavior. A rater may conduct a test with one
or more test takers, listen to their language, and produce a score about their second language ability.
Nonetheless, most rating criteria only include descriptions of performances that are purely language-based,
even when the purported test construct is based on communicative ability rather than linguistic accuracy or
complexity. Raters may then be expected to narrow their focus and ignore any salient visible information—
often implicitly—and award scores only from speech. Research has suggested, however, that raters notice
the nonverbal behavior and affective stances of test takers (Ducasse & Brown, 2009; May, 2009, 2011; Sato
& McNamara, 2019), which may ultimately impact scores beyond differences in language ability. Humans
may not be able to ignore salient aspects of communication to focus only on the verbal mode of
communication. This could also partially explain why raters can be idiosyncratic, formulating their own
internal rating processes (e.g., Lumley, 2002). Despite repeated calls for research in this area (Pennycook,
1985; Kellerman, 1992; Plough et al., 2018; Plough, 2021; Young, 2002), there is still limited information
about which types of performances are most impacted by visual information, the direction and size of this
impact, and whether nonverbal behavior may impact all rating criteria equally. Understanding the nature of
this impact will shed light on how nonverbal behavior fits into models of language proficiency that generally
only tacitly acknowledge nonverbal communication.
                                                                                                            4


Aims of this dissertation
         This dissertation contributes to the ongoing discussion of nonverbal behavior in the second
language literature. By studying nonverbal communication in the context of language assessment, it is
possible to consider speech perception from the rater’s perspective, identifying elements of the visual realm
that raters take into account when forming an impression of language proficiency. This dissertation is thus
a study of the following key topics:
     1. The relationship between measures of nonverbal behavior and the more linguistic constructs of
         fluency, vocabulary, grammar, and comprehensibility
     2. The relationship between judgements of affect—one channel of information from nonverbal
         behavior—and the more linguistic constructs of fluency, vocabulary, grammar, and
         comprehensibility
     3. The salience of particular nonverbal behaviors during the rating process
     4. The interpretation of nonverbal behavior when arriving at judgements of language proficiency
         The dissertation is divided into eight chapters including the introduction.. Chapter 2 presents
background literature on this topic, specifically detailing popular models of second language (L2)
communication and how they incorporate extra-linguistic elements, the roles and relationships of nonverbal
behavior with language, and the origins and impact of interpersonal affect. Chapter 3 presents the research
questions drawn from the literature review, followed by Chapter 4, which covers the methods and
organization of this research project. Chapters 5, 6, and 7 will present separate analyses based on analytical
methods used, the first two quantitative, and the last a qualitative analysis. Chapter 8 will discuss and
synthesize findings, and Chapter 9 will conclude with the overall contributions of this study, limitations,
and paths forward.
                                                                                                             5


                                        CHAPTER 2: LITERATURE
         Researchers of L2 speech assessment have sought to identify measurable features of spoken
language that provide inferences about the overall communicative language ability of learners. Crucial,
then, is an understanding of the structure of communicative language ability to select features that can be
reliably measured, yet give a full, nuanced picture of what someone can do with language. Once decisions
are made about the dimensions speech will be measured upon, validation research must then provide
evidence of the integrity of those ratings and their meaningfulness. Studies of this type, however, have
repeatedly shown that raters attend to far more than just the linguistic features of speech. They attend to a
wide range of phenomena, including the nonverbal behavior produced during the test, as well as the
affective stances interpreted from these behaviors. The literature to date has painted a picture of behavior
and affect influencing speaking test scores, but there is still a dearth of research available that indicates the
size and extent of this impact, and which constituent features matter most to raters.
         This literature review will be organized into three key sections. First, I review literature related to
the construct of L2 proficiency. This will include a discussion of models that proficiency is broadly based
on, such as communicative competence, communicative language ability, and interactional competence. I
will illustrate how nonverbal behavior and affect have been acknowledged but systematically
underrepresented in these models of communication. Underrepresentation will be discussed in studies in
which raters described features that define communicative success. I will provide evidence that modality
matters in the rating of performances, with audiovisual speaking tests consistently resulting in higher scores
than audio-only tests, showing that variance in tests scores is unaccounted for when raters can see the test
taker. The second section will review nonverbal behavior. I will discuss the semantic, cognitive, social, and
affective information that behavior can convey, how nonverbal behavior relates to language, and how it is
tied to culture. The section will end with a discussion of how nonverbal behavior impacts interpersonal
perceptions, with a detailed review of the literature pertaining to language testing. I provide evidence that
nonverbal behavior forms an important part of the fabric of what raters attend to. Finally, in the third section,
I will discuss in greater detail the affective function of nonverbal behavior. I discuss definitions of affect,
                                                                                                                6


emotions, and feelings, and provide evidence that the affective function of nonverbal behavior is
simultaneously a cognitive, social, and cultural phenomenon. I discuss its relationship to language
achievement in studies largely in SLA research, and finally I provide evidence that affect can have
important implications in interpersonal relationships when it is interpreted by listener-raters. As a core
function of nonverbal behavior, affect can also impact language proficiency scores. The review will
conclude with a short section on the measurement of nonverbal behavior and affect.
The construct of second language assessment
Second language proficiency
        Speaking is an everyday skill. It is used to communicate quickly, solve problems, and engage with
others. When people learn languages, the first skill that comes to mind is often speaking. The cultural norm
in English is not to ask Do you read any other languages? but rather Do you speak any other languages?
Speaking is thus at the heart of how society views L2 competence. When it comes to speech assessment,
however, it may be the most difficult skill to assess (Fan & Yan, 2020) given its ephemeral, complex,
socially contextualized, and dynamic nature. For this reason, speaking was often neglected in favor of
testing more discretized aspects of language such as reading comprehension (Fulcher, 2003).
        At the heart of any discussion of language assessment is a discussion of the construct. Tests aim to
measure aspects of language, and the construct provides a theoretical orientation to what is being measured.
Most contemporary speaking tests claim to measure second language proficiency, which is a construct that
draws on frameworks of communicative competence (Celce-Murcia, 2007; Hymes, 1972; Canale & Swain,
1980; Canale, 1983), communicative language ability (Bachman & Palmer, 1996, 2010), and, sometimes,
incorporates elements of interactional competence (Galaczi & Taylor, 2018; Kramsch, 1986; Plough et al.,
2018). These frameworks lay the groundwork for what it means to communicate effectively. Local, regional,
and national language development standards likewise draw from these same frameworks in their categories
and descriptions of language development. Some examples of these are the Common European Framework
of Reference (CEFR; Council of Europe, 2020), World-Class Instructional Design and Assessment (WIDA;
WIDA, 2020), Interagency Language Roundtable (ILR; Interagency Language Roundtable, n.d.), and the
                                                                                                           7


American Council on the Teaching of Foreign Languages (ACTFL; The National Standards Collaborative
Board, 2015). Each test or set of standards covers varying aspects and amounts of the underlying construct
of L2 speaking ability, but due to its complexity, it is impossible to cover all constituent aspects. In fact,
researchers in SLA argue that many of these models are missing cognitive components of speech
prossessing and production (e.g., Levelt, 1989; Levelt et al., 1999) and prediction (Levinson, 2016;
Pickering & Garrod, 2013) that characterize various stages of acquisition (Hulstijn, 2015), as
psycholinguistic aspects of speech processing are critical to language proficiency (de Jong, 2023).
        De Jong (2023) presented perhaps the most comprehensive description of language proficiency to
date, containing psycholinguistic, structural, and socio-interactional elements. Psycholinguistic elements of
speech processing (Levelt, 1989, 1999) include the following cognitive subskills:
        a skill to conceptualize the preverbal message, a skill to retrieve the correct lexical items quickly
        along with their morphosyntactic characteristics, a skill to retrieve the appropriate sounds with
        these lexical items and to plan them as connected speech, a skill to send motor programs to the
        articulatory muscles to produce intelligible sounds, and finally, skills to efficiently monitor one’s
        speech. (de Jong, 2023, p. 542)
De Jong also stressed a central role for interlocutor input comprehension and prediction as key cognitive
elements (Levinson, 2016; Pickering & Garrod, 2013), as these elements are critical predictors of quick,
spontaneous, fluent speech. The automatization of the processes mediating comprehension, prediction, and
production is what characterizes L2 fluency (Kormos, 2006). Structural elements of speech include
elements of linguistic competence drawn from Bachman and Palmer (1996, 2010), Canale and Swain
(1980), and Canale (1983). These include grammatical, lexical, and phonological competencies. These
hierarchical models also included discursive, pragmatic, and strategic components of speech, but for
reasons I explain below, I reimagine these as social-interactional elements. Finally, de Jong’s (2023) final
grouping of elements of language proficiency are those that have an outwardly social-interactional focus.
These include competencies that allow learners to mediate between their core linguistic knowledge and the
outside world, requiring a knowledge of contextual appropriateness and audience, and requiring the ability
                                                                                                            8


to negotiate and co-construct meaning among conversational interactants. These elements stress
interactional competence (Kramsch, 1986; Galaczi & Taylor, 2018; Young, 2011), or “the ability to listen
attentively, to design the message for the recipient…, to manage the conversation, and to use appropriate
nonverbal behavior” (De Jong, 2023, p. 544).
         De Jong (2023, p. 545) presented a hierarchical model of these elements, drawing from and adding
to models from Bachman and Palmer (1996, 2010). In her model, language proficiency is made up of
linguistic competence and strategic competence. Linguistic competences include structural (grammar,
vocabulary, phonological forms), predictive, and pragmatic (sociolinguistic and functional) competences.
Strategic competences, the other main branch in her model, include self- and other-supporting mechanisms
and planning. However, the model is lacking in its explanatory power as there is little distinction between
the cognitive or social orientation of competences. Purely cognitive aspects of speech (prediction), for
example, are listed alongside structural and organizational features, and social-interactional forms are
subsumed within discourse competence. If prediction, comprehension, and production form elements at the
core of language use, a reformulated, nested model may better represent language proficiency. Drawing
from de Jong’s (2023) descriptions, Figure 2.1 may better represent a sociocognitive model of language
proficiency.
                                                                                                          9


Figure 2.1
A Sociocognitive Model of Language Proficiency (Adapted From de Jong, 2022)
         Figure 2.1 is organized in such a way to stress a gradation between individual, psycholinguistic
components at the center, and socially oriented elements at the edge. The boundaries between
psycholinguistic elements are not strict, as indicated by dashed lines. Components within competences are
not listed as discrete units, as they represent constructs that overlap to some degree (e.g., grammar and lexis
also exist as lexicogrammar; pragmatic speech acts are often interactional in nature). Psycholinguistic and
cognitive elements of language use are at the core of language proficiency. These elements exist primarily
within the individual speaker and interact with structural components to produce speech (Levelt, 1993,
1999). As speech is proceduralized and automatized through use and practice of the language,
comprehension and prediction become faster, enabling cognitive fluency to develop (Segalowitz, 2010).
                                                                                                            10


Thus, the core aspects of production are strengthened. Structural elements, which also exist within the
individual, include grammar, lexis, and phonology, and form the bulk of linguistic competence (Bachman
& Palmer, 1996, 2010; Canale & Swain, 1980). These begin as declarative knowledge, and as speakers
access them repeatedly they are proceduralized and eventually become part of their automatic language use
(DeKeyser, 1997, 2020). Thus, there is a close interrelationship between structural and cognitive
competences. Discourse components, at a higher level than lexicosyntactic structural components (see e.g.
Celce-Murcia, 2007), allow speakers to connect structural forms to assemble meaning-making units of
language. De Jong (2023) argued that discursive and interactional competences are closely linked, as they
draw on similar linguistic forms. All discursive and interactional elements draw on lexical and syntactic
forms, but the way speech is organized at a micro level, independent of context (which I define as discourse
competence here) helps the speaker convey a coherent and cohesive message. The knowledge and ability
to produce coherent sentential and intratext structure in speech and writing is an element that I envisage as
primarily psycholinguistic and within the speaker, while the ability to manage conversational exchanges
draws more heavily on context and is socially dependent, and thus an aspect of interactional competence.
For this reason, I have used a bolded line around discourse competence, as these components are largely
within the cognitive realm of speakers’ abilities. Structural, psycholinguistic, and discourse components of
speech represent the majority of what is assessed in many contemporary language tests, such as the rating
categories fluency and coherence, lexical resource, grammatical range and accuracy, and pronunciation in
the International English Language Testing System (IELTS; IELTS, n.d.). These three core units represent
what is traditionally thought of in psycholinguistic constructs of language proficiency.
         Social-interactional components, on the outside of the sphere, orient towards meaning-making with
others. They allow speakers to use their core language comprehension, knowledge of structures,
organizational skills, and production skills to navigate communicative contexts. These social-interactional
components are other-focused, and aid in the co-construction of meaning (Young, 2011), and for this reason,
their external boundary is also dashed, as it indicates that meaning is bound with the social context. Social-
interactional components—including pragmatic, strategic, and interactional competences—go beyond
                                                                                                           11


structural forms and fluency. For example, within pragmatic competence, the sociolinguistic context
requires changes in structural forms in order for the user to choose, for example, the right level of register
or politeness (e.g., Brown & Levinson, 1987; Roever, 2021). Pragmatic competence furthermore allows
speakers to choose the correct functional language in a social situation (e.g., greeting someone instead of
saying goodbye). Any deviation in appropriate register or function can result in unintentional breakdowns
in social meaning. Similarly, strategic competence enables speakers to manage the creation of their
messages by planning utterances, moderating their own speech, and self-repair when necessary. Strategies
are meaning-focused and exist to support communication. Interactional competence, as described above,
then helps speakers organize conversations with others. The use of these social-interactional components
is mediated by the needs of the greater social context, surrounding it. These three groups (cognitive,
structural, and social-interactional), interact with each other in speech, but may be better seen as organized
by these cognitive-social dimensions. Nonetheless, these boundaries exist mainly as a visual metaphor, as
in reality all aspects of speech may be context-dependent.
         Important to this discussion, however, is the extent to which nonverbal behavior and/or affect are
included in definitions of language proficiency, and if they are, how they are conceptualized. De Jong’s
(2023) description only explicitly mentions nonverbal behavior in the context of interactional competence.
However, other frameworks have included nonverbal behavior in their models of communicative
competence as well. The following section will review how applied linguists have conceptualized the role
of nonverbal behavior and affect within their models. Following the literature and the dissertation study, I
will revisit whether de Jong’s model in Figure 2.1 adequately represents L2 communication.
Behavior in models of communication
         There are many terms from the literature describing L2 communicative ability. Communicative
competence (Canale, 1983; Canale & Swain, 1980; Celce-Murcia, 2007; Celce-Murcia et al., 1995; Hymes,
1972), communicative language ability (Bachman & Palmer, 1996, 2010), and language proficiency
(Hujlstein, 2015) have all been used to describe broadly similar frameworks that detail components of
successful communication. Models of interactional competence (Galaczi & Taylor, 2018; Young, 2011),
                                                                                                            12


rather than full communicative models, describe essential interactional skills to communication that exist
alongside communicative competence. This list is not exhaustive either, as many others have written on the
topic, though these are the most influential. Other frameworks, such as (interpersonal) communication
competence (Morreale et al., 2013) and intercultural communicative competence (Byram, 2021) extended
the notion of communication to other sociocultural contexts. The Common European Framework of
Reference (CEFR; Council of Europe, 2020) is a framework of L2 communicative language development
across multiple scales and subscales related to real world language use. Most of these frameworks
acknowledge the presence of nonverbal behavior, each with different functional attributes ascribed to it. In
these descriptions, rather than recounting in full the models of each author, I will instead focus on how they
operationalize nonverbal and affective behavior as aspects of communication.
         Communicative competence. Hymes (1972) developed a theory of communicative competence
at a time when there was heavy debate over differences between competence and performance (cf. Chomsky,
1965). His development of communicative competence incorporated grammatical competence and
contextual/sociolinguistic competence, and he further argued that competence was dependent on both
knowledge and the ability to use that knowledge in context. While linguistic knowledge was seen as result
of learning various components of a language (grammar, vocabulary, phonology, and organization features),
“ability for use” was defined as the functional abilities of individuals to deploy their linguistic knowledge.
He described ability for use as comprising cognitive, volitive, and affective factors, with affective
components (such as motivation) partially determining an individual’s level of competence. Importantly,
he also makes the case for the inclusion of an interactional domain of competence, drawing from the work
of Goffman. Hymes (1972) also explicitly included nonverbal behavior as part of his model:
         In both respects the interrelation of knowledge of distinct codes (verbal: non-verbal) is crucial. In
         some cases these interrelations will bespeak an additional level of competence… Within the view
         of communicative competence taken here, the influence can be expected to be reciprocal. (p. 284)
This quote suggests that Hymes viewed nonverbal communication as contributing meaning across various
aspects of language use and not comprising any separate functions on its own. Canale and Swain (1980),
                                                                                                            13


however, took a more restrictive view of the role of nonverbal communication.
        Canale and Swain (1980) set out to model Hymes’ (1972) ideas with the goal of drafting, essentially,
a formalized construct of communicative competence for teaching and testing. They stressed the inclusion
of sociolinguistic aspects of competence along with grammatical (syntactic, lexical, and phonological) and
strategic (compensatory) competences, as these are critical to real and authentic language use. However,
they maintained a separation with communicative performance, which included “the relationship of these
competences and their interaction in the actual production and comprehension of utterances (under general
psychological constraints that are unique to performance)” (p. 6). Canale and Swain (1980) thus explicitly
tease out the cognitive and affective aspects of Hymes’ (1972) competence. Nonverbal behavior was
granted a role only in strategic competence, which was the “verbal and nonverbal communication strategies
that may be called into action to compensate for breakdowns in communication due to performance
variables or insufficient competence” (p. 30). They described these as largely compensatory (e.g.,
paraphrasing), though they also included some underdefined aspects of sociolinguistic competence when
interacting with others, where these are seen as coping mechanisms. Nonetheless, they acknowledged in
1980 that nonverbal behavior may one day be given a larger role in models of L2 ability:
         More research on the role of such nonverbal elements of communication as gestures and facial
         expressions in second language communication may reveal that these are important aspects of
         communication that should be accorded more prominence in the theory we have adopted. (p. 36)
Canale (1983) reformulated their model slightly to include discourse competence (the organization of
language into coherent textual units), again assigning nonverbal behavior a compensatory role, though also
acknowledging its ability to enhance communicative effectiveness such as “deliberately slow and soft
speech for rhetorical effect” (p. 10).
         Celce-Murcia (2007), drawing from her own past work (1995; Celce-Murcia et al., 1995),
attempted to extend Canale and Swain’s (1980) and Canale’s (1983) models with a focus towards language
pedagogy. She proposed a model with four key subcompetences. This model included a reformulation of
some of the labels of Canale and Swain’s (1980) subcomponents (linguistic and socio-cultural competence),
                                                                                                         14


but also added in a new component called actional competence, described as the ability to understand and
produce pragmatic speech acts and sets. She also included interactional competence alongside the other
three competences and included strategic and discourse competence as uniting competences mediating the
other four. Important to this discussion, however, is that she is perhaps the first author to provide a more
nuanced and specific vision of nonverbal behavior in a larger conceptualization of communicative
competence. She described a non-verbal/paralinguistic competence embedded within interactional
competence, with bodily behaviors (gestures, gaze behaviors, nodding, and other body language),
proxemics (orientation in space), haptic behavior (touching), and paralinguistic cues (non-verbal sounds)
as core components. However, these behaviors were only allocated roles within interactional competence.
         Other authors also speculated about the inclusion of nonverbal behavior in discussions of
communicative competence. Scarcella et al. (1990) added a component of “verbal, non-verbal, and
paralinguistic knowledge underlying the ability to organize spoken and written texts meaningfully and
appropriately” (p. 72). Almost identical to Celce-Murcia (2007), however, these were given almost
exclusively interactional turn-taking roles. Likewise, Savignon (1972) and Jakobovits (1970), both
contemporaries of Hymes in the early 70s, also speculated on the importance of nonverbal behavior to
communicative competence. As will be seen later, however, there is evidence for nonverbal behavior at
nearly all levels of communication.
         Communicative language ability. Bachman and Palmer (1996, 2010) extended the work of
Canale (1983) and Canale and Swain (1980) to develop a conceptual framework of L2 language ability,
which they termed communicative language ability. Language ability for them was derived from complex
interactions between the sociolinguistic context and language use. Language use, “the creation or
interpretation of intended meanings in discourse by an individual, or as the dynamic and interactive
negotiation of intended meanings between two or more individuals in a particular situation” (p. 61), is
essentially language performance as described by Canale and Swain (1980). Language use was described
as being made up of language knowledge and strategic competence, which would then interact with topical
knowledge, affective schemata, and characteristics of the language use situation to produce meaning.
                                                                                                          15


Affective schemata here are the “affective or emotional correlates of topical knowledge” (Bachman &
Palmer, 1996, p. 65), and are characteristics of a task that may provoke an affective response in the speaker
due to past emotional experiences. These affective responses can then facilitate or hinder the test taker’s
ability or willingness to communicate in a given situation depending on the response’s emotional weight.
The ability to discuss affect is included in their list of pragmatic competences, called “knowledge of
ideational functions” (p. 70). The ability in question uses linguistic resources to explicitly discuss affect
(e.g., “I’m angry”, “I’m disappointed in the results”), rather than leveraging multimodal resources to
express affect (e.g., nodding, directing attention to speaker, and verbally backchanneling to show
engagement). Their presentation of strategic competence, which is a broadened view of metacognitive
strategies rather than mainly compensatory strategies, does not include an explicit discussion of nonverbal
behavior. Overall, Bachman and Palmer (1996) saw the danger of agitating test takers by using emotionally
charged content, but they did not outright discuss any impact of nonverbal behavior on the communication
of affect or interaction, as they viewed speaking as a primarily audio-based skill.
         Fourteen years later, Bachman and Palmer (2010) largely reaffirmed their previous conceptual
framework, while giving a greater role to the interactional effects of their affective schemata. Affective
schemata were still seen as mostly task-based emotional weights, with an additional caveat added about
individuals’ orientations to interpersonal communication. However, in their descriptions of affective
schemata in interaction, the authors explicitly invoked the use of nonverbal behavior when encoding and
decoding meaning. For example, when describing a customer service encounter, a waiter:
         engages his affective schemata when [the customer] reacts to how he seems to be feeling about the
         conversation—he looks relaxed and not impatient about taking her order… She also takes into
         account her affective schemata—does she feel confident enough about her ability to speak Thai to
         participate in a conversation or is she so nervous that she only feels up to pointing to items on the
         menu. (p. 39)
Implicit here is that both the customer and waiter are encoding and decoding each other’s nonverbal signals
to arrive at a conclusion about the communicative intent of the other individual. Nonetheless, the authors
                                                                                                            16


only ascribed verbal characteristics of speech to the language abilities used to convey meaning. Similar to
their 1996 framework, Bachman and Palmer (2010) did not place much emphasis on nonverbal behavior in
their formulation of strategic competence, though they did mention that it can play a role in the appraisal
of communication:
         In a conversational exchange, the language user can appraise the extent to which his communicative
         goal has been accomplished by the way his interlocutor responds, with language, with non-verbal
         communication, or with both… Affective schemata are involved in determining the extent to which
         failure was due to inadequate effort, to the difficulty of the task, or to random sources of
         interference. (p. 53)
Nonverbal behavior then appears to have a minor role in navigating coping mechanisms, though not as
clearly as with the compensatory strategies in Canale and Swain (1980).
         Interactional competence. As a reaction to knowledge-based, accuracy-focused language
proficiency assessments, Kramsch (1986) proposed interactional competence as a socially oriented test
construct that taps into speakers’ abilities to manage interaction and create meaning in conversational
contexts. Interactional competence is “the ability to co-construct interaction in a purposeful and meaningful
way, taking into account sociocultural and pragmatic dimensions of the speech situation and event” (Galaczi
& Taylor, 2018, p. 226). It consists of abilities that allow speakers to systematically manage turn-taking,
repair, topic management, and agreements with sensitivity toward the listener and the interactional context
(Hellermann, 2008; Pekarek Doehler & Berger, 2018). Dating back at least to the time of Hymes (1972),
interactional abilities were seen as relying heavily on both verbal and nonverbal channels. This was later
reaffirmed in Celce-Murica’s (2007) framework, where a range of different behaviors were specified in
their relationship to interaction. With Canale and Swain (1980), nonverbal behavior related to strategic
competence in regard to compensatory strategies and coping mechanisms, which may be seen as similar in
nature to repair strategies in conversation. Frameworks of interactional competence have been developed
over the years to detail the range of features speakers employ to manage conversational interactions, and in
these, authors have consistently provided a place for nonverbal behavior. Nonetheless, it has been argued
                                                                                                           17


that the specification and implementation of nonverbal behavior in interactional assessments still needs
further work (Plough et al., 2018; Plough, 2021; Roever & Kasper, 2018).
         Galaczi and Taylor (2018) provided the most visual representation of interactional competence to
date, bringing together macro- and micro-level features underlying the construct from a broad reading of
the literature. Their interactional competence “tree” (p. 227) visualizes spoken interaction as derived
primarily from social context, which defines the speech event and the speech act. These social-contextual
variables determine how speakers and listeners interact with each other and how they deploy their
interactional abilities. As an example, in a formal social context with a hierarchical power dynamic, one
can imagine fewer initiations and turns granted to listeners with lower power status, while in a more
informal, balanced social context, there will be more balance in how conversations are co-constructed.
Arising from social context then are the various functions of interactional competence, seen as branches,
while micro-level forms such as initiating and closing a conversation are listed as leaves. The functional
categories of interactional competence included turn management, topic management, breakdown repair,
and interactive listening, along with their constituent micro-level features. They also listed a specific place
for non-verbal behavior as a function, including features of facial expressions, laughter, and posture.
This conceptualization is somewhat problematic, however, since nonverbal behavior is not characterized
by propositional, form-function relationships (Buck & VanLear, 2002), as instead it co-occurs with speech
to accomplish the various functional categories of interactional competence (e.g., achieving
intersubjectivity, Burch & Kley, 2020; repair, Burton, 2021a; maintaining progressivity, Hırçın Çoban &
Sert, 2020).
         Other models. There are other models of L2 communication that also deserve mention in this
review, as these include to some degree mentions of the roles of affect or nonverbal behavior. Van Ek (1986)
developed a model of communicative ability with an emphasis on both communication skills and also
personal and social development. In his model, he added to Canale’s (1983) components of linguistic,
sociolinguistic, discourse, and strategic competence two additional categories: sociocultural and social
competence. Sociocultural context here referred to the familiarity with cultural norms in the communicative
                                                                                                             18


context and the ability to navigate these. Social competence included elements of interactional competence,
but also aspects of affect, namely motivation, attitude, self-confidence, empathy, and the ability to handle
social situations (van Ek, 1986, p. 65). Nonetheless, despite the inclusion of an ability to handle affect when
navigating social situations, nonverbal behavior was not mentioned explicitly.
         Byram (2021) drew on van Ek’s (1986) work in his development of an extension of communicative
competence to address cross-cultural contact in language teaching. His framework of intercultural
communicative competence included van Ek’s (1986) components, but stressed the various skills,
knowledge, education, and attitudes necessary for individuals from different cultural backgrounds to engage
in an “effective exchange of information” and “establishing and maintaining relationships” (p. 43). His
attitudinal domain of intercultural communicative competence made a case for the reduction of bias in order
to respect “people who are different in respect to the cultural meanings, beliefs, values, and behaviors they
exhibit” (p. 44). Essential here is navigating affect, both one’s own and the perceived affect from others.
This category also included an implicit focus on nonverbal behavior, though he was wary of any explicit
inclusion in this model due to prescriptivism of native speaker norms:
         [Nonverbal behavior] is clearly an element of interaction which is crucial, but the challenge to the
         dominance of the native speaker as model applies just as much here as it does to standards of verbal
         communication. In other words, any teaching of non-verbal skills and knowledge should enhance
         competences as an intercultural speaker, not imitation of a native speaker” (Byram, 2021, p. 59).
         Morreale et al. (2013) discussed a framework of “communication competence” removed from a
focus on language teaching or L2 developmental models. Their notion of competence, essential to
communication in any language, was defined as “the extent to which people achieve desired outcomes
through behavior acceptable to a situation” (Morreale et al., 2013, p. 25) and was dynamically defined by
context. This model hinged on the perceptions of others, as communicative success is interactional in nature:
“how we actually behave in most instances is less important than how others perceive us to have behaved
“ (Morreale et al., 2013, p. 25, emphasis in original). Their model consisted of three main elements—
motivation, knowledge, and skills—nested within context. Motivation included the communicative
                                                                                                             19


objectives and goals for a communicative act, as well as affective components driving or inhibiting these
acts. Knowledge consisted of the content or procedural information relevant to successful communication,
including topics, semantics, and discursive functions of language. Skills included the behaviors central to
communication itself, including macro-level (functional) and micro-level (form) skills. Skills explicitly
included verbal and nonverbal macro- and micro-level skills. Finally, culture provided the framing of
communicative events, including cultural, relational, and situational types, as well as interpersonal, public,
and mediational levels. Morreale et al.’s (2013) framework drew from the work of Burgoon (2016, first
published in 2010) to be the broadest description of the functional role of nonverbal behavior within a larger
model of communicative competence: “People use nonverbal behavior to complement verbal messages, to
regulate interactions, and to define the socio-emotional quality of relationships” (p. 104). They delineated
the specific roles of gesture, eye gaze, posture, facial movements, paralinguistics, haptics, proxemics, and
chronemics, although largely describing their affective output. This model, however, is meant for general
communicative success (e.g., in an L1) and not to describe L2 use or development, but it provides a useful
contrast to models developed in applied linguistics.
         The Common European Framework of Reference (CEFR) (Council of Europe, 2020) also provides
an example of how nonverbal behavior is operationalized in a set of standards concerning language ability.
The CEFR includes illustrative scales that detail language development across multiple domains of
communication. It is organized principally by skills (reading, writing, speaking, listening), interaction,
mediation, and strategies. It also includes scales of linguistic, sociolinguistic, pragmatic, plurilingual, and
pluricultural competence. As such, it is to date the most complete view of a developmental understanding
of L2 communication. The CEFR does include nonverbal behavior as well, but it is generally only limited
to descriptions at very low levels (Pre-A1–A1) in a limited number of subscales. For example, in the
category Overall Mediation, the A1 descriptor is:
         Can use simple words/signs and non-verbal signals to show interest in an idea. Can convey simple,
         predictable information of immediate interest given in short, simple signs and notices, posters and
         programs. (Council of Europe, 2020, p. 92).
                                                                                                             20


Descriptors at the A2 level and above do not include a continuation of the development in nonverbal
behavior. Similar descriptors can be found in subscales of Leading Group Work, Facilitating Pluricultural
Space, Mediating Concepts, and Mediating Communication. One subscale that deviates from this paradigm
is Strategies to Explain a New Concept. There are no descriptors for A1–A2 levels, with the following
starting at B1:
          Can make a set of instructions easier to understand by repeating them slowly, a few words/signs at
          a time, employing verbal and non-verbal emphasis to facilitate understanding. (Council of Europe,
          2020, p. 120).
No other descriptors above the B1 level further delineate a cline of nonverbal skills within this subscale.
Another scale that deviates from the prior paradigm is that of Interaction within Qualitative Features of
Spoken Language scales. Here, nonverbal behavior is described only at the highest level, C2:
          Can interact with ease and skill, picking up and using non-verbal and intonational cues apparently
          effortlessly. (Council of Europe, 2020, p. 183).
Given that nonverbal cues are generally encoded and decoded automatically (Gifford, 2013), it is unclear
which behaviors this descriptor refers to. Similar to the other scales, there are no descriptors below the C2
level illustrating a development in nonverbal behavior within this category.
          Interestingly, the inclusion of nonverbal behavior within the CEFR roughly follows patterns from
Hymes (1972), Celce-Murcia (2007), and Galaczi and Taylor (2018) in its inclusion within categories of
interaction, or Canale and Swain (1980) and Canale (1983) in its inclusion in a category of strategies.
Nonetheless, these inclusions are generally at a very low level, and descriptors offering a view of
development within nonverbal patterns at different ability levels are not given for any of the subscales apart
from signed languages, which are beyond the scope of this discussion.
          The CEFR does, however, include substantially more information about the encoding and decoding
of affect. For example, the Conversation scales describe how a learner “can express how they feel in simple
terms” at A1, “can express and respond to feelings such as surprise, happiness, sadness, interest, and
indifference” at B1, “can convey degrees of emotion” at B2, and “can use language flexibly for social
                                                                                                           21


purposes, including emotional, allusive, and joking usage” at C1 (Council of Europe, 2020, pp. 73–74).
Other categories that cover affect to varying degrees are Reading Correspondence, Correspondence, Online
Conversation and Discussion, Expressing a Personal Response to Creative Texts, and Sociolinguistic
Appropriateness.
Rater reports of behavior in studies of communication
         The development of models of L2 communication is generally top-down. That is to say, these
models are generally theoretical, developed by scholars based on readings from the literature and their own
observations or intuition about language. On the other hand, a bottom-up approach can be taken by asking
raters to listen to L2 speech and describe the various features that comprise communicative effectiveness.
The features they notice and are able to describe can then be used as validation evidence for models, and
any features they notice that are not included in models may be then included in further revisions.
Understanding what untrained, linguistic laypeople describe is important when designing tests of L2 ability
because:
         the ultimate arbiters of L2 speakers’ oral performance are typically not in fact trained language
         professionals, who have meta-level linguistic insight and are possibly concerned primarily with
         features of communication that are the focus of their own training as linguists or language teachers,
         but interlocutors with no specialist training. (Sato & McNamara, 2019, p. 895)
Thus, the views of these individuals can help refine models and test constructs.
         Early research on oral proficiency assessments consistently showed that raters attend to a range of
linguistic and nonlinguistic criteria when scoring, such as features of content and discourse management
(Halleck, 1992; Lazaraton, 1996; Neu, 1990; Ross, 1992; Young & He, 1998). There are few studies,
however, that have considered a broad range of performance features that raters notice, particularly through
rater reports, when orienting towards L2 communication ability. Orr (2002), for example, used trained raters
when eliciting speech features that factored into successful performances on the First Certificate in English
speaking test (a test linked to the B2 level on the CEFR). The raters commented on the linguistic criteria
present in the scales they were trained on, but also noted a range of 12 other characteristics, such as content-
                                                                                                             22


related task features, exertion/effort, test preparation, and nonverbal behavior. The authors did not explain
how raters oriented towards nonverbal behavior in the study. Brown et al. (2005) also investigated the
criteria raters used when rating the TOEFL iBT. Similar to Orr (2002), raters attended primarily to the
linguistic criteria given in the rating scales, but the content of speech relating to idea development and task
success was the next most important category raters mentioned. Nonverbal behavior was not attended to
given that the TOEFL iBT is rated audio only, with no visual material present.
         Sato and McNamara (2019) also elicited untrained raters’ internal scoring criteria in an attempt to
reveal underlying factors impacting impressions of communicative effectiveness. They had 23 novice raters
(postgraduate students without knowledge of applied linguistics) view and rate twenty speech samples on
the speakers’ overall communicative effectiveness. The speech samples were drawn from performances on
the College English Test-Spoken English Test (CET-SET) delivered in China, and dyadic interactions from
Cambridge Assessments. Raters used a seven-point holistic scale of communicative success with endpoints
of poor and excellent. Afterwards, raters provided stimulated verbal recalls and retrospective interview data
supporting their decisions. The raters discussed linguistic features of communicative success most often,
followed by a sizeable number of comments regarding general communicative success (categorized as
general comments relating to task success, comprehensibility, etc.). Content-related features made up the
third greatest number of comments, echoing Orr (2002) and Brown et al. (2005). Orientations to nonverbal
behavior and affect made up the next most sizeable number of features. These were categorized separately
but discussed together. The final category of comments related to interaction. While the specific findings
related to nonverbal behavior and affect will be discussed in detail later, important here is to note that raters
oriented to aspects of communication outside the traditional realm of communicative competence, as “their
judgements of communicative ability are based on a wider range of speech features and speaker behaviors
than the constructs of current proficiency tests” (Sato & McNamara, 2019, p. 911). In this sense,
communicative success as measured by the CET-SET and Cambridge Assessment tests could suffer from
construct underrepresentation, as content-, nonverbal behavior-, and affect-related features are not present
in these testing organizations’ rating scales and could thus “potentially misrepresent the judgements of real-
                                                                                                               23


world interlocutors” (p. 912).
         Similarly, Ducasse and Brown (2009) and May (2011) asked raters to report salient features of
successful and unsuccessful interactions. Ducasse and Brown (2009) asked 12 listeners with teaching
backgrounds to watch 17 pairs of beginning-level Spanish students taking a discussion-based test. Raters
were asked to watch and then record retrospective impressions of the paired interactions, and following this,
they watched the video a second time and provided stimulated verbal recalls on what constituted
interactional abilities. Interaction, notably, was not defined for the raters. The raters primarily oriented to
non-verbal interpersonal communication, followed by interactive listening (comprehension and support),
interactional management (topic management and coherence). In a similar design, May (2011) asked four
trained raters to discuss the performance features of 12 intermediate and advanced level learners taking
paired speaking tests. Raters rated the tests in pairs using an analytic rating scale focusing on both linguistic
and interactional aspects of language. Afterwards, raters provided retrospective reports, discussion
recordings with the other rater scoring the sample, and interviews with the researcher commenting on
features that were salient when scoring interactional effectiveness. She found that raters focused on whether
the dyads understood each other’s messages, responded to each other, and used communication strategies.
Both nonverbal and affective aspects of interaction surfaced in the responding category, along with listening
comprehension, comprehensibility, and other aspects of interactional competence (e.g., turn and topic
management, repair):
         The ability to work together cooperatively, manage a conversation, communicate with assertiveness,
         demonstrate effective body language and interactive listening, and thus help to co-construct a
         collaborative pattern of interaction were regarded by the raters as key aspects of a successful
         interaction. (May, 2011, p. 140)
Important to note is that raters focused on these criteria despite the fact that they were not present in the
original rating scale, again supporting Sato and McNamara’s (2019) claim that raters tapped into a larger
set of skills and abilities than delineated by the test construct.
                                                                                                              24


Modality effects
          Ratings in different delivery modes, namely audio-only and audiovisual rating, have the potential
to uncover whether the visual world as a whole may exert an impact on ratings. If differences do exist, then
something in the visual world, be it the presence of nonverbal behavior, its interpretation via affect, or some
other explanation must be the driving force of those differences. To date, research in this area has rather
convincingly found that scores based on audiovisual tests—that is, where the rater can see the test taker—
are higher than scores where raters only hear the speech of the test taker (audio-only) (Choi, 2022; Conlan
et al., 1994; Gullberg, 1998; Larson, 1984; Nambiar & Goon, 1993; Nakatsuhara et al., 2021; Styles, 1993).
These effects are group-level effects, however, as individual test takers may have equivalent scores in both
modes or even occasionally lower scores in the audiovisual format. Only one study found lower scores in
the audiovisual mode (Lavolette, 2013), while others found roughly equivalent scores (Beltrán, 2016;
Thompson, 2016; Shohamy, 1994; Uludag et al., 2022). Because these studies that found conflicting results
were for the most part underpowered, they will not be discussed here. I will discuss the three most definitive
studies on this topic in turn: Choi (2022), Nakatsuhara et al. (2021a), and Nambiar and Goon (1993).
          Nambiar and Goon (1993) were one of the first to speculate on the possible impact of modality
differences in language tests. The authors had raters conduct speaking tests with 87 undergraduate students
and score them face-to-face, after which the same samples were rated by the same raters as audio-only
recordings. The speaking test include two tasks, of which one was in an interview format with the rater, and
the second was a paired discussion task. The study found that mean scores based on audio recordings were
significantly lower than the ones rated in the face-to-face mode, with the mean audio-only score dropping
1.25 points in the interview task (out of 20 points) and 0.47 points in the dyadic task (also out of 20 points).
Students in the first quartile of scorers (the highest scorers) were impacted by the audio-only mode
difference more negatively than students in the bottom quartile (the lowest scorers). The researchers noted
that raters found interpreting pausing, silence, and grammatical/phonological inaccuracies in the audio-only
format to be difficult to interpret without visual information. Raters were better able to understand the
source of breakdowns in fluency in the face-to-face rating and were less attuned to inaccuracies. However,
                                                                                                             25


the authors noted the limitation that the raters also served as interlocutors in the speaking test design, which
may have played a role in their scoring tendencies.
         Nakatsuhara et al. (2021a) investigated the effect of mode on International English Language
Testing System (IELTS) scores in three scenarios: live, audio-recorded, and video-recorded. Using Many-
Facet Rasch Measurement (MFRM) on 6 raters’ scores, they found that audio-only rating resulted in scores
that were overall 0.92 logits more difficult than the video rating mode, resulting in a half band difference
in final scores after rounding (where band scores were ordinal units on a 9-unit rating scale). An analysis
of the individual criteria showed the same trend, with all four criteria (fluency and coherence, lexical
resource, grammatical range and accuracy, and pronunciation) marked lower in the audio-only mode, and
lexical resource was impacted the most. Nonetheless, these trends were not consistent across all raters, as
one rater in particular was found to be biased negatively towards video rather than audio. An analysis of
the raters’ verbal protocols revealed that in many of the cases, the raters scored the audiovisual samples
higher because they helped examiners “a) to understand what the test takers were saying, b) to comprehend
better what test takers were communicating using non-verbal means …, and c) to understand with greater
confidence the source of test takers hesitation, pauses, and awkwardness” (Nakatsuhara et al., 2021, p. 19).
         Finally, Choi (2022) investigated the score differences of 110 test takers on two asynchronous
audiovisual recordings conducted on Zoom (with and without the interlocutor present) and asynchronous
audio-only recordings. Eight trained raters scored these samples in an anchored dataset, and the results were
analyzed with confirmatory factor analysis and MFRM. She found that the data did not support a one-factor
model, but instead a three-factor model where the three delivery modes each represented separate latent
variables with variance attributable to different sources. In other words, visual elements in the video-
recorded samples represented differing relationships between the test scores and the higher-level latent
variable of L2 proficiency, which may have expanded the construct of speaking to include nonverbal
behavior. Regarding score differences, supporting previous studies, she found that audio-only scores were
lower than both video-recorded formats, and the two video formats were approximately equivalent. Audio-
only ratings were approximately 0.5 logits more difficult than the video recording with an interlocutor
                                                                                                              26


present, and 0.75 logits more difficult than the video recordings with the interlocutor removed. Similar to
Nakatsuhara et al. (2021a), these differences represented approximately a half-band difference on the same
9-point rating scale.
         Overall, these studies point to key differences between audio-only and audiovisual based ratings.
Audio-only scores were found to be consistently lower than audiovisual scores, with differences
representing about half a band score in Nakatsuhara et al. (2021a) and Choi (2022), which used the same
rating scale. In these studies, raters appeared to judge linguistic criteria in largely consistent ways between
the two rating designs, but the presence of visual information helped them to make a more informed
assessment of the test taker’s language ability, resulting in these score differences. However, as Nakatsuhara
et al. (2021a) and Choi (2022) noted, not all raters behaved in the same way. Some were influenced more
by visual information than others, and some outright ignored its presence and rated each sample
equivalently. Interpreting language proficiency in light of visual information may thus be idiosyncratic,
despite the presence of group-level effects. Nonetheless, all authors hypothesized that score differences due
to the presence of video were likely the result of the impact of nonverbal behavior.
Summary
         The studies reviewed so far all suggest that nonverbal behavior and affective responses during test
scenarios play a much larger role than ascribed by models of language proficiency (de Jong, 2023),
communicative competence (Canale, 1983; Canale & Swain, 1980), and communicative language ability
(Bachman & Palmer, 1996, 2010). Nonverbal and affective behavior play critical roles in test discourse
(Ducasse & Brown, 2009; May, 2011; Sato & McNamara, 2019), and they are important meaning-making
devices that possibly contribute to variance in speaking test scores (Choi, 2022; Nakatsuhara et al., 2021a;
Nambiar & Goon, 1993). The evidence to date points to the fact that these behaviors indeed form a critical
aspect of Hymes’ (1972) ability for use. Yet, despite wide consensus that nonverbal behavior is an essential
aspect of speaking, it has been neglected in models of L2 communication and in speaking test constructs
(Plough, 2021). In the next two sections, I consider nonverbal behavior and affect separately. In the section
on nonverbal behavior, I review the various functions of nonverbal behavior and how these have been
                                                                                                             27


shown to impact speaking test scores. An important question in this section is whether the role of nonverbal
behavior is limited to interaction or whether it may also play a role in perceived linguistic competence,
which indeed appears to be the case from the studies by Nambiar and Goon (1993), Nakatsuhara et al.
(2021a), and Choi (2022). The section on affect will be arranged similarly, detailing the meaning of affect
and its impact during interaction, as well as how it may relate to language learning outcomes and ratings
on language tests.
Nonverbal behavior
          Nonverbal behavior is fundamental to spoken, face-to-face communication (Hall et al., 2019; Hall
& Knapp, 2013; Matsumoto et al., 2016). It is always present in spoken communication, predates language,
develops prior to verbal ability in infants, and precedes verbalizations in interactional encounters (Burgoon
et al., 2016). Nonverbal communication combines with verbal communication in the encoding and decoding
of meaning in speakers and listeners (Halberstadt et al., 2013). Complex multimodal Gestalts (Mondada,
2014)—patterns of linguistic constructions and nonverbal behavior in sociocultural contexts—can also
inform observers of underlying cognitive, psychological, or (socio)affective states of speakers (Argyle,
1988; Guerrero & Wiedmaier, 2013; Schmid Mast & Cousin, 2013). Nonverbal behavior also plays an
important role in conveying the content of speech in conversation (Kendon, 2004; McNeill, 1992, 2005).
Nonverbal behavior has been studied in thousands of articles in fields as disperse as neuroscience,
psychology, sociology, anthropology, computer science, engineering, robotics, medicine, communication,
and applied linguistics (Plusquellec & Denault, 2018). However, despite the critical mass growing that
recognizes the central role of nonverbal behavior to communication, it has been somewhat neglected in
linguistics and applied linguistics—and perhaps especially language testing—as these fields stress the
autonomy of language within communication. Mondada (2016), drawing on Derrida’s (1967) critique of
the Western overreliance on speech as a source of truth, called this restricted focus logocentric, in contrast
to an embodied view of language:
          Producing talk involves visible breathing and articulating movements not only of the face and the
          mouth, but of the entire body; moreover, these articulatory movements are dissociable from other
                                                                                                           28


         bodily conduct… both talk and gesture originate from the same process. (p. 340).
To account for the full range of communicative skills, she argued that applied linguists should adopt a less
logocentric view of language to incorporate a wider range of embodied meaning. Regarding language
development, Stam (2008) remarked that “looking at learners’ gestures and speech can give us a clearer
picture of their proficiency in their L2 than looking at speech alone” (p. 253). If the visual realm is
informative about language proficiency, it is thus also important to study the visual realm in the context of
language testing and assessment.
         Setting boundaries on what constitutes nonverbal behavior, however, is a complex endeavor.
Although Hall et al. (2019) defined nonverbal communication as “a behavior of the face, body, or voice
minus the linguistic content, in other words, everything but the words” (p. 272), others have included other
aspects relating to the social context. Argyle (1988), for example, identified eight dimensions of nonverbal
communication, including facial expressions, gaze behavior, gestures, posture, haptics (touch), proxemics
(spatial location and movement), appearance (clothing and other objects), and paralinguistics (nonverbal
sounds such as laughing or sighing). Chronemics (Walther & Tidwell, 1995), or the ways in which
individuals use or perceive time, can also impact how messages are interpreted, such as when the same
message is conveyed at different times of the day. Vocalics—the expressiveness and animation of the
human voice which may or may not be grouped with paralinguistics—can also impact how people are
perceived in terms of affect and likeability (Mehrabian, 1981). For the purposes of this paper, I will restrict
the focus to that of Birdwhistle’s (1970) kinesics: facial expressions, gaze behavior, gestures, posture, and
other bodily movements. I will refer to these broadly as nonverbal behavior.
         The literature on nonverbal behavior is vast, too vast to cover in any one dissertation. Indeed, entire
volumes, handbooks, edited volumes, textbooks, and journals have been produced to tackle the various
research areas on the topic. For this reason, I will not break down this literature review by nonverbal form
(e.g., smiling, gaze shifts, etc., but see Givens & White, 2021 for an exhaustive treatment on the subject),
but rather by the role behaviors play and how these roles relate to L2 use and development. The
categorization of nonverbal behavior has a long history but was first seminally codified in Ekman and
                                                                                                             29


Friesen’s (1969) five categories: emblems (codified signs), illustrators (abstract meanings), affect displays,
regulators (interactional signs), and adaptors (non-meaning making movements). This section will cover
similar categories but in terms of function, namely the semantic, cognitive, social-interactional, and socio-
affective information conveyed by nonverbal behavior, as well as its cultural origins. I will then turn to how
nonverbal behavior factors into interpersonal perceptions in language testing and assessment. The focus
here will be on rater-test taker interactions, as these can be a source of score variance.
Semantic information
         Dual-process models of language hold that two streams of information make up communication:
one is acquired, intentional, propositional, and symbolic; the other is ontogenic, automatic, non-intentional,
non-propositional, and made up of signs (Buck, 1984). Communication can then convey discrete,
constructed meanings and implicit, almost unnoticeable information that occur simultaneously and interact
in all social contexts, developing over the course of one’s life (Buck & van Lear, 2002). Generally, verbal
language is propositional in that statements can be logically negated. Utterances can be negated (“the plane
took off”) by using evidence available to the listener (the plane is on the ground). Verbal language is
symbolic as it is made up of units (spoken or signed) that generally exhibit a bound form-meaning
relationship with discrete meanings. The ability to produce these meanings is learned. That is to say,
although words may be understood in many ways, the broad possibilities of meaning are restricted to a
defined number of possible interpretations at one particular moment in time (e.g., “cool” may indicate a
lower temperature, or an informal statement of approval, but it is unlikely to mean “feline”). Nonverbal
behavior, on the other hand, is generally of the spontaneous category. Information encoded by speakers in
the face, eyes, sound of the voice, or body may be unintentional, but may reflect ongoing internal or social
processes. Listeners may decode this information implicitly and unknowingly. These behaviors are often
non-propositional in that they cannot be negated; the surprise that someone appears to show as a plane takes
off carries no logical statement to negate (though the perceived emotion may be questioned; “don’t act
surprised!”). They are non-symbolic in that a gestural flourish may not have any specific meaning the
                                                                                                             30


speaker intended to convey. The flourish may then be a metaphor for the verbal content of the speaker.
         Although verbal information is conveyed in an almost entirely symbolic, intentional way,
nonverbal information can be either symbolic/intentional or spontaneous/unintentional (Kendon, 2004;
McNeill, 1992, 2005). For example, if a teacher asks a student a question (“Did you see the plane take off
outside?”) and the student either nods their head or gives a thumbs up emblematic gesture, this information
is conveyed intentionally and symbolically. The nod or thumbs up gesture both convey agreement; they are
intentional, and they can be negated if the teacher in fact knows that the student did not see the plane take
off. Likewise, if the student is describing the plane taking off and simultaneously uses a hand movement in
a rising fashion, this representational gesture adds intentional and symbolic meaning to the verbal utterance.
If the student said “the plane landed” while using the same rising gesture, there would be a breakdown in
the propositional meaning of the combined information. The student may also use other symbolic,
intentional gestures when speaking to manage the interaction with the teacher as well. However, alongside
these symbolic moves will be unintentional, non-propositional information conveyed by the student’s face,
hands, and body. The student may show an expressive, excited reaction in the face and use an upright,
engaged posture. These behaviors will color the informational, event-added affective information to their
verbal utterances. Gestures, in particular, “may reveal systems or richer underlying distinctions than are
apparent in speech alone… that is, semantic distinctions not apparent in speech may instead appear in
gestures” (Gullberg, 2022, p. 321). In other words, the information nonverbal behavior conveys is on a cline
rather than a strict dichotomy between propositional and non-propositional. The two feed into the same
system (Buck & van Lear, 2002), and influence the interpretation of the overall message. A visualization
of this relationship is presented in Figure 2.2, which I designed drawing from my understanding of the
literature presented above. The ombre color in the background suggests that both verbal and nonverbal
                                                                                                           31


information can convey both types of meaning, and the arrows suggest that they feed into each other.
Figure 2.2
The Relationship Between Propositional and Non-propositional Meaning
        The example of the “thumbs up” gesture showed that nonverbal behavior can convey meaning that
is propositional, intentional, and symbolic; one category of such behavior is emblematic gestures. These
gestures are those that have a codified symbol and interpretation in a given sociocultural context (Kita,
2009; Morris et al., 1979). Examples include the hand balled into a fist with the thumb raised (thumbs up),
the hand closed with the index and middle finger raised (peace sign), and the palm facing up with all fingers
touching together (in Italian, mano a borsa). The first two are common in American English, while the
second, if turned so that the palm faces the speaker, is pejorative in British English. The third is commonly
used in Italian for emphatic purposes. These gestures have meanings that are culturally bound and shared
by particular groups of individuals. Some studies have shown that learners orient towards these culturally
bound L2 gestures, facilitating the acquisition of language (Allen, 1995) and also the gestures themselves
(Belío-Apaolaza & Hernández Muñoz, 2021).
        However, most gestures are not as rigid in their form-function relationship, belonging to the
category of spontaneous communication. These are not codified or lexicalized, yet they sometimes still
provide semantic information through lexical or syntactic forms. Some spontaneous gestures fill linguistic
functions, such as identifying referential content in deictic expressions by filling syntactic structural slots
                                                                                                            32


(saying, for example, “Go!” while pointing at the bedroom) or substituting for particular pragmatic speech
acts (waving and smiling, to greet someone) (Gullberg et al., 2010). Others, such as the hand raising while
describing a plane taking off or motioning up with the head, reinforce the lexical-semantic content by
providing a visual representation through motion events (Cadierno, 2004; Choi & Bowerman, 1991). These
iconic gestures (gestures closely aligned with semantic content) always co-occur with speech (McNeill &
Duncan, 2000; Graziano & Gullberg, 2018) and synchronize at phonological, semantic, and pragmatic
levels (McNeill, 1992). Spontaneous, co-speech gestures have the same meaning as the speech utterance,
and they have the same pragmatic functions (Kendon, 1980; McNeill, 1985, 1992). McNeill (1992) and
others hypothesized spontaneous gesture and speech as a single integrated system (Clark, 1996; Engle,
1998; Goldin-Meadow & Alibali, 2013; Kendon, 2004), where nonverbal behavior provides imagery in a
“language-gesture dialectic” (McNeill, 2005, p. 25).
         An example of the lexicosyntactic encoding of gesture is with motion events. Languages encode
syntactic and lexical information in how they indicate manner and path motion differently (Slobin, 2006;
Talmy, 1985, 2000), and languages may vary in how gestures align with path motion in speech (e.g.,
Özyürek & Kita, 1999; Slobin, 1996). For example, verb-framed languages encode directionality within
the verb itself (e.g., Spanish subir (go up), while satellite-framed languages such as English encode
directionality on a unit such as a particle (e.g., up in go up; go has no directional meaning). Spanish speakers
then coordinate gestures corresponding to path on verbs (Negueruela et al., 2004; Stam, 2006), while
English speakers tend to encode directionality on the satellite (McNeill & Duncan, 2000; Slobin, 1996;
Stam, 2006). Studies analyzing acquisition have found that there is some evidence of transfer in the position
of target-like path gestures in learners of typologically different languages from their own (Gullberg, 2009a,
2009b; Stam, 1998, 2017) as well as other iconic gestures (McCafferty & Ahmed, 2000), though even
highly proficient speakers may remain entrenched in their L1 gestural patterns (Choi & Lantolf, 2008; Stam,
2008, 2010). Thus, it is possible that some types of nonverbal behavior conveying semantic information
can be acquired. The bulk of research on semantic aspects of nonverbal behavior has involved gesture
because of its special relationship to speech, but less is known about the semantic aspects and/or acquisition
                                                                                                             33


of other forms, such as nodding and head shaking.
         In Cienki’s (2012) integrated view of language and behavior, spoken language is the primary mode
for the generation and communication of ideas; all other modes—including but not limited to gesture, facial
expressions, paralinguistics—take on symbolic and communicative roles depending on contextual needs.
The boundaries of language are then flexible, with context determining whether a mode contributes
additional meaning. For example, in a telephone call, the body conveys no meaning whatsoever, while in
an emergency, face-to-face setting, the face and body may convey the majority of information necessary to
interpret the severity of the encounter. Meaning, in Cienki’s (2012) view, can be migratory depending on
needs. One can agree with an interactant by saying so, by using an emblem (thumbs up), or by nodding
their head. In teleconferencing, for example, hands are generally less visible due to the limited viewing
angle of the camera. If an interactant desires to show agreement in a group of several others, they may avoid
speaking as this would interrupt the call and purposefully show their hands on the screen in a thumbs up
gesture or nod emphatically, which may be less common in face-to-face settings (Mark et al., 2023). In this
framework, “a family of meanings is thus dynamically paired with a family of forms” (Morgenstern &
Goldin-Meadow, 2022, p. 6). Cope and Kalantzis (2020) and Kalantzis and Cope (2020) developed a
framework for a multimodal grammar that takes into account the transitional, migratory nature of meaning
across modalities, including speech, body, sound, space, and even images and text. For them, meaning is
unbound from modality and may transition from one to another depending on the constraints defined by
context.
Cognitive information
         Various theories exist regarding the link between gesture, speech, and thought. The majority of
these posit that all three are linked, but they differ in the degree to which speech or gesture is an integrated
(McNeill, 1992, 2005) or co-orchestrated system (Kendon, 2004, 2007; Goldin-Meadow & Brentari, 2017).
As mentioned earlier, McNeill (1992, 2005) and McNeill and Duncan (2020) theorized that some nonverbal
behaviors—in particular, spontaneous gestures—originate in the same cognitive processes as speech. In
their growth point theory, gestures (and perhaps other behaviors) thus offer glimpses into cognition (Goldin-
                                                                                                              34


Meadow & Alibali, 2013), and gestures may lighten the cognitive burden of the production of speech
(Cassell et al., 1999; Goldin-Meadow et al., 2001). The interface hypothesis (Kita & Özyürek, 2003), which
also considered gestures and speech as interlinked, considered the interplay between visual and linguistic
thought. Other theories considered behavior as a facilitator of lexical retrieval (a compensatory system),
such as in the lexical retrieval hypothesis (Krauss et al., 2000), or as instrumental in the construction and
representation of visual thought to be verbalized in the information packaging hypothesis (Alibali et al.,
2000). Despite the different ways each theory connects behavior and cognition, all theories account for
some degree of integration:
         Co-speech gestures, which are often produced without conscious awareness, are synchronous with
         speech, cannot be understood independently of speech, perform similar pragmatic functions as
         speech, and are multifunctional in that they perform both cognitive and communicative functions
         often at the same time. (Stam & Tellier, 2022, p. 336)
Perhaps because of this very tightly integrated system, gestures even parallel breakdowns during speech
disfluencies in L2 speakers (Graziano & Gullberg, 2018; Seyfeddinipur, 2006).
         There may be different cognitive mechanisms that influence behavior, such as affect, context, and
language proficiency. As will be discussed in coming sections, affective states can result in varying
autonomic and behavioral responses in speakers, such as anxiety causing changes in averted gaze, relative
expressiveness, and a higher number of self-adapting behaviors (Gregersen, 2005; Lindberg et al., 2021,
2022). Some cognitive mechanisms may be context-dependent, such as pupil dilation increasing as a
reflection of greater task difficulty (van der Wel & van Steenbergen, 2018) and stimuli familiarity (Heaver
& Hutton 2011, Otero et al. 2011). Underlying L2 ability, of interest to this study, may also impact
differential production of gestures. In bilinguals, speakers may gesture more in their less dominant, weaker
language (Aziz & Nicoladis, 2018; Benazzo & Morgenstern, 2014; Gullberg, 2006, 2012; Krauss & Hadar,
1999; Nicoladis, 2007; Nicoladis et al., 2007), though these results have been contested (Gullberg, 1998;
Laurant & Nicoladis, 2015; Nicoladis et al.; 1999; Sherman & Nicoladis, 2004). In a study of 75 Spanish
language learners at beginner, intermediate, and advanced levels, Gregersen et al. (2009) found that learners
                                                                                                           35


differed from the previous findings in their gestural output depending on proficiency level. They found
fewer illustrators—co-speech-occurring gestures that enhance the speaker’s meaning—in video recordings
of speakers with lower proficiency, while more advanced speakers gestured more often in meaning-
enhancing ways: “By using more illustrator gestures, advanced learners reinforced grammaticality, used
visual discourse markers, strategically reinforced meaning through the visual channel, and, in general,
responded with sociolinguistic gestural dexterity” (Gregersen et al., 2009, p. 205). They also found that
learners in beginner and intermediate levels used more self-adapting gestures, such as hand fidgeting and
adjusting clothing, than more proficient speakers. There were no group differences in the use of
compensatory gestures used to convey meaning when lexical retrieval was delayed or resulted in a
breakdown. Lin (2022) found similar results in Chinese speakers of L2 English, with a greater number of
illustrators and beats (gestures indicating phonological stress and rhythm) at advanced levels, while less
proficient speakers used a greater number of deictic gestures (pointing) and compensatory movements.
Learners with differing proficiency profiles may also differ in the abstractness or concreteness in which
they use deictic gestures (i.e., pointing) when specifying semantic referents (So et al., 2013).
          Behavior can also affect speech comprehension and production by enhancing the perception,
interpretation, reactivity, and memory storage of utterances (Beattie & Shovelton, 1999; Cohen & Otterbein,
1992; Drijvers & Özyürek, 2017; Graham & Argyle, 1975; Holler et al., 2018; Kelly et al., 1999; Tellier,
2008), as well as in L2 speech (Hardison & Pennington, 2021; Morett, 2014). Neurocognitive studies have
shown that the brain processes and decodes the two modes in similar ways (Özyürek & Kelly, 2007;
Özyürek, 2014). Co-speech gestures may facilitate a reduction in the load on limited working memory
resources when speaking (Cook et al., 2012; Krauss et al., 2000), and serve compensatory roles when verbal
resources are limited (Frick-Horbury & Guttentag, 1998; Hosetter & Alibali, 2007). Gesture, when
restricted, can also lead to the increased production of dysfluencies or otherwise limited language use in
the weaker language or L2 (Graham & Heywood, 1975; Laurant & Nicoladis, 2015; Rauscher et al., 1996).
When cognitive load is increased, such as when listening to a difficult question, speakers may orient away
from the visual input available to them by averting their gaze (Doherty-Sneddon & Phelps, 2005; Doherty-
                                                                                                        36


Sneddon et al., 2002; Glenberg et al., 1998); this can free up cognitive resources, then allowing the speaker
to provide an answer to the question. For example, in Burton (2023), greater proportions of averted gaze
were found to be the result of increased question difficulty in an online L2 speaking test. Although speakers
looked away from the camera/interlocutor more when preparing their answer to more difficult questions,
question difficulty had no relationship with blinking frequency. In these studies of gaze directional changes,
the task or affective changes due to task difficulty influenced behavior, which was then hypothesized to
benefit cognition.
          Facial behavior can also have unconscious physiological effects on listeners. There is evidence that
seeing spontaneous communication in the faces of others creates natural neurocognitive linkages between
interactants (Morris et al., 1996; Suslow et al., 2006); this happens when brain states reach a type of unity
between speakers that exerts a powerful influence on emotions and social organization (Buck & Powers,
2006). In some hypotheses, the brain interprets behavior in others and, to some extent, attempts to replicate
it in the listener, such as the active intermodal matching hypothesis (Meltzoff & Moore, 1997) or the mirror
neuron system (Rizzolatti & Craighero, 2004). Viewing certain facial configurations can lead to differing
cardiac (Levenson & Ekman, 2002) or respiratory responses (Boiten, 1996), perhaps due to the tight
connection between the face and emotion (Levenson et al., 1990). In short, behavior can have a reactive
effect in listeners, resulting in unconscious behavioral changes (such as synchrony, mimicking, or possibly
aversion) that may originate at a neurological level (Dimberg et al., 2000).
          Nonverbal behavior can also provide important information to listeners during conversational
exchanges. It can facilitate listening comprehension in L1-L1 encounters (Drijvers & Özyürek, 2017;
Goldin-Meadow, 2003; McGurk & McDonald, 1976) as well as in the context of L2 speakers or listeners
of L2 speech (Dahl & Ludvigsen, 2014; Nakatsukasa, 2016; Sueyoshi & Hardison, 2005; Tsunemoto et al.,
2022). Visible gestures may help listeners predict or infer about the semantic nature of physical objects
speakers are thinking about before these are verbalized (Pine et al., 2010). Listeners may furthermore
interpret nonverbal behavior to infer cognitive, social, affective, or trait information about speakers, often
completely automatically, unconsciously, and extremely quickly, within microseconds of social
                                                                                                            37


interactions (Ambady, 2010; Borkenau et al., 2009; Lakin, 2006). When forming an impression of whether
a speaker understood certain questions, listeners may orient towards speakers’ gestures (Goldin-Meadow
et al., 1992) or facial behaviors (McDonough et al., 2019, 2022b, 2023). More information about
interpersonal encounters and how nonverbal communication contributes to state or trait judgements is
presented in later sections.
Social-interactional information
         Interaction is dynamic, spontaneous, and sequentially organized, and transitions in interactional
turns between speakers are organized in observable ways (Sacks et al., 1974). Meaning is co-constructed
by interactants through the exchange and reciprocation of ideas (Young, 2011), and in dialogue, speakers
simultaneously convey meaning as well as other interactional information about their performance,
understanding, and intentions (Clark, 2002). As discussed in the section on L2 communication and language
proficiency, nonverbal behavior has been recognized for its role in the management of social interaction in
studies of L2 communication and interactional competence (Celce-Murcia, 2007; Dai, 2023; Galaczi &
Taylor, 2018; Hymes, 1972; Kramsch, 1986; Plough et al., 2018; Scarcella et al., 1990). The ability to
manage social interactions—interactional competence—is essentially one of organization and pragmatic
function, as it describes how speakers assign turns, how these turns and actions contribute to coherent
meaning making, how breakdowns are repaired so that all speakers can follow the conversation, and how
specific units of interaction such as openings and closings are organized (Schegloff, 2006). Interactional
behaviors convey embodied semiotic information that facilitates communication, yet do not have direct
lexicosyntactic functions (Gregersen et al., 2009; Gullberg, 1998). Along with verbal resources, these
behaviors allow speakers to reach, restore, and maintain intersubjectivity (Burch & Kley, 2020; Goodwin,
2000, 2018; Hırçın Çoban & Sert, 2020; Mondada, 2014; Streeck, 2009).
         There are many examples of behaviors in the literature that relate to the management of interaction
in both L1 and L2 settings. Some of these are studies of individual behaviors (e.g., the head poke, Seo &
Koshik, 2010), while others represent more complex gestalts (Mondada, 2014) of behavior. An exhaustive
list would be impractical, but some of these interactional functions and their associated behaviors are listed
                                                                                                            38


below:
    •  turn selection by using facial expressions and gestures (Streeck & Hartge, 1992) or pointing
       (Mondada, 2007; Nakatsuhara, 2011)
    •  the management of turn-taking through mutual gaze, averted gaze, shifts in gaze, and head
       movements (Goodwin, 1980; Greer & Potter, 2008; Rossano, 2012)
    •  maintaining extended turns in storytelling sequences through gaze direction and paralinguistic
       features (Tominaga, 2013)
    •  indications of turn completions through combinations of gestures and body motion with
       paralinguistic vocalizations (Keevallik, 2014), gesture holds (Groeber & Pochon-Berger, 2014),
       and gesture retraction (Mondada, 2006)
    •  the initiation of repair through gestures, and the mediation of meaning through raised eyebrows,
       mouth movements, and mutual gaze in mediation sequences (van Compernolle, 2013)
    •  indication of comprehension problems (trouble) through long silences, averted gaze, and smiles
       (Hırçın Çoban & Sert, 2020) and raised eyebrows and head displays (Oloff, 2018)
    •  repair initiations with mutual gaze and holding the floor for lexical retrieval with averted gaze
       (Burton, 2021a; Pekarek Doehler & Skogmyr Marian, 2022; Streeck, 2009)
    •  communicating trouble sequences and showing resolution through holds and their release (clusters
       of behavior held static during a period of time) (Burton, 2021a; Floyd et al., 2016; Oloff, 2018),
       head pokes (Seo & Koshik, 2010), and postural leaning forward (Rasmussen, 2014)
    •  complaining sequences through the use of paralinguistic features, gestures, facial expressions, eye
       gaze, and posture shifts (Skogmyr Marian, 2023)
    •  the maintenance of intersubjectivity in paired speaking tests through eyebrow flashes and gestures
       (Burch & Kley, 2020)
    •  communicating speech acts of greetings, farewells, and introductions with gestures such as waving
                                                                                                        39


         and haptics such as hugs and handshakes (Rylander et al., 2013)
         These are just a few of the documented cases of how behaviors can contribute to interactional
management in L1 and L2 settings. Because nonverbal behaviors and associated meanings do not have one
to one relationships, however, exhaustive lists of behaviors and their associated possible meanings would
not be meaningful, as the situational context provides the interpretative lens for these actions. Linguistic
forms and their semantic intent also do not always have one to one relationships, but the possible range of
associations is much more restricted. Mondada (2014) noted that
         participants might choose the way in which they format a particular action—and that these choices
         might vary, privileging either verbal resources, a combination of verbal/embodied resources or
         embodied resources alone… these choices might be constrained in interesting ways in situations of
         multiactivity, where participants distribute resources among various concurrent courses of action
         and often prioritize one over the other (p. 140)
Thus, there is no guarantee that any particular action or verbalization may occur during particular
interactional-pragmatic sequences. Just as someone can as easily say “goodbye”, wave, or both to a
departing guest, speakers make choices depending on context and perhaps their own idiosyncratic
preferences. Nonetheless, interactionally competent listener-receivers are able to interpret these various
signals as particular courses of action.
Socio-affective information
         One of the most powerful roles of nonverbal behavior is to convey affective information about
emotions, attitudes, and stances (Kappas et al., 2013; Richmond & McCroskey, 2004; Singelis, 1994).
These affective responses help to drive social interactions and may reveal information about underlying
psychological states of speakers. Because of the importance of this topic, affect is treated in an entire
subsection in this literature review following this one on nonverbal behavior. However, nonverbal behavior
and affect are also an important drivers of social interaction, particularly the face, which has been described
as “the primary site of affect displays” (Ekman & Friesen, 1969, p. 841). Social interactions are
characterized by a wide range of both verbal and nonverbal behavior which conveys meaning about the
                                                                                                             40


nature and status of the relationships amongst the speakers (Argyle, 1988; Mehrabian, 1972; Patterson,
1983). If one thinks of a group of friends catching up over coffee, the topic of the conversation will of
course take center stage, but the visible reactions amongst the interactants will stand out as secondary
sources of information. Smiling at a fellow group member may indicate deference towards that individual,
or it may serve as a backchanneling device to acknowledge the other’s utterance. A rise of the upper lip
might appear as a sneer and could convey disdain or contempt for another member of the group, which
could then lead to a topic-shift or a turn ending. While these behaviors are both quite noticeable, research
has shown that even more subtle, almost imperceptible movements of the body, such as movements of the
arms or the face, can also impact views of other individuals (Argyle, 1988; DePaulo & Friedman, 1998).
These various behaviors may play an important role in communicating power dynamics such as dominance
and submission (Tiedens & Fragale, 2003). In these cases, the information being conveyed can be both
affective in its attitudinal and emotional components and social in its orientation towards organizing
conversation.
        Some socially oriented aspects of nonverbal behavior arise from individual differences in speakers.
Culture (discussed in the next section) is an important moderator of such individual differences. Sex and
gender, both interacting with culture, also appear to relate to social-forward nonverbal behavior. For
example, there may be differences in the sensitivity of reading nonverbal cues, comfort with proximity and
touch, and proportions of mutual gaze between men and women (Hall, 1984, 2006). Personality can also
exert an effect on the production of nonverbal behavior, with extraverted, relaxed individuals showing
higher signs of engagement such as closer proximity, greater proportions of mutual gaze, and higher overall
expressiveness such as smiling (Neumann et al., 2009; Patterson, 1983; Patterson & Ritts, 1997). Even
socioeconomic status may result in differences amongst speakers. Kraus and Keltner (2009), for example,
found that speakers from more advantaged socioeconomic backgrounds conveyed a greater amount of
disengagement (averted gaze and attention, doodling on paper), while speakers from more disadvantaged
socioeconomic backgrounds generally conveyed a higher degree of engagement such as head nodding and
laughter. In all of these cases, context and culture likely moderate the findings as well. Many other
                                                                                                          41


individual differences have also been found to relate to differences in nonverbal behavior (e.g., Hall &
Gunnery, 2013; Gifford, 2013; Nestler & Back, 2013, Rule & Alaei, 2016).
         In interaction, speakers orient towards the behavior of their partner, often adjusting their behavior
depending on the affective nature of the interaction. In equilibrium theory (Argyle & Dean, 1965; Patterson,
1973), speakers seek to maintain a balance between intimacy and behavioral expressiveness; behavior that
violates norms of closeness, such as standing too close to a person, can be met with an opposing, equalizing
behavior of moving away. Likewise, if one person maintains mutual gaze to the point of staring, the
interactant may seek balance by looking away. In other cases, however, speakers may reciprocate the
direction and valence of a behavior (Burgoon, 1978; Capella & Greene, 1982); that is to say, people may
stand closer together, or a higher proportion of mutual gaze may lead to a corresponding level of mutual
gaze in the speaking partner. In general, negative affect can cause a compensatory effect, and positive affect
can cause reciprocation, but these naturally flow back and forth between speakers in complex parallel
systems (Patterson, 2013). Affect does not drive all interactions, however. To some degree, behavior seen
in social settings may be mimicked unconsciously (Chartrand & Lakin, 2013), and affect can be contagious.
Behavior can also lead to various subconscious impressions made about speakers (Todorov, 2017), without
any corresponding action. The issues of mimicry, contagiousness, and impression formation will be
discussed in the following section on affect. Likewise, culture may be a strong determinant in how many
of these socio-affective behaviors are conveyed and understood, and will be discussed next, as well as in
the section on affect.
Cultural origins of nonverbal behavior
         Anyone with experience talking to people from varying cultural backgrounds quickly realizes that
there are differing norms when it comes to nonverbal behavior. In some contexts it is appropriate to shake
someone’s hand when greeting someone, while in others (particularly Japan), it is customary to bow with
their hands by their sides. In the Mediterranean, it is often customary to kiss the check of an individual once,
twice, or even three times, depending on the country or region that one is in. There are also differences
between countries in the appropriacy of proxemics, or the distance one stands next to someone else, as well
                                                                                                             42


as how long to hold mutual gaze. There are countless other examples. It is important to acknowledge the
role of culture with nonverbal behavior in language testing settings because intercultural communication is
often part and parcel of the experience. The test taker may exhibit cultural differences in nonverbal behavior
that are systematically misinterpreted by raters from different cultural backgrounds and vice versa. While
the previous sections detailed differences in the various types of information nonverbal behavior can
provide, this section will briefly review some of the key issues to consider when dealing with intercultural
encounters.
         Culture provides the overall context for social encounters to occur, and it provides a framework for
how social interaction should take place. Within the context of culture, Matsumoto and Hwang (2016)
argued that
         [t]he function of communication is to allow for the sharing of social intentions, which facilitates
         social coordination. Cultural norms provide rules for the regulation of expressive behaviors,
         including nonverbal behaviors, to allow for the sharing of social intentions as part of
         communication… this underlying function of nonverbal communication vis-à-vis the function of
         culture is universal; the cultural norms and the manifestation of those norms in actual behavior,
         however, are different because of the various adaptations different groups have made to survive in
         their ecological contexts. (p. 77)
The authors make clear here that the semantic, cognitive, interactional, and affective roles of nonverbal
behavior do not change (at least substantially) across cultures. What can change, however, are the nonverbal
forms presented. Some forms do appear to be mostly universal, such as the use of verbal and nonverbal
resources for repair (Dingemanse et al., 2015) and the tempo of turn-taking (Stivers et al., 2009), though
learners do not always use these universal forms appropriately while learning a language (Pekarek Doehler
& Pochon-Berger, 2015). Most forms, however, are not universal. The same function can be conveyed by
different deployments of facial and bodily expressions in different cultures, and the same behavior may be
seen in different functional sequences (Crivelli & Fridlund, 2018; Fridlund, 1994). This is also true within
individual cultures, as detailed in the section on interaction (mutual gaze can indicate a turn release pattern
                                                                                                             43


or also a repair initiation depending on context). Even regions within very similar cultural groups may
develop slightly different muscle movements when conveying the same information depending on the
setting (Elfenbein & Ambady, 2002). That is to say, there appears to be nonverbal “dialects” or “accents”
that vary across cultural contexts in similar ways to language (Elfenbein et al., 2007; Marsh et al., 2003).
         Researchers first began work in the domain of behavior and culture attempting to show that the
nonverbal cues associated with affect displays were universal (Tomkins & McCarter, 1964). Building on
their work, Ekman set out a research agenda to study the universality of affect judgements. In a series of
studies, he and others found agreement between images of particular facial expressions and prototypical
emotional categories in different cultural settings (Ekman, 1972; Ekman et al., 1969). These studies claimed
to have found six universal emotional expressions: surprise, sadness, happiness, fear, disgust, and anger.
Replications and meta-analyses have provided support to these findings (Matsumoto, 2001; Matsumoto et
al., 2009). Nonetheless, these studies have faced criticism due to various methodological issues (Russell,
1994; Russell & Fehr, 1987), such as the use of Western faces and the types of scales used. In any case,
these studies focused on a very narrow range of emotions rather than the much broader use of behavior in
society.
         Context, then, is critical to how cultural meaning develops. Different cultures may have different
sets of display rules that determine whether particular behaviors should be displayed or not (Ekman &
Friesen, 1969). For example, it may be appropriate to show anger in a service encounter in the United States,
but a display of anger in a similar context in Asia would be far less appropriate. Cultures may also have
different relationships between the public and private sphere, for example, and individuals may be more
aware of their behavior in public (Matsumoto & Hwang, 2012). Some contexts carry a cultural weight with
them that may be similar across international contexts, such as in a testing environment. Because testing
environments, in particular those that are high stakes, are naturally imbued with uncertainty and anxiety for
the test taker, behavior may change as a result of the setting. Guerin (1986) suggested that these types of
contexts where the thoughts, intentions, and feelings of others (such as the examiner) are uncertain may
cause individuals to restrict their behaviors and act more cautiously. Parkinson (2019), when discussing the
                                                                                                           44


importance of context and culture, noted that “[t]he emotional meaning of faces may depend on the
trajectories of action that they indicate or foreshadow and the centrality of those trajectories to our culturally
specific prototypical representation of the emotion in question” (p. 89).
         Cultural differences can sometimes lead to differential use in the forms of nonverbal behavior. As
discussed earlier, emblematic gestures may differ substantially across cultural boundaries (Morris et al.,
1979). Also discussed earlier are the differences in gestures that illustrate path movement (e.g., Kita &
Özyürek, 2003), though other gestures forms may also differ depending on culture, such as counting
movements (Pika et al., 2009). Gaze may also differ according to international background (Hall, 1963;
Watson & Graves, 1966) or ethnic group (LaFrance & Mayo, 1976), such as when some groups may
maintain mutual gaze longer than others. Paralinguistic cues, in particular prosody, speech rate, silence, and
volume, can also convey differential emotional states or affective responses depending on culture (Sauter
& Eimer, 2010; Sauter et al., 2010). For example, women in Japan may raise the pitch of their voice during
telephone conversations to convey politeness, whereas Americans may be less likely to do so and instead
use verbal mechanisms alone to convey politeness. There is relatively less research available on other
behaviors such as eyebrow furrowing/raising, mouth movements (e.g., smiling), shoulder movements, head
movements, and posture, though these behaviors may also exhibit certain differences around the world.
More on how culture impacts nonverbal behavior through the lens of affect will be discussed in the section
on affect below.
Nonverbal behavior and second language assessment
         From this review so far, evidence has been presented that nonverbal behavior is a core aspect of
communication that can convey semantic, cognitive, social-interactional, and affective information
embedded in cultural contexts. When nonverbal behavior is seen and decoded by an interactant, either
consciously or unconsciously, it becomes a central aspect of interpersonal perceptions. People see others
and interpret their intended meaning from speech and the body, but they also infer information about what
they might be thinking and feeling. Burgoon et al. (2016) argued that nonverbal behaviors have the capacity
to shape and color our perceptions of speakers, even in the presence of contradictory verbal information.
                                                                                                               45


Drawing on several decades of past research in the field of human communication, these authors distilled
six tenets about the relationship between verbal and nonverbal communication:
             1. On average, adults rely more on nonverbal cues than on verbal cues to determine social
                  meaning.
             2. Children rely more on verbal cues than adults do.
             3. Adults rely more on nonverbal cues when verbal and nonverbal channels conflict than
                  when these channels are congruent.
             4. Channel reliance depends on the communication features at stake.
             5. When the content in different channels is congruent, the meanings of the cues tend to be
                  averaged together equally; when the content is incongruent, there is greater variability in
                  how information is integrated.
             6. Individuals have biases in their channel dependence. (pp. 221–224)
Burgoon et al. (2016) further stressed the co-occurring nature of nonverbal communication with verbal
communication, as well as how the two interact:
         Because verbal and nonverbal signals arise simultaneously, the nonverbal channels can silently
         monitor the sender, send and receive feedback, express emotions, and define the interpersonal
         relationship all the while the verbal stream is conveying linguistic content. The nonverbal cues thus
         become the frame of reference against which verbal interpretations are checked. (p. 226)
Considering how nonverbal behavior impacts judgements, it is important to uncover the relationships
between behavior and perceived L2 proficiency, as raters have found the visual realm salient when
understanding certain verbal phenomena, such as breakdowns in fluency (Choi, 2022; Nakatsuhara et al.,
2021a; Nambiar & Goon, 1993).
         One theoretical and methodological framework that has been used when studying interpersonal
judgements arising from nonverbal behavior is the Brunswik lens model (Brunswik, 1956; Nestler & Back,
2013; Hall et al., 2019), shown in Figure 2.3. This model accounts for the interpretation and accuracy of
interpersonal perceptions between people. There are three core elements in the model that interact together:
                                                                                                            46


a speaker’s nonverbal behavior, their true underlying states or traits, and the judgement a listener makes
about the individual based on impressions of their state or trait drawn from the visual realm. The relationship
between the speaker’s nonverbal behavior and their actual state or trait (measurable by asking the speaker
or using validated physiological measurements such as galvanic skin response) is cue validity. In other
words, this relationship defines to what extent a particular cognitive or affective element (e.g., excitement)
is truly represented by nonverbal behavior (e.g., mouth agape, eyebrows raised, hands in air). The
relationship between the true state or trait of the speaker and the listener’s impression of that trait is
interpersonal accuracy. In other words, if the speaker feels excited about a plane taking off, and the listener
correctly interprets the speaker’s emotion as excitement, the emotions were conveyed accurately; any
misinterpretation results in inaccurate relationships and may suggest that particular affective responses are
difficult to interpret accurately. Finally, of importance to this study is the relationship between the visible
nonverbal behavior of the speaker and the way listeners interpret these when forming impressions, which
is called cue utilization. In many non-laboratory settings, the speaker’s actual trait or state is unknown, yet
listeners (raters in this study) still use visible cues when making judgements. If, for example, particular
facial cues result in ratings of excitement (and perhaps other tangential judgements), one can make
inferences about how the cues impact certain perceptions. This model has been used to examine
relationships between many interpersonal perceptions, such as intelligence (Borkenau et al., 2009; Reynolds
& Gifford, 2001), self-esteem and expressiveness/warmth (Hirschmüller et al. 2018), personality and
physical appearance (Naumann et al., 2009), and extraversion and likeability (Back et al. 2011; Borkenau
et al., 2009).
                                                                                                             47


Figure 2.3
Brunswik Model (Adapted From Hall et al., 2019)
         In terms of communicative competence, there is strong evidence for, at a minimum, a supporting
role of nonverbal behavior when conveying L2 ability through speech; in other words, listeners utilize the
cues of nonverbal behavior when making at least some judgements about speakers, whether they are broadly
holistic (competent in their L2) or specific to a subconstruct, such as interactional or strategic competence.
Studies of rating mode showed that the visual domain introduces sources of variance in test scores (Choi,
2022; Gullberg, 1998; Nakatsuhara et al., 2021a; Nambiar & Goon, 1993), but necessary to this area of
research is a documentation of how specific behaviors relate to language proficiency. A growing body of
researchers, however, have made findings in this area using rater reports (Choi, 2022; Ducasse & Brown,
2009; May, 2009, 2011; Jenkins & Parra, 2003; Nakatsuhara et al., 2021a; Sato & McNamara, 2019;
Thompson, 2016). The authors of these studies have determined that raters notice a range of different
nonverbal behaviors, in particular the direction and holding of eye gaze, facial expressions, gestures, and
posture. Another line of research has used post hoc analyses of video data to investigate differences in
nonverbal behaviors for different proficiency profiles without rater reports (Gan & Davison, 2011; Neu,
1990). A final line of research has taken an empirical approach, documenting nonverbal behavior either
                                                                                                           48


through raters, discourse analysis, or both to measure the quantitative impact of behavior on L2 proficiency
outcomes (Gullberg, 1998; Kim et al., 2023; Thompson, 2016; Trofimovich et al., 2021; Tsunemoto et al.,
2022). Overall, although raters primarily use verbal, linguistic information when making scoring decisions,
raters also use nonverbal information when making decisions about L2 ability.
         Rater reports. One of the most seminal papers considering the role of nonverbal behavior in ratings
of L2 proficiency Jenkins and Parra (2003). The study was a qualitative analysis (discourse and video
analysis) of eight Spanish and Chinese-speaking international teaching assistants taking part in a paired-
format L2 speaking test. The authors analyzed written comments on the eight test takers’ performances left
by their individual examiners, as well as think-aloud protocols on all performances from one rater. They
found that several nonverbal features contributed positively to the raters’ perceptions of communicative
effectiveness. These included nonverbal backchannels/receipt tokens (e.g., head nodding), displays of affect
(e.g., laughing, smiling, frowning), mutual gaze, forward body leaning, and paralinguistic features.
Importantly, showing engagement and listening comprehension (maintaining eye contact, leaning forward,
and backchanneling) was an important feature that raters took into account positively when test questions
were asked. Features that negatively affected perceptions included extended silence, an overall lack of
expressiveness (e.g., an absence of head nodding, stiff posture, gaze aversion), a lack of eye contact, a stiff
posture, and non-target-like prosody and tone. Another key takeaway of this study was that test takers with
borderline-passing linguistic skills were able to compensate for their weaknesses by taking an affective,
actively communicative stance, which boosted their scores in comparison to other borderline speakers,
while a rigid, inexpressive stance negatively impacted test takers’ scores.. “By making use of nonverbal
features of conversational interaction, [test takers] can convince the evaluators that they have a higher level
of proficiency than may in fact be the case from a purely linguistic perspective” (Jenkins & Parra, 2003, p.
100).
         Three other studies discussed nonverbal in regard to interactional communication (Ducasse &
Brown, 2009; May, 2009, 2011), using raters to uncover elements that factored into the raters’ judgements.
Ducasse and Brown’s (2009) raters oriented strongly towards nonverbal behavior as contributing to
                                                                                                             49


interactional competence, contributing to “the presence or lack of a ‘physical’ nonverbal fluency” (Ducasse
& Brown, 2009, p. 433). Raters commented that mutual gaze was a positive attribute, while averted gaze
was a negative attribute, without any reference to the conversational context of the gaze behavior. Gesture
also contributed to positive and negative judgements; positive with co-speech gestures that illustrate affect,
and negative when gesticulated and overly frequent. Similar to Jenkins and Parra (2003), nonverbal
backchannels were seen as evidence of attention and listening comprehension and perceived positively.
May (2009) analyzed four raters’ comments about 12 test taker’s interactions, considering elements
contributing to co-constructed interaction. She also found evidence of nonverbal behavior contributing to
successful communication. Closed-off body language, such as averted gaze and relative inexpressiveness,
was perceived negatively, while establishing eye contact, gesturing appropriately, and nodding helped to
sustain successful interaction. May (2011) extended this study and again found strong evidence for the role
of nonverbal behavior in interactional competence. Notably, behaviors showing a desire to communicate
were perceived as positive (maintaining mutual gaze, expressiveness, nodding, and using gestures) while
behavior showing disinterest (avoiding eye contact, inexpressiveness, relatively rigid posture, facing away
from interactant) were perceived as negative. She mentioned that:
          body language, although not mentioned in the rating scale, featured so prominently in the raters’
          evaluation of interactional effectiveness at all stages of the rating process. Raters’ perception of
          body language… is also linked to assertiveness through communication, working together
          cooperatively, and contributing to authentic discussion. (May, 2011, p. 137)
May (2011) made a strong case for the interpretation of behavior through affect, which will be described in
more detail later.
          In a similar format, Sato and McNamara (2019) also considered factors that led to impressions of
communicative success, but they used untrained, novice raters instead of operational raters. As described
earlier, the authors found that a range of nonlinguistic criteria were related to communicative competence,
of which nonverbal behavior was a small but important category, making up about 10% of the comments.
Similar to the previous studies, co-speech gestures were seen positively, while gesticulation through self-
                                                                                                            50


adaptors (e.g., playing with a ring) was seen as irrelevant to the context, appeared to indicate anxiety, and
were seen negatively. Eye gaze was also an important behavior for these raters as it signaled various
affective stances. Mutual gaze again was seen positively, and averted gaze was seen negatively without
exception. As with May (2011), Sato and McNamara (2019) found that nonverbal behavior and affect were
closely linked in the raters’ judgements, as raters oriented to the socio-affective meaning-oriented output
of the test takers’ nonverbal behavior.
         Two additional studies (Choi, 2022; Nakatsuhara et al., 2021a) primarily considered the impact of
rating mode (video vs audio-only) and were discussed previously. However, these authors also elicited rater
reports to investigate how raters perceived performance under the different rating conditions. Nakatsuhara
et al. (2021a) found that a combination of mouth, eye, hand, and body movements helped enhance
comprehensibility when pronunciation and fluency were less controlled. Behaviors also led to a clearer
understanding of the test takers’ affective stances, such as engagement, confidence, and the desire to
communicate. Importantly, their raters mentioned that during disfluencies, nonverbal behavior provides
evidence of whether breakdowns are due to a lack of comprehension, inadequate linguistic resources, or
considerations of content. Understanding these distinctions was critical for raters when assigning
appropriate fluency scores, as the scale categories included rationale for breakdowns. Raters also mentioned
how inaccuracies were less noticeable when nonverbal behavior was visible. Overall, the authors remarked
that video-based rating gave examiners a more complete view of the test taker’s communicative competence,
which therefore made them more confident in the scores they awarded.
         Choi (2022), in a similar type of analysis, considered the verbal reports of eight raters when rating
in three modes: audio-only, video with the examiner visible, and video without the examiner visible. She
also found that video-based rating was more informative than audio-only rating, as raters had a fuller picture
of test takers’ performances. “Facial expressions, eye-gaze, and head orientation were commonly
mentioned nonverbal cues that affected raters’ perception towards test takers, signaled test takers’ struggle
during their responses, and showed test takers’ focus during responses” (Choi, 2022, p. 151). Averting gaze
by looking down, for example, was mentioned as evidence of problems with linguistic access, while
                                                                                                            51


problems accessing content were shown through shifting gaze back and forth. Similar to Nakatsuhara et al.
(2021a), nonverbal behavior provided important information about test takers’ fluency, as seeing behavior
allowed them to make more informed judgements. Also similar to Nakatsuhara et al. (2021a) was that
nonverbal behavior, in particular gaze and head orientation, informed the raters of affective traits, namely
engagement, desire to communicate, and confidence. Mutual gaze was preferrable for the raters, while
averted gaze was seen negatively. An interesting finding in this study was that some raters found the visible
rating format to be distracting, as processing nonverbal behavior took away focus on purely linguistic
features. This may have been because the raters were used to rating formats in which accuracy was
emphasized, which also links back to Nakatsuhara et al.’s (2021a) finding that inaccuracies were less
noticeable in the video format.
        Behavior at proficiency levels. Neu (1990) investigated nonverbal behavior in relation to
communicative success. Drawing from the literature at the time, she presumed that important to
communicative success were:
        1) the appropriate use of meaningful gestures
        2) nonverbal gestures in aid of speech difficulties
        3) nonverbal gestures appropriately synchronized with the verbal channel
        4) the degree of synchrony between interactants (Neu, 1990, p. 122)
She conducted a case study of two individuals taking a placement test for international students, choosing
these two based on their similar scores yet diverging performances. She analyzed the performance data
using the Foster system of discourse analysis (described later), transcribing verbalizations, facial/head
movements, gestures/arm movements, and posture. She found that after discourse analysis, one test taker,
Yama, had been rated lower than expected. She hypothesized that his lower ratings may have been due to
his relatively inexpressive stance, with few facial expressions, shifts in posture, or nonverbal
backchanneling. Furthermore, this test taker used gestures that were not aligned with speech, were complex,
and highly frequent, which gave the impression of struggling. The second test taker, Ahmed, appeared to
have inflated scores when compared with his actual discourse. Neu noted that his body posture was much
                                                                                                           52


more relaxed and dynamic than Yama’s, giving an air of confidence. He also used eyebrow raising, foot
tapping, nodding as an interactional device, and tilted his head frequently to show engagement, to initiate
repair, and to function as gestural beats indicating semantic units of speech. His gestures co-occurred with
speech and were overall simpler than Yama’s.
         Ahmed… appears to take control of the process…[his nonverbal behaviors] allow Ahmed to
         compensate for his weak verbal skills. When Ahmed cannot understand something, he bluffs his
         way through… by being nonverbally strategically competent in conversational interaction, Ahmed
         gives the impression of being more verbally communicatively competent than he is. Because Yama
         has not acquired nonverbal strategic competence in conversation, his verbal channel is perceived
         as less fluent than it really is. (Neu, 1990, p.136)
Thus, similar to Jenkins and Parra (2003), Neu (1990) largely determined that showing engagement,
expressiveness, and confidence through nonverbal behavior were critical elements that helped Ahmed, and
the lack of such expressions brought Yama’s scores down. Their nonverbal behavior became critical to their
unfolding conversations given its role in managing interactional moves of topic initiation and turn taking.
         Gan and Davison (2011) used multimodal conversation analysis to document gestures that test
takers used in a group-based interactive speaking test in Hong Kong. The analysis consisted of interactions
within two groups characterized by different score profiles: low and high. Using McNeill’s (1992)
categorization system, they coded iconic, metaphoric, beat, and deictic gestures, but not self-adaptors or
emblems. They found that the higher scoring group used co-speech gestures to provide detailed lexical
meaning to utterances, emphasize particular ideas and suggestions, and aid in interactional management.
Although not the focus of the study, the test takers also integrated eye contact, facial expressions, and body
posture into their interactional moves. The lower scoring group displayed markedly different nonverbal
behaviors in their group interaction. Some interactants were fairly rigid with very few visible nonverbal
behaviors, while others used gestures that did not align with speech and were self-adapting in nature (e.g.,
scratching hair). These unsynchronized gestures indicated problems accessing language and aligned with
breakdowns in fluency. These learners, however, did use deictic gestures (i.e., pointing) to assign turns as
                                                                                                            53


a form of interactional management. The authors concluded that “synchrony of gestures with speech and
other nonverbal acts proves interactionally, linguistically, and cognitively challenging, and their gestures
seemed to be utilized predominately at the paranarrative level and to be involved in self-organization
processes” (p. 116).
         Quantitative studies. Gullberg (1998) is one of the first researchers who used a quantitative
analysis to study the impact of nonverbal behavior—limited to gestures—on ratings of overall proficiency.
She had a relatively small sample of 20 raters score 20 narrative speech samples in French and Swedish (10
raters and 10 samples per language). In each language set, half of the samples were L1 speakers of the
language and the other half were L2 speakers. The raters scored the samples on a 5-point Likert scale of
overall linguistic level with descriptions of only the endpoints (1 was the lowest, 5 was the highest). She
found that the number of perceived gestures (observed by the raters) was the only variable that correlated
with proficiency evaluations, which correlated quite strongly at .75. In other words, the more that raters
noticed gestures, the higher they rated the test taker’s L2 proficiency. The actual number of most gestures
produced (tallied by Gullberg) did not correlate with proficiency except for iconic gestures, which related
to meaning making. An additional interesting finding was that raters’ perceptions of gestures were marked
by error in comparison to the actual gestures produced. Raters only noticed some of the gestures, under-
and overestimating the amount of gesturing in different speakers. “What constitutes ‘many’ gestures must
therefore be assumed to reflect qualitative differences between gestures, or gesture types, with respect to
how perceptually salient they are, in a broad sense” (Gullberg, 1998, p. 203).
         In a small-scale, mixed methods study, Thompson (2016) analyzed the relationships between the
frequency of eye contact, gestures, and smiles with holistic score outcomes using the IELTS rating rubric.
She also considered modality differences between audio-only and audiovisual rating. She followed up with
raters by conducting stimulated verbal recall sessions. She recruited four Canadian raters to conduct four
practice oral proficiency tests with four mainland Chinese test takers. She found that some behaviors
appeared to be associated with score outcomes. For example, a higher rate of self-adaptors (e.g., head
scratching), a greater amount of mutual gaze, and non-Duchenne smiles (less authentic smiles where the
                                                                                                          54


muscles around the eyes do not move) were associated with lower scores. In at least one test taker, more
representational gestures and authentic, Duchenne smiles may have compensated for limited lexical
resources to enhance scores. Additionally, ratings on fluency and pronunciation were found to be
consistently higher when rated with visual information present in the audiovisual mode, though broader
differences in modality were inconclusive. Although the sample was quite small and findings are largely
inconclusive, the study is notable as it is one of the only studies to consider the impact of specific behaviors
on discrete proficiency outcomes.
         Tsunemoto et al. (2022) considered the relationships between various facial behaviors and gestures
and ratings of comprehensibility, accentedness, and fluency. They had 60 novice raters assess videos of 20
L2 English speakers with Chinese and Spanish as their L1 narrating a story in English. Ratings were
conducted in three audiovisual conditions: audio with a static image, audio with only facial behaviors
visible, and audio with both face and body visible. For behaviors, they tallied raw frequency counts of head
movements (tilts, shakes, and nods), eyebrow movements (raises and frowns), averted gaze (looking away,
up, aside, and down), blinking, smiling, laughing, pursed lips, referential gestures (iconic, metaphoric, and
deictic), and beat gestures. Access to the face and the face and body was associated with progressively
higher comprehensibility ratings. That is to say, comprehensibility scores were higher with a visible face
than a static face, and the body and face scores were higher than the face alone. Lower accentedness scores,
however, only occurred when the face and body were both visible. There were no differences in fluency
scores as they related to viewing condition, in contrast to other studies on modality differences (e.g., Choi,
2022; Nakatsuhara et al., 2021). They additionally found no significant differences in behavior production
as a result of cultural background, and furthermore cultural background did not interact with scores awarded.
Some specific behaviors correlated with speech scores. They found that more frequent eyebrow movements
(both raises and frowns) related to less accented and more fluent speech. They speculated that eyebrow
movements may have enhanced prosodic information available to the viewers, signaling content and
indicating phrase boundaries, possibly in combination with hand gestures. Averted gaze associated with
higher comprehensibility. They speculated that gaze aversion may have helped speakers in their cognitive
                                                                                                              55


processing, leading to enhanced performance. They further considered the possibility that gaze aversion
“was expected by an external observer, with the consequence that this visual behavior alleviated at least
some processing burden for the rater” (Tsunemoto et al., 2022, p. 678). Other behaviors, including
behaviors indicating positive emotions, were not related to speech ratings.
         Using data from the same corpus as Tsunemoto et al. (2022), Kim et al. (2023) selected a different
set of 40 L2 participants taking part in paired interaction rather than monologue narration. The participants
were from a diverse range of L1 backgrounds. The L2 participants recorded perceived fluency scores about
their conversational partner on a 100-point sliding scale with disfluent and fluent as endpoints. The authors
extracted raw frequency counts of the incidence of the same nonverbal behaviors as Tsunemoto et al. (2022):
head movements, eye gaze direction, eyebrow and mouth movements, and gestures (representational and
beats). They found positive relationships between eyebrow and mouth movements (smiling) with fluency
(.31 and .34, respectively), but a negative relationship between non-beat representational gestures and
fluency. They speculated, similar to Tsunemoto et al. (2022), that eyebrow movements served to highlight
speech prosody, which tied into judgements of fluency. They further mentioned that smiling led listeners
to judge their conversational partners as less anxious and more engaged, which served as a proxy for fluency
measures. Regarding representational gestures, Kim et al. (2023) speculated that gestures broke the
continuity of speech, shifting conversational partners’ attention away from the interaction and making the
speaker appear less fluent. However, these findings are contradictory to broad agreement in the literature
that co-speech occurring representational gestures can serve to facilitate speech processing for listeners
(Drijvers & Özyürek, 2017; Jenkins & Parra, 2003; Kelly et al., 2008; Gullberg, 1998, 2006).
         Similar to the previous two studies, Trofimovich et al. (2021) investigated the impact of various
measurements of engagement on comprehensibility in the same corpus as Tsunemoto et al. (2022) and Kim
et al. (2023). Amongst these measures were nonverbal backchanneling (nodding) and measures of positive
affect. They counted the raw frequency of nods, smiles, and verbalized mentions of positive affect in 36
dyads from various national and linguistic backgrounds. Comprehensibility was measured by each partner
in the dyads. The authors found that comprehensibility was predicted by a greater number of nods and the
                                                                                                           56


use of encouraging verbal behavior. Positive affective behaviors were not a significant predictor in the
model, but the authors noted these behaviors occurred infrequently. Nonetheless, they did correlate
positively with measures of comprehensibility.
         Summary. To summarize the findings from studies relating nonverbal behavior to studies of L2
proficiency, Table 2.1 lists behaviors, their effects, and the sources documenting those effects. It can be
seen that mutual gaze is generally regarded positively, as it showed engagement and confidence, a desire
to communicate with the examiner, and occasionally listening comprehension. Averted gaze is almost
always seen as negative, as it indicated some sort of struggle or disengagement. Similarly, expressive
behavior (eyebrow movements, head movements, smiles, and other less defined body movements) is almost
always seen as a positive attribute. These may indicate interactiveness and engagement but could also
highlight prosodic information that may have likewise indicated greater spoken fluency. Any lack of such
(including extended silence) may be regarded as negative, especially if an inexpressive stance characterizes
the interaction. Gesture depends on where the gesture occurs and its frequency. Co-speech gestures are
almost always seen positively as they add important lexicosyntactic information to speech and help test
takers to manage interactions. Gestures occurring during silences or occurring too frequently are perceived
negatively due to their compensatory role during breakdowns. This is frequently the case with self-adaptors,
which are seen as a coping mechanism. Importantly, many students observed that nonverbal behaviors
convey affective information, such as confidence and engagement, that raters use when perceiving different
aspects of language proficiency.
                                                                                                          57


Table 2.1
Summary of Nonverbal Effects on L2 Perceived Proficiency
 Behavior                Impact on L2 proficiency                        Source
 Gaze
                Mutual +Engagement, desire to communicate,               Choi (2022), Ducasse & Brown
                         confidence, and listening comprehension         (2009), Jenkins & Parra (2003),
                         – Could associate with lower scores             May (2009, 2011), Nakatsuhara et
                         (Thompson, 2016)                                al. (2021a), Sato & McNamara
                                                                         (2019)
               Averted – Negative impact, indicating anxiety, access of  Choi (2022), Ducasse & Brown
                         language or content                             (2009), Jenkins & Parra (2003),
                         + Can enhance comprehensibility (Tsunemoto      May (2009, 2011), Nakatsuhara
                         et al., 2022)                                   (2021a), Sato & McNamara (2019)
 Mouth: Smiling          + Enhanced fluency, raised engagement,          Kim et al. (2023)
                         lowered perception of anxiety                   Thompson (2016)
                         + Possible relationship with comprehensibility  Trofimovich et al. (2021)
                         + if Duchenne (authentic)
                         – if non-Duchenne (inauthentic)
 Eyebrows
                 Raised + Interaction management                         Kim et al. (2023), Neu (1990),
                         + Lowers accentedness                           Tsunemoto et al. (2022)
                         + Raises fluency due to prosodic cues
              Frowning   + Lowers accentedness                           Kim et al. (2023), Tsunemoto et al.
                         + Raises fluency due to prosodic cues           (2022)
 Head
                  Nods +Engagement, listening comprehension, and         Jenkins & Parra (2003), May (2009,
                         interaction management                          2011), Neu (1990), Trofimovich et
                         + Comprehensibility                             al. (2021)
                         – Negative impact if missing
                   Tilts + Engagement and interaction management         Neu (1990)
 Posture
       Leaning forward + Engagement and listening comprehension          Jenkins & Parra (2003)
     Leaning backward + Confidence, low anxiety                          Neu (1990)
               Rigidity – Negative impact                                Gan & Davison (2011), May (2009,
                                                                         2011), Neu (1990)
 Gesture
       Representational + Co-occurring with speech, added                Ducasse & Brown (2009), Gan &
   (iconic, metaphoric, lexicosyntactic information                      Davison (2011), Gullberg (1998),
            and deictic) + Interaction management (deictics)             May (2009, 2011), Neu (1990),
                         – In silence, gesticulated, or too frequent     Sato & McNamara (2019),
                         – Lowers fluency, breaks attention (Kim et al., Thompson (2016)
                         2023)
                  Beats  + Emphasizing semantic units                    Gan & Davison (2011), Neu (1990)
                         + Interaction management
          Self-adaptors  – Indicated anxiety, problems with lexical      Gan & Davison (2011),
                         access, and breakdowns in fluency               Sato & McNamara            (2019),
                                                                         Thompson (2016)
 Other
                Silence – Negative impression if extended                Jenkins & Parra (2003)
         Expressiveness + Positive impact                                May (2011), Jenkins & Parra
                                                                         (2003), Neu (1990)
       Inexpressiveness – Negative impact                                Gan & Davison (2011), Jenkins &
                                                                         Parra (2003), May (2009, 2011),
                                                                         Neu (1990)
                                                                                                           58


         As the various authors noted, the findings generally align with those from the broader literature on
nonverbal behavior in that behavior can convey cognitive, semantic, social-interactional, and affective
information. When L2 speakers use these behaviors in target-like ways that show engagement, confidence,
and a desire to communicate, they are seen positively. When behaviors show anxiety, do not align with
speech, occur in non-target-like ways (being too frequent), or are generally rigid and not present, L2
speakers are perceived as less proficient. Raters use behavior to infer sociocognitive information about
speakers—language proficiency—often through an affective lens. It is also possible that raters are impacted
by affective responses, altering their own emotional stance towards the interview and coloring their
judgements about the test takers. Because of the importance of the ability of nonverbal behavior to convey
affect, this review now turns to that topic in more detail.
Affect
         Analogous to the form-function relationship between language and semantics, nonverbal behavior
(a set of forms) conveys information to listener-interactants. The previous section covered a sample of the
cognitive, affect, and interactional information that nonverbal behavior conveys. Affect, however, has
received substantial attention in the literature due to its close connection with body language, and because
of its importance in human communication, this review will now cover this topic in more detail. Affect may
include mood, emotions, interpersonal attitudes, and other personality states or traits. It is present alongside
the semantic content of utterances in all types of communication. We infer stances and feelings from written
texts based on the choice of words and discourse formulation on the page, but in speech we are able to draw
from the repertoire of both language and nonverbal behavior to make inferences about thoughts, feelings,
dispositions, and motivations of others. These inferences and attributions may then be used when evaluating
others in interpersonal encounters. The previous section has discussed some of the forms and functions of
nonverbal behavior, as well as how they can impact the evaluation of language proficiency. In practice,
individuals may be less attuned to the specific behaviors that they see and instead formulate evaluations
based on the affective responses they infer from their interactants. In this section, I begin by defining key
terms, followed by a discussion of the cognitive, social, and cultural origins of affect. I will describe how
                                                                                                              59


affective responses relate to language achievement, how they impact interpersonal relationships, and how
affect has appeared in the language testing literature in relation to evaluations of proficiency.
Definitions
         In general language use, the terms emotion, mood, and affect are often grouped together as the
same concept. This is also true in the academic literature in theoretical and empirical studies (Briner &
Kiefer, 2005). These terms are often grouped together as affect (Frijda, 1994; Scovel, 1978) or emotion
(Pavlenko, 2006). In applied linguistics research, Pavlenko (2006) lamented that emotions have often been
reduced to "a laundry list of decontextualized and oftentimes poorly defined sociopsychological constructs,
such as attitudes, motivation, anxiety, self-esteem, empathy, risk-taking, and tolerance of ambiguity" (p.
34). Though certainly similar, emotions, mood, and affect are distinct. Each is rooted in varying
neurophysiological processes and psychosocial phenomena. Here I provide some definitions of each,
though noting the caveat that disagreements on these definitions persist (Russell, 2012).
         Emotion. Scherer (2005) characterized emotions as a) being a reaction to some sort of
stimulus, b) varying in intensity, but often more intense than moods, c) lasting a relatively short period of
time, and d) resulting in some change in behavior. Emotions may consist of cognitive, physiological,
motivational, expressive, and experiential factors (Scherer, 2005). Cognitive factors refer to an assessment
of the stimulus event; physiological factors are the body's internal changes due to hormonal release and
reactions in the nervous system; expressive factors comprise all of the automatic, unconscious behavioral
embodiment of the emotion in the body and face; motivational factors are the actions resulting from the
emotion; and experiential factors refer to the subjective, personal experiences of an emotion by the
individual (Lischetzke & Eid, 2003; Scherer, 2005). Experiential factors are more commonly known as
feelings, which may not align exactly with the internal physiological factors at play in emotions, as
individuals may misinterpret or be unwilling to communicate their true emotions (Tran, 2007). Plutchik
(2001) proposed eight core emotions: joy, surprise, anticipation, trust, sadness, fear, anger, and disgust.
         Mood. Affective states that are generally lower in intensity than emotions, rather diffuse
in nature, longer lasting than emotions, and often without a definite stimulus event or focus are called moods
                                                                                                            60


(Frijda, 1994; Lochner, 2016; Tran, 2007). Moods are also more global in nature and may be comprised of
unconscious background sensations (Lischetzke & Eid, 2003). Mood, unlike emotion, is purely subjective
and an individual's perception of their internal state (Lochner, 2016). Similar to emotion, mood may also
be due to physiological changes and result in behavioral shifts in the body and face (Lochner, 2016). Some
examples of mood may be general contentment, ennui, or depression.
          Affect. Although affect commonly includes emotion and mood (Frijda, 1994; Scovel, 1978),
it also refers to a range of characteristics aligned more closely with personality. These characteristics may
include broadly pleasant or unpleasant experiences (Frijda, 1994), dimensions of personality traits or states
(Diener et al., 1995; Watson et al., 1988), or attitudes (Scherer, 2005). Similar to emotion and mood, affect
may also be rooted in physiological mechanisms and result in changes in bodily behavior. Affect is
distinguished from emotion and mood in that while emotion and mood are generally internally realized,
affect is something observed and perceived by others as a trait or state. An individual may convey a
particular personality characteristic or attitude at one particular moment (state), or they may be disposed in
their reactions to experiences that appear as trait elements of their personality (Lischetzke & Eid, 2011).
Affect may thus be transitory or even stable for long periods of time, perhaps even a lifetime (Mehrabian,
1996). Trait affect may thus serve as a lens to moderate emotions or moods people experience in certain
situations (Lischetzke & Eid, 2003). Examples of affect common in SLA literature are anxiety, motivation,
and willingness to communicate.
          In this dissertation, the focus is on the perceived, outward-facing emotions, feelings, orientations,
and stances of what individuals perceive, and for this reason the focus will be on affect. Affect best captures
the range of perceptions a listener may have of another person, such as being happy, confident, or engaged.
I will follow trends in psychology and group the various concepts together as affective phenomena (Tran,
2007), and I will refer to them broadly as affect. When discussing a particular instance of visible affect, I
will use terms such as affect displays, affective responses, or affective reactions. What is important in the
context of this study is not the actual emotion or mood the individual experiences (an internal state), but
rather how others perceive and decode the nonverbal behavior from a second language speaker as an
                                                                                                             61


external social experience.
Cognitive origins of affect
         People are continuously observing and reacting to their environment in our everyday lives. In many
of these moments, people experience feelings about what they see, and they may choose to express those
feelings for a number of reasons. Perhaps because every human experiences these internal reactions to
stimuli, emotions and attitudes have historically been characterized as largely internal cognitive phenomena.
A person may feel happy or proud because they won an award, and thus project an outward affective
expression of being focused, attentive, and confident. Another person may feel sad, angry, or shameful
because they failed a test, and they may be seen as anxious, disengaged, or non-interactive. People often
attribute these affective reactions as living purely within the individual.
         Research in psychology has historically supported this supposition. William James (1884) defined
emotions as personal, subjective experiences that arise from an individual’s internal senses and observations
of the world around them. Supposing James is right, the body should have consistent reactions to particular
phenomena, with the body and brain showing particular patterns when feeling sad, angry, or happy.
Attempts to identify internal bodily patterns associated with emotions have, however, come up short.
Although there are some patterns that differentiate affective responses (Kreibig, 2010)—such as rising
blood pressure’s association with anger rather than happiness—there are no consistent patterns in the
autonomic nervous system that characterize each particular emotion (Siegel et al., 2018).
         Schachter (1964) developed a two-factory theory to compensate for the lack of a one-to-one
relationship between autonomic response and emotions. In this theory, the brain manages physiological
arousal in light of situational contexts, thus producing emotional responses that take both into account. In
this theory, heightened arousal when receiving good news would be characterized as happiness, while
heightened arousal when being slapped would be felt as anger. Context, in other words, retroactively
translates the arousal of the emotional experience as valence. Although validation studies of this theory
have failed to replicate the initial findings (e.g., Manstead & Wagner, 1981), the implications are important
because it recognizes that outside stimuli are the driving force behind the valence of affective responses.
                                                                                                           62


Barrett (2017) extended the nuanced view of outside stimuli interacting with internal changes in the body,
showing different patterns of bodily responses characterizing particular expressions of an emotion in
different contexts. Her theory moves away from claiming that the body produces a specific response, but it
maintains that emotions are subjective experiences with externally classifiable representations.
         Other researchers have placed an even greater emphasis on the role of situational contexts in their
theories. In appraisal theory (Arnold, 1960; Lazarus, 1991), it is the evaluation of particular environmental
stimuli that activates emotions. This theory argues that emotions rarely appear spontaneously; they are
physiological responses to objects, entities, and actions being witnessed in the world. Emotions, then, are
feelings happening inside the body and also a reaction to emotionally primed situations (Frijda, 2005). The
“why” behind an emotion plays just as much of a role as the emotion itself; the award the student won is
what drives the happiness to begin with. Different emotions in appraisal theory are driven by dimensions
of motivational relevance, motivational congruence, and accountability (Smith & Lazarus, 1993). In terms
of relevance, situations that are not deemed as impactful to an individual do not provoke affective responses.
Incongruent events hinder our progress and cause negative emotions, while congruent events help to
provoke positive feelings. Finally, accountability determines whether the source of the event is oneself or
external. Thus, happiness about an award would be relevant, congruent, and externally caused, while the
source of pride would be similar but internal instead of external. Nonetheless, critiques of this model have
argued against this type of strict causality, leading reformulations that show that appraisals can be
consequences of affective responses as well (Lazarus, 1991). Events are thus appraised in relation to the
social context at hand; it is not the award itself that causes happiness, but the appraisal of having earned it
after working hard for a period of time.
         The lack of characterizable physiological responses and the growing importance of context has led
some psychologists to argue that emotions may be more complex than previously thought, occurring instead
as social phenomena, and arising between individuals. Emotions may then be social, distributed amongst
individuals, and thus the result of interpersonal interactions (Parkinson, 1996). Happiness may not only
arise just because an individual won an award after working hard on a particular project. It may also be the
                                                                                                            63


result of the person’s knowledge of their parents’ and teachers’ expectations and resulting pride in response
to the student’s award. I turn to the social origin of affect in the next section.
Social origins of affect
         The attribution of affect to social settings is intuitive. People are usually able to claim a source to
their emotions, such as when children claim that “so-and-so made me mad!” These affective responses may
serve to prepare people physically for further action, resulting in certain emotional dispositions (Arnold,
1960). The angry child in this example would then be prepared to take action or flee from the object of their
anger. Frijda (1986) made the explicit case that affect serves as a relational device between individuals “that
establishes or modifies a relationship between the subject and some object or the environment at large” (p.
13). Some affective responses may then be dependent on social relationships and originate in social
encounters, which led Mesquita (2022) to claim that “emotions are for acting, and particular for acting in
the social world” (p. 53, emphasis in the original).
         In one socio-affective model, de Riviera (1977) argued that emotions can be distinguished by the
type of activities occurring between individuals rather than within them. These emotions are thus colored
and contextualized by the social relationship itself instead of the physiological responses or the objects and
people being appraised. An emotion such as happiness, then, serves to bring people together, and a feeling
such as anger would serve to eliminate a threat or challenge. One can imagine a complex attitudinal response
such as confidence serving dual roles as an internal appraisal of one’s abilities and also an affective stance
that communicates to others that the individual is ready, willing, and able to tackle a challenge. This push
and pull between the individual and social directions of affective responses may be an inherent
characteristic of all emotions (de Riviera & Grinkis, 1986). According to Parkinson (2019), “emotion’s
special ingredient is its capacity to align and realign people’s relations with each other and with objects and
event in the shared environment” (p. 2).
         In these examples, the emotions underlying affect are still to some extent residing in the
individual’s responses to their environment. Some have argued, on the contrary, that while this may be the
case for some sets of emotions, others originate between individuals (Boiger & Mesquita, 2012; Frijda &
                                                                                                             64


Mesquita, 1994; Rimé et al., 1991). Mesquita (2022) differentiated between these two types as MINE
emotions (mental, inside the person, and essentialist) and OURS emotions (outside the person, relational,
and situated) (pp. 23–24). MINE emotions are those that we traditionally associate with, such as feeling
happy when winning an award. OURS emotions depend on the shared experiences between others. Uchida
et al. (2009) provided evidence of OURS-based emotions of happiness in the ways that Japanese and
Americans talked about feelings. Japanese individuals were much more likely to talk about the feelings of
a group as a whole, and Americans were more likely to talk about their individual emotions. Japanese
participants were also more likely to identify happiness in photographed individuals only when they were
surrounded by smiling people, whereas Americans identified happiness in the individual regardless of the
facial expressions of their surrounding group. Other findings have corroborated these claims (Masuda et al.,
2008, 2012).
         These findings point to evidence that emotions and the broader realm of affective phenomena are
socially constructed. In L2 testing settings, there is some evidence that examiner behaviors exhibiting
positive or negative affect can impact the corresponding affective responses of test takers as well (Briegel-
Jones, 2014; Plough & Bogart, 2008), such as a test taker feeling less anxious with a friendly examiner.
However, as apparent in the previously cited studies, there may also be important differences in how
emotions are conveyed and decoded in different cultural settings. Given that L2 testing contexts are almost
always intercultural encounters as well, the next section details some of the ways culture plays an important
role in our understanding of affect.
Cultural origins of affect
         Emotions and affect have been discussed as having origins that are simultaneously cognitive and
social. Related to the social phenomenon of emotions is the outstanding question of whether emotions are
universal across all cultures. Ekman’s (1972) neurocultural theory argued that certain basic emotions (e.g.,
happiness, sadness, anger, fear, disgust, and surprise) are hard-wired into the human nervous system and
shared across all cultures. Furthermore, he argued that those emotions had discrete representations in
muscle movements in the face. Any deviation from these affective facial behaviors and their underlying
                                                                                                           65


emotions was due to social learning or display rules—cultural norms that dictate the appropriacy of
expression of certain emotions in social contexts. Ekman’s theory was supported by findings that allegedly
showed that diverse cultures were able to accurately identify—at a rate greater than random chance—
particular emotions in connection with images of Western faces (Eckman et al., 1969). Eckman et al. argued
that this provided evidence of some degree of universality in at least some emotions, thus rejecting cultural
relativism.
          In the section on nonverbal behavior, studies were presented that argued that Ekman’s link between
emotions and behavior were consistent in some ways across cultures. Not all psychologists agree, however,
that better-than-chance identification of emotional displays means that emotions are universal. Some reject
the premise of emotional universality, arguing that cultures may differ in how emotions are felt (Mesquita,
2022), how they are encoded and decoded (Crivelli et al., 2016; Elfenbein & Ambady, 2002), and how they
describe emotions through their internal lexicons of emotional vocabulary (Wierzbicka, 1992). For example,
in a study across 2,500 languages, Jackson et al. (2019) found low semantic similarity among 24 emotion
concepts (English-based concepts) across cultures. They found that the only term that was similar across
nearly all cultures was broadly “feeling good.” Other terms, such as bad, love, happy, and fear, occurred in
as many as 70% of the languages to as low as 15%. Some languages may lack familiar emotional terms,
such as “sadness” in Tahitian (Levy, 1973). Other languages have emotional terms that have no direct
translation in English, such as “amae”—a feeling of coziness and tenderness when being quasi-dependent
on a parent—in Japanese (Morsbach & Tyler, 1986).
          Different cultures may experience and interpret emotions somewhat differently as well. For
example, conveying happiness does not appear to be perceived equally across all cultures. In the American
cultural tradition, happiness is an integral part of social interactions (Wierzbicka, 1994) and may convey
success, achievement, pride, superiority, and self-esteem (Uchida & Kitayama, 2009; Kitayama, et al., 2006;
Shaver, et al., 1987). Happiness is thus seen as positive and desirable, and individuals showing happiness
may benefit when dealing with others, solving problems, and negotiating outcomes. This is not the case in
all cultural contexts. For example, in one study Japanese students differed from their American counterparts
                                                                                                          66


in that they associated happiness with both positive and negative traits; happiness was seen as temporary
and fleeting, but also potentially disruptive to social groups because of its potential to cause envy (Uchida
& Kitayama, 2009). Differences in “the antecedents, the actions, the reactions from others, the
consequences, and arguably the associated feelings” of emotions can be found in nearly every culture and
type of emotion expressed (Mesquita, 2022, p. 147).
         Cultural interactions with emotion are particularly important for L2 speakers living away from their
home cultures. They must navigate emotions that are perhaps different or expressed differently, and then
adapt to the cultural norms associated with these. Pavlenko (2014) described the burden on language
learners:
         To move beyond initial and often faulty assumptions and to understand the emotional world of their
         host community, L2 learners … have to puzzle out unfamiliar behaviors, to identify what triggers
         which “emotions” and when, to learn how particular “emotions” may be managed and to discover
         what cues to pay attention to and how to interpret verbal and non-verbal “emotion displays.” (p.
         247)
Adapting to these cultural differences may take a lifetime or may never even be reached at all. Mesquita
(2022) further described the behavioral interplay between individuals on a social and cultural level as a type
of dance:
          Like partners in a dance, your emotions and those of others complement and steer each other to
         form the interaction. And shared cultural knowledge, in the form of language and practices,
         orchestrates the ways in which different individuals do emotions together. It is like dancing the
         tango at the rhythm of the music, together with a partner who knows their dance steps, as you know
         yours. The dance emerges from everyone knowing their moves, and from the moves being in sync
         with the music. Doing your emotions in a way that fits with the relationships in your culture, and
         with your position in those relationships, is akin to having the right dance steps. (p.164)
When L2 learners are out of step and show differences in the encoding and decoding of emotion through
nonverbal behavior, there is the potential for breakdowns in intercultural communication. These
                                                                                                            67


breakdowns may then straddle the line between emotion and its physical manifestation through nonverbal
behavior.
Affect and second language proficiency
          Cognitive mechanisms driving SLA and impacting second language assessment have been and
continue to be a focus for applied linguists. Likewise, especially in the language testing literature, features
of speech and writing that characterize particular proficiency levels have played a major role in the
validation and refinement of instruments that are better able to track the development of language
proficiency. Nonetheless, affect has been hypothesized as having a facilitating or limiting effect on test
takers’ responses, determining “not only whether they even attempt to use language in a given situation,
but also how flexible they are in adapting their language use to variations in the setting” (Bachman &
Palmer, 1996, p. 65). In recent decades, affective traits making up learner individual differences, such as
anxiety, motivation, and learner attitudes, have received substantial attention in the literature, and other
complex attributes such as willingness to communicate and grit have been hypothesized to further deepen
our knowledge of the mechanisms of language learning. Perceptions of affect, as well as the emotions
learners report to feel, have received somewhat less attention in the literature, though this is growing
(Dewaele & Li, 2020; Prior, 2019). It is important to uncover their effects because shifting affective stances
may impact test performances, which may then translate to test score variance. Also, although nonverbal
behavior is often decoded through an affective lens, behavior and affect may have varied interpretations by
different interlocutor-listeners.
          The most frequently studied affective phenomenon in applied linguistics is most likely anxiety
(Gkonou et al., 2017). Anxiety—“the subjective feeling of tension, apprehension, nervousness, and worry
associated with an arousal of the autonomic nervous system” (Horwitz et al., 1986, p. ii)—has been studied
as a trait-based, more permanent disposition (e.g., test anxiety; Scovel, 1978). It has also been regarded as
a state, understood as an affective response to a particular stimulus (Spielberger, 1983). Higher levels of
anxiety have been found to hinder the processing of language input, thus having deleterious effects on
language output in speakers (Gardner & MacIntyre, 1993; MacIntyre & Gardner, 1994) and possibly
                                                                                                             68


delaying language development (Dewaele, 2010). In speaking situations, it has been found to impact a range
of output variables, serving as a strong predictor for subjectively scored temporal measures of fluency
(Pérez Castillejo, 2019). It can even cause changes in how listeners react to speakers verbally and
nonverbally (Gregersen et al., 2014). Nagle et al. (2022) argued that anxiety may be distracting for
interlocutor-listeners, thus interfering with processes of comprehension, and resulting in decreased
comprehensibility. It is no surprise, then, that it has been found to share a negative relationship with L2
proficiency, competence, and achievement (Botes et al., 2020; Clément et al., 1980; Clément & Kruidenier,
1985; Dewaele & Alfawzan, 2018; Dewaele & Li, 2022; Dewaele & MacIntyre, 2014; Dewaele et al., 2019;
Jiang & Dewaele, 2019; Jin et al., 2017; Li et al., 2020; MacIntyre et al., 1997; Teimouri et al., 2019).
         Closely related to anxiety is the notion of confidence. Although not direct opposites, anxiety relates
to discomfort and tension during L2 use, and confidence, especially L2 self-confidence, “corresponds to
the overall belief in being able to communicate in the L2 in an adaptive and efficient manner” (MacIntyre
et al., 1998). Confidence has been found to explain large amounts of variance in various educational,
occupational, and other performative outcomes (Ahammer et al., 2019; Cobb-Clark, 2015; Heckman et al.,
2006; Judge & Hurst, 2007; Stankov et al., 2012). Stankov et al. (2012) argued that confidence can be an
affective state, trait, or disposition, lying somewhere between a measure of cognitive ability and a
dimension of personality. Self-confidence formed a core predictor of communicative competence in
Clément’s (1980) and Clément and Kruidenier’s (1983, 1985) socio-motivational models of L2 proficiency.
In these models, positive and frequent contact with speakers of the L2 was understood to build speakers’
self-confidence, leading to greater communicative competence, acculturation with the target L2 group, and
adaptability (Clément et al., 1994; Noels & Clément, 1996; Noels et al., 1996). Confidence, then, was
considered a core factor of motivation, and the development of both led to proficiency gains (Labrie &
Clément, 1986), leading Clément (1986) to claim that self-confidence was “the best predictor” of
proficiency (p. 286). Confidence can also lead to biases in the self-assessment of proficiency, with low
confident speakers underestimating and highly confident speakers overestimating their abilities (MacIntyre
et al., 1997). The relationship between confidence and L2 proficiency, however, may be cyclical, with
                                                                                                             69


bidirectional feedback between the two as learners’ language skills develop (Edwards & Roger, 2015). That
is to say, confidence may help speakers perform in a second language (e.g., Doqaruni, 2015), and it may
help them be perceived as more capable, while at the same time enhanced language development likewise
may very well boost these learners’ confidence.
         Positive emotions have also received attention in the literature, drawing inspiration from interest in
positive psychology (Fredrickson, 2001, 2003; MacIntyre et al., 2019; Seligman, 2011). Positive
psychology considers the roles of positive emotions and affective stances in people’s lives, and how these
feelings may impact subsequent performances, achievement, and overall development. Positive feelings,
such as happiness, warmth, enjoyment, and well-being play a critical role in language learning (Oxford,
2016). As opposed to the restrictive, debilitating effects of anxiety, positive emotions can enhance language
learning by “broadening a person’s perspective and opening the individual to absorb the language”
(MacIntyre & Gregersen, 2012, p.193). Enjoyment has received the bulk of the focus amongst positive
emotions in the L2 literature, though positive psychology has close connections to other extensively studied
phenomena such as motivation (MacIntyre et al., 2019). Dewaele and MacIntyre (2014) argued that
enjoyment and anxiety are not direct opposites; each has distinct functions regarding language learning and
acquisition and different characteristics. In a handful of studies, both anxiety and enjoyment were analyzed
in relation to L2 competence and achievement, with results indicating a positive relationship between
achievement and enjoyment (Botes et al., 2020; Dewaele & Alfawzan, 2018; Dewaele & Li, 2022; Dewaele
& MacIntyre, 2014; Dewaele et al., 2019; Jiang & Dewaele, 2019; Li et al., 2020). Li et al. (2020) was
careful to note that any relationships between enjoyment and achievement/competence may be bidirectional.
That is, enjoyment may drive language development similar to confidence, and language development may
simultaneously drive enjoyment. It is also notable that studies analyzing the relationships between
anxiety/enjoyment and achievement/proficiency were correlational in design, making the direction of
causality less clear.
         Engagement is another positive affective stance that helps to regulate interpersonal communication.
Engagement has been broadly defined as an individual’s level of interest and evidence of participation in
                                                                                                            70


an event (Philp & Duchesne, 2016), and it may be composed of cognitive, social, and emotional elements
(Baralt et al., 2016). Engagement may be strengthened by communicating on familiar rather than unfamiliar
topics (Qiu & Lo, 2017), especially ones related to a speaker’s life or personal experiences (Lambert et al.,
2017). It may also be higher when learners speak to higher proficiency interlocutors (Dao & McDonough,
2018), and in face-to-face, human-mediated tasks (rather than computer-mediated) (Baralt et al., 2016).
Engagement has also been classified as an integral strategy in interactional communication (Zhu et al.,
2019), and has been found to underscore many interactional phenomena such as agreeing, disagreeing, and
managing turn-taking (Goturk & Chukharev-Hudilainen, 2023). In language testing contexts, engagement
likely drives participation in communication, which intuitively would lead to enhanced performance. The
need to show participation despite underlying communication problems has been documented, as some
learners have been shown to avoid clarification sequences when they fail to understand their interlocutors
by using minimal responses (e.g., simple backchannelling like nodding, saying yes) (Ducasse, 2010; Lam,
2015, 2021; Luk, 2010). These face-saving responses likely serve to show the test takers as engaged even
though they are in reality feigning understanding.
Affect and interpersonal perceptions
         Until this point, affective responses have been presented as phenomena that arise either within or
between people and dependent on the cultural background of the individual(s). Affect has also been shown
to correlate with language achievement, proficiency, or competence, where negative affect (e.g., anxiety,
disengagement) tends to align with lower proficiency, and positive affect (e.g., confidence, enjoyment,
engagement) may be more characteristic of higher proficient speakers. These correlations may reveal broad
trends amongst learners, but they are not likely to be causal. A highly proficient speaker can be anxious
during an unfamiliar task, and a low proficient speaker may enjoy speaking a language despite facing
communication difficulties. A question arises then about the impact of these affective displays on other
individuals. As emphasized throughout this paper, communication rarely occurs in a vacuum. Listener-
interlocutors ultimately perceive the affective responses of others, and these perceptions may furthermore
play a role in how the speaker’s language proficiency is perceived.
                                                                                                          71


         As discussed in the previous section on nonverbal mimicry, people may produce similar or
opposing behaviors when seeing the behaviors of their interactants. Research in psychology has also
discussed the ways that the affective displays of one person may impact another. People may converge in
their emotional responses, such as when excitement spreads amongst a team, or when someone feels
empathy for another’s pain. People may also diverge in their responses, such as with feelings of
schadenfreude (gloating at someone else’s disappointment). Likewise, one person’s emotions may provoke
a non-emotional response, such as an observation of one’s behavior. In cases of convergence or divergence,
people’s orientations towards events may be the same, but not necessarily so; likewise, similar orientations
may provoke different responses, such as opposing teams watching a sports game.
        The emotional influence of one person on another may be due to an emotional contagion (Elfenbein,
2014; Hatfield et al., 1994), contracted when people converge on an undirected emotion (an emotional
display without a clear source) immediately and automatically. In other words, the mere presence of a
feeling, such as fear, spreads quickly and without appraisal of any particular stimulus (Parkinson, 2011;
Parkinson & Simons, 2009). In Hatfield et al.’s (1994) original conceptualization of primitive emotional
contagions, seeing the speaker/source’s emotional expression through facial and bodily behavior leads to
nonverbal mimicry and finally emotional matching in the listener/viewer. Related to these behavioral
imitations, emotional mimicry (Hess & Fischer, 2013) involves people producing behaviors that correspond
to convergent emotions. Emotional mimicry differs from a contagion in the viewer’s interpretation of the
affective meaning at play, as mimicry involves a degree of awareness and contagions happen unconsciously
and without appraisal. Mimicry is more common when the two or more individuals have coordinated
perspectives on the context of the situation (Bavelas et al., 1986; Bourgeois & Hess, 2008). Importantly,
Parkinson (2018) noted:
         Mimicry is not an instinctive and automatic process that guarantees facial matching under all
         circumstances, but instead relates to communicative goals that already imply an emotional
         orientation to the other person. Emotional mimicry depends on emotional meaning rather than
         producing it. (p. 166)
                                                                                                          72


Regardless of the underlying processes involved in contagions, mimicry, or social appraisal (Lazarus, 1991),
what is clear is that people see others’ affective reactions, and these may sometimes produce a convergent
response in the viewer, automatically or after inferring the underlying causes.
         Research has also considered whether affective contagions may alter not only the way a receiver
may feel, but also perform. In a study of the impact of emotion on test outcomes, Lochner (2016)
hypothesized that positive psychological states would result in enhanced performance on a reasoning test
in comparison with negatively valanced emotions, and also that anger would result in better performance
than sadness. She conducted pre- and post-intervention reasoning tests with 429 participants. In the
intervention, she induced five general emotions from the participants: joy, sadness, anger, contentment, and
a neutral state. She measured the emotional states of the participants after the intervention and found that
emotions had successfully been transferred. However, contrary to findings from the literature, she found no
effects of emotion on reasoning test performance. She concluded that reasoning may be less susceptible to
emotional interference in a laboratory context, but it may be more noticeable in other types of performance.
         There is evidence of interpersonal affective impact in the L2 literature, often occurring dynamically
as conversations unfold. Negative feelings, such as anxiety, may be provoked by a speaker’s interlocutor
(Hashemi, 2011). These feelings may fluctuate when speakers encounter interlocutors with differing social
status and familiarity (Shirvan & Talebzadeh, 2017). Affective responses may also vary depending on how
individuals perceive their partner’s level of engagement, such as with their choices of forms and frequency
in backchanneling (Cutrone, 2005; Heinz, 2003; Lindberg et al., 2022). When a listener hears “mhmm”
from an interlocutor, they may interpret this backchannel as an acknowledgement of what they said, as
impatience, or even as an interruption (Cutrone, 2005). These various interpretations can be a source of
miscommunication, leading to growing anxiety in the listener (Li, 2006). Low engagement in an
interlocutor, such as appearing disengaged or uninterested, may also provoke autonomic responses of
anxiety (Lindberg et al., 2022). Anxiety may surface when individuals talk to a more proficient speaker of
a language (Sevinç, 2018) due to linguistic insecurity (Heng et al., 2012). In L2 testing settings, the
perceived affect of examiners may sometimes alter the performance (and underlying psychological
                                                                                                           73


response) of test takers (Briegel-Jones, 2014; Plough & Bogart, 2008). Viewing nonverbal behavior may
drive emotional contagions or mimicry in the transfer of affect (Blairy et al., 1999), but not necessarily so.
The impact of viewing affective displays may be attenuated by perceptive ability. Lindberg et al. (2022),
for example, found no evidence of an emotional contagion of anxiety in L2 dyads during a speaking task.
These researchers used a correlational design to compare physiological measurements of anxiety (using
galvanic skin response) with perceived ratings of anxiety of their interlocutors. They found no relationship
between the two. They speculated that the lack of contagion may have been due to a lack of being attuned
to their partners’ feelings. Thus, sensitivity towards others may be an important variable that drives these
effects. While this study was conducted in person, it is unknown whether asynchronous online stimuli could
also provoke congruent affective responses.
         Affective responses can occasionally then be a driving force of emotional changes in others. There
is also evidence that affect not only impacts how others feel, but how they perceive the person that displays
the affect in other ways. That is to say, if a test taker is perceived as being happy, this happiness may then
color the examiner’s perceptions of other characteristics and judgements about the speaker, such as their
competence. In a study of 200 trait and state notions from 12 dimensions related to organizational
psychology, Wojciszke et al. (1998) found that holistic judgements of individuals were predicted almost
entirely by morality (defined as an affective, emotional character, e.g., warmth, friendliness, positive affect)
and competence (e.g., skillfulness, ability), with morality comprising the bulk of the variance. Drawing
from this study, empirical work on social cognition (e.g., Fiske et al., 2007), and their own empirical
research in social psychology, Cuddy et al. (2007, 2008) created the Behaviors of Intergroup Affect and
Stereotypes (BIAS) map of affective impact whereby warmth and competence formed two key dimensions,
shown in Table 2.2. Here, high competence and high warmth in a speaker may elicit feelings of admiration
from a listener. One can imagine a confident, competent, friendly L2 speaker being received with
admiration due to their personability. On the other hand, a competent yet cold individual may provoke
feelings of envy, such as the feeling when one may feel another does not deserve certain success due to
their lack of friendliness. When someone is seen as warm yet less competent, this may result in feelings of
                                                                                                             74


pity for that individual, such as a happy student that fails an important exam. Finally, cold, less skillful
individuals may be seen as a burden and treated with contempt. The use of nonverbal behavior that conveys
desirable dimensions of warmth and competence may then benefit less advantaged social groups, perhaps
even L2 users, by overcoming judgmental biases (Cuddy et al., 2011). Feigning competence through power
posing—intentionally putting on a performative confident stance—can even result in desirable outcomes,
such as being perceived as stronger during job interviews (e.g., Cuddy et al., 2015).
Table 2.2
Affective Impact of Warmth and Competence (Adapted From Cuddy et al., 2007, 2008)
                                      High warmth                          Low warmth
  High competence                     Admiration                           Envy
  Low competence                      Pity                                 Contempt
         Far less is known about the impact of affect and emotion on listener-raters in the L2 context. There
is anecdotal evidence of raters noticing and reporting various dimensions of affect in the testing literature,
especially in the context of interaction (Ducasse & Brown, 2009; May, 2009; Nakatsuhara et al., 2021a;
Orr, 2002; Sato & McNamara, 2019; Thompson, 2016). Ducasse and Brown (2009) reported raters’
observations about candidates’ interactional competence. They found that attention and engagement were
important affective stances during interactive listening, noting their critical role when displaying
comprehension or repairing breakdown with their speaking partners. Nakatsuhara et al. (2021b) speculated
that the significantly higher number of clarification sequences in their contrast of video conferencing tests
with face-to-face tests could have been due to a greater need to signal engagement and solve communication
breakdowns in this format, as gestures and voice inflection may have been less salient online. The raters in
May (2009) also appeared especially attuned to affect in the context of asymmetrical interactions, noting
that engagement, attention, and confidence were perceived via body language to convey assertiveness.
Raters drew on assertiveness to determine the effectiveness of interactions, which May (2009) cautioned
“may be of concern, in that these characteristics could be seen as aspects of culture, and L1 usage” (p. 417).
                                                                                                           75


In a subsequent study, May (2011) found evidence of a broader number of patterns indicating interactional
competence, a number relating to affect. Raters noted, for example, that confidence, assertiveness,
engagement, collaborativeness, and showing a desire to communicate were all positive characteristics of
interaction. Theorists have also stressed that affective stances may be important in communication, such as
confidence and empathy (Morreale et al., 2013), adaptability to cope with challenging situations (Harding,
2014), and patience, tolerance, and humility to negotiate for meaning in intercultural situations
(Canagarajah, 2006).
         The perception of affect, generally by seeing nonverbal behavior, can impact judgements of L2
proficiency, though empirical research in this area is quite scarce. Sato and McNamara (2019), as discussed
earlier, found evidence for an orientation to affect amongst their raters. They found that composure and
attitude made up 5.7% and 6.3% of the comments about the CET-SET and Cambridge exams, respectively.
Among the features discussed, confidence was one of the most important affective displays that related to
raters’ perceptions of proficiency. Confidence was sometimes viewed through the use of mutual gaze and
the absence of anxiety-related behaviors (e.g., self-adaptors, averted gaze, low expressiveness), and related
to a perception of stronger communicative effectiveness. Socially oriented affect such as engagement was
also a common observation that occurred during moments of collaborativeness, also factoring into
judgements of confidence. Anxiety was perceived negatively, even in cases where the speaker’s message
was comprehensible, leading to judgements of lower competence. Raters were also attuned to interactional
features, finding their presence (such as active listening, responding, backchanneling) related to being
attentive, interactive, and collaborative, and also leading to greater perceptions of competence.
         However, few studies have attempted to measure the impact affect may have on ratings of L2
speech. Three studies to date have considered subjectively perceived affect (Nagle et al., 2022; Ockey, 2009)
and objectively measured affect (Chong & Aryadoust, 2023) and their impact on rated outcomes. Nagle et
al. (2022) set out to measure the extent that anxiety and collaborativeness (which the authors noted was a
proxy term for perceived social engagement) influenced scores of L2 comprehensibility. The authors
analyzed data from twenty dyads taking part in 17-minute interactions. In each interaction, the authors made
                                                                                                           76


repeated measurements of their own and their partners’ perceived anxiety and collaborativeness, while also
rating the comprehensibility of their partner. Both anxiety and collaborativeness correlated at roughly .50,
and each predicted comprehensibility, explaining 59–60% of the variance in the ratings depending on the
task. The authors noted that collaborativeness may have comprised various orientations towards the task,
speech content, and interactional competence. Anxiety, the authors posited, was likely perceived through
various behaviors reported in the literature, such as a lack of expressiveness, gaze aversion, and the use of
self-adaptors (e.g., Gregersen, 2005; Lindberg et al., 2021). They noted that these visual cues “may have
made processing the L2 speaker’s message more effortful for the interlocutor, leading to lower
comprehensibility ratings” (p. 12). Whether perceived anxiety and engagement can have an impact on other
ratings of language, such as fluency, grammar, and vocabulary, is unknown.
         Ockey (2009) investigated the role of assertiveness on performances in group speaking tests.
Although assertiveness may be considered a personality trait, its discrete appearance in tests may also be
interpreted as a type of interpersonal affect. In his study, the test taker’s level of assertiveness impacted the
subsequent scores the test taker received. Higher assertiveness related to higher scores, but only when
assertive test takers were paired with less assertive test takers. When paired with other assertive test takers,
on the other hand, the effect was reversed. Ockey’s study highlighted the importance of viewing affect or
personality through the lens of context, as merely displaying one affective stance was not enough to change
outcomes unilaterally. His study also highlighted the co-constructed nature of interactional performance in
group test settings. Nonetheless, the impact of affective personality traits on score outcomes has been
widely contested with little consensus to date (Berry, 2007; Davies, 2009; Nakatsuhara, 2011, O’Sullivan,
2004).
         Chong and Aryadoust (2023) investigated emotional transfer and impact on L2 performance in an
online speaking test. They exposed sixty undergraduate students in Asia to L2 English speaking
performances in audio-only and audiovisual modes evoking either sadness or happiness as determined using
sentiment analysis. There were thus four conditions, audio-happy, audio-sad, video-happy, and video-sad.
Afterwards, raters asked them comprehension questions about the stimuli. The comprehension questions
                                                                                                               77


were rated by four trained raters using the TOEFL iBT integrated speaking rubrics, which were (and are)
available online (Educational Testing Services, n.d.). Chong and Aryadoust then analyzed videos of the test
takers reactions to the videos using FaceReader 8.0 (FaceReader, n.d.), an automated, machine learning
software that measures emotional facial behavior. They analyzed seven basic emotions: happiness, sadness,
anger, surprise, fear, disgust, and a neutral state. They found that the happy videos indeed induced happy
responses in participants’ facial behaviors, and sadness induced negative valence emotions such as disgust.
Videos produced a higher intensity of emotions than audio-only stimuli. However, participants’ scores did
not change as a result of modality or emotional state in the stimuli on any of the rating categories. There
were, however, low correlations between the participants’ actual measured emotional reactions and their
performance outcomes, though the authors did not interpret these possibly due to the complexity of the
findings, conflicting correlations, or low variance explained. The authors concluded that there was little to
no effect of the test taker’s visible emotions on their rated performance.
Measurement of affect and nonverbal behavior
         There are many considerations in the use of nonverbal behavior and affect as either independent
and dependent variables in research, and it is beyond the scope of this literature review to list them all.
Reviews such as Gray and Ambady (2006) and Harrigan (2013) provide important methodological
considerations for the use of different stimuli (images, thin slices, videos, interactions) to document
behaviors. Generally, however, there are three methods for retrieving and coding nonverbal behavior:
observational methods, physiological methods, and software methods. There are examples of each method
in the L2 literature, and authors have used these somewhat differently.
Observational methods
         The oldest form of studying behavior involves observational methods, in which a researcher
annotates what they see in some written or numerical form. One common annotation form is in the form of
discourse or conversation analysis, where speech is transcribed and nonverbal behaviors are indicated at
points in the transcript where the behavior occurs. These annotations generally only indicate the onset of a
behavior and not the duration and may be standardized or unstandardized. One of the most well-known
                                                                                                          78


forms of gesture annotation is the McNeillian system (McNeill, 1992). This system allows for the
description of the occurrence of gestural onsets and releases ([ ]), holds (…), and other phenomena such as
filled pauses and silences. Gestural types are described in line. An example of this system is provided in
Figure 2.4 from McNeill (1992, p. 95). The text example is of a monologue narrative in which a participant
is recounting a story. Various gestures are identified, such as a beat gesture between lines 1 and 2, and the
location of the gesture occurs within the bracketed word [ok]. Other gestures are described more fully. This
type of system has been used extensively by gesture researchers such as Marianne Gullberg, and a similar
system was used by Jenkins and Parra (2003) and Gan and Davison (2011). While this system is powerful
for indicating the location of co-speech gestures, its use with other behaviors appears to be quite limited, as
their timing and duration are sometimes less clear.
Figure 2.4
McNeillian System of Coding (McNeill, 1992, p. 95)
         Other systems annotate behavior in a separate line from speech in an attempt to align their temporal
cooccurrence. As an example, Neu (1990, p. 134) used the Foster system of discourse annotation (Foster,
1980, as cited in Neu, 1990) to detail eye gaze behavior in Figure 2.5. In this example, “T” indicates that
gaze is in the direction of the conversational partner, “c” is the ceiling, “r” is right, “l” is left, and “d” is in
the general direction of the interlocutor, and “desk” is looking at the desk. In the same line of annotation,
                                                                                                                 79


“b”, “r”, and “up” indicate various movements of the head. The example effectively shows how gaze shifts
and head turns unfold during the spoken sequence, giving the reader an understanding of how the two
behaviors may align with speech and help manage interaction. There are many disadvantages of using such
as system, however. For one, gaze and head movements occupy the same analytical line and may thus be
confounded. It is unclear how coincides of various behavior would be annotated. Speech articulation rate
is also not given, and as such segments of the discourse do not represent scaled moments in real time. While
the duration of behaviors across words can be understood, it is unclear how much time these durations
actually occupied. Other systems, such as the Birdwhistle system (Birdwhistle, 1970) and Ochs system
(Schieffelin & Ochs, 1979), are similar but take different approaches to the representation of time and other
behaviors.
Figure 2.5
Discourse Analysis From Neu (1990, p. 134)
         Similarly, studies of sequential interaction have used multimodal conversation analysis to
document the cooccurrence of nonverbal behavior between interactants. In these cases, because of the
complexity of the interaction, generally researchers choose a small set of behaviors to document in the
unfolding sequence. For example, in Figure 2.6 (Seo & Koshik, 2010, p. 2226), a conversational exchange
is documented between two individuals, after which a sequence of nonunderstanding occurs. The researcher
here has documented the ensemble of behaviors (head, eye, eyebrow, and posture movement) at the point
the behavior occurs with a description. This type of analysis is useful when a particular appearance of a
behavior at a particular moment is salient, providing the researcher and reader with an emic perspective of
micro-level features of interaction management. Nevertheless, information about the behaviors occurring
before and after, as well as the duration of the ensemble, are not possible, and a more complete view of the
                                                                                                           80


coincidence with other behaviors is lost.
Figure 2.6
Multimodal Conversation Analysis From Seo & Koshik (2010, p. 2229)
         An alternative type of multimodal analysis is possible using ELAN (EUDICO [European
Distributed Corpora] Linguistic Annotator) software (Max Planck Institute, 2020). ELAN is open-source,
free-of-charge audiovisual transcription software developed by the Max Planck Institute. ELAN has been
widely used in the study of behavior in psychological and social research. It employs a tier-based annotation
system that facilitates the coding of a wide range of spoken and visible phenomena aligned with the frame-
by-frame unfolding of the video. It is thus possible to conduct a micro-level analysis of the timing and
alignment of multiple behaviors with linguistic aspects of discourse, with various output styles, including
multimodal conversation analysis. It is also possible to aggregate these timings to investigate the frequency
and duration of each action or behavior over the course of analyzed segments. An example of such an
annotation system is presented in Figure 2.6 (Burton, 2021a, p. 43). In this example, seven tiers of behavior
have been annotated as well as four tiers of speech (two per interactant; one tier for conversation analysis,
one tier for individual words). The example demonstrates the appearance of a nonunderstanding sequence
and the unfolding of a hold. Because the annotated behaviors align temporally with speech and are scaled,
it is possible to observe behaviors happening throughout the sequence and their alignment with trouble
sources. In this particular case, a test taker has illustrated their lack of understanding by deploying an
ensemble of behaviors at the same time, followed by a self-adapting gesture. This type of annotation is
much more informative and transparent, but it can be extremely impractical due to the amount of time
necessary to produce full annotations.
                                                                                                           81


Figure 2.7
ELAN Multimodal Annotation System From Burton (2021a, p. 43)
         Non-discourse analytic studies often opt to extract nonverbal behavior using raw frequency counts.
For example, Tsunemoto et al. (2022) used a bottom-up approach to determine behaviors that appeared in
a set of speech samples from a larger corpus. The authors watched videos several times, noted behaviors
salient to discourse sequences, and then counted the behavior totals for each video. The advantage of this
type of analysis is that it can be relatively quick, and interrater agreement can be fairly high. Using
frequency data, however, only provides one piece of data about the occurrence of behavior. For example,
if an individual smiles only once, but holds that smile for a long period of time, this instance of behavior
would likely have a larger impact than a single smile held only momentarily. Yet, in this type of analysis,
these frequency counts would be equivalent. Likewise, important information is lost about the location of
behaviors. In non-linguistic studies, counts of individual muscle movements have been used, especially
when documenting their relationship with emotion. Ekman et al.’s (2002) Facial Affect Coding System
(FACS) has been used to provide a more complete picture of facial anatomy, including intensity and timing.
Using FACS requires significant training and has been used in a number of studies outside of applied
linguistics (Ekman & Rosenberg, 2005).
         Each of these systems requires reliability checks to ensure the integrity of the annotations. Some
studies have two individuals annotate the entire dataset, and any disagreements are discussed and resolved.
This type of dual coding was reported by Jenkins and Parra (2003) and Tsunemoto et al. (2023). For more
                                                                                                          82


complex analyses, however, resources are often insufficient, as annotations can last weeks at a time. Duncan
and Fiske (2015) recommended that 10% to 25% of datasets be double coded prior to coding the entire
dataset, after which disagreements can be resolved, and then the rest of the dataset can be coded by one
individual.
          Finally, observations of affect may be gathered using scales. Scales can be a quick and intuitive
tool for observers to make decisions about affective phenomena. Nagle et al. (2022), for example, had raters
use simple scales of collaborativeness and anxiety to score their own and their partner’s affect. These scales
were set on a sliding bar which represented a total of 100 points (one point per millimeter), allowing the
researchers to use a bounded continuous variable as an independent variable. Similar scales were used in
Tsunemoto et al. (2022) and Kim et al. (2023) to measure comprehensibility, accentedness, and fluency.
Other scales may be set up in a Likert format, with a limited number of ordinal points, such as semantic
differentials (Osgood et al., 1957; Snider & Osgood, 1969).
Physiological methods
          In some cases, physiological methods may be preferable as an objective (or quasi-objective)
measure of behavior. For example, eye tracking technology (Godfroid, 2019) can be used to track the eye
gaze patterns of individuals when interacting with language when reading or listening, but it can also be
used to track interpersonal communication. For example McDonough et al. (2015) used eye tracking to
measure gaze location and duration during recast episodes. Likewise, Batty (2021) measured areas of the
face that L2 users attended to when watching and listening to speakers in video-based listening tests.
          Quasi-objective measures may also be informative when studying affect. These measures provide
electrical information from the body that has been linked to either muscle stimulation or autonomic arousal.
For example, facial electromyography (EMG) detects electrical signals from muscle activation that may
not be otherwise visible. It can be used to detect signals occurring with particular expressions of emotions,
but it is not generally used to detect the facial movements without prior knowledge of affect (Cacioppo et
al. 2000). It is, however, quite obtrusive, requiring equipment on the face or even below the skin. Other
physiological measures, such as galvanic skin response, can be much less intrusive. Galvanic skin response
                                                                                                            83


is thought to measure autonomic arousal, which is associated with affective responses (for example,
anxiety), but it is a rather crude measure with no direct relationship to any particular emotion (Afifi &
Denes, 2013). As an example, Lindberg et al. (2021) used galvanic skin response as a proxy for anxiety,
which allowed them to correlate their findings with the occurrence of certain nonverbal behaviors. A benefit
of this type of measure is it is relatively unobtrusive, but the validity of its inferences may be limited to
arousal rather than specific affective output.
Software
         The final system of measuring nonverbal behavior and affect is through computer vision and
machine learning. Decades of work have resulted in greater and greater accuracy in systems that can
measure these indices without the participant’s awareness, thus potentially enhancing the ecological validity
of research studies. With software-based measurements, participants are recorded (often sitting in front of
a computer with a front-facing camera) conducting a study online. Participants need not wear any
equipment, as the software is able to extrapolate muscle movements from video alone. Another important
advantage of these systems is that they offer speed and objectivity in measurement. These systems work by
using machine learning to identify specific points on the face corresponding to Ekman et al.’s (2002) Facial
Action Coding System. The software then produces probability measures of the activation of individual
facial muscles. Using models trained on classified banks of individuals exhibiting certain emotions, the
software then can produce a probability measure of the type and strength of various measures. These models,
however, sometimes operate somewhat opaquely and have limited published validity evidence of their
internal functioning, and it is also sometimes unknown to what extent these may work with individuals
from varying cultural backgrounds. Additionally, the accuracy of emotional classification in a comparison
of various automated facial recognition systems in 2020 showed that although better than chance, these
systems were not as accurate as judgements made by human observers (Dupré et al., 2020).
         Chong and Aryadoust (2023) authored the only study to date, at the time of writing, that has used
an automated facial recognition system in the study of L2 proficiency. They used the FaceReader emotion
recognition system (FaceReader, n.d.) to extract indices of happiness and sadness which they then analyzed
                                                                                                          84


in comparison to language proficiency subscores. FaceReader (FaceReader, n.d.) is able to produce indices
of seven basic emotions and individual action units pertaining to facial muscle movements. Although
FaceReader was purportedly more accurate than many other systems (in 2020; Dupré et al., 2020), it fails
to recognize many emotions identified by human observers (Hirt et al., 2019). Additionally, FaceReader is
unable to produce omnibus measures of overall expressiveness, attention, or a general measure of emotional
valence. The literature to date does not point to a direct effect of specific emotions on proficiency outcomes
but rather overall expressiveness and positivity (Jenkins & Parra, 2003), more complex affective measures
(e.g., anxiety and engagement; Nagle et al., 2022), or specific behaviors (e.g., eyebrow movements and
smiles; Kim et al., 2023; Tsunemoto et al., 2022).
         An alternative system that addresses some of these limitations is iMotions Affectiva (iMotions,
2017). iMotions is a behavioral analysis application that uses a complex array of computer vision and
machine learning algorithms to detect faces, automatically code facial expressions, and classify emotional
states. iMotions is able to detect head orientation, facial landmarks, action units or expression metrics, seven
emotional states (joy, anger, surprise, fear, contempt, sadness, and disgust), and three omnibus measures of
valence, engagement, and attention (iMotions, 2017). Although Dupré et al. (2020) found iMotions
Affectiva to be somewhat less accurate than FaceReader, the software has been found to be more accurate
than physiological methods such as facial EMG (Kulke et al., 2020). Only one study in applied linguistics
to date has used iMotions software to my knowledge, but it used the eye-tracking component with no facial
behavior analysis (Suvorov, 2015).
Summary
         The studies I have reviewed suggest that there is a complex and natural relationship between
nonverbal and verbal communication. The two modes may offer complementary semantic information that
strengthens speakers’ intended messages. When engaged in decoding meaning, individuals are informed
by these two physically separate yet conceptually intertwined lines of cognitive, affective, and social
information. With judgements in performance tests, raters may use the nonverbal information available to
them to complement their understanding of someone’s L2 proficiency, thus making use of external
                                                                                                              85


nonverbal criteria to decide whether a performance merits a higher or lower score. If a low proficiency
speaker uses nonverbal behavior to express strong positive affect, or if a higher proficiency speaker does
not display much affect at all, it may be the case that nonverbal information plays a role in decisions about
language proficiency. In these cases the direction of the effect of nonverbal behavior may be positive in the
first case and negative in the second case. It is also possible it may have no effect at all. The effect of
nonverbal information on judgements of verbal ability may also not be uniform for all proficiency levels or
all criteria included in a rating scale. Scant research informs these areas of inquiry.
          Overall, research studies on the effects of nonverbal behavior on raters’ scores are inconclusive.
Simply comparing mean differences between rating samples with video and without video may mask
underlying interactions between raters, proficiency levels, and criteria scores. For example, one rater in
Nakatsuhara et al. (2021a) was found to exhibit severity on the video mode, while other raters showed more
severity on the audio-only mode. If this type of differential severity were split evenly in a group of raters,
any mean group differences would possibly be cancelled out. Nonetheless, evidence largely points to a
benefit of the visual mode of rating on scores, but it is unknown whether the benefit, ostensibly due to
nonverbal behavior, is consistent across proficiency levels. It is important to consider such interactions
where possible.
          Studies that took a fine-grained, discourse analytic perspective provided evidence that indeed raters
appeared to behave in similar ways documented by Burgoon et al. (2016): When test takers’ linguistic skills
fall into borderline categories, or if nonverbal information conflicts with verbal, nonverbal communication
may take precedence when raters make decisions about L2 ability. In particular, expressiveness and
attentiveness (mutual gaze) were cited as important criteria that brought borderline scores up, while relative
inexpressiveness and inattentiveness had the effect of bringing scores down. Gestures, posture, and
paralinguistic features also contributed to raters’ impressions. In all studies that asked raters to provide
reports of their decision-making processes, nonverbal behavior was often mentioned through its functional
output as an affective response. Raters were sensitive to behaviors that conveyed confidence, anxiety,
engagement, attention, interactiveness, and positivity, especially when these expressed a desire to
                                                                                                            86


communicate.
         To date, the studies reviewed here have either taken a small scale, discourse-analytic approach or
a larger scale, score-based approach. The gap here, then, is to be able to analyze scores with a larger bank
of raters and a more diverse sample of test taker behavior, and to produce a rich amount of qualitative data
that can triangulate findings between scores, behaviors, and rater comments. In the past, this type of analysis
would have been impractical: Detailed behavioral analyses of even one minute of speech can take hours.
Today, it is possible to leverage machine learning technology to extract behavioral indices that can be used
in statistical models to better understand the moderating impact of nonverbal behavior and affect on
proficiency scores.
                                                                                                            87


                                  CHAPTER 3: RESEARCH QUESTIONS
         The literature shows that nonverbal behavior is a core aspect of communication that conveys
semantic, cognitive, affective, and social-interactional information. Speaking test constructs, including
those based on communicative competence, have been shown to tacitly acknowledge nonverbal behavior,
especially in its social-interactional roles of conversation management and strategic planning, but these lack
a full exploration of the functional output of nonverbal behavior. Nonetheless, variance has been found in
testing contexts that is attributable to the visual realm, with links both to the behavioral forms and the
informational output, often as affect. Missing from the literature to date is an exploration of score variance
across various language proficiency outcomes with a dataset large enough to measure the strength of
associations. Likewise, there are methodological gaps in the measurement of affect and behavior, with most
studies measuring few variables or in a limited measurement style, such as frequency. This dissertation
attempts to bridge those gaps by answering three key sets of research questions.
Proficiency scores and interpersonal affect
         This question block focuses on interpersonal affect. Using a range of perceived affective variables,
the aim is to describe trends and patterns in the variables that may explain language proficiency outcomes.
These variables will be observed by the raters in the study. There is one question in this section:
         RQ1:          What is the relationship between interpersonal affect and language proficiency?
No hypotheses were generated prior to formulating this question, as this question was purely exploratory.
However, the literature points strongly to confidence, anxiety, and engagement correlating with language
proficiency, so it is anticipated that some aspects of language proficiency would correlate with these
variables. Less is known about the relationship between positive affective variables and outcomes. Research
on foreign language enjoyment has suggested a link to L2 achievement.
Proficiency scores and nonverbal behavior
         The second block of questions relates to externally measured variables of nonverbal behavior and
language proficiency. As described in the methods, these variables will be measured objectively using
automated software. Base proficiency measures will be obtained by using official scores from a testing
                                                                                                            88


agency. These questions are confirmatory in nature.
         RQ2.1:        Do externally measured indices of nonverbal behavior predict language proficiency
                       scores?
         RQ2.2:        Do nonverbal behaviors impact outcomes differentially depending on the base
                       proficiency levels of test takers?
After generating these research questions, I proposed three hypotheses relevant to this block of questions,
which I pre-registered in the Open Science Framework (Burton, 2021b).
         H2.1.1:       Indices of attention and expressiveness will have significant but moderate correlations
                       with language ability.
         H2.1.2:       Higher values of attention and expressiveness will result in significant positive
                       regression coefficients of fixed effects, indicating an overall positive impact on
                       impressions of second language proficiency across ability levels.
         H2.2:         Significant interaction coefficients of base language proficiency with attention and
                       base proficiency with expressiveness will indicate that the effect of nonverbal behavior
                       on rated outcomes depends on the base proficiency of the test taker.
Raters and nonverbal behavior
         The third block of questions is a mixed-methods inquiry into rater behavior. The primary goal of
this block is to triangulate findings from the quantitative analyses. These questions are largely exploratory.
         RQ3.1:        Which nonverbal behaviors are most salient and informative to raters when scoring?
         RQ3.2:        How do raters understand language proficiency in light of nonverbal behavior?
One confirmatory hypothesis was pre-registered for question 3.1 based on a close reading of the literature
reviewed previously.
         H3.1:         Gaze aversion, eyebrow raises, smiling, head tilts, and inexpressiveness will be
                       mentioned more times by raters as noted by higher relative frequencies of comments.
                       Gesture and posture will be mentioned fewer times due to the online format of the
                       speech stimuli
                                                                                                             89


                                            CHAPTER 4: METHOD
         This dissertation is, at its heart, a study of speech perception, as the core questions surround how
listeners make use of visible nonverbal behaviors and affect while thinking about speech. It is also a study
of speech assessment, as the main method of the study is for participants to assign a score that represents a
particular level of language proficiency. By analyzing scores, it may be possible to observe the direction
and size of an effect of the visual realm on people’s perceptions of language proficiency. Score analysis
alone, however, will not be able to determine whether raters’ use of nonverbal behavior when assigning
scores is explicit (i.e., raters are aware of behavior during rating) or implicit (i.e., behavior impacts rating
unconsciously). Uncovering this explicit/implicit distinction is only possible by probing raters’ thought
processes through rater reports. For this reason, this study is also one of rater cognition. In order to explore
all of these elements, the study takes a mixed-methods approach.
         I adopted an explanatory sequential design (Creswell & Plano Clark, 2017) as a mixed-methods
framework for this study. In explanatory sequential designs, quantitative data is gathered first to identify
cases which may be indicative of or exemplify the phenomenon under consideration in the research study.
Following this, qualitative research methods are used to explain the nature of the quantitative results in
much more depth and detail. Explanatory sequential designs are most appropriate “to explain the
mechanisms through qualitative data that shed light on why the quantitative results occurred and how they
might be explained” (Creswell & Plano Clark, 2017, p. 77).
         The design of the study consisted of three main phases. Phase 1 took place in Fall 2021. It included
two components. In the first, I piloted the online Qualtrics survey used by the raters, using data from my
first qualifying review paper (Burton, 2023). In the second component, I selected the speech samples from
those provided to me by IELTS. Phase 2 began in December 2021. This phase included the recruitment of
rater participants and the collection of quantitative rating data of both proficiency and affect. This phase
lasted until April 2022. During this phase, I selected speech sample stimuli that exhibited rating patterns
suggesting a greater impact of nonverbal behavior on scores for further qualitative analysis. In this phase,
I also collected automated visual output of the speech sample videos using iMotions (iMotions, n.d.). Phase
                                                                                                              90


3 overlapped with Phase 2 in the last weeks of data collection. In this phase, 20 raters were invited to take
part in stimulated recall sessions within 24 hours of completing the online rating study. These stimulated
recalls targeted a subset of speech stample stimuli selected in Phase 2 of the study. Phase 3 took place
during March and April of 2022. Together with two research assistants, I also transcribed and annotated
nonverbal behavior in the speech samples to analyze together with the stimulated recall data.
Figure 4.1
Study Design
          The explanatory sequential design of the study is visible in Figure 4.1. The top half of the figure
illustrates the quantitative aspects of the project, including the rating of speech samples with scales and
iMotions analyses. The quantitative part of the study served to answer the first two blocks of research
questions. The explanatory sequential mixed-methods aspect of the study is described in the bottom half.
Here speech samples selected from the quantitative component formed the basis of the stimulated recall
sessions, which were analyzed thematically, and together with samples of multimodal discourse served to
answer the third block of research questions. This third block, however, was used to reflect back to the
quantitative findings. The design of the study was thus triangulated. The dataset included sources of data
from both raters and speech stimuli. the four main data sources Figure 4.2 further illustrates the four main
                                                                                                           91


data sources and their relationship to the data analysis.
Figure 4.2
Data Sources for Dissertation
Participants
         There were two groups of participants in this study. The first group of participants completed the
rating study. A subset of this group comprised the smaller second group of stimulated recall participants.
The two groups are described below.
Rating study
         Participants for the rating study were recruited from a pool of undergraduate students at Michigan
State University called SONA (https://psychology.msu.edu/undergraduates/sonaparticipation.html) and
from the university’s undergraduate student email list. Participants were allowed to participate as long as
                                                                                                         92


they were traditional undergraduates (in the age range of approximately 18–22), born in the United States,
and had grown up with English as their L1. These conditions were meant to ensure that participants had a
similar background; that is, they stemmed from a shared cultural and linguistic context, as individuals from
differing L1 or national backgrounds may vary in their interpretations of language proficiency (Barnwell,
1989; Kim, 2009; Marefat & Heydari, 2016; Shi, 2001; Wei & Llosa, 2015; Xi & Mollaun, 2011; Zhang &
Elder, 2011) and may possess culturally or linguistically different nonverbal behavior and affect (Crivelli
& Fridlund, 2018; Fridlund, 1994; Matsumoto & Hwang, 2016).
         Recruiting young undergraduates also ensured that no individual had operational experience rating
speaking tests or any familiarity with rating scales of language proficiency; that is to say, the rater
participants were untrained, often referred to as novice or naïve raters, or linguistic laypersons (Sato &
McNamara, 2019). The principal benefit of using untrained, novice raters in designs such as this study is
that scores and other mediating variables extrapolate more strongly to the target language use domain:
         in the real-world context, the ultimate arbiters of L2 speakers’ oral performance are typically not
         in fact trained language professionals, who have meta-level linguistic insight and are possibly
         concerned primarily with features of communication that are the focus of their own training as
         linguists or language teachers, but interlocutors with no specialist training (Sato & McNamara,
         2019, p. 895).
The design of this study also included no rater training, as judgements were epistemically to be based on
quick impressions based on the participants’ general experience with interpersonal interactions in life. This
in turn meant that scores would likely be less reliable and could exhibit major differences in rater severity
(Attali, 2016; Barkaoui, 2010; Cumming, 1990; Lim, 2011; Shaw, 2002; Weigle, 1994, 1998). However,
this variance was anticipated and desirable in this study, as it best reflects the way listeners in the target
language use domain may process multimodal input.
         To determine the sample size necessary to detect possible small effects in this study, I referred to
previous literature on sample size requirements for mixed-effects designs (that is to say, designs where
raters provide multiple observations per case). I also conducted a simulation study. Past literature suggested
                                                                                                            93


that larger second level cases (in this case, rater participants) than first level cases (here, visual stimuli)
would provide greater power (Hox et al., 2018), as long as the number of stimuli is large enough (Westfall
et al., 2014). I ran the simulation analysis with 10, 20, 30, and 40 visual stimuli. I found that a stimuli size
of 30 and a rating sample size of at least 80 would have a power of .95 to detect regression coefficients of .2
for the iMotions variables, which I considered the smallest meaningful effect size. This power analysis was
similar to Westfall et al.’s (2014) finding of a standardized effect size d of .4 with a participant pool of near
100 and with 30 stimuli. In the power analysis I ran, smaller stimuli sizes required fewer raters to arrive at
sufficient power (40 and 60, respectively), but reducing the stimuli size was not desirable for this study as
there would be less variation in proficiency levels and nonverbal behavior. Using 40 speech stimuli would
have required 120 raters to reach the same power, which would have furthermore increased the cost of the
study. For these reasons, I set the desired stimuli size at 30 and the rater sample size at 100, as I felt this
struck the right balance between desirable variance and practicality. It also allowed enough flexibility to
drop problematic cases or outliers without losing excessive statistical power.
          As referred to above, in order to recruit participants, I used both the SONA system and e-mail
invitation blasts to all domestic undergraduate students on campus In total, 2,340 individuals signed up to
take part in the study. I invited individuals that fit the requirements of the study randomly from this pool
until I reached the targeted number of participants. In total, 281 individuals received invitations and 100
successfully completed the study. All participants completed a consent form and signed non-disclosure
agreements in Appendix A. One individual experienced technical problems while completing the study,
which resulted in incomplete data, so I removed this case from the dataset. The final dataset contained
complete observations from 99 raters. The mean age of the participant raters was 20.92 years (SD = 1.48).
Gender was relatively balanced, with 41% reporting identifying as male, 53% as female, and 6% as other
(participants were not asked to specify further). All participants were from the USA and spoke English as
their L1, and 38% reported speaking a second language, though participants were not asked to identify their
level of proficiency or familiarity with their L2. School year was fairly balanced as well, with each year
(freshman, sophomore, junior, senior) representing approximately 25% of the dataset. Finally, participants
                                                                                                               94


reported studying in 37-degree programs, with three individuals reporting not having decided on their
program at the time. The full demographic reporting is in Table 4.1.
Table 4.1
Demographics of Rater Participants
 Category                                       n         Category                            n
 Gender                                                   Degree
    Male                                        41           Accounting                       1
    Female                                      52           Biochemistry                     1
    Other                                       6            Business                         1
                                                             Communications                   4
 Nationality                                                 Computer Science                 3
    USA                                         99           Creative Advertising             1
                                                             Criminal Justice                 1
 L1                                                          Economics                        1
    English                                     99           Education                        2
                                                             Engineering                      12
 Speaks L2                                                   English                          1
    Yes                                         38           Environmental Studies            1
    No                                          61           Finance                          3
                                                             Fisheries and Wildlife           1
 Year of School                                              Genomics and Molecular Genetics  1
    Freshman                                    25           History                          2
    Sophomore                                   22           Human Biology                    7
    Junior                                      21           Human Capital and Society        1
    Senior                                      29           Humanities                       1
    Other (Non-traditional year)                2            Interdisciplinary                1
                                                             Interior Design                  1
                                                             Japanese                         1
                                                             Kinesiology                      4
                                                             Linguistics                      3
                                                             Mathematics                      1
                                                             Neuroscience                     9
                                                             Nursing                          5
                                                             Packaging                        1
                                                             Physiology                       2
                                                             Psychology                       7
                                                             Social Work                      4
                                                             Spanish                          2
                                                             Supply Chain Management          4
                                                             Theatre                          2
                                                             Zoology                          4
                                                             Unspecified                      3
                                                                                                  95


Stimulated recall
         After 60 raters had completed the rating study, rating scores on language proficiency were used to
identify stimuli to analyze further in the stimulated recall sessions. The process used to identify samples is
described in a separate section below. I invited individuals that had not taken part in the study yet but had
indicated being willing to participate in the stimulated recall sessions when signing up. I randomly invited
participants until I reached a target sample size of 20. I chose 20 as this would represent a rather robust
number of stimulated recall sessions at 20% of the total number of participants. A total of 42 participants
were invited to participate in the stimulated recall sessions, and 20 successfully completed the sessions.
Stimulated recall participants signed an additional consent form allowing them to be audio recorded; the
consent form is available in Appendix A. The demographic breakdown was similar to the figures from the
larger rater group. Demographics for the stimulated recall sessions are in Table 4.2.
Table 4.2
Demographics of Stimulated Recall Participants
  Category                                          n         Category                                 n
  Gender                                                      Degree
     Male                                           8            Computer Science                      2
     Female                                         11           Education                             1
     Other                                          1            Engineering                           2
                                                                 Finance                               2
  Nationality                                                    Genomics and Molecular Genetics       1
     USA                                            20           History                               1
                                                                 Human Biology                         1
  L1                                                             Japanese                              1
     English                                        20           Neuroscience                          3
                                                                 Nursing                               1
  Speaks L2                                                      Psychology                            1
     Yes                                            7            Physiology                            1
     No                                             13           Spanish                               1
                                                                 Zoology                               1
  Year of School                                                 Unspecified                           1
     Freshman                                       7
     Sophomore                                      4
     Junior                                         3
     Senior                                         6
                                                                                                           96


Materials
              Materials used in this study included speech samples for rating, the rating scales, a survey of
demographic information, a follow-up survey after the rating took place, the online Qualtrics platforms that
hosted these materials, and a stimulated recall session for a subset of the raters using a subset of the speech
samples. Each of these will be described below.
Speech samples: Rating study
          The International English Language Testing System (IELTS; IELTS, n.d.) provided me with
previously recorded, high-stakes, live speaking test samples for use as rating stimuli for this study. The
videos were originally collected as part of the IELTS remote speaking research project (Nakatsuhara et al.,
2017), and as such were recorded in operational settings as part of research into the feasibility of online
speaking tests delivered through Zoom. IELTS is an international, high-stakes academic English
proficiency test used primarily by universities and colleges for admissions purposes. IELTS includes four
sections corresponding to the language abilities of reading, writing, speaking, and listening. The speaking
test is composed of three parts. Part 1, which lasts 4–5 minutes, includes a basic introduction and personal
questions. The format of Part 1 is that of an oral proficiency interview; the examiner asks scripted questions
followed by the test taker’s answers, without scaffolding or elaboration on the part of the examiner. This
part primarily targets familiar, non-abstract topics. Part 2 is in the format of a monologue. Test takers are
given a task card and are asked to speak on their own, without support, for up to two minutes on a topic.
These topics generally elicit familiar information (e.g., tell a story about your life) with generalized, abstract
conclusions (e.g., how does this apply to society in general?). Part 3, which comprises the final 4–5 minutes
of the test, is a semi-scripted oral proficiency interview. It is distinguished from Part 1 by having the
examiner produce unscripted follow-up questions and allowing the examiner to use a broader range of
functions with the test taker; the examiner may scaffold, gloss, probe, repair, and follow up on answers the
test taker has given. These topics are meant to be unfamiliar to the test taker, and the concepts gradually
grow     in  abstraction.   For    illustrative  purposes, example        test  questions     can  be   seen    at
https://takeielts.britishcouncil.org/take-ielts/prepare/free-ielts-practice-tests/speaking.
                                                                                                               97


         IELTS provided me 46 video recordings of part 3 of the speaking test for use in this study. Overall
speaking test scores, as well as subscores of the four analytical scoring categories (grammar,
comprehensibility, fluency, and pronunciation) for each test taker were provided. The dataset that IELTS
provided for this study included a range of total speaking scores (an average of the four subscores) from
2.5 to 6.5 (M = 4.97, SD = 1.07, approximately A2–C1 on the CEFR). All test takers were young Chinese
adults in Shanghai. The dataset skewed female (71%) and featured nine different examiners (staff members
who administered the speaking test to the test takers) and seven test topics (Education, Websites,
Communication, Success, Events, Leisure, and Travel). Prior research found no impact of examiner, test
topic, or interaction of both on test scores (Nakatsuhara et al., 2017). All tests were conducted in the same
controlled setting on standardized laptop computers. The tests were recorded using Zoom’s internal
recording feature, which displayed both the examiner and the test taker side-by-side. Test takers faced the
laptop directly, which had a camera embedded in the top bezel, thus simulating direct eye gaze. In order to
protect the test taker’s identity, I have cartoonized what the stimuli input looked like for the raters in Figure
4.3. Note that the raters did not see cartoons, but the actual video recorded sample.
Figure 4.3
A Video Frame of Sample 30, Cartoonized to Protect the Test Taker’s Identity
                            As mentioned above, I had decided to use 30 speech samples with the raters based on the simulation
study. Thus, I devised a method to select 30 of the 46 available samples. I first watched each sample to
                                                                                                                           98


document the quality of each video and the overall expressiveness of each test taker. I first removed videos
with extended periods of silence or videos in which language production was too limited to be meaningful
for rating language. I then sorted the videos into three proficiency groups of (a) 3.5 to 4.5, (b) 5 to 5.5, and
(c) 6 and 6.5 using the overall rounded IELTS scores. The scores are rounded because this is how the test
producer calculates the total score in operational settings. A critical aspect of this research project was to
observe the impact of nonverbal behavior on raters’ scores, and for this reason it was also necessary to
include a range of expressiveness and behavior for the raters to observe. I classified each video crudely as
less expressive (0), moderately expressive (1), and expressive (2) based on the presence of active facial
behaviors such as smiling, frowning, nodding and mutual eye gaze as well as gesture. To compile the dataset,
I selected stimuli from each scoring category with a balanced distribution of expressiveness, and an overall
balance of score distributions. This resulted in eight stimuli with scores 3.5–4.5, 14 stimuli with scores of
5 to 5.5, and eight stimuli with scores of 6 to 6.5 (M = 5.2, SD = 0.88). Furthermore, 12 of the samples
showed a high degree of expressiveness, 11 showed markedly low expressiveness, and seven fell
somewhere in between. A table listing the 30 resulting stimuli and their features is presented in
Table 4.3.
                                                                                                              99


Table 4.3
Speech Sample Data
                                       IELTS score
 Sample   Gender                                                         Examiner Topic                        Expressiveness  Time
                    FC         V         G           P         Total*
 S01      M         3          4         4           4         3.5       8           Education                 Low             02:23
 S02      M         4          4         4           4         4         2           Success                   Low             02:31
 S03      M         4          4         4           5         4         4           Traveling                 Low             02:19
 S04      F         4          4         4           5         4         4           Events                    High            01:43
 S05      F         4          4         4           4         4         6           Success                   High            02:09
 S06      F         4          4         4           5         4         7           Success                   High            02:18
 S07      F         4          4         4           4         4         9           Success                   Low             02:11
 S08      M         5          4         4           5         4.5       4           Communication             High            01:53
 S09      F         5          5         5           5         5         1           Education                 Low             02:00
 S10      F         5          5         5           5         5         1           Success                   Low             02:31
 S11      F         6          5         5           5         5         2           Success                   Low             01:46
 S12      F         5          5         5           5         5         2           Communication             High            01:49
 S13      M         6          5         5           5         5         6           Cinema                    Mid             02:13
 S14      F         5          5         5           5         5         6           Traveling                 Low             02:02
 S15      F         5          6         6           6         5.5       1           Success                   High            02:22
 S16      M         5          6         6           6         5.5       1           Communication             Mid             02:19
 S17      F         5          6         6           6         5.5       1           Traveling                 High            01:51
 S18      F         5          6         5           6         5.5       2           Events                    Mid             02:18
 S19      F         6          6         5           5         5.5       2           Events                    High            02:05
 S20      F         6          5         6           6         5.5       5           Success                   Mid             01:57
 S21      F         6          5         6           5         5.5       5           Events                    Mid             02:23
 S22      M         6          6         5           5         5.5       6           Education                 High            02:13
 S23      F         6          6         6           7         6         2           Cinema                    Low             02:20
 S24      F         7          6         6           6         6         6           Communication             Low             02:43
 S25      F         6          6         6           6         6         6           Cinema                    High            02:14
 S26      F         7          6         6           6         6         6           Communication             High            02:13
 S27      F         7          6         7           7         6.5       3           Traveling                 High            02:14
 S28      F         7          7         6           6         6.5       4           Cinema                    Mid             01:53
 S29      F         7          7         6           7         6.5       5           Events                    Low             02:22
 S30      F         7          6         6           7         6.5       9           Traveling                 Mid             02:15
 Note. FC = Fluency and coherence. V = Lexical resource. G = Grammatical range and accuracy. P = Pronunciation. *Rounded down to .5.
                                                                                                                                     100


         After selecting the 30 video files, I then trimmed each file to a length of approximately two minutes
(M = 2m 11s, SD = 14s). The rationale to reduce the length of the stimuli was primarily for practicality.
The rating design of this study was fully crossed, meaning that each rater viewed and rated each speech
sample. As the original videos were each up to five minutes long, watching thirty full samples would have
taken the raters a minimum of 180 minutes just for the rating alone (150 minutes viewing, 30 minutes rating
after the videos), plus time to enter in personal information, do the short practice session, and the final
survey questions. Because this study dealt with impressions of language ability and affect without using an
empirically developed rating scale, decisions could be quick and intuitive, and made on the basis of much
less information than provided by a traditional rating scale. Past studies have found that impressions of
affect could be made as quickly as after a 100-millisecond viewing of stimuli (Willis & Todorov, 2006),
but a longer sample was necessary to form an impression of language ability. I chose two-minute segments
because the test taker would have the opportunity to answer one or two test questions, thus providing my
study’s rater-participants a quick snapshot of the test takers’ language abilities without overwhelming them
with input. The two-minute mark reached approximately the mid-point of part 3 of this test section.
         I trimmed each speech sample using the following procedure. Each original raw file included a set
of instructions delivered by the examiner for this part of the test and an introduction to the topic, also
delivered by the examiner, before each question was asked. These instructions and introductions were
removed. I then trimmed endpoints for the files where there was a natural segue between test questions.
That is to say, for all speech samples, the test takers had reached the end of their turn naturally before the
examiner prepared the next question. I trimmed these as close to the two-minute mark as possible. In some
cases, this meant trimming the end of the file before the two-minute mark (e.g., sample S04, 1m 43s)
because the next question produced an extended turn that went far beyond the two-minute mark. In other
cases, this meant trimming the end of the file beyond the two-minute mark (e.g., sample 24, 2m 43s), as a
segue before the two-minute mark would have left an insufficient amount of speech to be rated. As noted
earlier, however, most samples lasted between 1m 57s and 2m 25s.
                                                                                                          101


Speech samples: Stimulated recall
         Stimulated recall formed an integral part of the mixed-methods qualitative data analysis in this
study. A definition, rationale, and discussion of the methods underlying stimulated recall is presented in the
materials section. The stimulated recall sessions used a subset of the rating study samples as stimuli for
eliciting the memories and thought processes of the raters. I used 10 of the 30 samples for the stimulated
recall sessions for practical reasons. The main rationale for 10 samples was that I anticipated that the
minimum number of rated samples that would provide enough data for the recall would be one third of the
whole dataset, as this would allow me to choose a range of different performances, but not take too much
of the stimulated recall participant’s time. Any fewer than 10 would have drastically limited the range of
performances available. Also, from previous experience with stimulated recall sessions based on rating
two-minute samples (Burton, 2020), I anticipated that each session with 10 samples would last no more
than 1.5 hours (I did not want the session to go beyond 1.5 hours), which was indeed the case. Recall
sessions are generally much shorter than this amount of time, so this was already a substantial cognitive
burden on the rater participants. A full recall session with all 30 files would have lasted 4.5 hours.
Importantly, analyzing 10 of the 30 samples fit squarely within sequential explanatory mixed methods
design principles (Creswell & Plano Clark, 2017), as I would be able to choose samples that appeared to
exhibit the greatest impact of nonverbal behavior on scores in order to better target these samples for elicited
responses.
         Sample selection for stimulated recall. The choice of samples used in the stimulated recall was
an important methodological decision for this study. It needed to be informed by the quantitative data, yet
the qualitative recall sessions were inherently a subpart of the quantitative data collection itself because the
stimulated recall participants completed the rating study prior to the recall sessions. I thus needed to conduct
the recall sessions after a notable amount of quantitative data had been collected, yet sufficiently prior to
the end of the study to ensure that enough funding was available to pay participants and so that adequate
sample size estimates were reached. I settled on an analysis of 60 scores from raters in the study to determine
the speech samples to use. This gave me the flexibility to recruit 20 stimulated recall participants and to
                                                                                                             102


continue collecting data from non-recall participants.
         The IELTS scores from Table 4.3 allowed me to rank the speech samples by proficiency level.
With the IELTS scores as the baseline ranking, I was able to select stimulated recall files based on the
degree to which the ranking changed between the baseline and a ranking of the samples based on the scores
from the undergraduates. These selected samples would then represent the greatest deviation between
official IELTS proficiency scores and undergraduate-rater ratings: I hypothesized the deviation would be
due at least partially to non-linguistic criteria. To rank the samples based on scores from the undergraduates,
I gathered the raw rating data from the first 60 participants having finished the study. I then conducted
multi-faceted Rasch measurement (MFRM) using Facets (https://www.winsteps.com/facets.htm). I ran a
partial credit model with raters, samples, and criteria (fluency, vocabulary, grammar, and comprehensibility)
as facets. I then extracted the Rasch ability estimates, which I used to rank the speech samples: I set these
in an Excel-like file in one column alongside the same samples’ IELTS scores in a second column, and the
relative ranking (the difference between the two rankings) in a third. For the stimulated recall sessions, I
chose the 10 speech samples that changed ranking the most: that is, the samples that contrasted the most
between the IELTS speaking score ranking and the ranking of the undergraduate-rater-determined scores
calculated as Rasch ability measures. For example, Sample 29, which shared the highest IELTS score of
6.5, dropped the most of all test takers to 17th place in Rasch ability ranking, a drop of 12 points. On the
other hand, Sample 13, who scored 5 on the IELTS test and was ranked in the bottom half of the test takers,
rose to 24th place in the rating study according to the Rasch ability estimates. The final selected files are
listed in Table 4.4. It is notable that all but two samples (S25 and S29) were in the intermediate range of 5
to 5.5 on the IELTS range. Test takers in the lower range of scores had very low rank changes within ±2
points, apart from sample S08 who dropped four places. Likewise, apart from S25 and S29, test takers in
the upper range had rank changes within ±4 points.
                                                                                                           103


Table 4.4
Speech Samples Selected for Stimulated Recall
 Sample      Gender      IELTS score       Expressiveness       Rank change
 S09         F           5                 Low                  +10
 S12         F           5                 High                 +6
 S13         M           5                 Mid                  +11
 S15         F           5.5               High                 +6
 S16         M           5.5               Mid                  +7
 S17         F           5.5               High                 +9
 S18         F           5.5               Mid                  -6
 S21         F           5.5               Mid                  -11
 S25         F           6                 High                 -10
 S29         F           6.5               Low                  -12
Rating scales
          Performance test scores can be affected by features that are not present in the rating scale (Douglas,
1994; Schoonen, 2005). Raters take into account a range of other factors, such as their own perceptions of
the test takers or their perceptions of the quality of the content of the test taker’s response (Kuiken & Vedder,
2014). However, empirically derived rating scales and rater training typically try to reduce this type of
construct-irrelevant variance in order to measure constructs and subconstructs more precisely. In this study,
the opposite was true; that is to say, I wanted to measure the amount of natural variance present when
individuals watch, listen to, and make judgements about L2 speaking ability in the target language use
domain. By not restricting this variance through the use of detailed rating rubrics or rater training, I
hypothesized that stronger insights could be gathered about how potentially construct-relevant features such
as gaze, facial expressions, posture, and gesture might impact perceptions of language ability. Knowing
how this impacts non-linguist listeners may provide more generalizable information than looking at trained
raters, in which case much simpler, impressionistic rating scales may be more useful.
          As hypothesized in this study, individuals may use both verbal and nonverbal information to make
decisions that about L2 ability. They may also use both channels of information when perceiving certain
affective states and traits, such as confidence and warmth (e.g., Cuddy et al., 2011). There is some indication
                                                                                                             104


that these perceptions of affect may correlate with judgements of language ability (Nagle et al., 2022). In
fact, it may be the case that listener-raters’ judgements of L2 ability may be directly impacted by these
perceptions of affect, while nonverbal behavior itself has an indirect or even minimal effect. Thus, in order
to investigate how perceptions of affect might be related to raters’ scores and test takers’ nonverbal behavior,
I also included a short set of affect scales for the raters to use after assigning language scores. The inclusion
of affect scales was also intended to prevent raters from listening only and not watching the video samples,
given that language may be judged by listening alone. By including categories such as warmth and
attentiveness, raters were implicitly required to watch the video as well as listen because these judgements
would otherwise be harder to make.
         I built a measurement instrument using semantic differentials to tap into raters’ impressions of the
test takers. Semantic differentials (Osgood et al., 1957; Snider & Osgood, 1969) allow researchers to gain
insight about evaluations of individuals in performance settings. They are generally single word adjectives
or short descriptions which are paired with their antonyms (e.g., engaged/unengaged, anxious/at ease) set
on an ordinal scale 5 to10 points apart. These allow observations of a broad spectrum of attitudes and
impressions, as well as the intensity and directionality of each category. Semantic differentials have a
number of benefits for rating, as they are simple to understand and generally do not require training.
Previously built scales (e.g., Zahn & Hopper, 1985) have largely focused on opinions and evaluations of
individuals (e.g., good/bad), though semantic differential scales targeting features of speech have been used
in the L2 literature as well (e.g., Burton, 2020; Harding, 2011).
         I constructed 14 semantic differential scales based on both features of language and affect. Each
scale contained seven points, which included a midpoint. The choice of a midpoint is largely one of stance
rather than psychometric properties; without a midpoint, raters must choose a direction of an effect (anxious
or not anxious), and with a midpoint it is possible to see someone as relatively neutral. I chose a scale with
seven points because this is the smallest scale with a midpoint with desirable psychometric properties, as
scales with 5 or fewer points may have attenuated precision (Simms et al., 2019). Language features on the
scale included fluency, vocabulary, grammar, and comprehensibility. Comprehensibility was chosen rather
                                                                                                              105


than pronunciation because there is ongoing work on comprehensibility in this area (e.g., Nagle et al., 2022;
Tsunemoto et al., 2022). I also believed the novice raters would understand comprehensibility better than
pronunciation, as constituent parts of skillful pronunciation (e.g., phonemic control, appropriate stress) may
be unfamiliar to linguistic laypeople, and they may have resorted to judgements of accent. I chose affect
measures based on the L2 literature, selecting those that have a confirmed relationship with proficiency,
including engagement (Ducasse & Brown, 2009; Jenkins & Parra, 2003; May, 2009; Nakatsuhara et al.,
2021a; Sato & McNamara, 2019), anxiety (Sato & McNamara, 2019; Thompson, 2016), confidence
(Ducasse & Brown, 2009; May, 2009, 2011; Thompson, 2016), and expressiveness/happiness (Jenkins &
Parra, 2003; Thompson, 2016). I also chose two measures related to engagement that reflect the socio-
affective orientations of the test taker: attentiveness (Ducasse & Brown, 2009; May, 2011) and
interactiveness (related to interactional competence; Galaczi & Taylor, 2018; Plough et al., 2018), as these
stances may also factor into raters’ judgements. I included warmth, attitude, and competence (Cuddy et al.,
2011) as these traits were found to relate to positive or negative outcomes in organizational psychology and
could possibly relate to outcomes in this study as well. The scale was piloted with 25 participants (students
in the target demographic, enrolled in a teacher education course) and analyzed using Facets. It was found
to function as intended, with scale units ordered as intended, meaningful separability amongst the seven
scale units, and no misfitting scales overall. Pilot participants indicated that the scales were simple and
intuitive to use. The full scale is provided in Figure 4.4.
         The scales were presented in the online survey system in the following way. First, the polarity of
the adjectives alternated in the operational scales so that for some scales the positive adjective was on the
left, while for other scales the negative adjective was placed on the left. This was to prevent survey
acquiescence bias (Iwaniec, 2019), such as marking all positive answers (7s, in this study) for all categories,
as raters would have to carefully read each scale to know whether a 1 or a 7 was positive or negative.
Second, the language categories were always presented first in the same order as Figure 4.4, followed by a
randomized order in the affect categories for each speech sample. Randomizing the order was meant to
reduce order/primacy/recency effects of the various affect scales. Primacy was desirable for language
                                                                                                           106


features, as I wanted raters to primarily focus on paying attention to the features of L2 ability.
Figure 4.4
Language Features and Affect Rating Scales
 Rate the speaker’s language on the following elements:
 Fluent                           __:__:__:__:__:__:__       Disfluent
 Strong vocabulary                __:__:__:__:__:__:__       Weak vocabulary
 Strong grammar                   __:__:__:__:__:__:__       Weak grammar
 Comprehensible                   __:__:__:__:__:__:__       Incomprehensible
 Rate the speaker on the following elements:
 Engaged                          __:__:__:__:__:__:__       Disengaged
 At Ease                          __:__:__:__:__:__:__       Anxious
 Confident                        __:__:__:__:__:__:__       Not confident
 Warm                             __:__:__:__:__:__:__       Cold
 Attentive                        __:__:__:__:__:__:__       Inattentive
 Expressive                       __:__:__:__:__:__:__       Inexpressive
 Happy                            __:__:__:__:__:__:__       Unhappy
 Competent                        __:__:__:__:__:__:__       Incompetent
 Interactive                      __:__:__:__:__:__:__       Non-interactive
 Positive attitude                __:__:__:__:__:__:__       Negative attitude
Sign-up survey
         Prior to being invited to take part in the study, individuals interested in participating completed a
questionnaire which included a range of demographic variables used to determine their eligibility. The
questionnaire had fields for the participants’ names (used to address participants in automated e-mails and
piped text in the survey), e-mail address, year of birth, gender, nationality, L2 status, L1, year of study at
Michigan State University, and major. There were additionally questions asking whether participants had
access to a quiet, distraction-free space to complete the survey and whether participants would be willing
to take part in the stimulated recall. The survey text is presented in Appendix B.
Online rating platform
         The online rating platform was constructed and hosted in Qualtrics. I created two-formats for the
survey: a two-day format used with the majority of the raters, and a one-day format used with the stimulated
                                                                                                           107


recall participants. The rationale for the two-day design, which spread rating out in equal halves on two
days spaced out by 24 hours, was to reduce rater fatigue. The novice raters were unaccustomed to rating
speech samples, and rater fatigue can reduce the quality of ratings. The one-day rating design was used with
the stimulated recall participants to ensure that the rating of all samples was fresh and recent in the raters’
minds when they conducted the stimulated recall instead of the longer two-day design. Both designs
contained the same components: an introduction, instructions and practice, the speech samples and rating
scales, and the follow-up questionnaire. The follow-up questionnaire was only included in the second day
of the two-day format. Images of the instructions and practice for the study are presented in Appendix B.
These images mirror how the study looked overall, with the exception that feedback was not presented in
live rating. I am not able to share direct links to the survey as it includes proprietary information from
IELTS protected by a non-disclosure agreement.
         Introduction. The introduction to the online rating platform included a description of the study,
instructions on how to set up for the study, data verification, and consent/non-disclosure agreements
(Appendix A). The instructions asked participants to secure time and a quiet place to conduct the study in
one sitting, and to use headphones if available. Information was given to participants on follow-up activity
after the study (e.g., stimulated recall, compensation), as well as contact information in case problems arose.
This section also verified participants’ names and e-mail addresses using piped data from the original survey
sign-up, in Appendix B. Finally, participants were asked to agree to both the consent form and the non-
disclosure agreement separately. Any response of “no” to either the consent or non-disclosure agreement
closed the survey.
         Introductions and practice. The introduction section introduced the task for participants and the
terms used in the rating scales. Terms were not explicitly defined because it was desirable for participants
to bring their own internal definitions of the terms to the rating scenario. While this can have a negative
impact on the reliability of scores, this was advantageous because in the target language use domain,
individuals make judgements about others using their own internalized understanding of the world around
them. Providing extended definitions may have introduced unnecessary confusion or difficulty with the
                                                                                                           108


task if raters were unfamiliar with certain ideas. For example, defining interactiveness as showing evidence
of interactional competence may have confused raters unnecessarily, as they would not be accustomed to
thinking about interactional skills as defined in applied linguistics.
          The practice section immediately followed the introduction. Each participant viewed the same two
cases, one of a higher ability speaker and one of a lower ability speaker. The practice videos were drawn
from samples I had recorded in previous studies with consent from speakers. Participants were encouraged
to consider the performance and rate the sample on the 14 scales. After rating, which was not scored, raters
received minimal feedback about the performance (but not their ratings). For example, the feedback after
the first performance was: “Although the speaker struggles to understand the question at first, overall her
language is fairly strong once she begins speaking. She manages to communicate fairly effectively.”
Explicit benchmarked scores were not provided to the raters, as it was hoped that raters would develop their
own internal definitions of the scale categories. Both the introduction and practice materials are available
in Appendix B. The introduction and practice sections were available to participants on both days of the
study for the two-day format, but participants were able to skip the practice on the second day if they desired.
          Speech samples and rating scales. After the practice session concluded, participants were directed
to the rating portion of the study. The speech samples were presented in a random (but even) order for all
participants, meaning that all participants rated all samples one time, but each participant was presented
samples in a different order. The speech sample stimuli were presented on individual pages without the
presence of rating scales. Rating scales were presented only after the videos had finished and were on a
separate page. The samples and scales were divided on separate pages to reduce distractions and to
encourage participants to watch the video the entire time it was playing, as otherwise the raters may have
looked through the rating scales while listening. I embedded Java code so that the samples could be played
only one time with no pausing, no other video controls, and also no ability to download the files. I also
entered Java code so that the videos would be presented in the maximum size possible (rather than the
smaller default versions) within Qualtrics in their internet browser. I encoded a large video size so that
participants would have a much larger visual area to pay attention to. The videos would have taken up the
                                                                                                           109


majority of the participants’ browsers. Unfortunately, it is impossible to estimate the exact dimensions of
these for each person because screen sizes were ultimately different, and each person could control their
browser window size. As mentioned earlier, the rating scales following the samples were presented with
language features first in the same order (fluency, vocabulary, grammar, and comprehensibility), while
affect scales were presented in a random order for each participant.
         Participants completing the two-day study and the one-day study had slightly different rating
designs. For the two-day study, I divided the 30 samples into blocks of 15 by odd even numbers. Dividing
by odds or evens ensured the same distribution of proficiency levels for each day. The two blocks were
counterbalanced by randomly assigning participants one of the two blocks to begin with on the first day of
the study. Twenty-four hours after completing the first day of the study, participants received an automated
e-mail from Qualtrics giving access to the remaining block of questions. This design is shown in Figure 4.5.
Participants completing the one-day study were provided all 30 speech samples presented randomly. After
finishing 15 samples, however, a break screen was presented to encourage the raters to rest for a moment
in order to reduce rater fatigue.
Figure 4.5
Counterbalanced Blocks for Day 1 and Day 2 of the Two-Day Study
              Follow-up survey. After completing the online rating study, participants were directed to an
                                                                                                        110


optional brief questionnaire about their experience while participating in the study. The follow-up questions
were used to monitor ratings as they were submitted during the study to ensure that participants were not
experiencing technical issues. The questions also served as a way to verify how raters felt about the scales
themselves and allowed them to comment on any doubts or difficulties while rating. A final section in the
questionnaire elicited information regarding how much the raters felt that various facets in the testing
situation impacted their ratings of language and affect separately. The follow-up survey items are presented
in Appendix C but were not analyzed in this dissertation.
Stimulated verbal recall
         Second language research has traditionally relied on product data, such as the rating scores in this
study, to explain phenomena relating to language development. Product data, however, give limited insight
into the nature of how and why individuals perform in particular ways, and thus process data may support
investigations of human behavior through the triangulation of the two data sources. While retrospective
interviews, questionnaires, and other forms of qualitative data collection may be used to investigate the
process data of the internal working of cognition while performing tasks, one of the more popular research
methods over the past 40 years has been the use of verbal protocols. Verbal protocols, arising from research
in cognitive psychology, seek “to understand in detail the mechanisms and internal structure of cognitive
processes” (Ericsson & Simon, 1993, p.1). They may occur as concurrent think-aloud protocols, which take
place while a task is performed, or they may take place after performing a task as retrospective or stimulated
verbal protocols. Because the focus in this study is on the rating of speech, concurrent protocols are
unmanageable because participants’ verbalizations of their thought processes would overlap with the
speech being heard. In addition, retrospective recalls, which take place after an entire task is performed, are
not ideal because the task in this study was to listen to a series of speech samples, and human memory is
limited in what it can store and reproduce. For this reason I chose to use stimulated verbal recalls for use in
this study.
         Stimulated verbal recall (or stimulated recall, for short), aim to enable participants “to relive an
original situation with vividness and accuracy if he is presented with a large number of the cues or stimuli
                                                                                                           111


which occurred during the original situation” (Bloom, 1953, p. 161). This type of method generally uses
recordings of a task in progress to stimulate the memory of participants, which they are then encouraged to
verbalize by pausing the task recordings when memories occur. Stimulated recall designs are useful for the
analysis of rating process data because they can be relatively non-intrusive and intuitive to perform, do not
have problems of reactivity (negatively impacting scoring data, which has already been collected), and are
generally considered to be veridical (revealing the true nature of cognitive process while rating rather than
other aspects of cognition) as long as they follow strict guidelines (Bowles, 2018; Egi, 2008). There is a
well-established tradition of the use of stimulated recall in L2 research in both SLA and language testing
with guidelines for best practices (Gass & Mackey, 2016), which I used in the design of this study.
         The stimulated recall sessions in this study formed an integral part of the sequential explanatory
design I adopted, as the speech samples in the recall had been identified as exhibiting scoring patterns that
merited further study. The method that I previously described involved identifying samples which were
scored much higher or lower than the original ranking of the IELTS scores, indicating a potential effect of
behavior on the resulting scores. Ten of the 30 videos showing the largest differences in the ranking were
chosen for this study.
         Each session included carefully drafted instructions, which were piloted operationally with one
participant. I drafted the instructions using examples from Gass and Mackey (2016), making sure to
emphasize that participants were to recall their memories from the rating session and not their thoughts at
the time of rating. Participants were not asked to speak about any particular rating categories, nor were they
limited to discussing only verbal aspects of speech in the samples. Participants were shown an example of
how to operate the recall session using the space bar of an iMac computer to pause and start the videos.
There was no practice session as the recall sessions are generally intuitive, and I wanted to limit any
unnecessary time spent due to the number of videos to be watched. Participants were provided with an iPad
next to the iMac computer that displayed their score reports of the 10 samples obtained from Qualtrics. That
is to say, the pdf printout from Qualtrics showed the same format of the rating scales, which I hoped would
help stimulate the raters’ memories further. I sat to the right of each participant with instructions and score
                                                                                                            112


results available on my laptop computer. The setup is displayed in Figure 4.6. I drafted a set of probe
questions to be used when pausing the sample videos. The session finalized with a semi-structured interview
to target specific aspects of my research questions, as well as to follow up with particular comments made
by the participants during the session. The drafted instructions, probe questions, and interview questions
are available in Appendix D.
Figure 4.6
Stimulated Recall Setup
         As explained in the above sections, I used ongoing rating data to identify a subset of 10 samples to
be analyzed using stimulated recall. The videos were presented in a random order for each participant. I
invited a subset of 20 raters to take part in these sessions. Each recall session lasted an average of one hour
and 11 minutes (SD = 18m 58s). The resulting dataset included 200 unique recalls, 20 follow-up interviews,
and 23 hours and 53 minutes of data. The transcribed dataset, including both stimulated recall and interview
content, contained 157,894 words.
Procedures
Rating study procedures
         The rating study procedures are outlined in Figure 4.7. Data collection took place between
                                                                                                            113


December 2021 and April 2022. Participants first indicated their interest in the study by completing the
sign-up survey in Appendix B, which collected demographic information used to identify eligible
participants. Following this, I invited participants in batches of 10 to 50 individuals to participate in the
two-day rating design. Participants were sent e-mail invitations through Qualtrics that piped their unique
participant IDs, names, and e-mails into the rating flow. Twenty-four hours after completing day 1 of the
study, participants were automatically notified that day 2 of the study was available. Participants were not
obliged to complete day 2 of the study immediately but could choose a day and time of their preference.
Attrition (participants that did not complete day 2 of the study) was low, at 8%. The follow-up survey was
included at the end of the day 2 survey. Upon completion of this study, participants were provided a $30
Amazon e-gift card.
Figure 4.7
Data Collection Procedures
                                                                                                         114


         After identifying samples to be targeted for the stimulated recall session, I invited participants in
batches of 5 to10 individuals to participate in the one-day rating plus stimulated recall sessions. Those that
indicated interest in completing both the rating and stimulated recall sessions were sent links to choose a
date and time for their recall sessions, after which they received instructions on how to proceed. Three days
before the stimulated recall, I sent each participant detailed instructions outlining each step of the process
to ensure that the rating was completed less than 24 hours before the stimulated recall session. One day
before the recall session participants received a link to the rating study. When the study was complete, I
then sent each participant directions to the laboratory used for this session. All e-mail communications to
participants for both the two-day rating and one-day rating designs are included in Appendix E.
Stimulated verbal recall session procedures
          The stimulated recall sessions took place in March and April, 2022. The sessions took place in
person at the SLA Knowledge and Production Lab in Wells Hall. All participants had completed the rating
session no more than 24 hours before the recall session began. Participants were provided with a bottle of
water and invited to sit at a Mac terminal. I sat to the right of the participants and used my laptop to conduct
the session. I welcomed each participant with small talk to put them at ease, and then explained the content
of the session. Participants then signed a second consent form agreeing to the audio recording of the session,
indicated at the end of Appendix A. I then gave the participants instructions on the stimulated recall session
and demonstrated how to start and stop each video. I provided participants with an iPad that showed the
scores they awarded for each speech sample. I then began the audio recording by both recording a screen
share on the Mac computer (in order to capture time stamps of the video when the participant paused) and
using an external digital recording device. I then began the session. Participants watched all 10 speech
samples and recalled their thought processes, which lasted an average of one hour and 11 minutes (SD =
18m 58s). I then debriefed each participant in a semi-structured interview. Each stimulated recall participant
was compensated with a $50 Amazon e-gift card. Stimulated recall instructions, probe questions, and
interview questions are available in Appendix D.
                                                                                                             115


Analysis
Software
         iMotions. I analyzed the video speech samples using iMotions software (Version 9.0; iMotions,
2017). I extracted three indices for analysis in this study: engagement, valence, and attention. I chose these
three measures because each captured complex combinations of facial movements that related to features
raters identified in the literature (e.g., expressiveness, positivity, and gaze direction). Affectiva (n.d.) (the
parent company that produces the facial algorithm Affdex for the software iMotions) provided descriptions
of how each measure is compiled. Engagement is defined by iMotions’ facial recognition algorithm Affdex
as a measure of overall expressiveness derived from the participant’s facial muscle activation. Facial
muscles contributing to this measure include eyebrow raising and furrowing, cheek raising, nose wrinkling,
mouth movements, and chin raising. Engagement is thus not an indication of positive or negative emotion
but rather an indication of how non-neutral a participant may appear. Valence, on the other hand, is defined
by Affdex as a measure of the positive or negative emotions exhibited by the participant. Valence is
calculated by smiling and cheek raising for the positive end of the scale and brow raising/furrowing, nose
wrinkling, mouth frowning, lip pressing, and chin raising for the negative end. Attention is a measure of
gaze and head turns directed towards the stimulus source (in this case, an examiner visible on a laptop
computer with a camera embedded in the upper bezel). Data output for these action units/expression metrics,
emotions and states is in the form of a probability-based confidence score of 0 to 100 for each video frame.
For example, if a frame receives a value of 87 on engagement, the algorithm has classified this instance as
highly likely that the individual is classified as engaged. Valence, or the strength of positive or negative
emotions, is the only measure that uses a scale from -100 to 100, where -100 is a highly probable negative
overall response and 100 is a highly probably positive response. The final output of each response is a table
of probability measures for each frame of video analyzed.
         NVivo. I used the qualitative data analysis software NVivo for Mac (Version 12;
qsrinternational.com) to analyze the stimulated recall data. NVivo is a popular tool used to organize
qualitative data by using cases, which in this dataset were two kinds: the stimulated recall (a) raters and (b)
                                                                                                             116


speech samples; and nodes, which are annotations of themes that arose in the dataset. The power of NVivo
is that once the dataset is annotated, it allows researchers to analyze frequencies and cross instances of
nodes, such as fluency and eye gaze, to find deeper patterns in the dataset.
         ELAN. I used ELAN (Max Planck Institute, 2020) to annotate both speech and nonverbal behavior
in the 30 speaking test files. For this study, I drew on an annotation system that I developed for the study
of nonverbal behavior used in repair sequences (Burton, 2021a). This system included four tiers for verbal
information (simplified conversation analysis and individual words) plus seven tiers of nonverbal behavior
(gaze, blinks, mouth movements, eyebrow movements, head position, posture, and gesture). For this
dissertation study I refined and added to this annotation scheme as behaviors became salient in the example
performances. I added a tier for head gestures (nods and shakes), and I added in additional behaviors to the
system such as rocking back and forth, shifting posture, and shifting eye gaze. I also added an additional
category for occasional behaviors that were otherwise uncategorized, such as shoulder shrugs and
swallowing. The full annotation scheme is provided in Appendix F.
         ELAN produced two types of data useful for the analysis of stimulated recall data. These were used
to compare the spoken transcripts from the raters with evidence of what test takers did during the samples
in reality. One data type was the multimodal transcripts. These were fine-grained, tier-by-tier descriptions
of the unfolding interaction between the examiner and test taker. One example, taken from sample 30, is
shown in Figure 4.8. The top line in this sample is the annotation word-by-word for the examiner, who is
finishing his question. The next line is the test taker in sample 30, likewise annotated word-by-word,
including breath marks .hhh and filled pauses umm. Next, there are nine tiers of nonverbal behavior, not all
of which are annotated because not all behaviors appeared in this excerpt. Gaze was only annotated when
averted or shifting, and mutual gaze was left uncoded. In this sample, the test taker averted her gaze when
she began her turn, reestablishing mutual gaze with the examiner at the word fact. She smiled during her
filled pause and blinked six times. She began tilting her head right as she began her turn, first to the right,
followed by a full turn right. Her posture began leaning forward at the word fact. There were no examples
of eyebrow movement, head gestures (e.g., nods), or gestures in this excerpt. In the analysis, I removed
                                                                                                           117


empty tiers to shorten the transcripts and make them more readable, as appropriate.
Figure 4.8
Multimodal Transcript from ELAN (Sample 30)
         The second data type that ELAN produces is the annotation density graph. These graphs show the
appearance of behaviors throughout the entire two-minute sample and are useful when considering the
sample behavior as a whole. The annotation density plot for sample 30 is shown in Figure 4.9. This plot is
arranged similarly to the multimodal transcript, but solid bars represent entire moments when behaviors
were coded within that tier. In this example, the test taker spends most of the time talking, with the examiner
asking questions or backchanneling six times. The test taker used variable gaze patterns, withdrawing and
reestablishing gaze frequently. She did not sustain smiles for long periods of time, but she did smile
frequently. She used few eyebrow movements, though notably moving them in a sustained manner during
the examiner’s second question. She turned her head frequently during this sample, rarely keeping a neutral
position. She nodded or shook her head (head gestures) six or seven times, particularly during the
examiner’s questions. She changed her posture frequently as well and produced no visible gestures.
                                                                                                           118


Figure 4.9
Annotation Density Plot from ELAN (Sample 30)
Data Preparation
         Rating data. The rating data for the 100 participants were extracted from Qualtrics. I prepared the
dataset for analysis using R (R Core Team, 2022). Because the scales were presented to participants with
alternating scale polarity (1 and 7 represented both positive and negative trait judgements depending on the
scale), I adjusted the polarity of all scales such that negative judgements (e.g., weak vocabulary, anxious,
cold) were reordered with 1 being the lowest endpoint, and positive judgements (e.g., strong vocabulary, at
ease, warm) were reordered such that 7 was the highest endpoint. The final dataset included the variables
listed in Appendix G.
         iMotions. Raw iMotions data were prepared prior to use in the study. Raw scores “represent the
classification results of the facial expression engine for a certain respondent compared to the facial
expressions stored in the global database” (iMotions, 2017, p. 31). Baseline corrections were not applied to
these samples, as these would center all participants identically, thus blurring the meaning of neutral states
(one participant’s neutral state may be highly aroused while another may appear more neutral). This
correction is recommended by iMotions (2017) for the investigation of individuals’ relative changes in their
own behavior, but in this study the focus is to contrast differences between individuals, not within. I also
                                                                                                          119


avoided thresholding output indices on either time or amplitude. Thresholding is important when
determining the likelihood of the appearance of an individual feature such as a smile or brow furrow (such
as the boundaries of fixations in eye-tracking). These boundaries may be determined by the strength of the
facial muscles when producing the behavior (amplitude) or the amount of time the behavior is held (time).
Thresholding allows researchers to investigate longer-lasting emotions or emotions that exceed a certain
strength (iMotions, 2017). However, in the context of this study, it is unknown whether small variations in
change may affect the overall perception of test takers, or whether rapid bursts are noticeable, both of which
could be eliminated if time or amplitude thresholding were implemented. For this reason, I used the raw
probabilistic data in this study. Raw scores are most appropriate when comparing different individuals or
groups of respondents (iMotions, 2017).
         iMotions extracted 117,221 frames of data from the 30 speech samples at 30 frames per second,
totaling 66m of video. I averaged engagement, valence, and attention values for each speech sample for
analysis. The averages represented the average of the confidence measures for each value, in other words,
the overall strength of engagement, valence, or attention for each speech sample across the entire video.
However, the values across the videos were constantly in flux as the test takers’ behavior shifted. For this
reason, I also computed standard deviations of each value to represent the amount of standardized variance
for each test taker, as this would capture the number of changes in behavior, which may also be meaningful
in analyses.
         Stimulated recall transcripts. I prepared the stimulated recall audio recordings for analysis by
first conducting an automated transcription of the individual speech samples in Otter.ai (https://otter.ai),
which I then listened to individually and corrected. Otter.ai automatically transcribes audio files and
automatically recognizes speakers throughout the transcripts. It includes speaker names, timestamps, and
the text produced. These are transcribed orthographically without the inclusion of paralinguistic features or
filled pauses, except where these were extended. Where words were undetectable or difficult to understand,
I indicated these with brackets [Unknown] or [???]. Otter.ai provided commas where natural pauses were
                                                                                                           120


in the recordings.
         After the speech samples were fully transcribed, I uploaded the stimulated recall audio recordings
to Otter.ai and repeated the same process. In the final transcript, I included the text from the original speech
samples up to the point the recordings were paused in the stimulated recall rather than the entire block of
text. A sample of the transcript is provided in Figure 4.10. The speech sample is labeled at the top, in this
case Sample 21. The transcribed sample text is shown in the text box, where Examiner 5 and the test taker
in Sample 21 are talking. The timestamps in the textbox refer to the times the text in the speech sample
were produced. Following this, Rater 14, the 14th stimulated recall participant, provided their recall, with a
separate timestamp that refers to the stimulated recall session. Following their comment, I extended the
recall as PI (principal investigator) with a standardized question. Rater 14 then answers the question,
provided an extended recall, and then continued the audio. Note that when Rater 14 referred to an explicit
quotation of text from the test taker, I used quotation marks in the transcript to identify this word. Once
these transcripts were complete, I uploaded them to NVivo.
Figure 4.10
Stimulated Recall Transcript
         NVivo. Data preparation in NVivo involved uploading the transcripts for the stimulated recall
                                                                                                             121


participants. Each participant’s file received a case coding of the participant number, and within these files
each discussion of a speech sample received a case coding of the sample number. This resulted in 20 cases
for rater participants (1–20) and 10 cases for speech samples. Final debriefing interviews were coded as a
third case, but these data are substantially different from stimulated recall data and beyond the scope of
analysis for this dissertation.
         ELAN. Multimodal speech and behavior transcriptions for the 30 speech samples were conducted
in ELAN. Using the Otter.ai transcriptions, I added in speech tiers to the dataset. The only difference in
transcription style was that I transcribed filled pauses, audible inhale/exhale, and laughing for each sample.
I also annotated a tier of the production of each individual word. Because the behavioral data were to serve
for analysis with the stimulated recall data (and possible subsequent analyses post-dissertation), it was
paramount to ensure that the transcripts with behavioral data were as accurate and precise as possible. At
the same time, dense multimodal transcription is extremely time consuming. I hired two research assistants
(Master of Arts graduate students in Teaching English to Speakers of Other Languages (TESOL)) to assist
with the behavioral annotations. Instead of a standard 20% reliability check, I designed an iterative training
course covering six samples (20% of the dataset) to ensure that the research assistants and I annotated the
samples as similarly as possible prior to annotating the entire dataset. During these training sessions,
additional behaviors became salient (e.g., throat movements during gulping, shoulder shrugging, tongue
sticking out), and I updated the annotation scheme to include these behaviors.
         The first stage of training included practice, feedback, and consensus. I trained the research
assistants how to use ELAN for multimodal transcription and annotation. I showed them how to annotate
each line of behavior, providing information on how to define boundaries for each. For this stage, we
worked with two samples, S01 and S16. I annotated both samples fully, and on separate days the research
assistants each submitted their annotations for individual tiers of behavior. In other words, these feedback
sessions focused individually on gaze, blinking, eyebrow movements, etc. In this round of feedback, I
compared my annotations with the assistants’ two annotations, and we discussed agreements and
disagreements, coming to a consensus on each tier of these files.
                                                                                                           122


         In the second stage of training, each research assistant annotated a different set of two files (06 and
21; 11 and 26), and I annotated all four. The assistants completed all tiers of behavioral annotation, and
then I gave feedback to each assistant after comparing my annotations with theirs. I calculated reliability
separately for number of annotations per tier (e.g., gaze, blinks, posture) and duration of annotations per
tier. This was because the number of discrete annotations could be misleading, as each annotation could
last 1 millisecond up to several seconds. The average Pearson correlations between each research assistant
and me on the total number of annotations was .95 and the duration of annotations was .91, which I
considered to be a strong measure of agreement. For each file, I offered feedback on both our agreements
and disagreements. For each annotation in which we did not align, we reached consensus following
feedback and analysis.
         Following training, the research assistants independently annotated the remaining 24 speech
samples. One research assistant annotated 8 files, and the second annotated 16. The numbers differed due
to the assistants’ differing work-availability times. After annotation, the research assistant who annotated
16 of the files and I reviewed all files. Multimodal transcripts were then compiled in ELAN for use in the
stimulated recall analysis.
Data Analysis
         Rating data cleaning. All variables used in this study are listed in Appendix G. The first part of
the statistical analysis involved ensuring the dataset was sufficiently reliable for analysis. Because the study
involved the use of novice raters that completed the study completely online for a gift certificate, it was
anticipated that some individuals would be careless in their ratings or directly negligent. It was thus
necessary to inspect the dataset for undesirable responses. I cleaned the dataset using a combination of
methods. First, I calculated Spearman correlations for each of the 99 raters to determine whether they agreed
with the rest of the group on stronger and weaker performances overall. For each rater, I calculated the
average of each raters’ four language outcome correlations with all other 98 raters. I then calculated a grand
mean of all raters’ average correlations to produce a rough estimate of their overall agreement with the
                                                                                                             123


group. I flagged raters with correlations lower than .4.
         Second, I ran MFRM analysis—with 99 raters, 30 speech samples, and four language criterion
measures as facets in a partial credit model—to determine misfitting raters using fit statistics. I only used
language criterion measures of fluency, vocabulary, grammar, and comprehensibility in this model, as there
is an expectation of unidimensionality in Rasch models (Aryadoust et al., 2021), and I posited that including
affect scales would result in multidimensionality. I used Bond et al.’s (2020) and Linacre’s (2002) criterion
of fitness between .5 and 1.5 of infit mean square values, and Wright and Linacre’s (1994) criterion of a
maximum of 2.0 for outfit means square values. Infit statistics report the mean squared residuals of raters
on inlying performances, while outfit statistics identify outlying performances (Linacre, 2002).
         Finally, as outliers can have a negative impact on ordinal regression (Tabachnik & Fidell, 2013), I
checked the dataset for patterns in multivariate outliers amongst participants. I used the four language
ability scales to check for outliers in the four-dimensional data. I used Mahalanobis distance to detect values
that veered from a normal distribution in a Q-Q plot. These values were individual scores on language
criterion measures. I then produced a frequency table of raters that produced the highest number of
multivariate outliers to be flagged for possible erratic behavior.
With these three methods (low rank correlations with the overall group, Rasch misfit, and multivariate
outliers), I removed 16 raters whose scores did not align with the majority of the group.
         Integrity of rating data. The second part of the analysis involved providing evidence that the
rating data could be used to produce meaningful inferences. I calculated descriptive statistics to show
overall trends in the dataset. I report Cronbach’s alpha for each scale and the distributions of rating scale
scores. I also report scale, sample, and rater data derived from the partial credit Rasch analysis (only using
language criterion scales due to dimensionality concerns) to provide evidence that the scales, raters, and
samples functioned within acceptable parameters. In addition, I report intraclass correlation coefficients
(ICCs) as a measure of interrater reliability. ICCs provide a measure of both correlation and degree of
agreement, taking into account the mixed effects nature of the dataset (raters x samples). ICCs below .5 are
                                                                                                            124


low, between .5 and .75 are moderate, .75 and .9 are good, and above .9 are high (Koo & Li, 2016).
         Affect analysis. To answer the first research question, I began by calculating inter-scale
correlations to determine the associations amongst all scale categories. I used polychoric correlations
because Pearson correlations are likely to attenuate relationships amongst ordinal or ordered categorical
variables (Winke et al., 2022). I used the polychoric function in the psych package (version 2.0.8) in R. In
order to produce inferential data about the relationship between affect and language proficiency, it was
desirable to reduce the dataset to use a smaller number of components for analysis. Although principal
components analysis is mathematically more appropriate for data reduction and summarization when there
are no theoretical relationships underlying variables (Tabachnik & Fidell, 2013), the variables in this study
were semantically related, and I hypothesized that an underlying factor structure existed that would enhance
interpretability after data reduction (rather than a component structure that cannot be interpreted). Given
that interpretability was a key desire in this study, I checked the factor structure underlying the 10 affect
scales. Because violating assumptions such as multicollinearity can result in unstable factor scores, I
checked assumptions for factorability to make sure the factor solution would be robust. Afterwards, I
reduced the dataset using exploratory factor analysis with the polychoric correlation matrix using maximum
likelihood estimation and a promax rotation. I then used the factor scores from factor analysis to conduct
mixed-effects ordinal regression, as described in detail below, to observe relationships between the factors
and language proficiency outcomes.
         iMotions behavioral data. Descriptive information on means and standard deviations is presented
for each of the iMotions variables: valence, engagement, and attention. In addition, graphical data exploring
the distribution of these variables is presented for descriptive analysis. I then built models using the means
of the iMotions variables as predictors using mixed-effects ordinal regression, as described below, to
answer research question 2.1. To answer research question 2.2, I used the scaled scores from IELTS as
interaction terms in the same models. Throughout the dissertation, these scores will be referred to as base
proficiency scores, as they were external and measured prior to the research taking place. The analyses
described above were confirmatory and preregistered in the Open Science Framework (https://osf.io/u6243).
                                                                                                           125


An exploratory analysis, not hypothesized a priori, involved using a separate set of predictors using the
standard deviations of the variables rather than the means. Using standard deviations allowed testing
whether variance in behavior (instead of the average level) would predict proficiency measures.
         Ordinal regression. I used cumulative logit mixed effects models—a proportional odds model
with ordinal outcomes—to test for relationships amongst the extracted factor scores and the language
variables, as well as the relationships amongst the iMotions behavioral indices and the language variables.
I used the clmm function in the ordinal package (v. 2019.12–10) in R to account for mixed effects of raters
and samples.
         Most assumptions for the four models were met: the dependent variable was ordinal, the
independent variables were continuous (factor scores and iMotions scores), and there was no evidence of
multicollinearity. The fourth assumption, that of proportional odds, or parallel regression slopes, was
checked using Brant’s tests (Brant, 1990) using the brant package (v. 0.3–0) in R on models with random
effects removed (polr models). This is due to the fact that there are currently no packages that are able to
check for parallel regression slopes using clmm, and current recommendations suggest using base models
without random effects as sufficient evidence of proportional odds. Not all of the models supported the
proportional odds assumption, which is a claim that the predictor variables have an equal impact on the
outcome variable at each score level. However, this may not necessarily be problematic when estimating
average odds ratios across samples/raters, and when the aim of the study is not to predict discrete outcomes
for individuals. As Harrell (2020) wrote,
         When [the proportional odds assumption] does not hold, the odds ratio from the proportional odds
         model represents a kind of average odds ratio… a unified [proportional odds] model analysis is
         decidedly better than turning to inefficient and arbitrary analyses of dichotomized values of Y
         (Conclusion section, para. 1).
For this reason, I estimated model effects using mixed effects ordinal regression rather than comparable
multinomial models, which would hinder parsimony and interpretability.
         In each model, I entered variables by order of correlation with the dependent variable, creating four
                                                                                                           126


models including the null model. Each model began with the null model, entered as
clmm(Language Score~1+(1|Rater)+(1|Sample))
where Language Score refers to each dependent variable (fluency, grammar, vocabulary, and
comprehensibility), 1 indicates that no fixed effects were entered, and rater and sample were entered as
separate random effects. Following this, the remaining variables were entered in each model one by one,
with the final model including all three:
clmm(Language Score~Var1+Var2+Var3+(1|Rater)+(1|Sample))
All models used a logit link and flexible thresholds for the most accurate estimation of score probabilities.
I then selected the best fitting model based on comparisons using likelihood ratio tests. I also tested the
final selected model against the same model with random effects removed (a clm model) to verify that
random effects contributed meaningfully to the model. Bonferroni corrections were applied for the four sets
of analyses to control for multiple hypothesis tests, resulting in a more conservative Type I error threshold
of α = .0125. Plots of regression slopes, with points jittered to avoid overlap at scale scores, are presented
to illustrate relationships with the final scores.
          Interactions. Interactions between base language proficiency level (using rescaled IELTS scores)
and the behavioral variables was also conducted in clmm. In this model Prof refers to the base proficiency,
scaled IELTS score, and interactions were tested against all measured variables in the model simultaneously.
The model is as follows:
clmm(Language Score~Prof*Var1+Prof*Var2+Prof*Var3+(1|Rater)+(1|Sample))
Significant interactions were explored using the marginaleffects package in R (version 0.8.1). However,
the post hoc analysis using this software required two key modifications. This package computes marginal
effects of interaction terms for cumulative logit models without mixed effects—clm models. I used clm
models for post hoc analyses using this package because clmm lacks the predict function, making it
impossible to run marginaleffects (Arel-Bundock, personal communication), and thus this method is the
best approximation to analyze these interactions. The second modification is that marginaleffects can only
handle categorical interaction terms. I thus dichotomized proficiency into two levels, low and high, using
                                                                                                           127


the rescaled IELTS scores. I decided not to split proficiency into three levels (low, medium, and high), as
this would have created two groups with only 8 cases, and 15 samples per case in the dichomotomized
interaction was already a limited size. The low proficiency group included scaled scores below 4, leading
to 14 samples equivalent to IELTS scores below 5.5, which is accepted to be below B2 level on the CEFR
(Lim et al., 2013). High proficiency included scores greater than or equal to 4, which accounted for 16
scores at IELTS 5.5 or above, generally considered to be at B2 or above. The interaction models were then:
clm(Language Score~Prof*Interaction_Var)
I then used the comparisons function to compute the coefficients for each comparison. This function gives
comparisons for proficiency level at each score level, 1–7. This gives insight into the differential effects of
the interaction according to the scores raters assigned each sample. I used the plot_cap function to visualize
the conditional adjusted predictions of the interactions, which lists the predicted probabilities of a score
assignment for a given proficiency level for a single interaction variable.
          Stimulated recall. After transcribing the stimulated recall data, I devised a coding scheme to begin
analyzing the dataset. I first began by reading a sample of transcript files carefully and assigning codes
based on items from the rating scales, behaviors identified in the ELAN file preparation, and other topics
the raters identified (comments on content of what was discussed, desire to communicate, humor, etc). This
resulted in a set of 40 preliminary codes. In the process I describe below, these codes were reduced to 38
and arranged thematically before being applied to the whole dataset.
          I began by segmenting the dataset into idea units based on a semantic analysis of the textual content.
I used Brown et al.’s (2005) definition of idea units as “single or several utterances, either continuous or
separated by other talk but falling within the same turn, with a single aspect of performance as the focus”
(p. 14). Most recalls consisted of a single idea unit, though some longer recalls consisted of multiple. I
segmented rater 9’s recall in the following example, where there was a comment on sample 9 after playing
the file for 1 minute. This recall was complex and contained multiple ideas within. Idea units in this example
are indicated with double slashes (//), which mark idea unit boundaries. Three idea units appeared in this
recall. Note that these idea units contain multiple references to concepts, but the idea itself is generally
                                                                                                            128


focused.
        // (Unit 1) Um, I would say I know that I put vocabulary pretty strong. And I think it was not
        necessarily just from the beginning, but throughout the video, I could tell that she was saying some
        words that I may have not expected someone who doesn't know English very well to use, which is
        why I also put fluency higher, because from that, I would assume that she knows English better
        than other people. // (Unit 2) And I was also comparing these videos to each other a lot, which I
        think probably impacted what my scores were like, from the first videos I watched towards the ones
        at the end, because I was able to compare them to other ones. I don't, I'm not sure when I watched
        this video in comparison to the others. // (Unit 3) But I don't know from what I can tell so far, even
        from just like the first sentence of what she said, it seems like she knew what she was saying. And
        she understood the question very well. //
The first idea unit focused on the relationship between overall language ability, vocabulary, and fluency.
The second idea unit was an observation on metacognitive rating strategies, namely comparing
performances and the effects of order on the rater’s scores. The final idea unit contained observations about
the content of the utterance and its relationship to test question comprehension.
        I segmented the entire dataset following the above scheme alone because segmentation “require[s]
subjective interpretation, contextualization, and especially a thorough understanding of the theoretically
motivation questions guiding the study” (Campbell, et al., 2023, p. 304). Following segmentation, I began
refining the coding scheme I had developed earlier. I first ran a reliability check to determine whether my
codes were logical and applicable to this dataset. For the check, I input 10% of the idea units—a total of
186 units—into an Excel sheet for double coding. This percentage was chosen rather than the more standard
20% due to the substantial size of the dataset (with precedent in the literature, e.g., Sato & McNamara,
2019). The exact agreement between the raters was 77%. Based on the areas of disagreement, I refined the
coding scheme to reduce ambiguity; for example, initially I had included codes for various aspects of
content (amount of discussion, breadth of discussion, naturalness of content, truthfulness of ideas), which
I collapsed to one test content category. Afterwards, the research assistant and I discussed the remaining
                                                                                                           129


areas of disagreement, and each made final decisions on the remaining idea units. The final agreement rate
was 98%. The final coding scheme I developed is presented in Table 4.5.
Table 4.5
NVivo Coding Scheme
  Category          Code                                Category         Code
  Affect              • Anxiety                         Behavior           •    Eyebrows
                      • Attentiveness                                      •    General Face Behaviors
                      • Attitude                                           •    General Body Language
                      • Competence                                         •    Gaze
                      • Confidence                                         •    Gesture
                      • Desire to Communicate                              •    Head
                      • Engagement                                         •    Mouth
                      • Expressiveness                                     •    Paralinguistics
                      • Happiness                                          •    Posture
                      • Humor
                      • Interactiveness
                      • Warmth
  Test                •  Active listening               Language           •    Comprehensibility
  Interaction         •  Content                                           •    Comprehension
                      •  Examiner                                          •    Fluency
                      •  Relevance-Contingence                             •    Grammar
                      •  Repair                                            •    Organization
                      •  Thinking                                          •    Overall ability
                      •  Turn-taking                                       •    Pronunciation
                      •  Visual Artifacts                                  •    Vocabulary
          I coded the entire dataset in NVivo for Mac (Version 12). First, I assigned attribute coding,
identifying sections of text by stimulated recall rater and by test taker. This resulted in 200 unique
observations (20 raters by 10 stimulated recalls). Following this, I coded the entire dataset using the coding
scheme in Table 4.5. This resulted in 4,251 decisions on 1,213 idea units (M = 3.5 codes per idea unit). To
illustrate, I provide an example of the coding process in Table 4.6. Unit 1 only contained language-related
evidence, so it was coded as vocabulary, fluency, and overall ability. Unit 2 did not contain any information
of relevance to this study, as mentions of metacognitive strategies were beyond the scope of this analysis.
Thus, Unit 2 was not coded. Unit 3 contained multiple codes, namely a focus on comprehension (language),
the content of the utterance (they knew what they were saying; test interaction), and an implication that the
test taker was competent (they knew what they were saying; affect).
                                                                                                           130


Table 4.6
Coding Example
 Unit    Attribute                                                             Code
 1       Rater 9 /    Um, I would say I know that I put vocabulary pretty      LANGUAGE (Vocabulary,
         Sample 9     strong. And I think it was not necessarily just from     fluency, overall ability)
                      the beginning, but throughout the video, I could tell
                      that she was saying some words that I may have not
                      expected someone who doesn't know English very
                      well to use, which is why I also put fluency higher,
                      because from that, I would assume that she knows
                      English better than other people.
 2       Rater 9 /    And I was also comparing these videos to each            N/A
         Sample 9     other a lot, which I think probably impacted what
                      my scores were like, from the first videos I watched
                      towards the ones at the end, because I was able to
                      compare them to other ones. I don't, I'm not sure
                      when I watched this video in comparison to the
                      others
 3       Rater 9 /    But I don't know from what I can tell so far, even       TEST INTERACTION
         Sample 9     from just like the first sentence of what she said, it   (Content)
                      seems like she knew what she was saying. And she         LANGUAGE
                      understood the question very well.                       (Comprehension)
                                                                               AFFECT (Competence)
        While these codes allowed for a broad overview of where raters directed their attention in the
stimulated recalls, sufficient for analyzing patterns underlying the relationship between nonverbal behavior
and language proficiency (RQ3.2), an extra level of granularity was necessary to identify the nonverbal
behaviors raters found most salient during their recalls (RQ3.1). Once the entire dataset was coded, I added
an additional layer of subcodes to the Behavior category based on the raters’ comments. These nonverbal
behavioral subcodes are listed in Table 4.7, and resulted in 505 additional coding decisions.
                                                                                                         131


Table 4.7
Subcodes of Nonverbal Behavior
 Code              Subcode                              Code              Subcode
 Eyebrows           • Furrowed                          Gaze              • Averted
                    • Movement                                            • Blinking
                    • Raised                                              • Eyes grow wide
                                                                          • Mutual
                                                                          • Shifting (Movement)
                                                                          • Staring
                                                                          • Unfocused
 General Face       • No subcodes                       Gesture           •  Lack of hand movement
 Behaviors                                                                •  Random movement
                                                                          •  Representational gestures
 General Body       • No subcodes                                         •  Self-adaptors
 Language
 Head               • Turns                             Mouth              • Frowning
                    • Nodding                                              • Lack of smile
                                                                           • Lip movements
                                                                           • Mouth barely open
                                                                           • Nervous smile
                                                                           • (Genuine) smile
                                                                           • Swallowing
 Paralinguistics    •  Audible breathing                Posture           •  Adjusting posture
                    •  Backchanneling                                     •  Leaning back/Slouching
                    •  Filled pauses                                      •  Leaning forward
                    •  Laughing                                           •  Moving around
                    •  Speed                                              •  Rigid/Straight posture
                    •  Tone-prosody                                       •  Rocking-Shaking
                    •  Volume                                             •  Shoulder movements
         The coding scheme allowed me to view the frequencies of appearance of different comments
regarding nonverbal behavior and affect, the extensiveness of their appearance across the 20 raters, and the
relationships between comments and judgements of language. By analyzing areas where language ratings
intersected with behavior, I was able to extract patterns and themes from the dataset. I generally followed
Corbin and Strauss’ (2015) method of constant comparisons, whereby intersections of data are repeatedly
checked for similarities and differences in order to build theory. This analysis, however, deviates from
Corbin and Strauss’ grounded theory in that I approached the topic with clear hypotheses. I present the
                                                                                                        132


themes, frequencies, and extracts illustrating the patterns I found in Chapter 7. In order to illustrate further
the observations of the raters, where appropriate, I also include the multimodal transcripts from ELAN
alongside quotes.
                                                                                                            133


                      CHAPTER 5: AFFECT AND LANGUAGE PROFICIENCY
         The purpose of this chapter is to describe methods used to analyze rating data from the online
survey. First, I will describe the method I used to enhance data integrity by removing participants that
exhibited scoring tendencies that were irregular or undesirable. Second, I will describe the dataset of
observed rater judgements, including scores on language elements and impressionistic judgements, with
the aim of providing evidence that the online survey provided meaningful data. I will then consider the
structure of the dataset through an exploratory factor analysis to determine whether the 14 rating categories
showed an underlying factor structure that could be used to reduce the dimensionality of the dataset. Finally,
I will use the reduced dataset to explore relationships amongst the observed variables. The research question
guiding Chapter 5 is:
         RQ1:         What is the relationship between interpersonal affect and language proficiency?
Participant selection
         In the planning of this study, it was anticipated that some raters would exhibit undesirable rating
tendencies. More raters were recruited than necessary so that problematic participants could be removed,
but it was also desirable to keep as many raters as possible and to adopt fairly liberal measures to do so. It
was expected that raters would exhibit irregularities given that they were not trained and only had minimal
practice before completing the study. This allowed the integrity of the dataset to be strengthened without
losing substantial statistical power. I considered three key threats to rating quality: 1) raters with low
correlations against other raters on each language criterion, 2) rater misfit, and 3) multivariate outliers. The
below analyses led to the exclusion of 16 raters.
Spearman correlations
         The mean Spearman correlations resulted in 63 raters that correlated in their ranking of the test
takers at .5 or above, and 89 raters that correlated at .4 or above. Given the exploratory nature of this study,
I opted for the less conservative correlation estimate in order to retain the higher number of raters. Table
5.1 shows the correlations of the seventeen raters with the lowest grand means, of which 11 had means
                                                                                                             134


below .4. These 11 raters were flagged to be removed from the dataset.
Table 5.1
Lowest Means and Grand Means of Spearman Correlations
  Rater      Fluency            Vocabulary         Grammar              Pronunciation       Grand Mean
  86*        .24                .17                .21                  .20                 .20
  72*        .15                .30                .29                  .14                 .22
  16*        .36                .33                .15                  .15                 .25
  14*        .46                .32                .37                  .28                 .36
  50*        .45                .45                .20                  .34                 .36
  94*        .39                .45                .18                  .47                 .37
  47*        .46                .35                .33                  .37                 .38
  37*        .43                .35                .31                  .44                 .38
  96*        .50                .42                .22                  .42                 .39
  21*        .50                .41                .34                  .34                 .39
  58*        .53                .45                .21                  .35                 .39
  44         .49                .44                .39                  .31                 .41
  30         .42                .41                .39                  .44                 .41
  78         .47                .46                .27                  .43                 .41
  51         .48                .44                .25                  .46                 .41
  68         .51                .42                .40                  .29                 .41
  15         .54                .45                .36                  .34                 .42
  53         .63                .55                .02                  .49                 .42
  Note. * indicates rater’s grand mean was less than .4 and was dropped from the study.
Misfit
          The second criterion for exclusion was the presence of possible rater effects visible through misfit
in a MFRM model. Figure 5.1 shows the fit measures of the upper and lower end of the rater measurement
table. It can be seen that 8 raters underfit the model, with infit over 1.5, and only one rater overfit the model
with fit indices below .5. Although infit values lower than .5 indicate more stable than expected rating
patterns, there was only one rater in this category, so this individual was left in the dataset. Three of these
raters were also in the set of raters with low correlations. These five additional raters were flagged to be
removed from the dataset.
                                                                                                              135


Figure 5.1
Fit Measures of Raters
Multivariate outliers
        The Q-Q plot of the Mahalanobis distances is shown in Figure 5.2. The numbers in this figure
correspond to a single set of language scores (all four criteria) for one individual by one rater in the long
form dataset. Points veering substantially from the diagonal line represent multivariate outliers. These were
often cases of extremely jagged score profiles (such as 1, 1, 7, 1 for fluency, vocabulary, grammar, and
comprehensibility), which could have been due to carelessness not noticing the shifting of polarity of the
                                                                                                          136


semantic differentials. Each rater produced 30 ratings, so there were 2970 score profiles in total.
Figure 5.2
Q-Q Plot of Values and Multivariate Outliers
         I then compiled a list of the raters to whom the multivariate outliers were associated. The frequency
table of raters and the number of outliers associated with them is presented in Table 5.2. Each rater awarded
120 language ability scores (30 samples x 4 criteria), and the percentage of outlying scores is presented in
the third column. I speculated that a small degree of carelessness (one or two ratings) would probably not
impact the overall dataset, and dropping raters over a small number of cases could be detrimental if the
raters’ ratings were otherwise careful. There were only two raters with more than two outliers (46 and 72),
and both of these had been identified as misfitting in the Rasch analysis or correlation analysis. These raters
produced 4 and 6 outliers respectively, and as such were flagged to be removed. The rest of the raters in
Table 5.2 were kept in the study as they may have represented important aspects of the population being
sampled (Tabachnik & Fidell, 2013).
                                                                                                           137


Table 5.2
Frequency Table of Multivariate Outliers
 Rater              Number of outliers     % of outlier ratings (of 30)
 2                  2                      6.67%
 9                  1                      3.33%
 13                 1                      3.33%
 14                 1                      3.33%
 21                 1                      3.33%
 25                 1                      3.33%
 26                 1                      3.33%
 29                 1                      3.33%
 32                 1                      3.33%
 46                 4                      13.33%
 48                 1                      3.33%
 55                 1                      3.33%
 56                 1                      3.33%
 63                 2                      6.67%
 66                 2                      6.67%
 68                 1                      3.33%
 72                 6                      20%
 80                 1                      3.33%
 82                 1                      3.33%
 83                 1                      3.33%
 90                 1                      3.33%
 92                 2                      6.67%
 94                 2                      6.67%
 97                 1                      3.33%
         As mentioned previously, the cleaning analysis resulted in 16 raters being dropped from the dataset.
Four of these raters showed low alignment with the greater group of raters through grand mean Spearman
correlations (30, 71, 80, 82). Eight showed misfit in the Rasch model (14, 16, 21, 37, 47, 50, 58). Four
additional raters showed a combination of either > 10% of scores being outliers, misfit, or low correlations
(46, 86, 72, 94). Additionally, one rater had been dropped prior to the analysis due to technical problems in
the recording. The remaining analyses were thus conducted with 83 raters.
Dataset description
Scales
         Mean scores on each of the language and affect scales were situated towards the midpoint of four
or slightly higher, with raters assigning the full range of scores for each scale. Descriptive statistics are
presented in Table 5.3. Here we can see that the lowest mean score was grammar (4.14), while the highest
                                                                                                         138


mean score was attention (5.34). Standard deviations indicated somewhat less variance for most of the
affect-related scales, with the lowest variance in attitude (1.22), while language-related scores showed
greater variance with a high SD in vocabulary (1.67). Alpha levels varied between .65 (anxiety) and .85
(fluency). These levels, although a little low for operational testing (especially anxiety), indicate a fair
degree of consistency despite the unlabeled scale levels and lack of rater training. Indeed, .85 is remarkably
good, and is on par with high-stakes, standardized tests (Nunnaly & Bernstein, 1994; Zhang, 2010).
Interrater reliability estimates, estimated using intraclass correlation coefficients (ICC), will be presented
in the section on raters below.
Table 5.3
Descriptive Statistics and Cronbach’s Alpha of Scales
 Scale                     Mean           SD            Skewness         Kurtosis       SE             𝛼
 Fluency                   4.58           1.65          -0.40            -0.86          0.03          .85
 Vocabulary                4.33           1.67          -0.13            -1.07          0.03          .81
 Grammar                   4.14           1.59          -0.03            -1.03          0.03          .72
 Comprehensibility         4.83           1.64          -0.47            -0.81          0.03          .79
 Engagement                5.33           1.31          -0.79            0.37           0.03          .79
 Anxiety                   4.12           1.54          0.06             -0.92          0.03          .65
 Confidence                4.45           1.59          -0.23            -0.86          0.03          .82
 Warmth                    4.76           1.34          -0.28            -0.43          0.03          .73
 Attention                 5.34           1.23          -0.83            0.71           0.02          .74
 Expressiveness            4.56           1.57          -0.44            -0.64          0.03          .77
 Happiness                 4.77           1.32          -0.23            -0.30          0.03          .75
 Competence                4.87           1.63          -0.51            -0.66          0.03          .84
 Interactiveness           4.98           1.39          -0.58            -0.17          0.03          .78
 Attitude                  5.05           1.22          -0.23            -0.36          0.02          .74
         Table 5.4 and Figure 5.3 show the distribution of rater choices of individual score categories. While
raters used the whole range of possible scores, some scales such as engagement and attention showed
negatively skewed and highly kurtotic patterns, while others such as warmth and happiness showed a
distribution appearing more “normally” distributed (normal is in quotations as normality is not considered
for ordinal data). For the four language categories, raters appeared more likely to avoid a midpoint of four
and instead lean towards a more positive or negative end of the scale. In fact, a selection of 4 was never the
most common choice on any of the scales, even though it was the default selection on the scales in Qualtrics.
                                                                                                           139


Table 5.4
Frequency Counts of Scale Categories
 Scale                  1       2    3   4   5   6   7
 1. Fluency             82      265  400 247 637 599 260
 2. Vocabulary          82      322  541 246 587 477 235
 3. Grammar             83      342  600 307 572 447 139
 4. Comprehensibility   49      230  370 224 598 617 402
 5. Engagement          17      70   194 204 803 706 496
 6. Anxiety             74      289  682 349 555 392 149
 7. Confidence          75      237  492 331 640 485 230
 8. Warmth              18      106  360 466 799 501 240
 9. Attention           14      61   161 209 840 787 418
 10. Expressiveness     83      229  389 263 790 517 219
 11. Happiness          23      89   311 584 761 463 259
 12. Competence         61      196  334 259 644 551 445
 13. Interactiveness    25      115  295 279 837 620 319
 14. Attitude           4       45   206 529 810 567 329
                                                         140


Figure 5.3
Scale Histograms
Scale correlations
         Next, I considered the relationship amongst the scales. All correlations were positive, ranging from
medium (.4) to strong (≥ .6) (Plonsky & Oswald, 2014). The weakest correlation was between anxiety and
attention (.40), while the strongest correlation was between fluency and vocabulary (.85). The full
correlation table is detailed in Table 5.5.
                                                                                                          141


Table 5.5
Polychoric Correlations
                           1    2      3      4 5   6   7   8   9   10  11  12  13
 1. Fluency
 2. Vocabulary             .85
 3. Grammar                .75 .74
 4. Comprehensibility .81 .74 .67
 5. Engagement             .63 .58 .51 .59
 6. Anxiety                .56 .54 .48 .50      .43
 7. Confidence             .73 .70 .61 .63      .62 .72
 8. Warmth                 .48 .45 .41 .50      .63 .43 .54
 9. Attention              .59 .55 .47 .57      .84 .40 .58 .59
 10. Expressiveness        .58 .55 .47 .58      .66 .47 .61 .72 .60
 11. Happiness             .52 .48 .42 .52      .62 .45 .58 .82 .59 .73
 12. Competence            .84 .77 .68 .76      .68 .54 .70 .53 .66 .59 .56
 13. Interactiveness       .63 .58 .50 .59      .76 .45 .62 .63 .72 .68 .63 .67
 14. Attitude              .51 .47 .42 .52      .69 .41 .58 .79 .63 .70 .82 .55 .64
 Note. All correlations significant at p < .05.
                                                                                    142


Scale functioning in Rasch
          The model used to interpret scale functioning was the same partial-credit model as in the outlier
analysis with samples, language criteria, and raters as facets. I did not run the model with the affect variables
included due to possible problems with multidimensionality (Aryadoust et al., 2021). Overall, the language-
related scales appeared to function as anticipated in an MFRM analysis of the final, outlier-removed dataset.
The MFRM summary statistics are available in Table 5.6. Average fit statistics were all very close to 1.00
with standard deviations within recommended cutoffs of .5 to 1.5. The separability index was high at 10,
suggesting that the scale could reliability separate ability levels into 10 different categories.
Table 5.6
MFRM Summary Statistics
                                                           Rater                Sample            Criteria
  N                                                        83                   30                4
  Measures
         M                                                 -0.38                0.00              0.00
         SD (pop.)                                         0.59                 1.12              0.35
         SE                                                0.08                 0.05              0.02
         RMSE (pop.)                                       0.08                 0.05              0.02
         Adjusted (true) SD (pop.)                         0.39                 0.80              0.19
  Infit MS
         M                                                 1.00                 1.03              1.01
         SD (pop.)                                         0.31                 0.20              0.13
  Outfit MS
         M                                                 1.04                 1.04              1.04
         SD (pop.)                                         0.33                 0.21              0.16
  Homogeneity index (χ2 )                                  1770.00              6363.40           458.40
         df                                                82                   29                3
         p                                                 < .001               < .001            < .001
  Separation (pop.)                                        4.75                 15.89             10.72
  Reliability of separation (pop.)                         .96                  > .99             .99
  Interrater reliability
         Observed exact agreement %                        25.7
         Expected %                                        25.3
         Rasch κ                                           .001
          The Wright map for the final dataset on four language categories is presented in Figure 5.4. The
figure shows that Grammar indeed was the more difficult criterion, as suggested by Table 5.3, and
Pronunciation was the easiest. These differed by .52 logits. Each criterion demonstrated adequate
measurement properties based on their category statistics. I have included these statistics and their category
                                                                                                             143


probability curves in Appendix H.
Figure 5.4
Wright Map of Dataset
Note. S.1 = Fluency, S.2 = Vocabulary, S.3 = Grammar, S.4 = Comprehensibility.
Samples
         The means and confidence intervals for each sample by rating scale are displayed in Figure 5.5.
Descriptive statistics of the ratings from the speech samples appeared to roughly correspond with their
original ordering based on IELTS scores (Sample 1 was the least proficient candidate, while Sample 30
was the most), though as noted in the methods section, some samples dropped in their ranking substantially
(e.g., Sample 29) while others rose (e.g., Sample 9). Surprisingly, this appeared to be the case for all rating
                                                                                                           144


scales. Nevertheless, it can be seen that there was substantial variance across the criteria, with scales such
as Engagement appearing much more linear than others such as Anxiety or Expressiveness. Similar to the
scale MFRM statistics, average fit statistics for the samples were all very close to 1.00 with standard
deviations within recommend cutoffs of .5 to 1.5, as seen in Table 5.6.
                                                                                                           145


Figure 5.5
Means and CIs of Speech Samples
                                146


Raters
         The final reduced set of raters exhibited rating patterns that fit the Rasch model well but showed
substantial disagreement and variability. Average fitness parameters were close to 1, and standard
deviations fell within recommended boundaries of .5 and 1.5. Fit statistics are dependent on the sample of
individuals measured, however, so when the previously misfitting raters were removed, the truncated
dataset showed some newly misfitting raters. Four additional raters misfit the model, as can be seen in
Figure I.1 in Appendix I, but because this was anticipated, these additional raters were not removed.
Regarding estimates of rater severity, The raters were as a group severe, with a mean logit score of -0.38.
Figure I.2 shows that the vast majority of raters fell within a one logit range (-.50 – .50). There were,
however, raters who were lenient to a large degree, up to -1.65 logits (rater 50). This resulted in a logit
range of 2.17, and raters could be separated into 4.75 different severity levels overall.
         Regarding consistency, raters showed disagreement. Exact rater agreement was 25.7%, only
slightly higher than the Rasch expected agreement of 25.3%. This resulted in a Rasch Kappa index of .001,
which is a model-expected level of agreement according to Linacre (n.d.). I also computed ICCs as a
measure of interrater reliability. The ICCs, shown in Table 5.7, show that correlations for Fluency and
Vocabulary were moderate, while the ICCs for Grammar and Comprehensibility were poor, especially for
Grammar.
Table 5.7
Intraclass Correlation Coefficients
 Outcome                    ICC
 Fluency                    .58
 Vocabulary                 .54
 Grammar                    .37
 Comprehensibility          .49
Data Structure
         To understand the relationship amongst the variables measured in this study, I first ran an
exploratory factor analysis to reduce the dimensionality of the scales. I first checked assumptions prior to
conducting the factor analysis. The sample size of 83 with repeated measurements of 30 test takers resulted
                                                                                                        147


in 2490 sets of observations, which is above the threshold suggested in Tabachnik and Fidell (2013).
Correlations in Table 5.5 were above .30 but below .90, which suggests factorability and a lack of
multicollinearity. I confirmed this by calculating the Kaiser measure of sampling adequacy, which was .95,
and all scales exceeded .93. Furthermore, I checked for multicollinearity by building multiple linear
regression models with all affect variables as predictors and the four proficiency variables as outcomes. All
of the variance inflation factors (VIF) were lower than 4, and all VIF thresholds were higher than .25, which
indicates the lack of collinearity issues. There was a linear relationship amongst all scales, as shown in
Figure 5.6. Outliers were dealt with previously.
                                                                                                          148


Figure 5.6
Linear Relationships Amongst Variables
                                       149


         A calculation of the eigenvalues resulted in only two values above 1, suggesting the presence of
two factors using the Kaiser criterion. I used parallel analysis to further investigate this, as the Kaiser
criterion is often a conservative estimate of the number of factors. Parallel analysis and the resulting Scree
Plot in Figure 5.7 suggested four factors. Given that the dataset is ordinal, I used the polychoric correlation
matrix for factor analysis rather than a Pearson correlation matrix. I then extracted four factors using
exploratory factor analysis using maximum likelihood estimation and a promax rotation. The pattern matrix,
shown in Table 5.8, indicates that communality values were quite strong, and all variables loaded on factors
above common cutoff values of .45 with no substantial cross loading.
Figure 5.7
Scree Plot
                                                                                                            150


Table 5.8
Pattern Matrix
  Variable                 Factor 1       Factor 2       Factor 3        Factor 4     Common          Unique
  Fluency                  .96            -.05           .01             .03          .91             .09
  Vocabulary               .88            -.05           -.01            .06          .79             .21
  Grammar                  .83            -.02           -.06            .02          .63             .37
  Comprehensibility        .84            .11            .00             -.08         .73             .27
  Engagement               .00            .03            .92             .00          .88             .12
  Anxiety                  .17            .07            -.09            .60          .53             .47
  Confidence               .02            -.04           .10             .94          .99             .01
  Warmth                   -.03           .94            -.02            -.03         .79             .21
  Attention                .00            .02            .88             -.01         .79             .21
  Expressiveness           .14            .62            .10             .03          .67             .33
  Happiness                .01            .99            -.10            .00          .85             .15
  Competence               .74            .01            .19             .00          .80             .20
  Interactiveness          .14            .20            .54             .03          .69             .31
  Attitude                 -.08           .82            .13             .03          .79             .21
  SumSq loadings           3.95           3.18           2.30            1.43
  Proportion variance      .28            .23            .16             .10
  Cumulative variance      .28            .51            .67             .78
         The factor loadings indicated a possible interpretation of the factor structure. Factor 1 consisted of
strong loadings of fluency, vocabulary, grammar, comprehensibility, and competence. These factors
together formed a factor I called language. The second factor consisted of strong loadings of warmth,
happiness, and attitude, and a medium loading of expressiveness. I called this factor positivity, as it appeared
to relate mostly to positive affect (warmth, happiness) indicated through behavior (expressiveness). The
third factor was composed of strong relationships to engagement and attention, with a medium relationship
to interactiveness. I called this factor involvement, as these three scales all related to the relationship the
test taker established with the rater during the test. Finally, the fourth factor comprised a strong relationship
with confidence and a medium relationship with anxiety. I called this factor assuredness, which
approximates the relationship between these two affective states. Correlations amongst the four factors,
shown in Table 5.9, were quite strong, suggesting either meaningful relationships between these categories
or an artefact of rater effects (e.g., halo effect). The path diagram of the overall structure of the model is
presented in Figure 5.8. Similar to path diagrams in confirmatory factor analysis, this model constrains the
strongest factor loading to 1 to identify the factor, and for this reason the loadings and correlations look
                                                                                                              151


slightly different from Table 5.8.
Table 5.9
Correlations Amongst Factors
                       Language       Positivity      Involvement
 Positivity            .63
 Involvement           .71            .76
 Assuredness           .74            .63             .60
Figure 5.8
Path Diagram of EFA Structure
Relationships between affect and proficiency measures
         Because each affect-related factor correlated with language, I then investigated the relationship
between each factor and the four language components of fluency, grammar, vocabulary, and
comprehensibility. I did not investigate the relationship amongst these factors and competence as there was
no theoretically informed reason to do so, so this variable was not included in the analysis. ICCs had
                                                                                                        152


suggested that there was substantial variance amongst raters, and any regression model would require
random effects accounted for at the second level (Tabachnik & Fidell, 2013). For this reason I used ordinal
mixed-effects regression. For interpretability, the factor scores were scaled to span scores of 1–7 for
comparability with the 7-point scales the raters used.
Fluency
         Factors were entered in the model based on their correlations with fluency listed in Table 5.10.
Assuredness had the strongest correlation with fluency, while positively had the lowest. The relationships
are illustrated in Figure 5.9 using regression lines to best approximate the relationships across various levels
of the predictors. It can be seen that while fluency increased relatively monotonically with assuredness, the
relationships with the other factors were not as strong. The four models, shown in Table 5.11 indicated that
the best fitting model was the one that only included assuredness, 2(1) = 74.58, p < .011. This model fit
significantly better than the model with random effects removed, 2(2) = 518.05, p < .001, clm model AIC
= 7458.10, clmm model AIC = 6944.10.
Table 5.10
Polychoric Correlations Between Factors and Fluency
 Factor                                                   Polychoric Correlation
 Assuredness                                              .72 [69, 74]
 Involvement                                              .70 [.67, .72]
 Positivity                                               .57 [.55, .60]
                                                                                                            153


Figure 5.9
Relationships Between Factors and Fluency
                                          154


Table 5.11
Model Comparisons for Fluency
 Model                 AIC                     2                   df                   p
 Null Model            7016.70
 Model 1               6944.10                74.58                 1                    < .001
 Model 2               6942.40                3.65                  1                    .06
 Model 3               6939.20                5.25                  1                    .02
 Note. Adjusted α = .0125.
         The final model, shown in Table 5.12, included only assuredness as a fixed effect. Assuredness
significantly predicted learners’ fluency,  = 3.09, p < .001. The odds ratio was 21.90, indicating that with
each one-point change in assuredness, fluency was 21.90 times more likely to change in the same direction.
There was more variance in raters, .97, than in samples, .32, which is not entirely unexpected, as raters were
novice and untrained. Assuredness explained 20% of the variance in fluency scores, Nagelkerke’s Pseudo
R2 = .20.
Table 5.12
Final Fluency Model
 Coefficients               95% CI           SE          z         p           OR           95% CI
 Assuredness      3.09       [2.74, 3.44]     .18         17.23     < .001      21.90        [15.42, 31.12]
                  Random effects
                  Groups Variance             SD
                  Raters     0.97             0.98
                  Samples 0.32                0.57
Vocabulary
         Factors were entered in the model based on the correlations with vocabulary listed in Table 5.13,
which were similar and identical in order to the correlations with fluency. These relationships are illustrated
in Figure 5.10. The regression graphs were very similar to those with fluency as well. The four models,
shown in Table 5.14 indicated that the best fitting model was the one with all three factors, 2(1) = 7.61, p
= .006. This model fit significantly better than the model with random effects removed, 2(4) = 396.68, p
< .001, clm model AIC = 7638.80, clmm model AIC = 7250.20.
                                                                                                            155


Table 5.13
Polychoric Correlations Between Factors and Vocabulary
 Factor                                           Polychoric Correlation
 Assuredness                                      .69 [.67, .72]
 Involvement                                      .66 [.63, .68]
 Positivity                                       .53 [.50, .56]
                                                                         156


Figure 5.10
Relationships Between Factors and Vocabulary
                                             157


Table 5.14
Model Comparisons for Vocabulary
 Model                 AIC                  2                   df                   p
 Null Model            7325.30
 Model 1               7255.50             71.84                 1                    < .001
 Model 2               7255.80             1.71                  1                    .19
 Model 3               7250.20             7.61                  1                    .006
 Note. Adjusted α = .0125.
        The final model, shown in Table 5.15, showed that all three factors predicted vocabulary. Unit
changes in assuredness and involvement significantly predicted vocabulary scores (assuredness,  = 1.98,
p < .001; involvement  = 1.90, p = .001), while changes in positivity had an inverse relationship with
vocabulary scores ( = -0.96, p = .003). Odds ratios for assuredness and involvement indicated that one
unit of change in each predictor resulted in between 6.67 and 7.27 times the likelihood of a change in
vocabulary. The odds ratio for positivity was quite low at 0.38. There was less variance in raters in this
model than the fluency model, 0.65, and likewise in samples, 0.21. The three factors explained 15% of the
variance in vocabulary scores, Nagelkerke’s Pseudo R2 = .15.
Table 5.15
Final Vocabulary Model
 Coefficients              95% CI          SE          z        p           OR           95% CI
 Assuredness      1.98      [1.24, 2.73]    0.38        5.22     < .001      7.27         [3.45, 15.31]
 Involvement      1.90      [0.74, 3.05]    0.59        3.23     .001        6.67         [2.11, 21.12]
 Positivity       -0.96     [-1.60, -0.32]  0.33        -2.94    .003        0.38         [0.20, 0.72]
                  Random effects
                  Groups Variance           SD
                  Raters    0.65            0.80
                  Samples 0.21              0.45
Grammar
        Factors were entered in the model based on the correlations with grammar listed Table 5.16, which
were ordered the same as fluency and vocabulary, but somewhat weaker. These relationships are illustrated
in Figure 5.11. The relationships were also similar to fluency and vocabulary, especially for assuredness,
but the association with involvement and positivity was weaker. The four models, shown in Table 5.17,
                                                                                                       158


indicated that similar to fluency, the best fitting model was the one with only assuredness as a predictor,
2(1) = 59.61, p < .001. This model fit significantly better than the model with random effects removed,
2(2) = 426.67, p < .001, clm model AIC = 8130.30, clmm model AIC = 7707.60.
Table 5.16
Polychoric Correlations Between Factors and Grammar
 Factor                                                 Polychoric Correlation
 Assuredness                                            .55 [.52, .58]
 Involvement                                            .52 [.49, .56]
 Positivity                                             .43 [.39, .46]
                                                                                                       159


Figure 5.11
Relationships Between Factors and Grammar
                                          160


Table 5.17
Model Comparisons for Grammar
  Model                 AIC                   2                   df                   p
  Null Model            7765.20
  Model 1               7707.60              59.51                 1                    < .001
  Model 2               7708.30              1.31                  1                    .25
  Model 3               7707.80              2.56                  1                    .11
  Note. Adjusted α = .0125.
          The final model, shown in Table 5.18, showed that only assuredness predicted grammar scores, 
= 2.04, p < .001, indicating that with each one-point change in assuredness, grammar was 7.67 times more
likely to change in the same direction. There variance in this model was similar to previous model, raters =
0.80, samples = 0.24. Assuredness explained 16% of the variance in grammar scores, Nagelkerke’s Pseudo
R2 = .16.
Table 5.18
Final Grammar Model
  Coefficients               95% CI         SE         z          p          OR            95% CI
  Assuredness      2.04       [1.74, 2.34]   0.15       13.32      < .001     7.67          [5.69, 10.36]
                   Random effects
                   Groups Variance           SD
                   Raters     0.80           0.89
                   Samples 0.24              0.49
Comprehensibility
          Factors were entered in the model based on the correlations with comprehensibility listed in Table
5.19, which were ordered differently from previous models, with involvement coming first, followed by
assuredness. The relationship between comprehensibility and the factors are illustrated in Figure 5.12. The
relationships were similar to previous models. The four models, shown in Table 5.20, indicated that the
best fitting model was the one with only involvement as a predictor, 2(1) = 48.08, p < .001. This model fit
significantly better than the model with random effects removed, 2(2) = 695.76, p < .001, clm model AIC
= 7892.60, clmm model AIC = 7200.80.
                                                                                                         161


Table 5.19
Polychoric Correlations Between Factors and Comprehensibility
 Factor                                           Polychoric Correlation
 Assuredness                                      .60 [.57, .64]
 Involvement                                      .61 [.59, .64]
 Positivity                                       .53 [.50, .56]
                                                                         162


Figure 5.12
Relationships Between Factors and Comprehensibility
                                                    163


Table 5.20
Model Comparisons for Comprehensibility
 Model                  AIC                    2                   df                    p
 Null Model             7246.90
 Model 1                7200.80               48.08                 1                     < .001
 Model 2                7196.80               5.99                  1                     .014
 Model 3                7198.60               0.22                  1                     .64
 Note. Adjusted α = .0125.
         The final model in Table 5.21 showed that involvement significantly predicted comprehensibility
scores,  = 2.53, p < .001. The variance in this model was larger than in previous models, raters = 1.17,
samples = 0.57. Involvement explained 25% of the variance in comprehensibility scores, Nagelkerke’s
Pseudo R2 = .25.
Table 5.21
Final Comprehensibility Model
 Coefficients                 95% CI          SE         z         p           OR            95% CI
 Involvement       2.53        [2.07, 2.99]    0.24       10.71     < .001      12.57         [7.90, 19.98]
                   Random effects
                   Groups Variance             SD
                   Raters      1.17            1.08
                   Samples 0.57                0.73
Summary
         This chapter has considered two aspects of the dataset: structural integrity and relationships
amongst rated variables. I first described how the dataset was cleaned to ensure the ratings were as reliable
as possible prior to analysis. Of the 99 samples, I removed 16 due to low reliability, misfit, multivariate
outliers, or a combination of these issues. I then checked the integrity of the dataset prior to analysis using
descriptive statistics and Rasch measurement. The scales and samples functioned appropriately without
misfit or erratic behavior. The raters, as expected, showed limited consistency and agreement, yet despite
their lack of training, were able to assign scores that were consistent within reason.
         I then considered interrelationships amongst the 14 rated variables of interest to RQ1. I found that
all variables correlated, and some relationships were stronger. Four factors emerged from exploratory factor
                                                                                                            164


analysis: Language, assuredness, involvement, and positivity. Using ordinal mixed-effects regression, these
factor scores had different relationships with language proficiency outcomes. Assuredness alone was found
to predict changes in fluency and grammar scores at a fairly high degree. Involvement, on the other hand,
was the sole predictor for comprehensibility, which also showed a strong relationship. All three factor
scores—assuredness, involvement, and positivity—predicted vocabulary, but only assuredness and
involvement played a substantial role in predicting this outcome. Although rater effects such as the halo
effect are undoubtedly part of the reason for these relationships, the results show that affect, in particular
assuredness and involvement, may be tightly bound to language proficiency outcomes, while positive affect
less strongly related.
         These results describe relationships between variables observed by the raters in the study, which
are subjective and prone to human error. In the next chapter, I turn to externally measured variables that are
closely tied to nonverbal behavior. These measures were produced by a computer vision, machine learning
algorithm, and will lend insight into whether omnibus indices of behavior also relate to proficiency
outcomes.
                                                                                                          165


            CHAPTER 6: NONVERBAL BEHAVIOR AND LANGUAGE PROFICIENCY
         While the rating data explored in Chapter 5 reveal interesting trends and patterns, one major
limitation is that all variables were observed by the raters themselves, thus being prone to halo effects across
categories. Likewise, understanding nonverbal behavior through affect is rather indirect, as affect may be
derived from verbal information as well. In this chapter, I will explore the effects of external measures of
nonverbal behavior derived from iMotions Affectiva, an automated pattern recognition algorithm that uses
computer vision to classify the probability of positive or negative valence, engagement (a measure of
expressiveness), and attention (a measure of eye gaze and head turn towards the camera). The research
questions guiding this chapter are as follows:
         RQ2.1:         Do objectively measured indices of nonverbal behavior predict language proficiency
                        scores?
         RQ2.2:         Do nonverbal behaviors moderate scores differentially depending on the proficiency
                        levels of test takers?
         In this chapter, I will first describe the data resulting from iMotions. I will present the distributions
of each measure to illustrate differences and similarities amongst speech samples. I will also present
graphical information to illustrate these trends. Following this, I will report inferential analyses of variables
on the language proficiency measures.
Description and distribution of variables
Engagement
         I first calculated mean scores and standard deviations for each of engagement, valence, and
attention from the iMotions output files for the 30 samples, listed in Table 6.1. As can be seen in this table
and Figure 6.1, mean engagement ranged from 5.15 (Sample 29; an overall low probability of exhibiting
facial expressions) to 62.18 (Sample 23; a relatively high probability of exhibiting facial expressions).
Figure 6.2 illustrates differences in engagement using an anonymized, cartoonized image of samples 23
and 29. Again, the actual video recordings of the test takers were used in the rating design, but for illustrative
purposes only, I produced cartoonized images to protect the test takers’ identities.
                                                                                                               166


Table 6.1
iMotions Means and SDs by Sample
              Engagement          Valence       Attention
 Sample       M             SD    M       SD    M         SD
 S01          15.50         24.08 -4.41   12.91 98.21     0.59
 S02          11.09         20.54 -3.42   10.50 61.17     46.74
 S03          7.21          17.19 -2.18   12.69 96.40     7.15
 S04          50.50         39.55 33.38   44.54 97.93     2.59
 S05          46.55         44.96 2.28    42.65 92.16     19.42
 S06          11.60         24.60 -1.09   19.02 97.13     2.28
 S07          22.44         31.42 -3.65   26.41 96.06     7.76
 S08          42.16         41.61 -32.07  38.75 85.72     20.94
 S09          11.45         18.09 -0.18   1.70  97.85     1.36
 S10          19.71         28.87 -7.64   13.34 97.03     1.99
 S11          7.95          16.00 -2.46   9.98  95.67     4.43
 S12          19.24         28.09 2.21    16.26 97.10     1.24
 S13          17.69         26.49 4.05    16.53 96.18     8.64
 S14          56.84         37.36 40.22   39.26 96.01     6.76
 S15          46.08         39.38 23.48   43.13 93.85     13.22
 S16          10.15         19.14 -0.84   4.45  92.00     19.26
 S17          19.77         25.30 -6.38   18.55 84.55     31.66
 S18          25.16         37.91 13.88   33.55 93.60     9.68
 S19          43.74         40.70 30.56   36.83 92.11     7.75
 S20          16.61         28.21 0.24    29.32 60.24     45.85
 S21          45.16         36.39 19.97   49.19 96.83     3.06
 S22          23.71         33.39 10.16   28.72 97.15     1.46
 S23          62.18         38.10 -1.33   11.46 96.86     3.10
 S24          10.22         18.93 3.92    12.14 80.22     36.84
 S25          39.31         40.20 12.31   32.16 80.33     22.40
 S26          59.09         39.32 30.93   40.40 95.81     6.18
 S27          19.13         31.44 6.71    27.73 74.82     38.17
 S28          32.47         36.78 6.41    22.84 97.34     2.82
 S29          5.15          11.78 -0.47   5.91  97.97     0.88
 S30          30.28         30.91 0.19    17.45 93.78     20.17
                                                                167


Figure 6.1
Distribution of Engagement Means and SDs by Participant
                                                        168


Figure 6.2
Illustration of High and Low Engagement
         The standard deviations of engagement were quite large. This suggests that test takers in the speech
samples varied substantially in their (probabilistic) appearance of facial expressions. Standard deviations
ranged from a low of 11.78 (sample 29) to 44.96 (sample 5). Figure 6.3 visualizes these changes through a
time series graph of the measurements of engagement during each speech sample. Here, the differences
between samples 23 and 29 are even more striking. Sample 23 retained a high probability of facial
expressiveness, especially during the middle of the test. However, despite a few peaks, the detection of
facial expressiveness in Sample 29 was rather flat. The calculations of mean engagement and its standard
deviation did not, however, correlate strongly with baseline proficiency level (.10 and .01 respectively).
                                                                                                          169


Figure 6.3
Time Course Data for Engagement
Note. Red bar indicates mean engagement across entire sample
                                                             170


Valence
        The mean scores and distribution of valence were markedly distinct from engagement. Mean scores,
ranging from -100 to 100, did not generally deviate substantially from 0. These scores can be seen in Table
6.1 and visualized in Figure 6.4. The lowest mean valence score was -32.08 (Sample 8, SD = 38.75) while
the highest valence score was 40.22 (Sample 14, SD = 39.26). Anonymized, cartoonized images of these
two samples are presented in Figure 6.5. Standard deviations also varied widely, with Sample 21 exhibiting
the most variance in valence (M = 19.97, SD = 49.19), and others, such as Sample 9, showing the least (M
= -0.18, SD = 1.70). These trends are also visible in the time series graphs in Figure 6.6. These data suggest
that most test takers did not exhibit strong evidence of positive or negative emotions throughout the test,
which can be seen in examples such as Sample 9. “Flatline” time courses suggest a somewhat unchanging
emotional appearance.
                                                                                                          171


Figure 6.4
Distribution of Valence Means and SDs by Participant
                                                     172


Figure 6.5
Illustration of High and Low Valence
                                     173


Figure 6.6
Time Course Data for Valence
Note. Red bar indicates mean valence across entire sample
                                                          174


Attention
         Attention scores also varied in their distribution from engagement and valence scores. As seen in
Table 6.1 and visualized in Figure 6.7, attention tended to be high across the samples, and variance tended
to be low with the exception of a small number of samples. The sample with the highest mean attention was
Sample 1, with a mean probability of 98.21 (SD = 0.59). This individual moved his head very little,
especially side to side movements. This sample also spent much of the time looking towards the camera.
When gaze was broken, his head stayed fixed towards the camera. The sample with the lowest mean
attention was Sample 20, with a mean probability of 60.24 (SD = 45.85). This test taker frequently turned
her head to one side or the other, and also frequently broke gaze with the interlocutor. Figure 6.9 shows
how attention changed throughout the speech samples. Samples 20 and 24, for example, appeared to shift
their attention frequently, while Samples 1 and 29 tended to hold their attention more focused.
                                                                                                        175


Figure 6.7
Distribution of Attention Means and SDs by Participant
                                                       176


Figure 6.8
Illustration of High and Low Attention
                                       177


Figure 6.9
Time Course Data for Attention
Note. Red bar indicates mean attention across entire sample.
                                                             178


Correlations with affect scores
         Finally, I calculated Pearson correlations between the three iMotions sets of variables and the factor
scores from Chapter 5, shown in Table 6.2. These correlations showed that the iMotions variables were
somewhat interrelated. Engagement and valence correlated at .54, which is a medium correlation. This is
logical, as showing positive valence is dependent on showing a certain amount of expressiveness.
Engagement also correlated with attention at .20, and valence and attention correlated at .19. While related,
these variables did appear to be measuring different aspects of performance. The means and standard
deviations also correlated quite strongly. Mean engagement correlated with its standard deviation at .87,
which indicates that more overall expressive individuals were more likely to vary the intensity of their
expressiveness throughout the sample, which is logical. Mean valence correlated with its standard deviation
at .56, which may be interpreted similarly. Mean attention correlated with its standard deviation at -.94,
which indicates that individuals with lower mean attention had more variance in how they established and
broke attention with the interlocutor, while individuals with higher mean attention were less likely to break
their attention frequently.
Table 6.2
Pearson Correlations Amongst iMotions Variables and Affective Factors
                            1        2          3        4          5        6          7         8
 1. Engagement (M)
 2. Valence (M)             .54
 3. Attention (M)           .20      .19
 4. Engagement (SD)         .87      .42        .08
 5. Valence (SD)            .74      .56        (-.001) .85
 6. Attention (SD)          -.19     -.22       -.94     -.08       (-.01)
 7. Assuredness             .05      .11        -.02     -.02       -.13     .12
 8. Involvement             .21      .24        .09      .18        .09      (.01)      .92
 9. Positivity              .48      .39        .06      .43        .41      (.01)      .79       .91
 Note. All correlations are significant except those in (parentheses).
         Regarding the correlations between iMotions variables and the factor scores, there was some
indication that mean engagement and positivity were measuring similar attributes, as they correlated at .48.
Note, however, that expressiveness was one of the variables that factored into positivity. Mean valence,
which would logically have correlated with positivity more, correlated somewhat less at .39. Involvement
                                                                                                           179


had weak correlations with engagement (.21) and valence (.24), while the remaining correlations were quite
negligible. The correlations with the standard deviations followed similar patterns.
Nonverbal behavior and facets of proficiency
Fluency
         Correlations. Table 6.3 shows the polychoric correlations between each iMotions variable and
fluency. The only significant correlation between the mean behavior scores and fluency was valence, which
correlated at .12. This indicates that there was a small positive relationship between positive valence and
fluency level. The standard deviations of valence and attention also correlated with fluency at -.11 and .11,
respectively. These correlations indicate a negative relationship between variance in valence (e.g.,
alternating between positive and negative valence) and fluency, and a positive relationship between higher
attentional variance and fluency. These correlations are illustrated in
Figure 6.10 as regression lines. Similar to Chapter 5, in these figures, vertical lines represent the distribution
of scores for each test taker.
Table 6.3
Polychoric Correlations with Fluency
  Variable                                                          Mean                                      SD
  Engagement                                                  .04 [0, .08]                       -.01 [-.06, .03]
  Valence                                                   .12 [.08, .16]                      -.11 [-.15, -.07]
  Attention                                               -.02 [-.07, .03]                         .11 [.07, .15]
                                                                                                               180


Figure 6.10
Relationships Between Nonverbal Measures and Fluency
                                                     181


         Regression of mean predictors. Factors were entered in the model based on the correlations with
fluency listed in Table 6.3. The five models, shown in Table 6.4, indicated that the best fitting model was
the interaction model. The interaction model, presented in Error! Reference source not found., fit
significantly better than the other models, 2(4) = 50.05, p < .001. This model also fit significantly better
than the model with random effects removed, 2(2) = 712, p < .001. The only significant predictor in this
model was base proficiency (the scaled IELTS scores), which had a sizeable effect on fluency
( =  odds ratio = 9.66). This model, however, only explained minimal variance in the model,
Nagelkerke’s Pseudo R2 = .02.
Table 6.4
Model Comparisons for Fluency (Means)
  Model                 AIC                   2                  df                   p
  Null Model            7016.70
  Model 1               7018.10              0.56                 1                    .45
  Model 2               7020.10              0.05                 1                    .83
  Model 3               7021.90              0.12                 1                    .72
  Interaction model     6980.00              50.05                4                    < .001
Table 6.5
Interactions Between Base Proficiency and Mean Behavioral Indices on Fluency
  Coefficients                   95% CI          SE       z         p        OR            95% CI
  Base proficiency    2.27        [0.76, 3.78]    0.77     2.95      .003     9.66          [2.14, 43.60]
  Valence             0.04        [-0.03, 0.11]   0.04     1.05      .30      1.04          [0.97, 1.10]
  Engagement          -0.04       [-0.10, 0.01]   0.03     -1.45     .15      0.96          [0.91, 1.00]
  Attention           -0.07       [-0.004, 0.14]  0.04     1.85      .06      1.07          [0.996, 1.20]
  Val:Prof            -0.01       [-0.03, 0.007]  -0.01    -1.25     .21      0.99          [0.97, 1.00]
  Eng:Prof            0.01        [-0.003, 0.02]  0.01     1.53      .12      1.01          [0.997, 1.00]
  Att:Prof            -0.02       [-0.03, 0.001]  0.02     -1.83     .07      0.98          [0.97, 1.00]
                      Random effects
                      Groups      Variance        SD
                      Raters      0.97            0.99
                      Samples     0.77            0.88
  Note. Adjusted α = .0125.
         Regression of predictor standard deviations. Similarly, predictor standard deviations were
entered in this secondary model based on the absolute value of correlations with fluency listed in Table 6.3.
Because the absolute values of valence and attention were equivalent, I entered valence in the first model
                                                                                                         182


for comparability with the models of mean indices. The five models, including the interaction model shown
in Table 6.6, indicated that the best fitting model was the interaction model, 2(4) = 53.48, p < .001. This
model also fit significantly better than the model with random effects removed, 2(2) = 647, p < .001. This
model, presented in Table 6.7, contrasted with the mean model in that one interaction term, attention with
base proficiency, was significant  =  p = .003, with a very small effect size (odds ratio = 1.02). The
main effect of attention was also significant,  = − p = .01, but I will not interpret the main effect given
the significant interaction term. This model explained minimal variance in the outcome, Nagelkerke’s
Pseudo R2 = .03.
Table 6.6
Model Comparisons for Fluency (SDs)
 Model                  AIC                     2                   df                      p
 Null Model             7017
 Model 1                7018                   0.79                  1                       .37
 Model 2                7018                   1.40                  1                       .24
 Model 3                7019                   1.38                  1                       .24
 Interaction model      6974                   53.48                 4                       < .001
Table 6.7
Interactions Between Base Proficiency and Behavioral Index SDs on Fluency
 Predictors                       95% CI            SE        z        p          OR            95% CI
 Base proficiency     0.76         [0.07, 1.46]      0.35      2.15     .03        2.15          [1.07, 4.30]
 Valence              -0.08        [-0.25, 0.08]     0.09      -0.95    .34        0.92          [0.78, 1.09]
 Attention            -0.06        [-0.11, -0.01]    0.26      -2.45    .01        0.94          [0.89, 0.99]
 Engagement           0.09         [-0.15, 0.33]     0.12      0.73     .46        1.09          [0.86, 1.38]
 Val:Prof             0.01         [-0.03, 0.04]     0.01      0.35     .73        1.01          [0.97, 1.04
 Att:Prof             0.02         [0.01, 0.03]      0.01      3.04     .002       1.02          [1.01, 1.03]
 Eng:Prof             -0.005       [-0.05, 0.04]     -0.001    -0.20    .84        1.00          [0.95, 1.04]
                      Random effects
                      Groups       Variance          SD
                      Raters       0.97              0.99
                      Samples      0.61              0.78
 Note. Adjusted α = .0125.
        To explore the interactions, I dichotomized proficiency and reran the simple ordinal model only
using the interaction terms fluency~sd_Attention*proficiency, as detailed in the methods section.
Significance tests of the differences in the coefficients between the two proficiency groups for all 7 levels
                                                                                                              183


of fluency score are presented in Table 6.8. The comparisons use the high proficiency group as the
comparison group. They show that the impact of attention on the proficiency groups was not even for each
score level. For example, at a score level of 3, the standard deviation of attention had less of an positive
effect on the higher group than the lower group ( = −) The direction of impact switched between
scores of 4 and 5. At a score of 6, for example, the high group was more positively affected by the variance
in attention than the lower group ( = ) Figure 6.11 shows the probabilities of a particular score
assignment according to the variance in attention. At lower levels of proficiency, greater attention
corresponded with lower probabilities of reaching a score of 6 to 7. However, at higher levels of proficiency,
the effect was the opposite, with greater variance in attention leading greater probabilities of scoring in
higher brackets.
Table 6.8
Post hoc Comparisons for Fluency with SD Attention
 Score level               95% CI           SE          z          p          OR           95% CI
 1              -0.08       [-0.06, -0.09]   0.01        -9.33      < .001     0.93         [0.91, 0.94]
 2              -0.18       [-0.16, -0.21]   0.01        -17.87     < .001     0.83         [0.81, 0.85]
 3              -0.17       [-0.15, -0.19]   0.01        -16.92     < .001     0.84         [0.83, 0.86]
 4              -0.04       [-0.03, -0.05]   0.005       -8.60      < .001     0.96         [0.95, 0.97]
 5              0.08        [0.10, 0.06]     0.01        9.57       < .001     1.08         [1.07, 1.10]
 6              0.25        [0.27, 0.23]     0.01        22.33      < .001     1.29         [1.26, 1.31]
 7              0.14        [0.15, 0.12]     0.01        15.82      < .001     1.15         [1.13, 1.17]
 Note. Adjusted α = .00714 for 7 comparisons
                                                                                                           184


Figure 6.11
Probability of Fluency Score Given SD Attention
        Finally, I illustrate the effects of this interaction on the dataset as a whole, pictured in Figure 6.12.
Although the effect size was in reality quite small, it can be seen that more varied attention related to
differential outcomes for the two proficiency groups.
                                                                                                             185


Figure 6.12
Visualization of Impact of SD Attention on Fluency
Vocabulary
         Correlations. Table 6.9 displays the polychoric correlations between each behavioral index and
vocabulary, while Figure 6.13 visualizes these relationships. These relationships were identical in direction
to fluency, but slightly different in size. The relationship between vocabulary and mean valence was slightly
weaker at .08 (rather than .12), as well as between vocabulary and the standard deviation of valence (-.15).
Correlations with the standard deviation of attention were the same as with fluency (.11).
                                                                                                         186


Table 6.9
Polychoric Correlations With Vocabulary
 Variable                                         Mean                 SD
 Engagement                              .02 [-.01, .06]  -.03 [-.07, .01]
 Valence                                  .08 [.03, .12] -.15 [-.19, -.11]
 Attention                              -.01 [-.06, .03]    .11 [.06, .15]
                                                                        187


Figure 6.13
Relationships Between Nonverbal Measures and Vocabulary
                                                        188


          Regressions of main predictors. Factors were entered in the model based on the correlations with
vocabulary listed in Table 6.9. The results were nearly identical for fluency. The five-model comparison,
shown in Table 6.10, indicated that the best fitting model was the interaction model. The interaction model,
presented in Table 6.11, fit significantly better than the other models, 2(4) = 52.44, p < .001. This model
also fit significantly better than the model with random effects removed, 2 (2) = 539.09, p < .001. As with
fluency, however, the only significant predictor in this model was base proficiency, which had a sizeable
effect on vocabulary ( = 1.90, odds ratio = 6.70). This model only explained minimal variance in the model,
Nagelkerke’s Pseudo R2 = .02.
Table 6.10
Model Comparisons for Vocabulary (Means)
  Model                  AIC                    2                 df                    p
  Null Model             7325.30
  Model 1                7327.00               0.33                1                     .56
  Model 2                7328.90               0.03                1                     .86
  Model 3                7330.90               0.02                1                     .90
  Interaction model      7286.60               52.44               4                     < .001
Table 6.11
Interactions Between Base Proficiency and Mean Behavioral Indices on Vocabulary
  Coefficients                     95% CI          SE      z         p         OR           95% CI
  Base proficiency     1.90         [0.56, 3.24]    0.68    2.78      .005      6.70         [1.75, 25.47]
  Valence              0.04         [-0.03, 0.10]   0.03    1.08      .28       1.04         [0.97, 1.11]
  Engagement           -0.05        [-0.10, 0.01]   0.03    -1.72     .09       0.96         [0.91, 1.01]
  Attention            0.06         [-0.001, 0.13]  0.03    1.93      .05       1.07         [1.00, 1.14]
  Val:Prof             -0.01        [-0.03, 0.005]  0.01    -1.45     .15       0.99         [0.97, 1.00]
  Eng:Prof             0.01         [-0.001, 0.02]  0.01    1.83      .07       1.01         [1.00, 1.02]
  Att:Prof             -0.01        [-0.03, 0.003]  0.01    -1.62     .11       0.99         [0.97, 1.00]
                       Random effects
                       Groups       Variance        SD
                       Raters       0.65            0.81
                       Samples      0.60            0.77
  Note. Adjusted α = .0125.
          Regression of predictor standard deviations. Similarly, predictor standard deviations were
entered in this secondary model based on the absolute value of correlations with vocabulary listed in Table
6.9. The five models, shown in Table 6.12, indicated that the best fitting model was the interaction model,
                                                                                                          189


2(4) = 54.57, p < .001. This model, shown in Table 6.13, also fit significantly better than the model with
random effects removed, 2(2) = 647, p < .001. This model, similar to that of fluency, had one interaction
term, attention with base proficiency, which was significant,  =  p = .003, with a very small effect
size (odds ratio = 1.02). This model explained minimal variance in the outcome, Nagelkerke’s Pseudo R2
= .02.
Table 6.12
Model Comparisons for Vocabulary (SDs)
  Model                 AIC                     2                 df                    p
  Null Model            7325.30
  Model 1               7326.10                1.22                1                     .27
  Model 2               7327.40                0.65                1                     .41
  Model 3               7327.10                2.30                1                     .13
  Interaction model     7280.60                54.57               4                     <.001
Table 6.13
Interactions Between Base Proficiency and Behavioral Index SDs on Vocabulary
  Predictors                      95% CI            SE      z        p         OR           95% CI
  Base proficiency    0.69         [0.06, 1.31]      0.32    2.16     .03       1.99         [1.07, 3.70]
  Valence             -0.08        [-0.23, 008]      0.08    -0.97    .33       0.93         [0.80, 1.08]
  Attention           -0.06        [-0.10, -0.01]    0.02    -2.37    .02       0.95         [0.90, 0.99]
  Engagement          0.07         [-0.18, 0.29]     0.11    0.69     .49       1.08         [0.87, 1.33]
  Val:Prof            0.003        [-0.03, 0.03]     0.02    0.21     .83       1.00         [0.97, 1.03]
  Att:Prof            0.01         [0.004, 0.02]     0.01    2.79     .005      1.01         [1.00, 1.02]
  Eng:Prof            -0.001       [-0.04, 0.04]     0.02    -0.05    .96       1.00         [0.96, 1.04]
                      Random effects
                      Groups       Variance          SD
                      Raters       0.68              0.81
                      Samples      0.49              0.70
  Note. Adjusted α = .0125.
         As for fluency, I calculated the differences in the coefficients and odds ratios between the
proficiency groups by each vocabulary score, presented in Table 6.14. The comparisons show nearly
identical trends with the fluency model, except that a score level of 4 showed no differences in the
proficiency groups in how the standard deviation of attention impacted fluency. Figure 6.14 shows the
probabilities of a particular score assignment according to the variance in attention, with analogous trends
with fluency. This indicated that higher variance in attention, such as shifting gaze frequently, increased a
                                                                                                          190


less proficient individual’s chance of receiving a 1 or 2 and lowered their chances of being awarded higher
scores.
Table 6.14
Post hoc Comparisons for Vocabulary with SD Attention
  Score level                95% CI              SE        z           p           OR           95% CI
  1               -0.07       [-0.06, -0.09]      0.01      -9.24       <.001       0.93         [0.91, 0.94]
  2               -0.21       [-0.19, -0.23]      0.01      -19.48      <.001       0.81         [0.79, 0.83]
  3               -0.17       [-0.15, -0.19]      0.01      -15.96      <.001       0.85         [0.83, 0.86]
  4               -0.01       [-0.001, -0.02] 0.004         -2.15       .03         0.99         [0.98, 1.00]
  5               0.13        [0.14, 0.11]        0.01      15.32       <.001       1.13         [1.12, 1.15]
  6               0.21        [0.23, 0.19]        0,01      20.32       <.001       1.23         [1.21, 1.26]
  7               0.12        [0.14, 0.11]        0.01      14.85       <.001       1.13         [1.11, 1.15]
  Note. Adjusted α = .00714 for 7 comparisons
Figure 6.14
Probability of Vocabulary Score Given SD Attention
         I illustrate the effects of this interaction on the dataset as a whole, pictured in Figure 6.15. As with
fluency, it can be seen that more varied attention related to differential outcomes for the two proficiency
                                                                                                              191


groups.
Figure 6.15
Visualization of Impact of SD Attention on Vocabulary
Grammar
         Correlations. Table 6.15 and Figure 6.16 again show similar trends between grammar and the
behavioral indices. Here mean valence had an even weaker correlation with grammar (.07, rather than .08
or .12), and a similar correlation between the standard deviations of valence and attention with grammar.
Table 6.15
Polychoric Correlations with Grammar
 Variable                                                         Mean                                   SD
 Engagement                                                 .04 [0, .09]                       0 [-.05, .04]
 Valence                                                  .07 [.04, .11]                   -.10 [-.14, -.06]
 Attention                                              -.01 [-.06, .04]                      .10 [.05, .15]
                                                                                                          192


Figure 6.16
Relationships Between Nonverbal Measures and Grammar
                                                     193


         Regressions of main predictors. The results for grammar were analogous to those of fluency and
vocabulary. The five-model comparison, shown in Table 6.16, indicated that the best fitting model was the
interaction model. The interaction model, presented in Table 6.17, fit significantly better than the other
models, 2(4) = 52.32, p < .001. This model also fit significantly better than the model with random effects
removed, 2(2) = 460, p < .001. As with fluency and grammar, the only significant predictor in this model
was base proficiency, which had a sizeable effect on fluency (  = 1.43, odds ratio = 4.19). This model, as
the previous three, only explained minimal variance, Nagelkerke’s Pseudo R2 = .02.
Table 6.16
Model Comparisons for Grammar (Means)
 Model                 AIC                    2                   df                    p
 Null Model            7765
 Model 1               7767                  0.38                  1                     .54
 Model 2               7769                  0.01                  1                     .93
 Model 3               7771                  0.05                  1                     .82
 Interaction model     7726                  52.32                 4                     <.001
Table 6.17
Interactions Between Base Proficiency and Mean Behavioral Indices on Grammar
 Coefficients                   95% CI            SE        z        p         OR           95% CI
 Base proficiency    1.43        [0.44, 2.42]      0.51      2.83     .005      4.19         [1.55, 11.30]
 Valence             0.03        [-0.01, 0.09]     0.02      1.49     .14       1.04         [0.99, 1.09]
 Engagement          -0.02       [-0.06, 0.01]     0.02      -1.24    .21       0.98         [0.94, 1.01]
 Attention           0.04        [-0.004, 0.09]    0.02      1.79     .07       1.05         [1.00, 1.10]
 Val:Prof            -0.01       [-0.03, -.0002]   0.01      -2.00    .05       0.99         [0.97, 1.00]
 Eng:Prof            0.01        [-0.001, 0.02]    0.004     1.68     .09       1.01         [1.00, 1.02]
 Att:Prof            -0.01       [-0.02, 0.001]    0.01      -1.62    .10       0.99         [0.98, 1.00]
                     Random effects
                     Groups      Variance          SD
                     Raters      .80               .90
                     Samples     .31               .56
 Note. Adjusted α = .0125.
         Regression of predictor standard deviations. The five models using standard deviations as
predictors, shown in Table 6.18, indicated that the best fitting model was the interaction model, 2(4) =
55.51, p < .001. This model, shown in Table 6.19, also fit significantly better than the model with random
effects removed, 2(2) = 425, p < .001. This model contrasted with the mean model in that the interaction
                                                                                                          194


of attention with base proficiency was significant,  =  p = .002, with a very small effect size (odds
ratio = 1.01). This model explained minimal variance in the outcome, Nagelkerke’s Pseudo R2 = .03.
Table 6.18
Model Comparisons for Grammar (SDs)
 Model                  AIC                    2                   df                    p
 Null Model             7765
 Model 1                7766                  0.85                  1                     .36
 Model 2                7767                  0.90                  1                     .34
 Model 3                7767                  2.98                  1                     .08
 Interaction model      7719                  55.51                 4                     <.001
Table 6.19
Interactions Between Base Proficiency and Behavioral Index SDs on Grammar
 Predictors                      95% CI             SE       z        p         OR           95% CI
 Base proficiency     0.47        [0.02, 0.92]       0.23     2.07     .04       1.61         [1.03, 2.51]
 Valence              -0.02       [-0.13, 0.09]      0.06     -0.43    .67       0.98         [0.87, 1.09]
 Attention            -0.04       [-0.07, -0.01]     0.02     -2.40    .02       0.96         [0.93, 0.99]
 Engagement           0.03        [-0.12, 0.19]      0.08     0.43     .67       1.03         [0.89, 1.21]
 Val:Prof             -0.01       [-0.03, 0.02]      0.01     -0.46    .65       1.00         [0.97, 1.02]
 Att:Prof             0.01        [0.004, 0.02]      0.004    3.09     .002      1.01         [1.004, 1.02]
 Eng:Prof             0.01        [-0.02, 0.03]      0.01     0.38     .71       1.01         [0.98, 1.03]
                      Random effects
                      Groups      Variance           SD
                      Raters      0.80               0.89
                      Samples     0.23               0.48
 Note. Adjusted α = .0125.
         As in the previous analyses, I calculated the differences in the coefficients and odds ratios between
the proficiency groups by each grammar score, presented in Table 6.20. The comparisons were analogous
for grammar, with no difference in effect for a score of 4. Figure 6.17 shows the probabilities of a particular
score assignment according to the variance in attention, which was effectively the same as with fluency and
vocabulary. A higher variance in attention increased a less proficient individual’s chance of receiving a 1
or 2 on vocabulary, while lowering their chances of receiving a 5, 6, or 7.
                                                                                                           195


Table 6.20
Post hoc Comparisons for Grammar with SD Attention
  Score level                95% CI              SE        z           p           OR           95% CI
  1               -0.06       [-0.04, -0.07]      0.01      -8.64       <.001       0.94         [0.93, 0.96]
  2               -0.18       [-0.16, -0.20]      0.01      -17.19      <.001       0.83         [0.82, 0.85]
  3               -0.15       [-0.13, -0.17]      0.01      -14.57      <.001       0.86         [0.85, 0.88]
  4               0.002       [-0.01, -0.01]      0.004     0.37        .70         1.00         [0.99, 1.01]
  5               0.13        [0.14, 0.11]        0.01      15.43       <.001       1.14         [1.12, 1.16]
  6               0.19        [0.21, 0.17]        0,01      18.14       <.001       1.20         [1.18, 1.23]
  7               0.07        [0.08, 0.06]        0.01      11.28       <.001       1.07         [1.06, 1.08]
  Note. Adjusted α = .00714 for 7 comparisons
Figure 6.17
Probability of Grammar Score Given SD Attention
         I illustrate the effects of this interaction on the dataset as a whole, pictured in Figure 6.18. As with
fluency and vocabulary, it can be seen that more varied attention related to differential outcomes for the
two proficiency groups, though the impact appears somewhat smaller for grammar.
                                                                                                              196


Figure 6.18
Visualization of Impact of SD Attention on Grammar
Comprehensibility
         Correlations. Table 6.21 and Figure 6.19 illustrate the relationships between the iMotion means
and standard deviations with comprehensibility. Here trends were slightly different from fluency,
vocabulary, and grammar. In addition to a stronger correlation with mean valence (.18, rather than .15, .12,
or .07), mean engagement was also a significant positive correlate at .11. This suggests a positive
relationship between overall expressiveness and positivity with comprehensibility. Correlations with
standard deviations also diverged from fluency, vocabulary, and grammar. Here, valence was not a
significant correlate, but engagement and attention were, both correlating positively at .06 and .11,
respectively. These findings indicate a positive but small relationship between more varied overall
expressiveness and attention with comprehensibility.
                                                                                                        197


Table 6.21
Polychoric Correlations with Comprehensibility
 Variable                                                Mean                SD
 Engagement                                      .11 [.08, .15]   .06 [.02, .10]
 Valence                                         .18 [.14, .21] -.03 [-.07, .01]
 Attention                                     -.03 [-.08, .02]   .11 [.07, .15]
                                                                              198


Figure 6.19
Relationships Between Nonverbal Measures and Comprehensibility
                                                               199


         Regressions of main predictors. Factors were entered in the model based on the correlations with
vocabulary listed in Table 6.9. The results were nearly identical for fluency. The five-model comparison,
shown in Table 6.10, indicated that the best fitting model was the interaction model. The interaction model,
presented in Table 6.11, fit significantly better than the other models, 2(4) = 48.00, p < .001. This model
also fit significantly better than the model with random effects removed, 2 (2) = 697.42, p < .001. This
model of mean indices varied substantially from the previous three models. In this model, the interaction
between mean valence and base proficiency was significant,  = -0.02, odds ratio = 0.98, which is a small
effect size. The main effects did not reach significance. Similar to the other models, this model explained
minimal variance in the score outcome, Nagelkerke’s Pseudo R2 = .02.
Table 6.22
Model Comparisons for Comprehensibility (Means)
 Model                  AIC                     2                 df                    p
 Null Model             7246.90
 Model 1                7247.50                1.41                1                     .23
 Model 2                7249.40                0.05                1                     .82
 Model 3                7251.00                0.36                1                     .55
 Interaction model      7211.00                48.00               4                     <.001
Table 6.23
Interactions Between Base Proficiency and Mean Behavioral Indices on Comprehensibility
 Coefficients                     95% CI           SE      z         p         OR           95% CI
 Base proficiency     1.48         [0.21, 2.75]     0.65    2.28      .02       4.38         [1.23, 15.59]
 Valence              0.08         [0.01, 0.14]     0.03    2.42      .02       1.08         [1.01, 1.15]
 Engagement           -0.01        [-0.06, 0.03]    0.02    -0.56     .57       0.99         [0.94, 1.03]
 Attention            0.03         [-0.03, 0.09]    0.03    0.87      .38       1.03         [0.97, 1.09]
 Val:Prof             -0.02        [-0.04, -0.01]   0.01    -2.57     .01       0.98         [0.92, 0.99]
 Eng:Prof             0.01         [-0.004, 0.02]   0.01    1.18      .24       1.01         [1.00, 1.02]
 Att:Prof             -0.01        [-0.02, 0.01]    0.01    -1.09     .28       0.99         [0.98, 1.01]
                      Random effects
                      Groups       Variance         SD
                      Raters       1.17             1.08
                      Samples      0.53             0.73
 Note. Adjusted α = .0125.
         I calculated the differences in the coefficients and odds ratios between the proficiency groups by
each comprehensibility score, shown in Table 6.24. The direction of impact shifted much higher for
                                                                                                          200


comprehensibility, between a 5 and 6. Figure 6.20 shows the probabilities of a particular score assignment
according to the mean valence value. Here trends were quite different. An increase in mean valence, or the
overall positivity of a person’s expressions, resulted in lower probabilities of an assignment of 1–3 on
comprehensibility and a higher probability of receiving a 5–7. For more proficient speakers, higher valence
corresponded with a greater probability of score assignments between 1–5, and surprisingly a lower
probability of scoring a 7.
Table 6.24
Post hoc Comparisons for Comprehensibility with Mean Valence
 Score level               95% CI          SE         z          p           OR           95% CI
 1              -0.03       [-0.02, -0.04]  0.004      -6.90      <.001       0.97         [0.96 0.98]
 2              -0.13       [-0.12, -0.15]  0.01       -14.64     <.001       0.88         [0.86, 0.89]
 3              -0.17       [-0.15, -0.19]  0.01       -17.71     <.001       0.84         [0.82, 0.86]
 4              -0.07       [-0.06, -0.08]  0.01       -12.68     <.001       0.93         [0.92, 0.94]
 5              -0.03       [-0.01, -0.04]  0.01       -3.30      <.001       0.97         [0.96, 0.99]
 6              0.19        [0.21, 0.17]    0,01       18.17      <.001       1.21         [1.19, 1.24]
 7              0.24        [0.26, 0.21]    0.01       20.02      <.001       1.27         [1.24, 1.30]
 Note. Adjusted α = .00714 for 7 comparisons
                                                                                                        201


Figure 6.20
Probability of Comprehensibility Score Given Mean Valence
        The effects of this interaction on the dataset as a whole are illustrated in Figure 6.21. The
differential effects are much more apparent here, with a benefit of positive valence corresponding to greater
comprehensibility scores in less proficient speakers, and lower comprehensibility scores with higher
valence in the more proficient group. It must be said, however, that the range of mean valence was much
more restricted in the more proficient group.
                                                                                                         202


Figure 6.21
Visualization of Impact of Mean Valence on Comprehensibility
        Regression of predictor standard deviations. Predictor standard deviations were also entered in
a secondary model based on the absolute value of correlations with comprehensibility listed in Table 6.21,
with the order being attention, engagement, and then valence. The five models, shown in
Table 6.25, indicated that the best fitting model was the interaction model, 2(4) = 46.25, p < .001. This
model, shown in Table 6.26, also fit significantly better than the model with random effects removed, 2(2)
= 692.65, p < .001. As opposed to previous models of behavioral standard deviations, none of the predictors
in this model were significant. This model explained minimal variance in the outcome, Nagelkerke’s
Pseudo R2 = .02.
                                                                                                       203


Table 6.25
Model Comparisons for Comprehensibility (SDs)
  Model                  AIC                      2                  df                    p
  Null Model             7246.90
  Model 1                7247.60                 1.25                 1                     .26
  Model 2                7249.30                 0.27                 1                     .60
  Model 3                7248.80                 2.52                 1                     .11
  Interaction model      7210.60                 46.25                4                     <.001
Table 6.26
Interactions Between Base Proficiency and Behavioral Index SDs on Comprehensibility
  Predictors                        95% CI            SE       z        p        OR            95% CI
  Base proficiency     0.63          [-0.02, 1.27]     0.33     1.90     .06      1.87          [0.98, 3.58]
  Attention            -0.03         [-0.08, 0.02]     0.02     -1.27    .20      0.97          [0.93, 1.02]
  Engagement           0.02          [-0.20, 0.24]     0.11     0.17     .86      1.02          [0.82, 1.27]
  Valence              0.005         [-0.15, 0.16]     0.08     0.06     .95      1.00          [0.86, 1.18]
  Att:Prof             0.01          [<.001, 0.02]     0.01     1.98     .05      1.01          [1.00, 1.02]
  Eng:Prof             0.01          [-0.03, 0.05]     0.02     0.44     .66      1.01          [0.97, 1.05]
  Val:Prof             -0.001        [-0.04, 0.02]     0.02     -0.62    .54      0.99          [0.96, 1.02]
                       Random effects
                       Groups        Variance          SD
                       Raters        1.17              1.08
                       Samples       0.53              0.73
  Note. Adjusted α = .0125.
Summary
          The analysis in this chapter has utilized cutting edge pattern recognition software—iMotions,
running Affectiva—to analyze human behavior in an online speaking test. This study is only the second of
its kind, at the time of writing this paper, to use automated facial expression analysis in a study of nonverbal
behavior in applied linguistics (after Chong & Aryadoust, 2023). The software extracted behavioral indices
of engagement, or overall expressiveness, valence, and measures of attention from 30 video samples. These
objectively derived indices, as opposed to subjective ratings of affect in Chapter 5, were used as predictors
in models to determine their overall impact on language proficiency scores. These indices were calculated
as mean overall values of the behavior over the course of the speaking test, which were hypothesized in the
preregistration of this study to impact language proficiency scores. As an additional exploratory analysis, I
                                                                                                             204


also ran models with the standard deviations of the predictors as a measure of how much each behavior
varied during the test sample.
        The study showed that engagement (as measured by facial expressiveness), valence, and attention
do not have an impact on language scores without base proficiency taken into account. Base proficiency,
which consisted of the scaled IELTS scores, interacted with these measures to produce differential
outcomes. Greater positive valence, or the emotional expressiveness of a test taker, related to less proficient
speakers being perceived as more comprehensible, while having a negligible relationship with more
proficient speakers. On the other hand, more varied attentional focus, which was measured through head
and gaze turns away from the examiner/camera, corresponded with lower fluency, vocabulary, and
grammar scores in less proficient speakers, and somewhat higher scores in more proficient speakers.
        In the next chapter, I turn to the raters themselves. While the first two studies have extrapolated
about the effects of various metrics on proficiency outcomes, Chapter 7 will consider what the raters were
thinking as they assigned scores to the speaking test samples. This source of data is critical to a fuller
understanding of the phenomena at play, as rater reports may offer insight that triangulates and thus supports
findings from the quantitative analyses.
                                                                                                           205


                  CHAPTER 7: NONVERBAL BEHAVIOR AND RATER COGNITION
          The past two chapters have explored trends within the variables observed by the raters and between
the externally measured iMotions data and the languages scores. While the quantitative analysis has
revealed interesting trends, developing a greater understanding requires introspection in the form of rater
reports. In this chapter, I describe the results of a stimulated verbal recall I conducted in order to triangulate
the findings of the quantitative analyses. I conducted recalls with 20 of the undergraduate, untrained
participants from the main study. These raters provided reasons for their judgements on 10 files. These files
were selected using Rasch analysis, where I determined which files had changed their relative ordering with
the original test scores the most. I hypothesized that these files, where the lay raters disagreed the most with
the trained raters, would be impacted by criteria outside of the language test construct to a greater degree.
The research questions guiding this chapter are as follows:
          RQ3.1:       Which nonverbal behaviors are most salient and informative to raters when scoring?
          RQ3.2:       How do raters understand language proficiency in light of nonverbal behavior?
          In this chapter, I will first describe general trends in the stimulated recall data relating to comments
raters made as a whole. I will then provide descriptive data on the frequency of occurrence of nonverbal
behaviors that the raters observed. Following this, I will investigate deeper patterns in the dataset relating
to how raters used nonverbal behavior to formulate judgements related to language proficiency.
General rating trends
          The general aim of the stimulated verbal recall in this study was to elicit memories from participants
about their thought processes during the online sample rating. At the same time, the clandestine interest of
the recall was to understand how they used nonverbal behavior during their judgements, of which raters
were not aware. Because the rating scale used in this study contained elements relating to language followed
by affect, post-hoc evidence of the validity of the procedure should show that raters focused primarily on
language but also on affect. While raters focusing primarily on nonverbal behavior may not necessarily be
evidence of their awareness of the questions guiding the study, any such cases should be inspected carefully.
In this section, I provide validation evidence of the recall while highlighting overarching trends in the raters’
                                                                                                              206


comments.
         The raters produced comments that resulted in 4,184 coding decisions. The total count of each
coding category, presented in Table 7.1, provides evidence of where raters directed their attention. Raters
commented the most on language features (38%), followed by affect (31%), aspects of the test interaction
(20%), and finally nonverbal behavior (11%). These results indicate that the participants indeed provided
the most commentary on elements related to the rating scales, as intended by the procedure and indicated
in the instructions. However, raters also made a sizable number of comments about the test interaction, in
particular concerning the content of the test takers’ discourse (e.g., ideas mentioned, the topic, truthfulness,
amount and breadth, etc). For example, when discussing sample 21, rater 2 makes comments about the
content of speech and overall comprehensibility:
         She seems like she's having a really bad time trying to come up with an answer to his question. She
         says that people can do the same thing. That's it's pretty vague and nonspecific. It's hard to tell
         what she may mean by what she's saying. So some ideas are there, but they're not very descriptive.
         So they can't be easily understood. (Content, comprehensibility)
         Nonverbal behavior occupied the smallest number of coding comments. This suggests that raters
were not, as whole, previously exposed to or aware of the research questions. While 11% is a relatively
small percentage in comparison, it aligns with findings from Sato and McNamara (2019) where lay raters
offered recall on their rating decisions of communicative ability. In that study, comments on CET-SET
language features comprised 36.7%, content 15%, and nonverbal behavior 9.4%. Affect, coded as
composure/attitude, only made up 5.7% of the comments, but this discrepancy between these two findings
is certainly due to the inclusion of affect on the rating scales. Similarly, in May (2011), (trained) raters’
comments on features of nonverbal behavior relevant to interactional competence reached a similar figure
of 12%. Based on these results, the findings in this study align extremely well with previous work, which
provides evidence of the veridicality of the nonverbal observations within the stimulated recall data.
                                                                                                            207


Table 7.1
Code Counts for Stimulated Recall Data
 Code                                   Count     %*     Code                                   Count     %*
 Affect                                 1,278     31     Nonverbal Behavior                     477       11
 Anxiety                                215       17     Gaze                                   115       25
 Confidence                             183       14     Mouth                                  96        21
 Happiness                              126       10     Paralinguistics                        75        16
 Competence                             113       9      Face (General)                         53        11
 Engagement                             109       9      Posture                                52        11
 Expressiveness                         121       9      Gesture                                35        7
 Warmth                                 96        8      Head                                   30        6
 Attentiveness                          92        7      Eyebrows                               11        2
 Attitude                               93        7      Body (General)                         10        2
 Interactiveness                        66        5
 Desire to Communicate                  37        3
 Humor                                  27        2
 Language                               1,576     38     Test Interaction                       853       20
 Fluency                                308       19     Content                                368       43
 Comprehension                          244       15     Thinking                               150       18
 Comprehensibility                      228       14     Turn-taking                            99        12
 Vocabulary                             294       14     Examiner                               67        8
 Grammar                                224       13     Repair                                 65        8
 Overall Ability                        145       9      Relevance-Contingence                  57        7
 Pronunciation                          101       5      Active Listening                       35        4
 Organization                           32        2      Visual Artifacts                       12        1
 *Note. Percentages in bold are percentages of    total 4,184 comments. Percentages for each subcode are
 percentages of the total of the grouping code.
         In this dataset, all raters made comments on these four broad categories, albeit to different degrees.
Table 7.2 shows the distribution of these comments as percentages of their total number of comments.
Raters made an average of 124.6 coded comments each (SD = 39.01), ranging from as few as 87 to as many
as 224. All but four raters focused on language the most, ranging from 29–66% of their comments. Raters
03, 08, 14, and 20 deviated from these trends somewhat, focusing more on test interaction (rater 03; 36%)
or affect (raters 08, 14, and 20; 31–34%). There were no raters that focused the most on nonverbal behavior,
and there were no apparent patterns attributable to gender in how raters commented on the speech samples.
Raters were idiosyncratic, however, in how much they commented on nonverbal behavior. Comments
ranged from as few as 5% to as many as 24%. Overall, despite these differences, raters appeared to adopt a
similar focus across their stimulated recalls.
                                                                                                            208


Table 7.2
Percentages of Coded Comments by Raters
                                                                         Test            Nonverbal
 SVR Rater         Raw total        Language           Affect
                                                                         Interaction     Behavior
 01                101              66                 5                 24              5
 02                95               39                 18                37              6
 03                84               27                 29                36              8
 04                88               36                 18                30              16
 05                142              42                 20                29              10
 06                133              39                 31                20              10
 07                224              31                 29                24              17
 08                222              30                 31                22              17
 09                87               44                 21                29              7
 10                107              31                 27                18              24
 11                129              37                 31                21              11
 12                110              34                 34                20              13
 13                120              38                 23                34              5
 14                136              28                 32                18              22
 15                97               38                 22                28              12
 16                152              33                 26                24              18
 17                93               33                 30                18              18
 18                115              43                 23                23              10
 19                131              29                 27                26              18
 20                125              32                 34                22              12
 *Note: Bold indicates the highest of the four coding categories per rater.
Salience of nonverbal behavior
        As documented in the previous section, all raters indeed found nonverbal behavior salient to their
rating processes. The features the raters noticed, however, differed substantially. Table 7.3 lists the
nonverbal behaviors raters commented on throughout the stimulated recalls, categorized by area of the body
and type of behavior. In each set, the percentage of the categorical code is that of the total number of
behaviors observed, 477; the percentages of the subcodes of each specific behavior pertain only to that
grouping. For example, comments on gaze made up 25% of the 477 comments on nonverbal behavior,
while shifting gaze made up 42% of the 115 comments on gaze. Importantly, this table also includes the
extensiveness of the comments, representing the number of raters that commented on this area. Thus, for
example, 19 raters made comments on some aspect of gaze, while 16 raters commented specifically on
                                                                                                       209


shifting gaze.
Table 7.3
Coding Counts of Nonverbal Behavior
  Code                          Count   %     Ext    Code                                 Count   %     Ext
  Gaze                          115     24    19     Mouth                                96      20    20
  Shifting                      50      42    16     (Genuine) smile                      64      63    20
  Mutual                        31      26    10     Lack of smile                        12      12    4
  Averted                       28      24    8      Lip movements                        9       9     6
  Staring                       6       5     4      Nervous smile                        6       6     4
  Eyes grow wide                2       2     2      Swallowing                           6       6     4
  Blinking                      1       1     1      Mouth barely open                    2       2     2
  Unfocused                     1       1     1      Frowning                             2       2     1
  Paralinguistics               75      16    19     General face behaviors               53      11    17
  Production speed              88      49    19
  Filled pauses                 35      19    14     Posture                              52      11    17
  Laughing                      27      15    15     Rocking/Shaking                      14      24    10
  Tone-prosody                  23      13    8      Leaning forward                      12      20    7
  Audible breathing             5       3     2      Moving around                        11      19    5
  Backchanneling                1       1     1      Rigid/Straight posture               9       15    7
  Volume                        1       1     1      Leaning back/Slouching               6       10    5
                                                     Adjusting posture                    4       7     4
  Gesture                       35      7     13     Shoulder movements                   3       5     2
  Self-adaptors                 15      43    9
  Representational gestures     9       26    5      Head                                 30       6    13
  Lack of hand movement         7       20    4      Nodding                              23      74    11
  Random movement               4       11    4      Turns                                8       26    5
  Eyebrows                      11      2     6      General body language                10      2      6
  Movement                      6       55 3
  Furrowed                      3       27 2
  Raised                        2       18 2
  Note: “Ext” refers to the extensiveness of appearance: the number of raters that mentioned this feature
         Table 7.3 shows several patterns relating to what raters discussed when observing test takers. Gaze
and eye behaviors were the most common behavior topic in this dataset, at 24% of the comments. Nearly
all raters commented on this behavior, likely due to its visual salience in the online space where generally
only the face is seen (Batty, 2021). Closely following gaze were mouth behaviors, making up 20% of the
comments and being commented on by all 20 raters. This is also anticipated, as the mouth serves to produce
language, and individuals often look at the mouth to enhance comprehension (in L1 speakers, Krason et al.,
2022; and L2 speakers, Batty, 2021; Hardison, 2018; Hardison & Pennington, 2021; Ockey, 2007; Worster
                                                                                                          210


et al., 2018) and to interpret affect (Coniam, 2001). Paralinguistic features, which I defined quite broadly
as sounds or features of speech not explicitly associated with verbalizations, made up 16% of the comments
on nonverbal behavior. These were observed by nearly all raters. Similarly, non-specific references to the
face were also quite extensive, being observed by 17 raters, making up 11% of the comments. Posture (11%
of comments, 17 raters), gesture (7% of comments, 13 raters), and head observations (6% of comments, 13
raters) were also observed by more than half of the raters. This is somewhat surprising for gesture and
posture, as each are sometimes harder to see in the online format. Finally, eyebrow movements and non-
specific references to body language each comprised 2% of the dataset and were mentioned by fewer than
half of the raters. Thus, the most salient behaviors related to either the eyes, the mouth, or sounds from the
mouth, making up over half of all observations. Comments about inexpressiveness or the lack of particular
behaviors were present, but not extremely common.
         In this dataset, however, comments about nonverbal behavior were almost always accompanied by
evaluations related to affect. There is precedent in this, as Sato and McNamara (2019) noted that in their
study “[t]he speakers’ [nonverbal behavior] and composure/attitude were intertwined, since confidence was
judged primarily through observed [nonverbal behavior]” (p. 908). It is then worthwhile when describing
behaviors in the dataset to highlight where each behavior intersected with judgements of affect. These
intersections are available in Appendix J. Table J.1 displays intersection percentages across behaviors; that
is to say, for each behavior it indicates the percent of intersections for each of the 11 affect judgements on
that particular behavior. For example, 33% anxiety with body language in this table indicates that 33% of
comments, out of 19 total, about body language and affect indicated a comment about anxiety or
relaxedness/ease. The only other large intersection in this row is with confidence at 28%. Table J.2
considers the same data but within each category; that is to say, it indicates the comments about behaviors
that coincided the most with each individual affect judgement. In anxiety, for example, shifting gaze
coincided with anxiety in 13% of all behavioral comments, out of a total of 217 intersections. It is important
to note that a coincidence with anxiety may indicate both valences of anxiety or at ease, and likewise for
all other affect judgements. Thus, it is critical to consider the comments themselves when extrapolating
                                                                                                           211


about whether evaluations were positive or negative.
Gaze
         Gaze was the most frequently mentioned behavior in this dataset, making up a quarter of the
comments and being observed by 19 of the 20 raters. There was an average of 5.75 comments about gaze
per rater in the pool of 19, with a standard deviation of 5.23. Stimulated recall participant-raters observed
seven types of gaze behaviors, but three made up the bulk of all comments: shifting gaze, mutual gaze, and
averted gaze. Less frequently mentioned were staring, eyes growing wide, blinking, and unfocused gaze.
Shifting gaze, described as looking around or eyes darting all over, was the most salient gaze phenomenon,
making up 42% of the gaze comments and being observed by 16 raters. For example, rater 17 remembered
aspects of sample 16’s performance almost immediately:
         Um, I remember him kind of looking around a lot. And I thought of that as like, you know, he's
         trying to remember things, you know. And he was stuttering a little bit. So I remember thinking he
         was anxious. (Shifting gaze, anxiety, fluency, thinking).
Figure 7.1
Annotation Density Plot for Sample 16
Figure 7.1 shows the annotation density plot for sample 16. The tiers labeled Examiner and Participant
show units of sustained speech, while the Gaze tier shows all the moments the test taker was looking away.
The plot shows that the test taker frequently shifted between looking away and looking at the examiner.
The rater found the test takers’ shifting gaze to signal that the test taker was searching for information, and
its coincidence with the fluency pattern of stuttering (likely referring to false starts or repair) led to an
overall negative view of the test taker. In fact, shifting gaze was almost always considered negatively, as it
coincided with anxiety 36% of the times both co-occurred, and confidence (lack of, in this case) 32%. This
type of eye movement, however, differed markedly from mutual or averted gaze.
         Mutual gaze, that of maintaining eye contact with the unseen examiner in the video, was also
                                                                                                            212


mentioned frequently at 26% of the gaze comments. It was generally seen positively, often coinciding with
mentions of engagement (52%), attention (35%), and confidence (19%). For example, when describing
sample 21, rater 16 said:
        Okay. So here, just like in the first 10… like seconds, or whatever this is, she's like clearly focused
        in on what he's saying. Like her eye contact is dead on the screen. And she's also saying like, "yes",
        like after he says something. So it clearly shows that she is an active listener, and she is like
        engaging herself in the conversation. (Mutual gaze, engagement, comprehension, turn-taking,
        active listening)
Figure 7.2
Sample 21’s Gaze Patterns During Repair Sequence
Here, even though this test taker struggled throughout the sample to understand test questions, the rater’s
view of the test taker was generally positive because the test taker engaged with the examiner by
maintaining eye gaze and actively backchanneling. This is visible in Figure 7.2, where the test taker averted
her gaze three times while asking for a clarification and at the beginning of the examiner’s repair sequence.
This was followed by 16 seconds of unbroken mutual gaze (mutual gaze is identified by unannotated gaze
segments) accompanied by verbal backchanneling (“yeah”), which was only averted once comprehension
was resolved (“ahh”). In other words, mutual gaze here along with the test taker’s verbal interactional moves
contributed to the rater’s evaluation of the test taker’s listening skills and engagement in the test, despite
                                                                                                            213


ongoing problems with comprehension.
         Averted gaze, maintaining gaze away from the examiner but without shifting the gaze location,
made up the third most frequently mentioned gaze behavior at 24%. Depending on the context, however, it
was viewed differently. When seen as an act of processing the test content and preparing an answer, it often
coincided with engagement, being mentioned as a sign of thinking about the question. For example, rater 4
described sample 17 as:
         Really engaged, being close. She's heavily thinking about the question, and she understands it
         usually pretty quickly. Like it doesn't take her too long to respond. Even though she's looking off to
         the side, she's still like, she's really engaged and thinking about the question. And like I've heard...
         hers felt more natural when she was talking, like she really understood him. (Averted gaze,
         engagement, thinking, comprehension, turn-taking, speed, posture, content)
Figure 7.3
Sample 17's Averted Gaze Upon Comprehension
The test taker’s gaze patterns the rater mentions are visible in Figure 7.3. Before the question ended, the
test taker showed her comprehension of the question by averting her gaze, followed by two filled pauses
before she began her response. During this entire sequence, she maintained averted gaze, thus signaling she
                                                                                                            214


was thinking and preparing her response. She reestablished mutual gaze when she arrived at the direct
answer to the question (“Korea”), indicating to the examiner the importance of this key word. In this
example, however, it is likely the co-occurrence of the head turns and the furrowed eyebrows along with
averted gaze that led the rater to conclude that the test taker was listening and thinking. The fact that the
response was natural and contingent on the question likely led to an overall positive evaluation.
          However, a positive evaluation of averted gaze was not always the case, as demonstrated by rater
14 when describing sample 21:
          She seemed very unconfident in even trying to, like, try and answer it. She asked right away. …
          Right as he started talking, she looked away, and was immediately thinking. And then, as soon as
          he was done, she said, "Sorry", so she apologized for not knowing something. (Averted gaze,
          confidence, turn-taking, comprehension, repair, speed)
Here, averted gaze was also tied closely to thinking, but because the sequence was followed by a failure to
comprehend the question and other-initiated repair, the rater overall considered this sequence as relating to
a lack of confidence. Averted gaze has also been reported in the L2 literature as relating to anxiety
(Gregersen, 2005). Thus, while averted gaze related to evidence of the same cognitive process (thinking),
the evaluations based on it were distinct.
          The other gaze behaviors—staring, eyes open wide, blinking, and having unfocused gaze—were
all mentioned relatively infrequently, occurring only 10 times in this dataset. Apart from blinking, which
was an observation with no value judgement, the other behaviors were seen negatively as a sign of internal
struggle to perform during the test. For example, rater 12 commented on sample 16’s intense stare, finding
it to be so focused on the test question that it came off as somewhat distant:
          I think this stare, kind of more unhappy, but more, because I think he was so focused and like, so
          attuned to what he was doing that he kind of like. Yeah, you just get so focused, and you want to
          get it all out kind of, but a little bit colder. (Staring, happiness, warmth)
Similarly, sample 18’s wide eyed looks prompted rater 14 to remark that:
          There is where I kind of noticed she wasn't completely at ease, because every time he asked a
                                                                                                         215


         question, she got a little wide eyed. And it seemed like she was worried she wasn't going to know
         what it was after the second time. (Eyes grow wide, anxiety, comprehension)
For this rater, the test taker’s eyes growing wide was not necessarily a sign of a breakdown in understanding,
but rather an affective response to the intensity of listening and the social expectation to respond
contingently. In both cases, the lack of these behaviors was also noted as a positive attribute.
Mouth
       Mouth movements were the second-most frequently observed category of behavior, making up 20%
of the comments. These were observed by all 20 raters (M = 4.80 comments per rater, SD = 4.00). Rater-
participants observed seven mouth behaviors, with most comments concerning smiles, a lack of smiling,
and lip movements, with fewer comments on nervous smiles, swallowing, a relatively closed mouth, and
frowning. The vast majority of these comments, 63%, were about smiles, in particular smiles that were
perceived as genuine and not nervous. All raters noted the presence of smiles, making this the most
extensively mentioned behavior. Smiling, when not perceived as a nervous smile, was almost always
associated with positive evaluations, notably being happy (20%), warm (11%), and having a positive
attitude (13%). This is illustrated by rater 13 discussing sample 17:
         So I put happy more towards happy just because she seemed like, I don't know, she seemed like she
         was willing to answer things. And like, she obviously like laughed and like, smiled when she was
         talking about the makeup in Korea. … And I said, positive attitude, because I don't know, I guess
         it's like, kind of the same thing is like, the warmth scale is like, if somebody's like smiley and like,
         outgoing. I'd say that they have a good attitude. (Smiling, happiness, desire to communicate,
         content, attitude, warmth)
                                                                                                             216


Figure 7.4
Sample 17 Smiling and Laughing
As rater 13 noted here, the test taker began laughing once she brought up the idea about makeup, as seen in
Figure 7.4, accompanied by a head turn and leaning back in her chair. This laughing was followed by an
episode of smiling, and a smile once again once she mentioned cosmetic surgery. The presence of smiling
and laughing, as well as possibly the relaxed posture and head movements, led the rater to perceive this test
taker as happy, warm, and with a positive attitude. It is notable that the test taker’s desire to communicate,
identified as “she was willing to answer things,” was also a factor when perceiving this performance as
happy. Smiling was also associated with being at ease rather than anxious or nervous, as mentioned by rater
10 on sample 18:
         So at first, she definitely seemed nervous, but then as it went on, she starts kind of smiling. So she
         definitely seemed more at ease. (Smiling, anxiety)
In this case, smiling appeared to be evidence of the exact opposite of anxiety, though raters also detected
nervous smiles in the dataset.
         Smiling also led some individuals to see performances as being more overall engaged (6%), as was
the case of sample 15 when rater 4 remarked:
                                                                                                            217


          I was, I think I felt really, really good. Because she was smiling the whole time. And as she was
          talking about her mom and stuff. It felt like she was very engaged in what she was talking about.
          (Smiling, content, engagement)
The rater here noted the close association between content (a memory about her mother) and the test taker’s
reaction by smiling. This led the rater to feel this story was genuine, and that she was engaging with
something real that happened in her life. Interestingly, the positive affect from the video appears to have
impacted the rater as well when the rater noted “I felt really, really good.” Although an infrequent
observation in the dataset, this comment may be evidence of emotional contagion (Elfenbein, 2014; Hatfield
et al., 1994), with the rater being emotionally influenced by the behaviors witnessed.
          The lack of a smile made up the second highest number of mouth comments at 12%, although this
was only mentioned by four raters. This behavior, or rather its absence, did not indicate that the test taker
was frowning or exhibiting any other mouth behavior. Rather, it was usually mentioned when raters were
struggling to understand the test taker’s underlying happiness or attitude, such as rater 7 when describing
sample 17:
          I don't think he smiled too much. Yeah, that one was kind of a hard one to gauge at the same time
          with the happy/unhappy. But you know, there's a lot of things that go into that for sure. … I think I
          gauged a little bit on their actual responses, because that's a time you could see if their answer is
          competent and they spell it out and everything. You can see they're happy to be answering that
          question. But there was a question about, for example, the girl who didn't understand “ceremony”
          you know, she didn't seem super pleased to be answering…. (Smiling, content, competence,
          happiness, turn-taking)
In this extract, the rater lacked nonverbal evidence of happiness in the forms of smiling yet was hesitant to
determine that the test taker was unhappy based on this evidence alone. The rater resorted to inferring the
test taker’s affective state through their ability to understand and answer the test question, and the
effectiveness of the response. Competence for this rater, in the form of overall ability, was an alternate line
of evidence to infer underlying attitude.
                                                                                                           218


         Lip movements formed the third highest number of comments from raters, at 9%. These constituted
their own range of behaviors, such as lip biting, “dry mouthing,” smacking lips, and random lip movements.
These were generally associated with anxiety, such as with sample 21, described by rater 14:
         And then, you, I could see her, she's back to touching her face. And she like bit her lips, making her
         seem anxious and unconfident. But she is always engaged when he's talking. She's looking at the
         screen and thinking ahead to what she's going to try and say about what he's talking. (Lip
         movements, self-adaptors, anxiety, confidence, engagement, mutual gaze, thinking)
In this case and in others, anxiety and a lack of confidence was perceived through the test taker’s lip biting
and self-adaptors, but the overall evaluation was balanced by the fact that the individual maintained eye
contact and appeared engaged with the test. Thus, though some of these behaviors often led to negative
impressions, the overall impression was positive likely because the test taker is actively participating and
engaging in the speaking test. Like averted gaze, the impact of behavior was mediated by context.
         A smaller number of comments concerned other mouth behaviors which were less frequently
mentioned. These behaviors also coincided largely with being anxious and a lack of confidence, but they
did not always result in an overall negative impression. For example, when rater 20 described sample 21’s
nervous smile, he still found her to be overall positive, albeit not necessarily happy:
         I did rate her a little bit higher on this part, you know, personableness and warmth. But I don't
         think that she was particularly happy in this experience, though. Despite her smiling, that definitely
         feels like an anxious smile to me, which again plays into anxiety score. (Nervous smile, warmth,
         happiness, anxiety)
Common throughout these comments is that stimulated recall participants were regularly evaluating
multiple lines of evidence when understanding the test taker’s affective states, often both nonverbal and
verbal.
Paralinguistics
         Paralinguistic features made up the third most frequently commented phenomenon, making up 16%
of the total comments and spread across 19 raters (M = 3.75 comments per rater, SD = 2.45). These features
                                                                                                            219


comprised sounds made from test takers that were not verbal; that is, not composed of words with specific
form-meaning relationships. While silence or a reticence to speak was mentioned in the dataset, I did not
code this phenomenon because it was inherently not a sound and did not indicate the lack of any specific
behavior apart from speech. The most common paralinguistic feature mentioned by the raters was that of
production speed, or the rate of articulation of utterances. This feature was commented on 88 times (49%)
by 19 raters. It was often mentioned in relationship to fluency or the test taker’s overall ability. Rater 1
made this connect quite explicitly when discussing sample 9:
         It was a pretty complicated question. And so when she answered it quickly, it made me feel a little
         bit better about her fluency. Because I guess for me, I kind of think of the ability to think and then
         speak quickly a little bit fluency, I guess. So when she answered that pretty deeply complicated
         question, so quickly, I marked her up on fluency. (Production speed, fluency, turn-taking).
As noted by this rater, comments on speed were often made based on how quickly a test taker answered the
interlocutor’s question. Answering quickly was often a sign of comprehending the test question, and a quick
response was furthermore a sign of stronger proficiency. In other cases, the opposite was true, with slower
speaking being related to the test taker thinking about language and preparing a response, thus relating to
lower fluency. Speed was not always related to language ability, however. In rater 20’s view of the same
sample 9, it related to affect:
         I feel like part of why she's talking so quickly, because she's being anxious, and like just watching
         her eye movement and stuff, and kind of she reads more anxiously, but at the time, I was just like,
         wow, she's really going. (Production speed, anxiety, gaze)
In this example, speed was seen as possibly excessive or unnecessary, and for this reason it related to
anxiety. It was not the sole piece of evidence, though, as the rater also noted that eye movements factored
into this decision.
         The next most common paralinguistic feature in the dataset also related to language processing.
Filled pauses, particularly “ums,” “uhs,” and so forth, appeared in 19% of the comments on paralinguistics
by 14 raters. Filled pauses are a hallmark of developing fluency as they are assumed to reflect breakdowns
                                                                                                            220


(Kormos, 2006; Segalowitz, 2010; Suzuki & Kormos, 2020), and raters noted it as such. For example, when
rater 4 was describing sample 13, they observed: “I think with him saying, um, it just, it didn't give me, I
wasn’t gonna give him like full fluency with him saying ‘um’ and sometimes stopping his speech” (Filled
pauses, fluency). Raters also frequently noticed the absence of filled pauses as more fluid and proficient
speech. However, comments also frequently overlapped with anxiety (34%) and confidence (23%):
         [W]hen they were talking, you know, if they were stuttering a lot so, like people that had a lot of
         "uhs," you could tell she wasn't super confident and at ease. But once again, that could be much
         different things. (Filled pauses, fluency, confidence, anxiety)
Here, rater 7 observed that the filled pauses produced by sample 12 led to impressions of a lack of
assuredness, though noting that the behavior may be linked to other underlying causes. When speaking
about sample 9, the same rater mentioned that these filled pauses “could have just been her trying to think
of her answer.” In other words, this rater viewed these pauses as serving a cognitive role by creating space
to think or a social-affective role conveying a lack of comfort in the testing situation. The use of filled
pauses as continuers is also attested in the literature (Suzuki et al., 2021), as these alone do not predict
fluency. Raters frequently mentioned that these pauses were, however, distracting, and may have caused
problems with comprehension.
         The third largest set of comments on paralinguistics was laughing, which frequently coincided with
smiling. Laughing, appearing in 15% of the comments about paralinguistics, was mentioned by 15 raters
and was generally seen as positive (except as one personal anecdote when nervous laughing was
highlighted). Laughing coincided with comments about being at ease (21%), warm (15%), and expressive
(14%). Rater 4 noted the appearance of positivity and warmth in sample 18: “And then I was saying that
she was positive and warm, because she was like, laughing and smiling with the questions. So like, it felt
more natural when she was talking” (Laughing, smiling, positive attitude, warm). It is notable in this
example that the ability to laugh made the conversation seem more authentic to the rater. Rater 7 made a
similar observation about sample 18 as well, noting that confidence and anxiety were impacted:
         Confidence could be, you know, she's a little bit anxious about answering these questions. But at
                                                                                                         221


         the same time, I gave her a five because laughing. You know, she, like, she was laughing she was…
         at ease… because that just showed, like, you know, I mean, people laugh when they're nervous, too.
         But I mean, it wasn't. It didn't seem like a nervous laugh. To me. It seemed like relaxing, getting
         into it. (Laughing, confidence, anxiety)
Although anxiety was anticipated because of the testing context, the rater noted that laughing helped make
the candidate appear more relaxed, at ease, and eager to communicate her message. It also may have helped
the test taker overcome problems with comprehension and to save face (Matsumoto, 2018; Pitzl, 2010).
Overall, laughing helped to establish an evaluation of being involved in the conversation in a genuine and
involved way.
         Another notable category of comments regarding paralinguistics concerned tone and prosody,
making up 13% of the comments across 8 of the 20 raters. Comments referred to vocal inflection,
emotionality, tone, shakiness in the voice, being monotone, and stiffness. Comments regarding a wider
range of prosodic features coincided largely with expressiveness (21%), warmth (13%), and happiness
(13%). Rater 15 illustrated this observation about sample 13:
         And like, he was saying it's sort of, you know, his tone changed a little bit because he's like, “Oh,
         well, how else will you decorate your home?” … I think I did score happier on, or higher on like,
         happier and it makes him seem, like more personable. (Tone-prosody, content, turn-taking,
         happiness)
In this particular case, the rater had previously discussed how the test taker’s response was confusing, but
it was the shift in tone that caused a more positive impression in terms of how approachable the individual
was. The opposite also appeared to be true, as noted by rater 18 on sample 25 that “the reason why I put
cold or inexpressive is I think that she's not really at the level of being able to convey that much emotion
through speaking in English” (Tone-prosody, warmth, expressiveness, overall ability). Within this
comment, the rater noted that the lack of prosodic features led to the test taker as being perceived as less
approachable, while also noting that conveying warmth or emotion in the voice might have been related to
overall language ability.
                                                                                                           222


         Tone and prosody, on the other hand, also aligned with anxiety in 13% of the intersections. These
comments, although few, were due to comments about shakiness or timid-sounding voices. Rater 16 noted
this about sample 12:
         [I]t does sound like she has a shakiness in her voice. I can usually just recognize that. So I did put
         her higher on the anxious scale. But just because she's anxious doesn't mean that she's not like,
         she's not … competent. (Tone-prosody, anxiety, competence)
The rater here previously noted shakiness as a feature of their own voice when nervous or anxious and thus
related to the test takers in the testing context. However, as can be seen, the rater noted that this does not
necessarily imply anything about the individual’s language level.
         Overall, paralinguistic features were somewhat less uniform in how they were perceived than gaze
or mouth behavior, but these also varied quite substantially in form. In this dataset, while laughing and
tone-prosody related to expressiveness and warmth, filled pauses most often related to anxiety and
confidence. The remaining three paralinguistic features raters mentioned—audible breathing,
backchanneling (e.g., mhmm), and voice volume—were mentioned far fewer times and will not be
discussed here.
General face behavior
         Comments concerning a general “look” on the face, the overall expressiveness of a face, or blank
looks were coded as general face behavior. These broad comments comprised 11% of the comments on
nonverbal behavior and were made by 17 raters (M = 2.65 comments per rater, SD = 2.03). These comments
generally intersected with observations about anxiety (19%), confidence (13%), or expressiveness (13%).
Rater 18, for example, noted that despite filled pauses, sample 17 was not perceived as anxious because of
the combination of general facial expressiveness and use of prosodic features to convey emotion:
         I think she's got more of like, a little bit more emotion in her voice and kind of like … her facial
         expressions. That's kind of like even though like she is kind of like stuttering and stopping and
         saying, "um", a lot. I think that's like less because she's nervous. So that's why I kind of had more
         at the at ease side. (Face [general], tone-prosody, fluency, filled pauses, anxiety)
                                                                                                           223


The rater in this example was keen to use a variety of features to better understand the test taker’s possible
underlying affective state. This was also true for rater 4 on sample 12, who commented that “for the hand
movements I was said she was very expressive. So yeah, I would have given her a seven but her facial
expressions didn't always match her body movements though” (Face [general], gesture, expressiveness). In
this case, a lack of expressiveness in the face actually helped balance the rater’s understanding of the test
taker’s affect from gestures. Expressiveness was often judged mainly through movement in the face, as
noted by rater 6 on sample 16: “I rated him a little bit more inexpressive just because his facial expression
doesn't really change much” (Face [general], expressiveness). Likewise, a lack of movement in the face
often led directly to negative evaluations. When speaking about sample 19, rater 10 said, “So right away
just facial expression, she already seems like, like very anxious and borderline to the point of being like
uncomfortable and unhappy, stressed” (Face [general], anxiety, happiness). A more serious, stoic facial
configuration has been reported as linking with anxiety in other studies as well (Gregersen, 2005; Lindberg
et al., 2021). General face movements often served as evidence of question comprehension.
Posture
          Features about the position of the body relative to the camera, as well as comments about the
movement of the torso, made up 11% of the comments and were made by 17 raters (M = 2.60 comments
per rater, SD = 2.11). The most common postural feature commented on was visible rocking or shaking
(24% of comments by 10 raters), largely due to comments about sample 16. This behavior almost
exclusively intersected with comments about anxiety (65%). In this test sample, after hearing the test
question, the test taker almost immediately began rocking back and forth at the six second mark, which
raters found visually salient. The behavior stood out partly because the behavior began so early in the
interview, continued for 10 seconds, and recurred in the sample. The multimodal transcript in Figure 7.5
shows the onset of this behavior in context.
                                                                                                          224


Figure 7.5
Sample 16 Rocking and Eyebrow Behavior
The raters unanimously found this behavior to be a sign of anxiety, as with rater 3:
        I rated it higher for anxious because of the body rocking. Which, at least to me is like, it's very
        common body language is something that I'll see a lot. A lot of people will do as him when they're
        anxious. (Rocking-shaking, anxiety)
Others were more willing to balance behavior across the interview, with rater 8 noting that:
        [J]ust from the first impression that since he was going, like, back and forth, I thought he was
        anxious, not very at ease. … But I think later, he was anxious at first, but then later, he kind of like,
        calms down. That's why I initially rated it as at ease. (Rocking-shaking, anxiety)
Having seen sample 16, some raters also commented on the lack of rocking or shaking when building a
rationale for the relative calm or ease of a test taker. This was one of the few behaviors in the dataset with
such a clear relationship to a judgement of affect. Whether this rocking was a sign of stimming, a rocking
behavior typical of neurological conditions such as autism (American Psychiatric Association, 2013), is
unknown. The relationship between behavior and rater training for neurodiverse individuals will be
explored further in Chapter 8.
        A second set of comments concerned leaning forward, made by 7 raters, and comprising 20% of
                                                                                                             225


the comments on posture. Observations of leaning forward intersected most with engagement (29%) and
attentiveness (24%). For example, rater 14 noted that further into the interview, sample 16 used his body
posture to convey that he was engaging with the speaking test by leaning in: “And I noticed he was very
attentive. When the guy would ask a question, he leaned right in and listened” (Leaning forward,
attentiveness). A similar comment was made on sample 9 by rater 19: “And then how close she is and how
quickly she responds, as well as the, in the middle of the question. She did one of those like she said, ‘yeah’,
so I said she was relatively engaged” (Leaning forward, turn-taking, speed, active listening, engagement).
As seen here, these comments state explicitly or implicitly that leaning forward was related with the test
taker listening carefully to the examiner’s question. This behavior generally aligned with positive
impressions regardless of whether the leaning happened during moments of clear comprehension or
comprehension difficulties, possibly because an attempt to understand a test question (rather than avoidant
behaviors or answering without understanding fully) was always seen as a desirable behavior. Forward
leaning behavior has been attested as a sign of involvement and rapport (Burgoon et al., 1984; Mehrabian
& Williams, 1969), as well as engagement and listening comprehension (Jenkins and Parra, 2003; Neu,
1990). The topic of listening behavior, however, will be explored more in the next section.
         Moving around, a very general postural behavior that made up 19% of the comments and by 5
raters, did not directly correlate with any given behavior due to its general lack of description. As seen in
the comments on face behavior, some of these comments about general movement indicated expressiveness.
Others, such as movement associated with leaning forward, indicated engagement and interaction with the
test. Rater 5 noted that the test taker in sample 13 “felt engaged in the conversation like moving around.
Like I was saying before, the head movements and stuff like that, and like he's interested in the question,
actually, like, thoughtful about it” (Moving around, engagement, head). This type of synchronous
movement with responses with the test was viewed positively and has also been noted as a positive aspect
of performance in the literature (Jenkins & Parra, 2003; Neu, 1990). Other postural movements were seen
as indicating anxiety, such as rater 5’s observation about sample 12:
         [S]he seemed kind of anxious just a little bit based on like, all the movement. It seemed like she was
                                                                                                            226


         kind of getting her thoughts out as they came into her head, which was probably like, why it seems
         so rushed. (Moving around, anxiety, thinking, speed)
Although this comment goes into little detail, the movement this rater observes may very well be an
unsynchronized movement, which has often been linked to struggle in previous studies (Ducasse & Brown,
2009; Gan & Davison, 2011; Gullberg, 1998; May, 2009, 2011; Neu, 1990; Sato & McNamara, 2019;
Thompson, 2016). A similar behavior, that of adjusting the body’s position or shifting in the chair, made
up very few comments but was often also tied to anxiety.
         Finally, a straight or rigid posture, comprising 15% of the comments by 7 raters, and leaning
back/slouching (10% of comments by 5 raters), made up the final set of comments. Like many of the
behaviors, comments on straight posture intersected mainly with anxiety (21%), expressiveness (21%), and
confidence (16%). This category of behavior included cases where the person was seen as positive due to
their posture, such as rater 7’s comment about sample 15: “Confidence would be how quick they answer
each question, personally, and just like their body language. And like, you know, if they're like sitting up
or if they're like, kind of hunched” (Straight/rigid, confidence, body language). The rater here again
combined postural style with the speed or response and overall body language to formulate an impression
of confidence. Sitting up straight was seen as confident, while being hunched over was seen as lacking
confidence, but only if other behaviors suggested corroborated with the same affective interpretation. In
other cases, leaning back or slouching was seen as being engaged and at ease, such as in sample 13, who
visibly leaned back in his seat for a large part of the interview. Rater 9, for example, said of this test taker
that “I think I noticed very much just he looked comfortable in the way that he was sitting not like tense
and anxious” (Leaning back/slouching, anxiety). In some cases, sitting up straight contrasted with restless
shifting around the seat or other body movements. Rater 5 noted that this lack of movement appeared to
show a lack of expressiveness in sample 16: [H]e didn't really show too much emotion. … He didn't really
use any hand signals when he talked …. didn't really I guess move around a lot. He was just there, maybe
looked around a little bit, but he didn't like show any other type of expressive like behavior” (Straight/rigid,
gesture, gaze, expressiveness). Again, this rater made their observation based on multiple body movements
                                                                                                             227


when dealing with posture. Sitting up too straight, or too rigid, however, could also be a sign of anxiety, as
rater 19 observed saying that in sample 29, “she appears like a very upright demeanor. I think sure her
hands might be folded as well. Her shoulders are very, like close together. So she appears to be quite anxious
to a certain degree” (Straight/rigid, lack of gesture, shoulders, anxiety). From this, straight or rigid posture
and leaning backward could mean many things to raters depending on the overall observation of behavior
in the individual. This rater mentioned the final category of shoulder movements, which was infrequent in
the dataset but always related to anxiety.
Gesture
         Comments about gestures—broadly speaking, any movement involving the hands—made up only
7% of the total comments on nonverbal behavior, mentioned by 13 raters (M = 1.75 comments per rater,
SD = 1.59). Although relatively small, this figure is notable because the hands are generally not visible in
Zoom recordings, including in this dataset. Nonetheless, seven of the samples did demonstrate behaviors
that raters found salient. The most frequently mentioned gesture was the self-adaptor, which made up 43%
of the gesture comments by 9 raters. The self-adaptors observed included hair touching, face touching, head
scratching, and playing with the lips. These almost always coincided with judgements of anxiety (55%).
Rater 15, for example, described two self-adaptors when rewatching sample 17:
         She seems to like get more nervous towards the end because she starts, like, picking at her forehead
         and stuff. And at the beginning, she seemed like, just more comfortable. But I did notice that just
         because she started like fiddling like with her hair and stuff. (Self-adaptors, anxiety)
Figure 7.6 is an annotation density graph of three lines of behavior in this sample. The top line represents
the examiner’s speech in teal, showing the test question followed by backchannels and one short follow-up
question (“Why?”). The second line in light red is the test taker’s speech, which continued throughout most
of the sample here. The bottom line in dark red represents the appearance of self-adaptors. The rater’s
observation was consistent with the nonverbal annotations, as the test taker transitioned between a calmer
demeanor to one with visible hand movement to the face. The picking and hair touching signaled to the
rater that the test taker was experiencing anxiety as this part of the test unfolded. Indeed, self-adaptors are
                                                                                                             228


often associated in the literature with coping mechanisms during times of stress in adults (Ekman & Friesen,
1974; Kikuchi & Noriuchi, 2019; Gregersen, 2005). In the context of an oral proficiency interview, this
could possibly align with the general course of tests becoming more difficult as they proceed, possibly
representing a spike in cognitive load.
Figure 7.6
Annotation Density Graph of Sample 17’s Self-Adaptors
         Representational gestures—iconic and metaphorical gestures that refer to some sort of object, event,
action, or idea—comprised the next-largest group of comments on gesture (26%) being mentioned by 5
raters. These probably included beat gestures as well (as in the example below), but raters did not make a
distinction between these. Fine grained distinctions between iconic and metaphoric gestures were also not
possible in such a small sample, but raters made a clear distinction between desirable hand movements that
aligned with speech (e.g., talking with hands) and those that were seen as more random and far less desirable
(e.g., fidgeting). Representational gestures aligned with positive judgements of expressiveness (31%) and
happiness (19%), and engagement (13%). For example, when describing sample 12, rater 3 mentioned:
          [S]he was using her hands to talk … as someone who speaks with their hands, it is a little bit more
          expressive to me to speak with your hands because that's … an indicator that you're engaged in the
          conversation like expressing what you're thinking. (Representational gestures, expressiveness,
          engagement, thinking)
This rater observed that using gestures during speech was something desirable, possibly helping to convey
additional meaning through these hand movements. The sequence is presented in Figure 7.7, slowed down
so that the gesture is fully visible. The test taker here was discussing the impact of the internet on online
shopping. When she said, “has a very” and “influence,” she used beat gestures to emphasize these semantic
units. When she said, “in our (life)”, she opened her arms wide, just slightly visible at the bottom of the
video. This opening arm metaphorical gesture may be a sign of inclusivity, strengthening the meaning of
                                                                                                          229


“our” as non-exclusive “we” (English only has one form of “we” that can include or exclude the second
person interlocutor, only interpretable from context; Chinese uses 我们 for inclusive/exclusive “we” and
咱们 to indicate only inclusive “we”). Importantly, the rater noted that the use of these gestures (both beat
and representational) is a sign of thinking about concepts (rather than language), an important dimension
of McNeill’s (1992, 2005) growth point hypothesis. Beat gestures may also be an important sign of prosodic
control, revealing key information about the speaker’s fluency and automatization of L2 use (McCafferty,
2006). The rater’s observations led to viewing the test taker being more expressive and engaged. The co-
alignment with speech and connection to topic content also appeared to make the test taker appear more
proficient to the rater, aligning with Gan and Davison’s (2011) findings.
Figure 7.7
Sample 12’s Representational and Beat Gestures
         As noted earlier, however, gestures need to align with other nonverbal behaviors to form a positive
holistic impression of the test taker. Random movements, in contrast with representational gestures, were
unanimously seen as negative and relating to anxiety. These were different from self-adaptors in that they
had no noticeable form; self-adaptors were identifiable by their action (e.g., touching hair), while random
                                                                                                         230


movements did not add a dimension of meaning to speech and were often unclassifiable. There were only
4 observations in the dataset. For example, rater 2 described sample 9’s fidgeting gestures:
         I remember thinking that she was pretty, she must have been pretty nervous at the beginning. She
         sort of relaxes later on, if I'm remembering this video correctly, but she's saying "um" a lot, she
         seems to be like a little fidgety. (Random movement, anxiety, fluency)
In this example, it is likely the combination of filled pauses and random movement that lead the rater to
form an impression of anxiety in this test taker.
         The final category of gestures, that of lacking hand movement, made up 20% of the comments
about gesture. The comments were generally split in this category. There were comments that noted the
absence of fidgeting, which was seen as a positive attribute. An example of this type of comment was rater
6’s observation of sample 13: [He] wasn’t looking around and adjusting and fidgeting a ton. So his voice
seemed to remain like, I guess not shaky. So that's why I said, he seems to be pretty confident and not
anxious (Lack of gesture, shifting gaze, tone-prosody, confidence, anxiety). The rater here used multiple
lines of evidence of the lack of behaviors that have so far associated with anxiety to form a positive
impression of the test taker. This contrasted with the lack of desirable representational gestures, which was
generally seen as negative, as noted by rater 5 about sample 16:
         [H]e didn't really show too much emotion. So it was like everything else was just kind of in the
         middle. He didn't really use any like hand signals when he talked, didn't really I guess move around
         a lot. He was just like there, maybe looked around a little bit, but he didn't like show any other type
         of expressive like behavior. (Lack of gesture, rigid/straight posture, expressiveness)
The rater explicitly noted that the lack of overall movement in both the hands and the body made the test
taker appear far less emotive and less expressive. This did not lead to a negative evaluation, however, as
the rater later noted that the test taker was likely focused on the task at hand rather than trying to
communicate or express a certain idea. It is important to note that lacking co-speech representational
gestures can be associated with lower proficiency (Gan & Davison, 2011; Gregersen et al., 2009)
                                                                                                             231


Head
         Head movements made up a small category of two behaviors (6% of the dataset) observed by 13
raters (M = 1.50 comments per rater, SD = 1.76). The vast majority of these comments concerned head nods
(74%). Head nods were positively associated with most judgements of affect, though notably attentiveness
(18%) and engagement (18%). For example, rater 6 commented that sample 29’s nodding was positive:
         I think it was the nodding, and like, I guess just saying, “okay”, as the interviewer was explaining,
         which made me rate both attentive and expressive, even though she did not understand, I guess, the
         question the beginning. That was her actively trying to understand it as it was being explained, so
         I think the attention was paid to the interviewer. (Nodding, turn-taking, attentiveness,
         expressiveness, comprehension, active listening)
The rater observed in this extract that active listening, both in the form of head nodding and verbal
backchanneling (“okay”) were important tools to show active visible and audible engagement in the
communicative event. Figure 7.8 illustrates the extensiveness of this test taker’s use of head gestures, most
of which were nods, though these included one extended nod and two head shakes. The rater saw this
behavior as positive despite the ongoing comprehension difficulties the test taker had with one unfamiliar
word at the beginning. Sometimes nodding added to the level of confidence someone exuded, such as rater
7’s comment on sample 16: “Also, just something small, head nodding once he finished his sentence. Like,
‘I know what I said. I'm confident my answer. I didn't think I messed up and on anything.’” The test taker’s
nod in this context was seen as both an affirmation and a skillful turn-taking device to give the floor back
to the examiner. This interactional skill was almost always seen positively.
Figure 7.8
Sample 29's Head Behavior
          The second category of head behaviors concerned head turns, which only comprised 8 comments
(26%, by 8 raters). These comments were largely heterogenous, ranging from “moving the head back and
                                                                                                          232


forth” (rater 8, sample 13), “turning her head” (rater 5, sample 17), “bobble head movement” (rater 10,
sample 13), “constant little head movements” (rater 10, sample 29), “head bob” (rater 20, sample 16). In
each of these examples, these movements were seen as positive, indicating some level of comfort, ease, and
expressiveness in the display of movement. There was only one case where “mini head movements” (rater
5, sample 9) combined with shifting gaze led to an impression of anxiety. In all cases the movements were
factored together with other behaviors to form a holistic impression of the test taker.
Eyebrows
         Of all the specific areas of the body mentioned by raters, the eyebrows were the least frequent.
There were only 11 comments in total made by only 6 raters, representing only 2% of this dataset (M = 0.55
comments per rater, SD = 0.94). For this reason, the discussion will be brief. The most common comment
about eyebrows was their general movement, making up 6 of the 11 comments. Some raters, such as rater
6, found eyebrow movements to represent engagement and interactiveness, such as with sample 12: “And
like eyebrows in general, and eyes are moving up and down rather than just straight face. So I said that was
engaged and interactive” (Eyebrow movement, engagement, interactiveness). Others, such as rater 12,
found them to be a visual indicator of thinking, as in sample 17:
         [S]he was expressive with the eyebrows, but not so much on the face, which is why I thought
         colder … But I do remember like, she is thinking very hard, maybe not as expressive, but the
         eyebrows are like, the eyes are like okay, she's definitely thinking, she's definitely having, you know,
         like, visual reactions. (Eyebrow movements, expressiveness, gaze, thinking)
In this example, the movement of the test taker’s eyebrows was seen as a positive indicator of internal
cognitive processes, but not of overall positive affect because they did not align with over visible behaviors.
These findings appear to align with the positive impact of eyebrow movements on fluency (Kim et al., 2023;
Tsunemoto et al., 2022), as these movements were able to mark prosodic stress in ways that aligned with
fluent speech.
         Comments on furrowed and raised brows made up the remaining 11 observations. Furrowed brows
(3 comments) were associated with comprehension problems. For example, rater 15 observed the following
                                                                                                              233


about sample 16:
         [A]lready he seems confused, like, right when he started talking. He like furrowed his brow. So I
         remember thinking that even before he started talking, I was like, it seems like he's gonna have
         trouble from just his reaction to being asked something. (Furrowed brow, comprehension)
The behavior rater 15 mentioned can be seen in Figure 7.5. The test taker began furrowing his brow just
after hearing “What effect has the internet had”, before the completion of the examiner’s question. Rater
15 used this information to preemptively prepare to hear a repair sequence, even though there was no such
repair sequence in this sample. Raised brows, of which there were only two comments, were associated
with attention, such as the following comment by rater 7 about sample 9:
         They're engaged, at ease. … Body language, you know, she's facing the camera, obviously, her
         back is straight. She's not frowning or anything. Okay, her eyebrows are pretty raised. She just
         looks ready, attentive. (Raised brows, engagement, body language, posture, furrowed brows,
         attentiveness)
In this extract, the rater mentioned several lines of evidence that the test taker was relatively at ease and
engaged. The rater indicated that the raised brow aligned with these behaviors, but seemed to imply that
the eyebrow raising might be a sign of anxiety. It could be, then, that raised eyebrows were an interactional
device to signal to the examiner that full comprehension was not reached, as was the case with furrowed
brows.
General body language
         The final category was the most general of all and included observations about any movement of
the body without specific details. There were only 10 of these comments by 6 raters, making up 2% of the
dataset (M = 0.50 comments per rater, SD = 0.83). Raters generally referred to this category using the
specific term “body language,” but also included “change of mannerisms,” “physical movements,” “how
they were in the video,” and “their looks.” These comments were heterogeneous in nature providing
evidence for a range of affective states, such as anxiety, confidence, engagement, and others. Two
comments also indicated that body language provided evidence of an individual’s overall ability before the
                                                                                                          234


test taker started speaking. Rater 11, when asked why they initially thought that sample 13 “was not going
to be good,” said:
         I don't know, I feel like I would have had to base it off looks, which is something I didn't do
         throughout the rest, the whole thing. But I felt like I have to mention it … as soon as I was like, "Oh,
         he's not going to be good," I just remember feeling like that, and literally all I did was see his face.
         So it had to be maybe about the way that he was looking about it first. (Body language, face
         behavior, overall ability)
In this example, the rater later recalled that the test taker performed well on the test, and by the end the
overall judgement was positive. It may be the case in other situations, though, that observing nonverbal
behavior prior to hearing speech colors the raters’ interpretation of language proficiency throughout a
particular test.
Behavior, affect, and language proficiency
         In this study, I had anticipated patterns emerging about nonverbal behavior and proficiency
judgements. The reality, however, was more complex because judgements of affect and nonverbal behavior
were often intertwined. In fact, while there were some behaviors that were generally associated with
positive judgements of affect (e.g., mutual gaze, smiling, use of prosodic features) and others with negative
judgements (e.g., shifting gaze, filled pauses, self-adaptors), broadly speaking, raters did not often indicate
explicit ties between specific behaviors and evaluations. Instead, they considered nonverbal behaviors as a
context-bound cluster of phenomena that interacted with linguistic features and the content of utterances to
create a sense of affect, which occasionally impacted impressions of language ability. Raters considered
multiple lines of evidence when forming their impressions and balanced these to arrive at their judgements.
         While specific behaviors were not found to impact language proficiency, comments were made
about the utility of nonverbal behavior when forming impressions of language. For example, rater 1 made
the observation that:
         I think it is important to kind of integrate facial reactions and like, not reactions, but expressions
         on how they're reacting physically to statements as a reflection of fluency, or maybe even
                                                                                                              235


         vocabulary, like whether or not they're understanding what you're saying, is a reflection of their
         own lexical knowledge.
This statement indicates that facial behavior has the capacity to provide extra information about an
individual’s linguistic competence and language processing that would otherwise be unavailable to
someone just listening to the recording without visuals. Rater 12 made a similar observation about the use
of behavioral information when judging language:
         Just a lot more movement, a lot more thinking. I think with her when she said, it was her trying to
         think. The eyes too, the eyebrows, definitely more expressive. I think she smiled a couple times. But
         I think maybe it was like, the accent, so I couldn't definitely tell how to rate her, a weaker
         vocabulary, or weaker grammar. I think it was just easier with her specifically to see her face.
From this extract, the rater did not associate the test taker’s positive nonverbal expressiveness with any
concomitant positive features of language. In fact, the rater noted that the language skills, displayed through
pronunciation, vocabulary, and grammar, appeared somewhat less proficient. However, the presence of
nonverbal behavior aided in their final decision, thus providing additional information that would have been
unavailable in an audio-only format.
         In one extract mentioned previously, rater 11 noted that the presence of nonverbal behavior might
also create an impression of language ability even prior to hearing the individual. Although the rater noted
that the look on the test taker’s face conveyed a negative impression in terms of language ability, they also
said that this was uncommon. Nevertheless, it was the first instance of behavior that elicited this type of
response from the rater and shows how certain visible phenomena may impact proficiency ratings.
         Comments specifically relating nonverbal behavior to proficiency were, however, few. To discover
deeper meaning in the dataset, I used axial coding of the already coded utterances to uncover other patterns.
After repeated readings of the dataset and analyzing intersections of behavior and language, I devised,
revised, combined, and distilled sets of thematic codes that represented patterns within the dataset related
to language proficiency judgements. The final set of thematic codes and code counts is presented in Table
7.4, and each code is discussed separately below. For the sake of coherence, the order of the discussion is
                                                                                                            236


thematic rather than in order of pattern strength.
Table 7.4
Thematic Codes
  Theme                                                                           Count       %       Ext
  Multimodal assessment of listening comprehension                                99          46      19
  Assuredness impacts perception of proficiency                                   46          21      15
  Approachability moderates perception of comprehensibility                       42          19      15
  Adaptability moderates impact of breakdowns                                     30          14      14
Multimodal assessment of listening comprehension
         Although raters were not asked to rate or even consider whether the test takers understood the
examiners’ questions, comprehension factored all 20 raters’ decision-making processes. On its own, it was
the fourth most commented code, following speech content, fluency, and vocabulary. Comprehension of
test questions emerged as an early signal of a test taker’s overall proficiency, and raters frequently used this
to assess language skills. However, unlike speech, listening comprehension is not directly observable.
Raters used multiple lines of evidence, most importantly using nonverbal behavior, but also receipt tokens
such as oh, ok, yes (Heritage, 1984, 1998) to assess listening comprehension. This multimodal assessment
of comprehension emerged as the most common pattern in the dataset, occurring 99 times across 19 raters
(M = 4.75 times per rater, SD = 3.70). For example, rater 10 described evidence of sample 29’s breakdown:
         So right away just facial expression, she already seems like, like very anxious and borderline to the
         point of being like uncomfortable and unhappy, stressed. But then I did notice that like, right away,
         she tried asking “ceremonies?” which a lot of people just kind of like, let it go until like I said
         before, like it was just like that uncomfortable that it's clear, they didn't understand.
The rater in this example noted that the test taker’s facial expression indicated quite negative affect,
signaling discomfort. This discomfort was interpreted as an immediate indication of an underlying problem,
even before the problem was made explicit. Figure 7.9 shows the behavior that led to this interpretation.
The test taker held a furrowed brow from even before the test question began, and shortly after the question
started held her mouth in an open position without speaking. This combination of behavior likely conveyed
a sense of anxiety and unhappiness, though the mutual gaze held throughout indicated the test taker
                                                                                                            237


remained focused and engaged. The explicit communication of non-understanding and identification of the
trouble source appear 850ms after the test question ended. The test taker used a restricted repair initiation
(“Ceremonies?;” Dingemanse, et al., 2016) to convey that she had not understood this specific lexical unit.
The rater in this example appeared to imply that identifying the trouble source with this restricted repair
initiation was favorable, but combined with the behavioral and affective evidence solidified their
assessment of listening comprehension.
Figure 7.9
Sample 29's Comprehension Breakdown
         Assessing listening comprehension frequently associated with judgements of overall competence.
If competence is understood as being able to complete a “task” appropriately, this is a logical connection,
as answering a question contingently would satisfy its requirements. Rater 14 made this observation about
sample 18:
         Here's when I was like, Oh, she's very fluent, because her words are flowing right out, and I could
         understand everything… when she started, she's smiling, and she's attentive and understood the
         prompt. So I also said she was competent, because she did the right thing.
Competence also related to judgements of overall language ability. Rater 7 commented that “[h]onestly, for
the competent/incompetence. I viewed that that one as like an overall, like an overall scale.” These
comments show the importance of assessing listening comprehension through multimodal evidence, as it
                                                                                                         238


allowed them to form a holistic impression of the test taker’s language proficiency.
         Assessing listening comprehension also appeared to give raters specific information about the test
taker’s vocabulary knowledge. For example, rater 9 commented on this issue with sample 18:
         This one I do remember thinking the way she hesitated, looked around. Definitely seemed like she
         didn't understand what the question was, which I think, I mean, I've been saying vocabulary a lot
         based off of how they've been responding. But I think that's also what they're able to understand.
         So her vocabulary score was, it wasn't low, but it wasn't super high.
Instead of basing their decision on actual vocabulary produced in the test, the decision this rater arrived at
involved vocabulary that was not understood and not produced. In other words, this rater and others used
deficiencies as a line of evidence rather than basing decisions on what the individual was able to do. The
rating scale used in this study likely lent itself, at least partially, to this type of behavior as decisions were
mostly binary (positive/negative), and the raters were untrained on any understanding of language
proficiency. Contemporary rating scales for L2 speaking tests, on the other hand, generally draw on can-do
descriptors or descriptions of language use at target levels, and these do not generally include a description
of language deficiencies (see e.g. Council of Europe, 2020). Whether trained raters engage in similar
deficiency-based thinking is a question that remains and is beyond the scope of this paper.
Adaptability moderates impact of breakdowns
         The raters in this study always viewed breakdowns in understanding as negative. As discussed in
the previous section, it impacted overall impressions of language proficiency as well as impressions of
lexical competence. This negative impression, however, was sometimes attenuated by the behaviors and
affect the test taker exhibited during the breakdown sequence. These attenuating behaviors included verbal
aspects of interactional competence, the test takers’ turn-taking moves, repair initiations, relevance-
contingence of their responses, and active listening. Many comments also focused on identifying whether
the test taker was thinking about content or language, and these decisions were primarily made based on
the nonverbal behavior test takers exhibited in their facial expressions. Likewise, affective stances such as
a desire to communicate, engagement, and focused attention also factored into whether the impact of the
                                                                                                               239


breakdown was stronger or weaker. Together, this cluster of interactional, cognitive, and affective behavior
provided the raters an impression of their adaptability, or “how a candidate copes in a novel or challenging
language situation in real time” (Harding, 2014, p. 192). Adaptability mediated evaluations of test takers in
light of breakdowns, with more adaptable test takers benefitting. This pattern emerged 30 times across 14
raters (M = 1.50 times per rater, SD = 1.61).
         Adaptability was sometimes characterized by how personable a test taker was during the
breakdown sequence. For example, the test taker in sample 21 failed to initially understand the test question
in Figure 7.10. She displayed this nonunderstanding 880 ms after the end of the question by averting her
gaze and turning her head left. She held this behavior until she initiated an other-initiated repair by laughing,
re-establishing mutual gaze, and returning to a neutral head posture. This display of positive affect, attention,
and engagement may have established her comfort in asking for help, making her appear more personable.
As she began verbally requesting repair, she smiled while leaning forward toward the camera to continue
showing her engagement and attention to the examiner.
Figure 7.10
Sample 21’s Breakdown and Repair Sequence
Rater 6 noted the positive impact of these behaviors because she remained warm and friendly, but also
attentive, throughout her breakdown, despite the potential stress of the breakdown:
         I remember saying warm because even though she was confused, she didn't seem overly anxious
                                                                                                             240


         about it. Like she laughed about it and asked the interviewer to repeat the question rather than like,
         eyes darting around, eyes was getting bigger. Plus she was attentive and expressive as well with a
         smile and overly seems happy or overall seemed happy despite the confusion. (Laughing, warmth,
         comprehension, anxiety, repair, shifting gaze, eyes growing wide, attention, expressiveness,
         smiling, happiness).
In this scenario, despite the ongoing comprehension difficulties she faced, her positive attitude by smiling
and laughing throughout the repair sequences resulted in an overall positive impression. Likewise, the rater
noticed that the test taker kept mutual gaze with the examiner during the repair sequence without her gaze
drifting excessively. The rater viewed these behaviors as oriented towards communication as she was
paying attention and trying to collaborate with the examiner towards responding.
         Likewise, behaviors that indicated active listening also factored into the perception of adaptability
during breakdowns. For instance, the test taker in sample 29 initiated a repair sequence after hearing the
unfamiliar word “ceremony,” which was shown in Figure 7.9. After this, the examiner provided a
clarification sequence, shown in Figure 7.11. As seen here, after the examiner began speaking, the candidate
relaxed her brow and soon after her body posture, leaning forward from a backwards lean during her repair
request. She then showed that she was actively listening not by smiling or laughing but by reestablishing
gaze     with   the    examiner   and    nodding    just    after  key    words    the   examiner     repeated
(when/celebrate/important/events/life). As illustrated in Figure 7.8, this test taker used nods frequently to
backchannel and show active listening with the examiner. Her slight turn to the left may have also been a
head gesture of directing her ear to the camera, which could have been a sign of active listening.
                                                                                                           241


Figure 7.11
Sample 29’s Display of Active Listening During Clarification
         Rater 7 noted the head nodding and mutual gaze in this sequence, relating them both to active
listening:
         Just something little, her head nodding, she was attentive. She was under... She was trying to
         understand what he was trying to say. I mean, there's that you can see it, kind of see how, that goes
         on with facial expressions. But yeah, and a little bit of confidence to my opinion, because, you know,
         she's not in her head. And, you know, obviously she's looking at [the] computer.
Breakdowns in communication were almost always seen as leading to a loss of confidence, but the rater
here noted that the test taker’s adaptability helped her regain some of that confidence. Again, being attentive
was seen as vital in this moment of breakdown and integral to her adaptability.
         Even in cases where comprehension resolution was not fully reached, some raters still reached a
positive evaluation of the test taker due to their adaptability during their response. For example, sample 15
failed to comprehend a question about rewards. Instead of answering the question about rewards in general,
she responded with a personal anecdote. Rater 8 detected the non-contingent response, but still arrived at a
positive impression of the test taker.
         I didn't totally think that she [comprehended the question] at this time. Okay, just because she's
         talking about herself when the speakers asking about children and their parents’ relationships. I
         still think she's at ease even though she doesn't really look like she knows what she's talking about
                                                                                                            242


        yet. She's still calm. I think she's not like anxious, looking around, playing with her hair or anything
        like that. That also goes back to confidence as well, because she's still. Like, she looks confident in
        what she's saying. Even though what she's saying makes no sense to the question, but that's okay.
        We'll get there.
Despite providing a somewhat irrelevant answer to the question, the rater noted the general feeling of
calmness of the test taker. She adapted to the communicative event despite not fully understanding the
question, thus showing a desire to communicate. Her gaze was held on the examiner and she did not use
self-adaptors during her response, reinforcing this impression of ease and confidence. Confidence for this
rater was an important sign of overall language proficiency given the comment “but that’s ok. We’ll get
there.” This echoes a finding from Jenkins and Parra (2003) in which an off-topic response was perceived
positively in a similar manner: “[Alejandro’s] use of nonverbal features during his inaccurate response were
perceived as confident behavior and had a positive effect on the evaluators” (p. 98).
        On the other hand, breakdowns in comprehension combined with a lack of adaptability resulted in
negative evaluations overall. For example, in sample 18, the word “ceremony” also led to a breakdown in
understanding. The test taker’s behavior, however, was far different from the earlier samples and showed
much less adaptability. Immediately after the end of the test question, the test taker averted her gaze by
looking side to side repeatedly, maintaining this shifting gaze until the examiner began a confirmation
sequence. She initiated a repair, but indirectly by repeating the word “ceremony” with a falling tone, which
was not a direct clarification question. The word was surrounded by filled pauses. She did not smile and
showed little positive affect. Instead, she held her mouth slightly open, which may have conveyed confusion.
Behavior showing active listening was also absent as there were no nods during the examiner’s clarification
sequence or after reaching some resolution at the end when she says “yeah.” This sequence is displayed in
Figure 7.12.
                                                                                                             243


Figure 7.12
Rater 18's Breakdown Sequence
         Rater 19 noted this relatively unadaptable behavior in the following extract. Here, the rater pointed
out that the test taker’s repair sequence was not explicit, which provided evidence about her communicative
intent:
         [S]he wasn't necessarily willing to communicate that she didn't understand ceremony if that is the
         direct issue that appears to be. I said that she wasn't quite confident. I noticed because previous
         people, like asked what ceremony meant or like, they didn't understand the word and they
         communicated that where she's kind of just trying to work through it and find the meaning. And
         then also quite expressive, because you can tell she has kind of a confused look. And she's her eyes
         are darting trying to find the answer.
The rater noted that the candidate’s reluctance to admit to the breakdown indicated a certain lack of
confidence. Combined with an observation about the overall facial behavior, indicating confusion, and the
shifting gaze, the examiner deduced that the test taker was thinking about language rather than content.
This type of struggle likely provided more evidence of the lack of confidence, which overall was a negative
evaluation.
                                                                                                           244


Approachability moderates perception of comprehensibility
         Comprehensibility was one of the scale categories for raters, and as such it was the fifth most
commented feature with 228 comments, even more than grammar. I anticipated that comprehensibility
would be easier for the raters to interpret than pronunciation yet rated similarly. The raters indeed scored
this category on whether they understood what the test taker was saying, but because comprehensibility is
broader than pronunciation, the raters also referenced elements of fluency, vocabulary, grammar, content
(amount of talking, breadth of knowledge, relevance, truth, etc), together with pronunciation. Thus, this
measure is not a clear representation of pronunciation alone. For example, when describing the ease of
understanding sample 29, rater 2 mentioned:
         It gets a little hard to understand. After she says, “they choose that day to get married.” From that
         to this point is a little bit difficult to understand. Like it was hard for me to sort of figure out what
         she was trying to say… I think that it's a combination of like, misused grammar and accent, I do
         think the accent plays a factor here too. But I just think they're not. It's not a fluid conversation.
         Very, it's very choppy.
Rater 2 mentioned a combination of grammatical errors, accent, fluency, and possibly some confusion about
the content of the test taker’s speech all as factors that made understanding her difficult. Accent was not
always a main factor in decisions, however, as rater 13 noted when asked about whether their observations
about accent impacted any of their scoring decisions:
         I don't think it really does, because I think, like, as long as you can convey the idea, the idea is
         mostly conveyed through, in my opinion, grammar and vocabulary, and how well you're able to
         express yourself.
Again, grammar and vocabulary formed the majority of the rater’s understanding about comprehensibility,
as the key evidence appeared to be idea generation. Content, then, was a main focus. Other raters echoed
this, saying that they anticipated hearing an accent because they knew in advance that the speakers were
second language learners, and for this reason accent played less of a role. These findings were in line with
research on the differences between accent and comprehensibility in SLA (Munro & Derwing, 1995;
                                                                                                               245


Trofimovich & Isaacs, 2012).
          One pattern that emerged amongst these comments about comprehensibility was that test takers
that were more difficult to understand sometimes benefitted from behaviors showing greater
approachability. Approachability in this case arose largely from being seen as personable: having a positive
attitude, being expressive, and actively engaging with the examiner. There was some crossover with
adaptability in the sense of appearing at ease and confident, but resolving breakdown was not a key element
in these comments. This pattern appeared 42 times across 15 raters (M = 2.10 times per rater, SD = 1.86).
          For example, rater 1 explicitly mentioned affect and nonverbal behavior as key elements making
the test taker in sample 8 more comprehensible:
          I was just thinking about it's more of an affect thing, kind of able to laugh off his mistake it's a nice
          little moment… the stuttering obviously made me think that maybe it's not... his fluency isn't quite
          there. He's not able to get to his mouth what's in his brain. But his presentation is really not bad.
          That's why this was interesting to me because his grammar and vocab and his overall fluency is
          really weak but his pronunciation is pretty comprehensible.
Rater 1 noted that despite weak fluency, vocabulary, and grammar, the test taker’s pronunciation was
largely comprehensible. The rater appeared to attribute this, at least partially, to the test taker’s affective
stance during the test, taking a more relaxed approach and being willing to show a good-humored nature
by laughing and recognizing his mistake.
          Likewise, sample 15 was often discussed as an example where there were multiple problematic
moments, including a breakdown in comprehension and speech that was somewhat irrelevant at points.
Nonetheless, this speaker was very approachable during the entire interview. She smiled, showing a positive
attitude, and she often maintained mutual gaze with the rater while he spoke and clarified her
misunderstandings, showing engagement. A gaze density plot for the entire interview is provided in Figure
7.13, which shows averted gaze in the gaze tier, and mouth movements in the mouth tier. The test taker
spent 25% of the overall sample time smiling at the examiner. It is also notable that while the examiner was
                                                                                                                246


speaking, the speaker almost always maintained mutual gaze.
Figure 7.13
Sample 15's Smile Density
This approachable stance was noted by multiple raters. In the following example, rater 2 made the
observation that her positive affect impacted how communicative and thus comprehensible the candidate
appeared:
         I think she was one of the first ones where it seemed like, you can meet her on the street and have
         a conversation with her… It's very it's natural seeming. She's not always perfect, but she's really
         like, her point comes across, which I think is more important than being perfect. She's very
         communicative… They're really expressive. But the vocabulary is not great. But they do get their
         point across. So like, people could understand them… So I do think that I rated her better, because
         she was able to get our point across and was very engaged and focused.
Again, in this comment the rater noted weaker elements of language proficiency, but her expressive,
engaged, and focused demeanor helped her to convey her message effectively. Maintaining this stance was
then an important part of being perceived as comprehensible. Rater 10 echoed a very similar sentiment:
         So I remember thinking about her personality, I don't know how much like to relate this to like the
         language skills, but her personality is very easygoing, comfortable. And then to that, like she's able
         to kind of express it. So she might have an easier time communicating with, like, even if someone
         else might know English better than her. I think she's very expressionate [sic] in the way how she
         moves her face and stuff. So I think she might have an easier time communicating.
In this example, the rater explicitly noted that communication would be facilitated by her approachable
stance, even despite language difficulties.
         Lacking expressiveness, positivity, and engagement likewise factored into decisions that ultimately
                                                                                                            247


resulted in evaluations of lower comprehensibility. Namely, lacking expressiveness in the voice by speaking
in a monotone manner was seen as less comprehensible. Rater 1 noted this about sample 9, saying that “that
little chain right there, I remember feeling that I think this is where I really scored for comprehensibility,
because, like it's all so monotone.” As seen in the annotation density plot in Figure 7.14, this test taker’s
demeanor likewise lacked indications of warmth and engagement through mouth movements, eyebrow
movements, or head gestures. Her mouth movements included holding her mouth open for three seconds
in the middle of her response and three quick moments where she licked her lips. This test taker conveyed
a sense of disengagement to some raters due to her gaze and facial behavior. This can also be seen in Figure
7.14, where there were frequent interruptions in her mutual gaze throughout. Rater 17 mentioned that this
behavior made it difficult to follow what the test taker was saying:
         She's like, looking everywhere. And it's not like she's holding it, and then maybe moving. It's like,
         bam bam bam. And like, it was like everywhere. Like, anxious or not quite sure where she's going
         with her sentence.
In other words, her behavior was seen as rushed and unnatural, thus not connecting with the examiner, and
creating an approachable stance. Rater 4 echoed this, saying her lack of engagement, observed through eye
movements, a relatively static facial expression, and monotone delivery, made her discourse distracting:
         I was thinking about how much she was saying "um" and looking around… I guess it was just
         distracting. And then it was also like a break every time she was saying a word, which was, like,
         distracting me from what she was trying to say… I felt like she wasn't very engaged in the
         question… Um, facial expression, um, and then like, no inflection in the voice, like she's just talking
         the same way the whole time.
                                                                                                             248


Figure 7.14
Annotation Density Plot for Sample 9
Assuredness impacts perception of proficiency
         The final pattern of comments directly related to perceptions of language proficiency. Raters
frequently mentioned that assuredness—greater confidence and lower anxiety—frequently aligned with
their perceptions of overall language ability. This pattern was in reality the second most frequent, occurring
46 times and being made by 15 raters (M = 2.30 times per rater, SD = 2.39). In some cases, this perception
as quite broad and holistic, such as rater 2’s comment about sample 15 (who likewise appeared more
personable and approachable in the previous section): “I think that her seeming comfortable, made me want
to rate her as more fluent.” The rater appeared to refer to the broader meaning of fluency here as language
ability (Lennon, 1990). Simply appearing more at ease led the rater to a greater estimation of her language
ability. Likewise, confidence led rater 4 to a broad, holistic impression of ability of the same test taker:
         I think I instantly remember her sounding really, really good? I think I remember her face. I
         remember [her] sounding really good. I gave her a seven for being, me being able to understand
         her. And so she was calm and confident and warm. All good scores.
This rater formulated a very strong impression of the test taker based on her confidence, relative lack of
anxiety, and also her warmth. This observation also relates to her comprehensibility, which was noted in
the previous section.
         Some raters mentioned that nonverbal behavior led to an understanding of assuredness, and
assuredness then led to an overall understanding of language ability. Rater 7 referred to ability as
competence when describing sample 25:
                                                                                                            249


         Right there, I mean, that's where she kind of lacks some confidence in her answer. So I mean, that's
         why I didn't give her the full confidence. Her facial expressions, she's not super expressive. I mean,
         you can see she's kind of like looking off, you know, she's definitely trying to think of her answer.
         So I mean, that's where, you know, somewhat competence and incompetence comes into. Honestly,
         for the competence/incompetence, I viewed that that one as like an overall, like an overall scale.
         Like, is she competent at speaking in English? Or is she incompetent at speaking English? So that's
         how I viewed that slide on, you know, obviously, I gave her a kind of low score, because I mean,
         once again, I could understand what she was saying, but feel like each of her sentences that she
         was trying to speak, she was just like, jumping from word to word.
Here the rater started with an impression of confidence based on facial expressions and averted gaze. This
lack of confidence from a lack of expressiveness and attention led the rater to understand that the test taker
was struggling. The rater mentioned that she was thinking, and once she started speaking, fluency problems
became evident through the lack of connected speech. However, the judgement of lower competence
appeared to take place prior to any observations about language and was based purely on what the rater
could see in the test taker’s demeanor.
         In other cases, assuredness was deduced directly from nonverbal and verbal behavior, but the
relationship with overall proficiency was not necessarily causal. Rater 20 commented that for sample 17:
         Yeah, there's also just, I mean, generally there's no sign of like, lack of confidence or anxiety. If
         she looks away or hesitates, its because she's thinking. I detected very little doubt of her own
         abilities. Or, there were very few like awkward pauses. Not much awkward, like mouth movements
         or, you know, eyes darting around for no reason, things like that. And she stays very centered. Very,
         she seems very confident when she expresses herself and her language skills are very high. And
         again, that kind of combination of confidence and language skills gave made me give her high
         competency.
Although the rater did not explicitly mention a direct relationship between the two, the characteristics of
being confident, at ease, and stronger language were all related, leading to a strong overall impression of
                                                                                                            250


the test taker. In some cases, the direction appeared to be reversed, with markers of higher fluency
combining with nonverbal behavior to produce an impression of greater confidence. When asked what
made sample 13 appear more confident, the rater 6 said:
         [He] didn't seem to be stumbling a lot, and wasn’t looking around and adjusting and fidgeting a
         ton. So his voice seemed to remain like, I guess not shaky. So that's why I said, you seem to be
         pretty confident and not anxious. Um, overall, the sentence seem to flow pretty well, too.
The rater noted that behaviors such as shifting gaze, the use of random gestures and self-adaptors, particular
features of tone, and fluency features all combined to produce a confident impression of the test taker. In
another case, rater 14 mentioned that sample 29’s confident affect improved once the breakdown sequence
concluded:
         So when she started rolling with her response, it was very fluent. And there weren't very many
         hesitations. She had some of those same grammar issues with the missing plurals and stuff. But she
         was very comprehensible. And then her confidence seemed to gain once the, I'm sure that, I think
         the person was like, "Yes," or... And she seemed like very pleasant to be there. The inflection her
         voice was warmer and happier.
In this case the reverse direction from fluency to confidence was much more apparent. This comment also
links back to the pattern of adaptability. The test taker’s adaptable nature to the breakdown sequence led to
greater confidence, and hence an overall stronger impression of language ability. These bi-directional
relationships between fluency and assuredness were almost always linked to overall positive impressions
of the individual.
         On the other hand, observations of anxious or unconfident test takers did not frequently relate to a
direct negative impression of language ability, as the two were seen as interrelated. For example, when
discussing sample 21, rater 11 said:
         Yeah, here definitely was where I establish she just wasn't, I think I literally put it like all the way
         down for anxious… I accept, pretty much established she was completely unconfident and anxious
         at this point, because she just looked like kind of, like her facial expressions and stuff. And also,
                                                                                                              251


          she just kind of seemed like she just didn't understand like, what was going on? So that's that that
          to me, put, why I put her so far.
Here the rater noted that the test taker appeared quite anxious due to her nonverbal behavior but noted that
this might have been related to her lack of comprehension. Although the overall judgement appeared
negative, it did not relate directly to language proficiency. In fact, raters appeared to be able to separate
affect from proficiency in at least some of these instances, such as rater 16’s comment about sample 16:
“He seems very unconfident. But the words that he's using, he clearly knows. Like, he clearly has a good
vocabulary. He just, it just doesn't seem like he's confident enough to use it.” Despite being seen as lacking
confidence, the rater still found the test taker’s vocabulary level to be strong. Thus, judgements of higher
proficiency may be facilitated by a presentation of assuredness, but they are not necessarily hampered in
the lack of such affect.
Summary
               This chapter has described the patterns and trends that emerged from raters as they described
their thought processes while rewatching and listening to stimuli from the main rating project. As opposed
to the analysis of their scores, using stimulated verbal recall provided insight into where the raters directed
their attention and what information they used when making judgements about language proficiency. The
raters’ comments on nonverbal behavior made up a small but substantial portion of the dataset, aligning
with previous studies that used raters to elicit comments about L2 communicative competence. The raters
in this study focused primarily on behaviors related to eye gaze, mouth movements, and paralinguistic cues,
with a smaller focus on body posture, gesture, and eyebrow movements. The raters’ comments revelated
that they rarely focused on just one behavior when making decisions related to language proficiency or
affect. The raters considered ensembles and the various conflicting information from both verbal and
nonverbal behavior when making decisions.
          The raters’ comments revealed four main patterns related to how they used nonverbal behavior
when rating language proficiency. The first was that they used multimodal information from both nonverbal
and verbal channels to gauge whether a test taker understood their interlocutor. Non-understandings always
                                                                                                            252


led to a negative impact on language ability. The second pattern, however, was that this negative impact
could be attenuated by engaging in adaptable behaviors—behaviors that showed a desire to communicate,
engagement, active thinking, and interactional competence. The third pattern was that comprehensibility
was judged using both verbal and nonverbal criteria. Fluency, vocabulary, grammar, and pronunciation
worked together with positive affect and engagement to aid in smooth communication that was easy to
understand. Finally, the fourth pattern was that assuredness in the form of confidence and low anxiety was
closely related to judgements of language proficiency; the more assured someone appeared, the more
proficient they were perceived. This relationship was, however, bidirectional.
                                                                                                       253


                                         CHAPTER 8: DISCUSSION
         In this chapter, I will discuss the findings from Chapters 5 through 7. For each chapter, I review
the findings and offer an interpretation of their significance in terms of previous findings and theory. For
Chapters 6 and 7, I will discuss the results of the a priori hypotheses from Chapter 3. These three sections
will be followed by a triangulation of the three sets of findings, focusing on the themes these studies have
in common. I will then discuss the implications of this study on L2 assessment, SLA research, and research
methodology.
Study 1: Affect and language proficiency
Dataset integrity
         Chapter 5 began by ensuring the dataset was as robust as possible, as survey and questionnaire data
is notoriously prone to problematic rater responses (Iwaniec, 2019). In this study, I planned for several a
priori measures to minimize the impact of such behavior. For example, raters were not allowed to speed
through the survey, as they could not move forward through the samples without watching the videos in
their entirety. Raters were also not allowed to pause or replay the videos, as this could have caused
differential effects in the ratings if they focused too much on particular language elements during replay.
To reduce straight lining (the selection of all categories on one side of the scale) and acquiescence (the
selection of categories that a participant feels the researcher desires), I randomly reversed the polarity of
scales. This encourages raters to consider the meaning of the scale for each case, and in theory reduces this
type of bias. To reduce order effects of the presentation of the samples, the samples were counterbalanced
by day and randomized. To reduce primacy bias in the affect scales, scale ordering was randomized for
each rating sample as well.
         However, these measures do not guarantee the integrity of rating data, as these data must be
inspected a posteriori for problematic responses. I described how the dataset was cleaned to ensure the
ratings were as reliable as possible prior to analysis. Of the 100 original samples, one was removed because
of technical problems, and 16 were removed due to low reliability, misfit, multivariate outliers, or a
combination of these issues. Although other methods exist to detect problems in survey-type data, such the
                                                                                                         254


careless package in R to detect careless responding (Yentes & Wilhelm, 2018), some of these are not
effective for data with a nested structure (raters by samples with multiple outcomes). For this reason, I
relied on methods highlighted here, which I decided would be the most straightforward method to
strengthen the dataset without losing unnecessary power.
         I then checked the integrity of the dataset prior to conducting analysis using descriptive statistics
and Rasch measurement. The scales functioned appropriately without misfit or erratic behavior, showing
desirable Rasch measurement characteristics. Raters used the full range of scores, and the standard
deviations for each rating category indicated that raters showed a satisfactory degree of variance across the
7 score categories. Language-related traits had the highest amount of variance, while affect measures, in
particular positive affect, had the lowest. For this study, this type of variance was important as it suggested
that raters as a whole were not assigning a restricted number of scores for the language categories, which
would have attenuated correlations and weakened inferences from regression analyses. The distribution of
the scale scores furthermore showed desirable characteristics. Overall, raters awarded midpoint scores of 4
less frequently than scores indicating a scale direction (e.g., 3 or 5, marking direction towards a descriptive
endpoint such as positive or negative). I understood this as indicating that raters did not exhibit central
tendency. Language scores were fairly balanced overall, with somewhat bimodal distributions (3 and 5
were the most frequent categories for fluency, grammar, and vocabulary; comprehensibility was negatively
skewed). Anxiety and confidence were similar. Other affect scores tended to be skewed negatively, showing
that raters assigned higher scores for these phenomena. All scale categories correlated at a medium to strong
level according to Plonsky and Oswald (2014). These correlations suggest that there was very likely an
impact of halo effect, with raters assigning similar scores across all categories, despite the fact that the
scales were randomized and with random polarity. In terms of severity, the Rasch partial credit model
showed that grammar was the most difficult scale, and comprehensibility was the easiest.
         I also checked the integrity of the samples. The samples showed desirable Rasch statistics, with no
sample showing any misfit. This was positive as it indicated that there were no problematic samples that
caused large degrees of inconsistency in the ratings. The samples were selected to represent a wide range
                                                                                                           255


of abilities (on the CEFR, A2–C1), and the language proficiency scores the raters awarded were largely
consistent with the base proficiency scores awarded by IELTS. The scores on the samples also showed that
the raters detected a wide variety of differences amongst the samples in affect measures. Curiously, as noted
with the correlations above, these affect scores showed a linear tendency with ability level, especially with
categories such as confidence. Other categories, such as attention and engagement, showed much less of a
linear relationship with base proficiency.
         The raters also largely showed desirable Rasch fit statistics. As expected, however, their
consistency and agreement were limited. Exact agreement was low, and ICCs were largely low or medium.
This was anticipated, given the minimal instructions and practice, and lack of benchmarked samples for
rater training. In fact, I considered this degree of agreement positive given that raters were novice, and the
scales were simplistic. The raters had never rated language categories before, and likely had never explicitly
thought about these characteristics when listening to a L2 speaker. The fact that ICCs were near .5 suggests
that there was a limited but shared understanding of some of the underlying characteristics of language.
Rater training could have boosted these ICCs, but an explicit focus on language would have narrowed the
raters’ focus to the linguistic code. This might have removed necessary score variance from the use of
information in the visual realm, and it would have distanced the participants from real world listeners.
RQ1: What is the relationship between interpersonal affect and language proficiency?
         To answer this question, I first reduced the dataset using exploratory factor analysis. This resulted
in four factors, which I named language, assuredness, involvement, and positivity. Assuredness was a
combination of confidence and low anxiety, while involvement was a measure that represented engagement,
attention, and interaction. Positivity represented positive affective measures of warmth, happiness,
expressiveness, and positive attitude. Each of these factors was found to correlate with language proficiency
measures (polychoric correlations .43–.72). Using factor scores, I then ran ordinal mixed-effects regression
to determine which of these three factors related the most to the four language proficiency measures. By
doing so, I provided evidence of differential impact of these measures on the language outcome variables.
         Assuredness alone predicted changes in fluency and grammar scores with a strong effect size for
                                                                                                          256


fluency, and a smaller effect size for grammar. Assuredness was also one of three significant predictors of
vocabulary, with a smaller effect size. These findings indicated that as an individual is perceived as more
confident and less anxious, they are also seen as having stronger fluency, grammar, and vocabulary. This
finding is largely consistent with the literature. Anxiety, for example, has frequently been found to relate
to lower L2 proficiency or achievement outcomes (Botes et al., 2020; Clément et al., 1980; Clément &
Kruidenier, 1985; Dewaele & Alfawzan, 2018; Dewaele & Li, 2022; Dewaele & MacIntyre, 2014; Dewaele
et al., 2019; Jiang & Dewaele, 2019; Jin et al., 2017; Li et al., 2020; MacIntyre et al., 1997; Teimouri et al.,
2019). Confidence, on the other hand, has been found to predict educational achievement (Ahammer et al.,
2019; Cobb-Clark, 2015; Heckman et al., 2006; Judge & Hurst, 2007; Stankov et al., 2012) and L2
achievement outcomes (Doqaruni, 2015; Edwards & Roger, 2015; Labrie & Clément, 1986). Confidence
is an affective stance that raters frequently observe and factor into positive evaluations of test takers in the
language testing literature (Jenkins & Parra, 2003; Neu, 1990; May, 2009, 2011). Given the close
relationship confidence and anxiety have with cognitive, psychological, and personality elements (e.g.,
Stankov et al., 2012), raters may have drawn on nonverbal cues to extrapolate about the test takers’
underlying cognitive fluency and lexicogrammatical competence through the affect they demonstrated.
Seeing an individual as anxious may have led raters to doubt the individual’s level of ability, resulting in
lower scores awarded. Likewise, seeing a confident performance could inform raters that the test taker
believed in their own abilities, thus leading to higher gains.
          There are caveats that should be mentioned, however. Through the stimulated recall in Chapter 7,
I found that most raters used a broad definition of fluency (Lennon, 1990), which they often understood as
overall language ability (that is to say, “He speaks English fluently” for linguistic laypeople generally
indicates that the speaker is perceived as proficient across multiple aspects of language). For this reason, it
may be erroneous to think of fluency in this case in the narrow, psycholinguistic sense of an “impression
on the listener’s part that the psycholinguistic processes of speech planning and speech production are
functioning easily and efficiently “ (Lennon, 1990, p. 391), such as discrete forms of utterance fluency (e.g.,
pauses, repair, articulation speed; Segalowitz, 2010). The fact that assuredness impacted three similar areas
                                                                                                            257


of language proficiency is perhaps evidence that confidence, anxiety, and overall communicative ability
are closely bound together. It is also important to note that this is not evidence of a causal relationship, as
confidence and proficiency likely have a bidirectional relationship (Edwards & Roger, 2015). Evidence of
confidence and low anxiety may lead to an impression of greater proficiency, and at the same time, greater
proficiency, and ability to handle a communicative event likely leads people to display a greater amount of
confidence and ease when speaking.
         Involvement, a combined measure of engagement, attention, and interactiveness, was the sole
predictor for comprehensibility, with a large effect size. Involvement also predicted vocabulary scores,
along with assuredness and positivity, with a smaller effect size. The findings here show that as individuals
were seen as more engaged, attentive, and interactive, they were easier to understand and their vocabulary
was perceived as stronger. Engagement has been defined as a person’s level of interest and participation in
an event (Philp & Duchesne, 2016), which may be likewise perceived as a desire to communicate. In the
language testing literature, engagement has been detected by raters through mutual gaze, smiling, head nods,
and a forward leaning posture (Ducasse & Brown, 2009, Jenkins & Parra, 2003; May, 2009, 2011;
Nakatsuhara et al., 2021a; Neu, 1990; Sato & McNamara, 2019), all of which are also closely related to
attention and interaction. Studies in SLA have furthermore found links between both subjective
observations of engagement (as collaborativeness; Nagle et al., 2022) and objective measurements of head
nods (Trofimovich et al., 2021) with comprehensibility. Trofimovich et al. (2021) found that different
dimensions of engagement, including social involvement (e.g., using encouraging language) and
backchanneling (through nodding), successfully predicted comprehensibility. Nagle et al. (2022) likewise
found that comprehensibility was predicted by collaborativeness (social engagement), and low anxiety.
This study supports these findings here in that involvement was a strong predictor of comprehensibility.
Assuredness, with included measures of low anxiety, was not a predictor of comprehensibility in this model,
although the correlations were positive between these measures. Regarding the association between
involvement and vocabulary, no study to date, to my knowledge, has shown a link between these measures.
It could be postulated that when individuals are more involved, they show a stronger degree of willingness
                                                                                                           258


to communicate. If this link stands, there is some recent literature that supports the connection between
involvement and vocabulary, namely a greater amount of productive vocabulary use (Heidari, 2019) and
receptive vocabulary knowledge (Şen & Oz, 2021) in learners that are more willing to communicate.
          An alternative explanation for the impact of involvement could reside within the literature on
affective contagions (Elfenbein, 2014; Hatfield et al., 1994). When a rater sees a test taker as involved, it is
possible that some of that involvement transfers to the rater, making them feel more invested in paying
attention and listening to the test taker. In other words, willingness to communicate may inspire a
willingness to listen. Raters that are more willing to listen to a particular speech sample would likewise be
more likely to find it easier to understand, because they were already making the effort to do so. This
interpretation would have important implications for research on measuring comprehensibility if purely
linguistic forms were the object of study and the individual were able to be seen. It would also open the
door to more literature on this topic in language testing, as the impact of interlocutors or examiners on test
takers’ performance outcomes has been studied, but there is a need for work considering the impact of test
takers’ behavior and language on rater behavior (Briegel-Jones, 2014; Brown, 2003; Plough & Bogart,
2008).
          Regarding positivity (positive affect), the only significant relationship in these models was a
negative relationship with vocabulary. This finding is curious, as the correlation between positivity and
vocabulary was not negative, but positive, with a medium effect size. I suspect that this negative coefficient
is a statistical artefact or the result of a complex relationship with assuredness and involvement, and for
that reason I hesitate to interpret this effect as meaningful without more data. Past literature states that
positive psychology plays an important role in language learning (MacIntyre et al., 1998; MacIntyre et al.,
2019; Oxford, 2016), helping learners overcome anxiety and creating opportunities to learn (MacIntyre &
Gregersen, 2012;). Several studies have also found links between language achievement and enjoyment,
which may be a facet of positive affect (Botes et al., 2020; Dewaele & Alfawzan, 2018; Dewaele & Li,
2022; Dewaele & MacIntyre, 2014; Dewaele et al., 2019; Jiang & Dewaele, 2019; Li et al., 2020).
Trofimovich et al. (2021) also showed that positive affect related positively (to a very small degree) with
                                                                                                           259


comprehensibility gains in some of the participants, but not all. Chong and Aryadoust (2023), on the other
hand, found no evidence of an impact of positive affect on language proficiency measures. Thus, the null
findings in Chong and Aryadoust (2023) and the negligible effects in Trofimovich et al. (2021) align with
the findings in this study that positivity was not a significant predictor of fluency, grammar, and
comprehensibility. It is perhaps the case that positive affect may relate to achievement outcomes in
classroom-based settings, where teachers and students interact with each other in a much more relaxed
environment, but not in a high stakes situation where language is being scrutinized. More research is
necessary to determine whether an effect indeed exists.
         A key question at this point is whether the findings here are meaningful in a language testing
context. If one considers what is construct relevant or irrelevant, it is intuitive to consider positive affect as
being construct irrelevant to any measure of language. There are many reasons a highly proficient candidate
may exhibit markedly lower positive affect in a testing scenario. For example, if a test taker is somewhat
less expressive, it may be because of the stressful testing context. Concentrating on producing accurate
language while maintaining rapport with the examiner in a social setting may spike cognitive load,
inhibiting the display of particular affective stances. In addition, a broad range of neurodiverse test takers
may display differing patterns of affective phenomena or differing patterns of nonverbal behavior, such as
gaze differences and repetitive motion in individuals with autism (American Psychiatric Association, 2013).
In these cases, language ability should not be judged worse in these individauls than someone who smiles
more or maintains more mutual gaze on that basis alone. If this finding were true, test takers may find it
beneficial to train their behavior in order to get a higher score. It is fortunate, then, that positivity was not
a major predictor in any of the models.
         The construct relevance of assuredness and involvement, however, is more complex. If indeed
assuredness is a bi-directional result of language proficiency, and if assuredness is partially a cognitive
mechanism, being perceived as more confident or less anxious may in fact reveal something about the
person’s ability to speak. I say this with caution, however, as I would not argue the opposite: Testing
situations are anxiety-producing, and the mere fact of feeling anxious or having lower confidence should
                                                                                                              260


not reveal anything about lower language ability. A similar interpretation is possible for involvement.
Showing engagement, attention, and interactiveness are all critical skills in L2 communicative competence,
especially within subdomains such as interactional competence (Galaczi & Taylor, 2018; Plough et al.,
2018) or goal-directed communicative effectiveness (Morreale et al., 2013). Being engaged, interactive,
and attentive are all crucial to effective communication amongst neurotypical individuals, as these stances
help people build bridges in interpersonal encounters, especially when problems or breakdowns occur.
Displaying this type of affect alone is not enough to overcome very low proficiency, but it can help facilitate
intercultural encounters. Appearing withdrawn, disengaged, and inattentive can have an opposite effect,
causing communication to deteriorate or fail.
         Neurodiverse individuals, however, may not exhibit the same range of attention and engagement
as neurotypical individuals, especially in a testing context. Neurodiverse test takers, such as those on the
autism spectrum, may exhibit full communicative competence within a different set of behavioral
repertoires. To my knowledge, it is unknown how raters interpret these stances in L2 settings, and more
research is needed in this area. Raters would need to be trained on how to work with these test takers, as
particular accommodations may be necessary to ensure a bias-free, equitable speaking test (Randez &
Cornell, in press). In any case, overall, raters naturally pick up on both assuredness and involvement, as
attested in many studies, and for this reason it is likely a natural part of the construct of speaking, but any
operationalizations of these types of affect would have to be carefully implemented, and accommodations
for various populations would have to be thoroughly researched.
Study 2: Automated measures of nonverbal behavior
Measurement patterns
         In Chapter 6, analyses of the iMotions behavioral output showed substantial variance in the
nonverbal behavior of the samples. The samples varied the most in engagement (in the iMotions data, a
measure of expressiveness, or amplitude of facial muscle activation), both in the mean and standard
deviation values. This showed a spread of different profiles in overall facial movement and how much the
facial movement varied within each sample. The variance in this measure indicates that as test takers
                                                                                                            261


completed the test, they expressed a range of different visible facial movements, ranging from more neutral
to highly active. Note, however, that the measure of engagement was not a measure of whether the
expressions indicated any certain positive or negative direction.
          Measures of valence indicated whether expressions were positive or negative, yet this measure
showed very little overall variance. Most test takers maintained a fairly neutral stance, and the time series
graphs showed generally stable patterns for most test takers within each sample. This is not altogether
unexpected, however, as it is possible that the institutional nature of the testing context encourages some
test takers to manage impressions of how they are perceived (Luk, 2010), which may then restrict strong
expressions of positive or negative emotions. Culture may have also played a role in these differences, as
it is possible that Chinese individuals may manifest a different set of display rules regarding their emotional
expression than their American counterparts (e.g., Ekman & Friesen, 1969; Matsumoto & Hwang, 2012).
Nonetheless, there were individuals that displayed visibly positive valence during the tests. Samples 14 and
15, for example, smiled regularly throughout and had high mean valence scores. Contrastingly, few test
takers showed clear negative emotions during their samples. Although the test taker in sample 8 did smile
at times, he struggled with the test questions and appeared to express a more worried or concerned look
during the test, which resulted in his valence score being the lowest of the group.
          In terms of attention, the test takers generally had high mean attention measurements, with most
scores near the maximum. Only seven samples showed values less than 90 (of a maximum of 100). This
suggests that the test takers in these samples largely directed their gaze and head turns towards the camera.
The time course graphs showed test takers that scored lower on attention generally varied their attention
more; in other words, these individuals tended to withdraw and reestablish attention by looking at or turning
towards the camera more frequently. For example, samples 20 and 24 broke their attention regularly
throughout and scored the lowest on attention. Lower attention with high variance is quite common,
however, with speakers in interactional settings. Generally, listeners maintain attention with their
interlocutors, and speakers may look back and forth at their listener while speaking, using direct and averted
gaze to manage conversational structure (Rossano, 2012; Goodwin, 1980). Thus, lower attention scores are
                                                                                                            262


not necessarily a negative characteristic, given that these samples showed the test takers speaking for the
majority of the duration.
         It is also important to note the differences between these three measures and similar measures of
assuredness, involvement, and positivity as measured by the raters in Chapter 5. The raters’ subjective
observations of behavior correlated quite strongly with most other observed measures, including language
proficiency. Although the factor analysis was able to identify unique patterns in these measures, there is a
strong argument to be made that raters exhibited a halo effect across the scales. The correlations between
these factor scores and the iMotions variables, however, showed that there were communalities in what was
being measured, but these were somewhat unexpected. Positivity showed a medium correlation with
valence, which is logical, but it correlated even more strongly with engagement (facial muscle activation).
Given that the positivity factor included the measured variable expressiveness, this may be the reason for
the medium correlation with both measures. I consider these correlations evidence that iMotions data were
indeed measuring aspects of nonverbal behavior. The medium rather than strong correlations (for example,
between positivity and valence) may be a result of differences in what is perceived and what is measured,
as there is evidence that externally measured and rater-perceived aspects of behavior do not always align
(Gullberg, 1998). Given that the positivity variables (warmth, happiness, attitude) were quite subjective for
most raters, a medium correlation with iMotions variables is positive for this study. I also believe this makes
a strong argument for the use of objective measures of behavior here, as these were measured independently
of language, unlike the rater variables.
RQ2.1: Do externally measured indices of nonverbal behavior predict language proficiency scores?
         H2.1.1:      Indices of attention and expressiveness will in significant but moderate correlations
                      with language ability.
         H2.1.2:      Higher values of attention and expressiveness will result in significant positive
                      regression coefficients of fixed effects, indicating an overall positive impact on
                      impressions of second language proficiency across ability levels.
                                                                                                            263


         There were two broad patterns that characterized the relationships between the iMotions predictor
behavioral means and the language proficiency outcomes. The first involves fluency, vocabulary, and
grammar ratings. Each of the sets of predictors behaved quite similarly with these three outcome variables.
In each case, the only behavior that correlated significantly with each outcome was valence, with
correlations ranging from .07 (grammar) to .12 (fluency). These correlations suggest a positive relationship
between overall positive expressions of emotion and how the raters perceived language ability. Correlations
with engagement (.02–.04) and attention (-.02– -.01) were low but non-significant. These findings align
somewhat with the literature, as test takers that appear more open, outgoing, and expressive (perhaps
through positive expressions) in the testing situation may be seen as more proficient overall (Jenkins &
Parra, 2003; May, 2011; Neu, 1990) and display greater fluency (Kim et al., 2023; Tsunemoto et al., 2022),
though whether this is a direct reflection of proficiency is a question still up for debate.
         There was a second pattern that emerged regarding comprehensibility. Valence and engagement
both emerged as significant positive correlations for comprehensibility (.18 and .11, respectively), but
attention was not (-.03). These correlations suggest that both the amount of expressiveness and its positive
direction related to greater comprehensibility. In other words, more expressive or more positively emotive
test takers were perceived as easier to understand. These correlations may also reflect a willingness to listen
to the samples on the part of the raters, as positive affect may have encouraged raters to pay more attention
to what the test takers were saying. In both of these cases, it is possible that the raters saw positive valence
and expressiveness as a sign of confidence or even task engagement, of which positive affect has been
theorized as a component (Philp & Duchesne, 2016; Trofimovich et al., 2021). This engagement may have
led the examieners themselves to be more engaged, thus leading to higher comprehensibility scores.
         The fact that attention did not correlate with any of the proficiency outcomes (-.03– -.01), was
unexpected. Studies have repeatedly found that mutual gaze relates to positive impressions of test takers,
while averted gaze exerts a negative influence (Choi, 2022; Ducasse & Brown, 2009; Jenkins & Parra, 2003;
May, 2009, 2011; Nakatsuhara et al., 2021a; Sato & McNamara, 2019). I can hypothesize two possible
reasons for this lack of correlation in this study. One, the iMotions index of attention is not restricted to eye
                                                                                                              264


gaze behavior alone, as it also includes head turns. Thus, this measure was not a sole measure of gaze,
which might have been more informative of attention in this case. Second, it is possible that mean attention
is not informative as a measurement itself. Rather, the frequency of changes in gaze patterns or the way test
takers break and reestablish gaze to convey interactional information depending on the underlying social
context of interaction may instead be more informative for raters. Both of these issues will be explored later
in this discussion.
          The correlational analysis thus partly confirmed hypothesis 2.1.1. Expressiveness (iMotions
engagement) correlated with comprehensibility, as hypothesized, but the correlation was quite small (.11).
Correlations with attention and all outcomes (-.03– -.01), on the other hand, were non-significant, negating
this aspect of the hypothesis. Not hypothesized was the role that valence would play in the correlations;
valence exerted the strongest effect on all measures (.7–.18), with comprehensibility being the highest (.18).
          In the regression analyses, however, the main effects of the mean behavioral indices did not emerge
as significant predictors of the outcomes. This indicates that despite positive correlations, the main effects
did not predict language proficiency outcomes across ability levels. Only interaction terms in the interaction
models showed significance. Thus, hypothesis 2.1.2, that attention and expressiveness would predict
proficiency outcomes across all test takers, was refuted. This finding did not support studies such as Kim
et al. (2023) and Tsunemoto et al. (2022), in which a greater number of smiles (positive valence) and
eyebrow behavior (expressiveness) enhanced fluency scores, and averted gaze positively predicted
comprehensibility.
RQ2.2: Do nonverbal behaviors impact outcomes differentially depending on the base proficiency levels
of test takers?
         H2.2:         Significant interaction coefficients of base language proficiency with attention and
                       base proficiency with expressiveness will indicate that the effect of nonverbal behavior
                       on rated outcomes depends on the base proficiency of the test taker.
          For fluency, grammar, and vocabulary, the best fitting models were the ones with interaction terms
with base proficiency (the scaled IELTS scores) included. None of the interaction terms, however, emerged
                                                                                                           265


as significant, with only the main effect of base proficiency emerging as a significant effect. The fact that
base proficiency was a main effect predicted the final outcomes is positive, as it provides evidence that the
raters in this study awarded scores largely in line with those of the initial proficiency assessment. This
suggests that even though the raters were novice and untrained, they shared a common understanding of
language proficiency with the underlying construct of the test. The non-significant interaction coefficients
indicated that for fluency, grammar, and vocabulary, even though valence correlated with the outcome, it
did not emerge as a significant predictor of the raters’ scores, and test takers with varying proficiency levels
were not impacted differentially. This finding is somewhat surprising, as the literature suggests that less
proficient test takers may benefit from being more expressive or positive than stronger test takers (Jenkins
& Parra, 2003). It may be the case, though, that the effects of overall behavior on judgements of language
proficiency are generally so small that they are rendered non-significant by the overall effect of language.
         Comprehensibility showed a different trend, however. The best fitting model was again the
interaction model, but in contrast with the models of fluency, vocabulary, and grammar, there was a
significant interaction between base proficiency and valence. In order to explore this relationship
graphically and also through statistical approaches, I dichotomized base proficiency into low and high
groups by using a B2 cutoff score. Results showed that lower proficiency (< B2) test takers benefitted from
positive valence in their comprehensibility ratings, and there was little change in the higher proficiency
group ( ≥ B2). This aligns well with Jenkins and Parra (2003), as raters found the test takers easier to
communicate with as they were more generally expressive, perhaps through positive affect. This ease of
understanding in the original study may have caused raters to have a more positive impression of the test
takers’ overall communicative ability, thus raising scores. These findings may also align somewhat with
Trofimovich et al. (2021), who found a partial positive correlation of .12 between positive affective
behavior and comprehensibility in one group of dyads.
         Somewhat more difficult to explain is the apparent negative impact of higher valence on the more
proficient test takers’ comprehensibility ratings. However, the range of valence measures in the more
proficient group was markedly more restricted, which may have attenuated correlations or even muddled
                                                                                                            266


inferences as a statistical artifact. It is likely the case that more proficient speakers are perceived as more
comprehensible regardless of their behavior, and it is unlikely that increased valence would relate to a more
proficient person being less comprehensible. This can only be tested if a greater range of behaviors is
measured in the two groups.
Relationship between variance in behavior and language proficiency
         After drafting and preregistering the study (but prior to running the analyses) (Burton, 2021b), I
decided to analyze the previous research questions with predictors of standard deviations of the iMotions
variables. I wanted to test whether behavioral variance instead of mean values could exert an effect on the
model. Thus, I ran the same models and analyses with these alternative indices. Similar to the models using
mean predictors, the exploratory models using standard deviations also exhibited two patterns, one
impacting fluency, vocabulary, and grammar scores, and the other with comprehensibility measurements.
Each fluency, vocabulary, and grammar outcome were impacted similarly by the variance of the predictors,
with small negative correlations with the standard deviation of valence, small positive correlations with the
standard deviation of attention, and non-significant correlations with engagement. These correlations
indicated a relationship between shifting between emotive states (positive, negative, neutral) and lower
scores, while shifting attentional patterns corresponded with higher scores.
         Regression modeling of these effects showed that, again, the model with interaction effects best fit
the data. As opposed to the average predictors, however, the interaction term between attention and base
proficiency (the scaled IELTS scores) was significant for fluency, vocabulary, and grammar. Only in
fluency was the main effect of the standard deviation of attention significant. Other interactions and main
effects were not significant. These findings suggest that the impact of varied attention differed according
to base proficiency level. In a follow up analysis, I found that less proficient speakers tended to be
negatively impacted by variance in attention, while more proficient speakers were positively impacted by
this variance.
         While this finding may not be immediately intuitive, it does align with findings from past research.
Gaze, a major component of attentional focus, is a complex behavior that may change according to various
                                                                                                            267


cognitive states, social cues, and pragmatic needs. Shifting between mutual and averted gaze is an important
aspect of interactional moves in speech, with speakers showing uptake (that is, integration) of information
and initiation of turns with the breaking of gaze, and gaze returning to the interactant when turns are
complete (Rossano, 2012; Goodwin, 1980). Speakers may also return gaze to their interactant to create a
gaze window, offering listeners the chance to backchannel to show intersubjectivity (Bavelas et al., 2002).
Speakers may also break gaze when questions are more difficult (and perhaps not understood) as a way to
wrangle additional cognitive resources (Burton, 2023; Doherty-Sneddon & Phelps, 2005). In language
testing contexts, less proficient speakers have been found to use ensembles of behavior (in particular,
gestures) that are more frequent, more irrelevant to the content of their utterances, and narrower in range,
while more proficient speakers may use more integrated verbal-nonverbal utterances (Gan & Davison,
2011). Irrelevant, non-target-like gaze patterns may also become salient and informative to raters. Together,
the attention data in this study tell essentially the same story. Speakers that varied their attention in an
integrated way with their utterances were more likely to be perceived as producing more fluid speech with
stronger vocabulary and more accurate grammar. Attention, then, added to their overall impression of
language ability, as they were able to use attention as a tool at their disposal to manage the interaction with
the examiner. On the other hand, when less proficient speakers varied their attention more, it may have
signaled that the test takers were struggling to cope with the topics or questions. These attentional shifts
would not have been integrated with speech as with the more proficient group, and the raters may have used
these shifts as evidence for weaker language ability. This largely goes against previous findings that averted
gaze has a negative impact on raters (Choi, 2022; Ducasse & Brown, 2009; Jenkins & Parra, 2003; May,
2009, 2011; Nakatsuhara et al., 2021a; Sato & McNamara, 2019). These findings suggest that breaks in
attention have a far more complex relationship with L2 speaking test scores than hypothesized.
         For comprehensibility, patterns with the standard deviations were markedly different. While
attention also showed a small positive correlation (.11), valence did not (-.03). Engagement in this case, as
opposed to the previous model, had a small positive correlation (.06). Nonetheless, while the interaction
model was again the model that fit the data best, none of the predictors explained the variance in the model.
                                                                                                            268


This indicates that varying attention did not have the same impact on comprehensibility as with fluency,
vocabulary, and grammar, despite the positive correlation with this measure. Given that mean valence
impacted comprehensibility but varying attention did not, it may be the case that comprehensibility may be
a construct more impacted by positive affect and engagement. This would align well with the findings of
Nagle et al. (2022), who found that lower anxiety and greater collaborativeness served as predictors of
comprehensibility scores. Likewise, Trofimovich et al. (2021) found that behaviors such as nodding, which
is evidence of interactional engagement, related to positive outcomes in comprehensibility (r = .03–.34 in
contrasting sets of pairs), with some indication that smiles may have played a role as well (r = -.28–.12 in
contrasting sets of pairs). Tsunemoto et al. (2022), on the other hand, found that looking away predicted
comprehensibility, which may have been a sign of thinking about content and relating to engagement, and
would appear to align well with the positive correlation above. Likewise, they found that eyebrow
movement predicted lower accentedness. Nonetheless, these authors did not provide standardized effect
sizes for comparison across studies. Positive affect, task collaborativeness and engagement, and lower
anxiety may then stimulate a willingness to listen in the interactant, which would undoubtedly cause test
takers’ to be more easily understood.
         The findings from study 2 are overall positive. The effects found, though significant, were all quite
small and explained very small amounts of variance in the models (2–3%). It is unlikely then that attention,
valence, or engagement would cause large score differences alone. While many aspects of nonverbal
behavior are construct-relevant aspects of communication, behavior probably plays much less of a role in
conveying grammar and vocabulary ability, though it is certainly relevant with certain gestural forms that
encode, for example, path motion (Kita & Özyürek, 2003; Slobin, 1996, 2006; Talmy, 2000). Growing
evidence, especially from the past two studies, shows that nonverbal behavior may be highly useful for
raters when understanding aspects of fluency (e.g., sources of breakdowns) or being able to comprehend
someone well. The features that play the greatest role in impacting these proficiency outcomes need to be
outlined, as any inclusion in a speaking test construct would need evidence that indeed they reveal
                                                                                                          269


something about language development. The raters in the stimulated recall study helped to triangulate the
findings of the first two studies, providing additional evidence towards this broad goal.
Study 3: Stimulated Verbal Recall
        The stimulated recall sessions conducted in Chapter 7 served as the explanatory sequential mixed
methods (Creswell & Plano Clark, 2017) component of the dissertation. It was explanatory in that it served
to explain further the reasons for score changes in the larger dataset by highlighting raters’ thought
processes. It was sequential because the MFRM analysis served to identify samples that deviated from the
base proficiency ratings the most. Twenty raters took part in this part of the study, and each rater watched
10 videos that they had previously scored online. These 200 recall sessions revealed patterns related to the
two research questions pertaining to this chapter.
         In this analysis, I was first interested in validating the inferences from the stimulated recall by
investigating whether the raters were aware of the general aims of the study. This was an important point,
because some of the survey questions at the end of the rating study contained some, albeit minimal,
reference to nonverbal behavior. Also, the raters could have investigated my research interests online,
drawing their own conclusions about the research questions underlying this study. Fortunately, the
categories the raters commented on did not skew towards nonverbal behavior. Raters commented most
frequently on aspects of language, which was anticipated as the rating study primarily focused on language,
and the language scales were first in order. After language, the raters focused on aspects of affect. This is
not common in past studies on rater cognition (e.g., May, 2011; Sato & McNamara, 2019), but explicit
inclusion of rating categories of affect drove the raters’ focus to this area. Test interaction, including a focus
on the content of utterances, was the third most frequent category, followed by a much smaller focus on
nonverbal behavior.
        The fact that nonverbal behavior made up 11% of the comments suggests that nonverbal behavior
makes up an important part of people’s decision-making processes. Similar amounts of focus on the visible
realm during rating have also been found in speaking tests by Sato and McNamara (2019) and May (2011).
The fact that these studies converge in the percentage of focus on behavior when the target of the recall is
                                                                                                               270


language suggests that 1) raters indeed orient to the target construct and focus on language itself, but 2)
they reinforce their decisions using all evidence available to them, including the behavior of test takers. I
considered the fact that this percentage aligned with May (2011) and Sato and McNamara (2019) evidence
that any focus on nonverbal behavior was natural and not influenced by knowing about the study previously.
Furthermore, an analysis of the percentages of coded comments by raters showed idiosyncratic but
acceptable trends that largely aligned with the results from the group as a whole. Raters’ reliance on a range
of behaviors, including gesture, facial expressions, and other postural and head movements, may be due to
an implicit understanding that language proficiency and development can only be determined using multiple
lines of verbal and nonverbal evidence (Stam, 2006).
RQ3.1: Which nonverbal behaviors are most salient and informative to raters when scoring?
         H3.1:        Gaze aversion, eyebrow raises, smiling, head tilts, and inexpressiveness will be
                      mentioned more times by raters as noted by higher relative frequencies of comments.
                      Gesture and posture will be mentioned fewer times due to the online format of the
                      speech stimuli.
         The raters in this study observed a wide range of behaviors across the speaking samples during the
stimulated recall sessions. Some behaviors were mentioned quite frequently across multiple individuals,
while others were noted only once. The most frequently and extensively discussed behaviors were gaze and
mouth movements. This finding aligns with Lansing and McConkie (2003), who used eye-tracking to
identify where participants directed their attention in video-based samples. They found that viewers looked
most frequently at the speaker’s eyes and shifted to looking at the mouth area when comprehensibility was
lower (such as with unclear audio or with L2 speech). This finding also aligns somewhat with those of Batty
(2021) and Suvorov (2018), who analyzed the attentional focus of L2 test takers during video-based
listening tests. Batty (2021) found that the examinees’ focus was “mostly on the face of whomever is
speaking, with only small departures from this to directly look at gestures, objects, the setting, and so on.
Participants appeared to largely split their time between watching the speaker’s eyes or mouth” (p. 527).
Likewise, Suvorov (2018) noted that the vast majority of test takers watched the “speaker’s mouth, face,
                                                                                                           271


head, hands, [and] eyes” (p. 150).
          The most frequently mentioned behaviors in this study were those of the eyes. Raters in this study
used gaze behaviors, such as shifting gaze and averted gaze, to understand test takers’ cognitive processes,
in particular listening comprehension and thinking. When the examinees’ eyes were marked by movement,
such as eyes darting around, this was perceived as a sign of struggle either to produce language or to
comprehend the test question. This also led to the attributional judgement that the test taker was anxious
and unconfident. Averted gaze was also used to understand similar processes. Thus, many of the negative
features of averted gaze or shifting direction aligned with past research showing that these behaviors were
perceived negatively (Choi, 2022; Ducasse & Brown, 2009; Jenkins & Parra, 2003; May, 2009, 2011;
Nakatsuhara et al., 2021a; Sato & McNamara, 2019). Nonetheless, this was not always the case. Averted
gaze was sometimes seen as positive if the examinee was perceived as having understood the question and
was simply preparing the content of their speech. Thus, gaze away from the rater was often used as critical
evidence of fluency judgements, as raters pieced together an understanding of whether the examinee was
struggling or preparing utterance content. This finding somewhat aligns with Tsunemoto et al. (2022), who
found that averted gaze was a predictor of comprehensibility judgements. Overall, these gaze findings are
an important deviation from past research, as they suggest that context mediates the impact of averted gaze.
Mutual gaze (when it was not staring), while also possibly used as a marker of comprehension, was often
discussed as it related to engagement and confidence. Especially during repair sequences, when test takers
maintained mutual gaze with the camera/interlocutor, this behavior led raters to understand that the
examinee was making an effort to communicate and show attention, somewhat ameliorating the negative
impact of the comprehension breakdowns.
          Mouth movements were discussed almost entirely within the context of the test taker’s affective
state, aligning with Batty (2021) and Coniam (2001) in that individuals largely used facial expressions to
determine attitudes and affect. The most frequent of these behaviors was smiling, which led to examinees
being perceived as happy, positive, warm, and often confident and at ease. Likewise, observations about a
lack of smiling or lip biting were often associated with anxiety or being unconfident. Mouth movements
                                                                                                         272


did not appear to relate directly to any judgements of fluency, vocabulary, or grammar, but they did appear
to have an impact, or perhaps an indirect impact, on comprehensibility. Apart from one case of an individual
that was mumbling, which led a rater to try to read the examinee’s lips, raters did not discuss lip reading as
a tool to enhance comprehensibility, although this has been documented with L2 speakers listening to others
(Hardison, 2018; Inceoglu, 2016; Lansing & McConkie, 2003; Suvorov, 2018). Instead, perceptions of
positive affect, often derived from smiling and other behaviors, created a sense of approachability and led
raters to want to listen more carefully and comprehend the test taker, thus enhancing the test taker’s overall
comprehensibility.
         Paralinguistic features also made up a sizable portion of the comments in this dataset, relating to
both fluency and overall affect. Raters frequently discussed filled pauses and slow speed as evidence of
lower fluency, both of which are well documented in the literature (Suzuki et al., 2021). In some cases,
filled pauses were also mentioned as a technique for pausing to think, or even as evidence of anxiety.
However, most comments relating paralinguistic features to affective states dealt with laughing and
prosodic features such as tone. Raters frequently used these aspects of paralinguistics in light of positive
affect, which likewise added to comprehensibility.
         Posture was a feature mentioned by most but not all examiners, and rarely coincided with
judgements of language proficiency. Posture was largely linked to affect, as past research has attested
(Coulson, 2004; Dael et al., 2012). Some postural behaviors, such as rocking, shaking, or leaning back,
were seen as evidence of anxiety and a lack of confidence. These postures did not always lead to a negative
impression of language ability but were sometimes seen as distracting, and overall negative. Leaning back
in at least one case was seen as a sign of confidence and being at ease, as was attested in Neu (1990).
Leaning forward, an important postural behavior in this category, was largely seen as positive. This
behavior was seen as an attempt to remain engaged with the examiner and to show attentiveness and
interactiveness, especially in cases during breakdowns in comprehension. In Jenkins and Parra (2003),
leaning forward was perceived similarly as a sign of engagement and listening comprehension. Forward
leaning may also indicate rapport with an interlocutor, presence, and involvement, while backward leaning
                                                                                                          273


can convey a lack of presence, distance, and detachment (Burgoon et al., 1984; Mehrabian & Williams,
1969). Although movement in general was seen as positive, quick, erratic movements (especially with the
hands) was associated with struggling and breakdowns in fluency. A rigid, unmoving posture was generally
seen as negative, aligning with Gan and Davison (2011), May (2009, 2011), and Neu (1990).
         Gestures made up only a small portion of the overall comments. This was at least partly because of
the limited range of motion visible in Zoom recordings, as the laptop was placed quite close to the test taker
as seen in the cartoonized images in Chapter 4. Nonetheless, although infrequently mentioned, self-adaptors
(e.g., head scratching, adjusting glasses) were highly salient to the raters and were highly indicative of
problems with underlying cognitive fluency (Segaliwitz, 2010) or anxiety. The relationship between self-
adaptors and coping mechanisms for stressful situations is attested in the psychology literature (Ekman &
Friesen, 1974; Kikuchi & Noriuchi, 2019) and literature with L2 speakers (Gregersen, 2005; Lindberg,
2021, 2022). These non-target-like, erratic gestures have also been mentioned as relating to having negative
impressions on raters in the testing literature (Gan & Davison, 2011; Sato & McNamara, 2019; Thompson,
2016). Other gestures, such as iconic or beat representational gestures, were infrequent and isolated to one
or two samples, but were unanimously seen as positive.
         Head movements were also mentioned relatively infrequently, but head nods stood out as the most
frequently mentioned behavior. These were nearly unanimously seen as positive. Head nods showed raters
that test takers were engaged with the test discourse, following raters’ questions, and showing active
listening comprehension. Head nods were then a critical aspect of interactional competence, allowing the
test takers to show uptake, hold the floor, and close their turns effectively with the raters. The positive
impact of nodding is well-attested in the L2 literature (Jenkins & Parra, 2003; May, 2009, 2011; Neu, 1990;
Trofimovich et al., 2021), and it is an important mechanism for displaying a range of information
(continuers, backchannels, etc.) in online teleconferencing (Mark et al., 2023). The only case in which a
head nod was perceived negatively was when it was too fast and showed a degree of anxiety.
         Eyebrow movements, head tilts, and an overall lack of expressiveness were all mentioned in this
dataset, but by fewer than half the raters. A whole range of other small behaviors, such as visible swallowing
                                                                                                           274


and shoulder shrugging were also mentioned, but in isolated cases. These behaviors did not make up a
sizeable portion of the comments, and it is difficult to generalize about their function from a small number
of comments.
         Thus, hypothesis 3 was only partially supported. Gaze behavior, not just gaze aversion, and mouth
movements were indeed the most frequent behaviors commented on. Likewise, as hypothesized, gesture
did not feature in a large number of comments, partially due to limited visibility in the online format.
However, eyebrow movements, head tilts, and inexpressiveness were not frequent comments in this dataset
as hypothesized. Likewise, posture was hypothesized as being a less commonly commented element, but
in reality, it was mentioned by most raters. Not hypothesized was the overall importance of paralinguistic
features or a general focus on the face.
RQ3.2: Relationship between nonverbal behavior and language proficiency
         Nonverbal behaviors in this study were found to exert varying influences on evaluations of speakers.
In general, these related to the interpretation of affect, which is in line with past research (Batty, 2021;
Coniam, 2001; Kappas et al., 2013; Richmond & McCroskey, 2004; Sato & McNamara, 2019; Singelis,
1994). Raters also used nonverbal behavior to understand semantic aspects of speech (e.g., head nods and
representational and beat gestures for emphasis and additional meaning), cognitive aspects (gaze shifting
indicating listening comprehension and speech processing), and interactional moves (e.g., mutual gaze,
head nods, and paralinguistics to mark turn-taking and holding the floor). Ultimately, these were used with
speech to better understand language proficiency outcomes, but they were not used in isolation. Raters
always made holistic judgements using the entire nonverbal ensemble and speech as evidence. There was
never a case, for example, when someone evaluated fluency lower because of and only because of sustained
averted gaze.
         Comprehension and adaptability. Various patterns arose in the qualitative data that suggested
that certain ensembles of behavior were useful to raters when evaluating proficiency. The most extensive
and frequently mentioned of these was a multimodal assessment of listening comprehension. The raters
drew primarily from nonverbal behavior, often prior to the answering of a test question, to infer about the
                                                                                                         275


cognitive processes test takers were undergoing. The reliance on nonverbal cues to convey backchanneling
and receipt tokens may be the result of testing in a Zoom-based format, as these exchanges have been shown
to elicit far fewer verbal backchannels than face-to-face conversation (Mark et al., 2023). The raters
thereafter often used this information in their judgements of overall language ability, and occasionally as it
related to vocabulary or fluency when breakdowns occurred. The raters frequently described these
judgements as ones of competence, one of the scale categories that closely related to language in Chapter
5. Competence was often described as both the ability to understand and successfully realize task demands.
Neither listening comprehension or task success are generally aspects of rating scales (e.g., IELTS), but
they regularly appear in the literature connected to raters’ views of communicative competence (Brown et
al., 2005; Ducasse & Brown, 2009; Orr, 2002; May, 2011; Sato & McNamara, 2019).
         If raters regularly find that comprehension and task completion are part of the construct of
communicative competence, this could indicate areas of construct underrepresentation in speaking tests.
This could also indicate that speaking should be treated more as an integrated skill than it has to date. Given
the recent resurgence of research on detecting nonunderstanding in interpersonal L2 encounters, especially
from a multimodal perspective (McDonough et al., 2019, 2022b, 2023), there is growing evidence of the
visual signature of non-understanding. McDonough et al. (2023), for example, found examples of facial
expressions that were largely used to determine affect that aligned with episodes of understanding (e.g.,
engagement, attention, confidence, relaxation) or nonunderstanding (e.g., inexpressiveness, disinterest,
confusion, stress). The raters in their study also discussed specific behaviors that related to these episodes.
They found that gaze behavior (e.g., blank stare, looking off into space), eyebrow movements (e.g., raised),
posture (e.g., slouching), and the use of self-adaptors were related to nonunderstanding, while mutual gaze
and the use of representational gestures related to understanding. These behaviors and affective stances
largely align with the behaviors discussed in this dissertation. With careful operationalization, one could
imagine richer descriptions of listening comprehension added to rating scales for speaking tests.
         The raters in this study noted that the impact of comprehension breakdowns was consistently
negative. Nonetheless, that impression could be moderated by the test taker’s adaptability when managing
                                                                                                            276


breakdown sequences. Adaptability was marked by both affective responses and nonverbal behaviors. In
terms of affect, adaptable candidates remained engaged, attentive, confident, and willing to communicate
throughout the breakdown sequence. They used multimodal backchannels (e.g., nodding), natural gaze
patterns showing attentiveness (either towards the interlocutor or towards thinking about content, but not
specifically mutual gaze), and postural stances that conveyed engagement, such as leaning forward. They
deployed multimodal resources of interactional competence to engage in active listening and to repair their
breakdowns. On the other hand, unadaptable test takers were disengaged, less attentive, less confident, and
appeared less willing to communicate. Their behaviors tended to be rather expressionless and distant. Their
gaze was removed or shifting, they nodded very little, and showed more of a slouching posture. They also
displayed fewer resources of interactional competence to manage the breakdown sequences, such as merely
repeating the trouble source (e.g., “Ceremonies?”) or using an open class initiator (e.g., “What?”) rather
than asking a clarification question (e.g., “What does ceremony mean?”). When raters noted that someone
showed tendences of being more adaptable to breakdowns in comprehension, they were more likely to have
an overall positive evaluation of the test taker. When fewer adaptable sequences were shown, the
evaluations remained negative. In other words, test takers could overcome the negative impact of
breakdowns by deploying a range of behaviors, affect, and strategies to create a more positive impression
of their language abilities. This attenuating nature of adaptability could perhaps explain the relatively
weaker relationship between repair fluency and perceived L2 fluency when compared with other linguistic
aspects of fluency, such as articulation rate and pausing, in recent meta-analyses on fluency (Saito et al.,
2018; Suzuki & Kormos, 2020).
         The adaptability shown by test takers successfully managing breakdown sequences has a relevant
connection to Hymes’ (1972) concept ability for use. In Hymes’ model, ability for use related to a person’s
linguistic competences that enabled L2 communication and also the cognitive, social, and affective capacity
to apply their communicative skills to differing contexts. For Hymes, communicative success was then
mediated by other psychosocial elements the test taker employed, which would change depending on the
characteristics and needs of each situational context. A similar sentiment was echoed by Morrow (1979)
                                                                                                        277


regarding interactional success: “The apparently trivial observation that the development of an interaction
is unpredictable is in fact extremely significant for the language user. The processing of unpredictable data
in real time is a vital aspect of using a language” (p. 16). When the test takers in these samples encountered
unfamiliar or otherwise problematic utterances, they were encountering an unpredictable situation. Their
adaptability to that unfamiliarity is what enabled more or less successful interactions with the examiners.
Harding (2014) argued that adaptability in the essence of Hymes and Morrow might form a missing core
element in models of communicative competence:
          The notion of adaptability is the common denominator in a test taker’s need to deal with different
          varieties of English, to use and understand appropriate pragmatics, to cope with the fluid
          communication practices of digital environments, and to notice and adapt to the formulaic linguistic
          patterns associated with different domains of language use (and the need to move between these
          domains with ease). (Harding, 2014, p. 194)
In line with these arguments, the raters showed an implicit understanding of the importance of adapting to
unpredictable and new contexts and considered this as evidence of communicative competence when
scoring samples with breakdowns.
          An illustration of how I have theorized the relationship between listening comprehension and
adaptability is presented in Figure 8.1. In this model, there are two unseen individuals: a rater and a test
taker, or an interlocutor and a speaker. This interaction represents, in essence, the end result of a
hypothetical adjacency pair: an interlocutor produces a statement or asks a question, and the test taker or
speaker is expected to respond. The model begins once the rater or interlocutor hears the speaker’s response.
First, the rater forms an initial assessment of listening comprehension using multimodal resources (column
1). The resources raters use listed in the model have been drawn from the literature and the findings from
the raters in this study. Simultaneously, raters observe the adaptability of the test taker, with the features of
adaptability being drawn from the rater reports in this study (column 2). When there are no comprehension
difficulties, there is no general impact of adaptability, and the proficiency outcome may be positive if there
is a display of linguistic competence in the test sample as well (column 3). When comprehension problems
                                                                                                             278


surface, however, adaptability moderates the outcome impression of language proficiency. The outcome
can be positive if the test taker displays an adaptable affect using pragmalinguistic tools (elements of
language that convey social meaning), verbal and nonverbal interactional skills to manage the conversation,
and nonverbal behavior to show an engaged and attentive affect. The outcome is negative when after the
breakdown, there is no adaptability present: the test taker does not repair the breakdown and displays
behavior showing disengagement.
                                                                                                       279


Figure 8.1
Impact of Affect on Assessment of Listening Comprehension
     Assessment of Listening                            Affective Response                 Impact on Rating
            Comprehension                                                                            Positive
                                                                                             • Rater combines evidence
     •  Multimodal backchannels                                                                with other verbal aspects
     •  Mutual gaze                                                                            of language
     •  Positive or neutral affect                                                           • Effect may be attenuated
     •  Less frequent or noticeable pausing                                                    by severity/frequency of
     •  Use of representational gestures                        Adaptable                      comprehension breakdown
                                                     • Demonstrates engagement (e.g.,
                                                       multimodal backchannels, leaning
              Breakdown in                             forward) and confidence
                                                     • Gaze suggests thinking patterns
            Comprehension                            • Shows desire to communicate
                                                     • Displays interactional competence
     • Shifting or averted gaze                        through active listening and repair
     • Gestural holds and self-adaptors
     • Frequent filled pauses
     • Mouth held open, raised brows, or                       Unadaptable                         Negative
       frowning
     • Eyebrow frowning
     • Clarification requests
                                                     • Disengaged (e.g., limited            • Near unanimous negative
                                                       backchanneling, slouching posture)     impact on test scores,
                                                       and unconfident                        regardless of verbal skills
                                                     • Gaze averted or shifting
                                                     • Demonstrates limited desire to
                                                       communicate
                                                     • Limited interactional competence,
                                                       few attempts to repair
                                                                                                                          280


          The raters’ intuitions about dealing with unpredictability using adaptable affective stances appears
to have some precedent in the literature. The boundary between B1 and B2 on the CEFR is mainly
characterized by unpredictable language use situations in which learners are able to communicate
effectively (Council of Europe, 2020). The skills involved in being adaptable, which are partly
pragmalinguistic in nature and partly nonverbal/affective, may be an implicit signal to raters that a speaker
has acquired certain high-level skills with the language to cope with unpredictable situations, even though
their core psycholinguistic structural and cognitive competences resulted in breakdowns in communication.
Perhaps, then, raters prioritize these pragmatic and affective displays of competence over linguistic displays.
Roever (2021) offered some support for this argument:
         High-proficiency learners at B2 and above need a strong focus on pragmatics as they have the
         linguistic tools needed for successful communication in a wide variety of settings, but they do not
         necessarily know how to deploy these tools for optimum effect… In face-to-face communication,
         [problems in written communication] can be compensated through tone of voice and facial
         expression.
If an individual does know how to deploy their pragmatic and affective tools for optimum effect, but lack
the linguistic tools in a certain situation, this could perhaps explain the positive effect of adaptability in the
model.
         Comprehensibility and approachability. Another pattern in the dataset related to evaluations of
comprehensibility. The raters in this study drew on a much wider range of speech features than just
pronunciation, noting that grammar, vocabulary, and fluency features also impacted the ease of
understanding the test takers in the samples. This finding is in line with previous findings on
comprehensible speech (Crowther et al., 2022; Saito et al., 2016; Trofimovich & Isaacs, 2012). Also of
interest is that the raters often explicitly mentioned that L2 accents did not play a large role in how easy or
difficult they found speech to understand, which is also in line with past research (Munro & Derwing, 1995;
Trofimovich & Isaacs, 2012). However, the raters in this dataset also drew extensively on nonverbal
                                                                                                              281


behavior and affect when describing speech that was more or less comprehensible.
         The raters broadly described a positive effect of approachability—broadly conceived as
personability and presence—as enhancing comprehensible speech. The lack of approachability often
inhibited the raters’ ability to understand speech as easily. Approachability included elements that conveyed
positive affect, such as smiling, laughing, and overall facial expressiveness. Positive affect helped create a
sense of personability and rapport with the raters, which the raters reported as making them want to
communicate with the test takers. These positive behaviors provided an important point of connection
between the two individuals even despite the fact that the samples were recorded and not live. Engagement
was also a critical element of approachability, as the constituent behaviors conveyed a sense of presence
and desire to communicate on the part of the test taker. Being attentive by displaying mutual gaze to the
examiner during test questions was one behavioral aspect of this, as well as using interactional behaviors
such as nodding. In these cases, comprehensibility was enhanced even when the raters noticed weaker
vocabulary, grammar, fluency, or pronunciation. On the other hand, inexpressiveness in both paralinguistics
(e.g., monotone, flat prosody) and embodied behavior (e.g., rigidity), neutral affect, erratic gaze patterns
(shifting, gaze darting around), and evidence of anxiety all contributed to lower comprehensibility in the
test takers. The dataset suggested that an absence of movement overall resulted in raters feeling detached
from such responses.
         The finding that behaviors influenced comprehensibility aligns with recent past research in this
area. Nagle et al. (2022), in a study of affect, found that greater collaborativeness (a measure of engagement)
and lower anxiety were associated with higher comprehensibility scores. Tsunemoto et al. (2022)
considered specific nonverbal behaviors and their impact on comprehensibility, finding that averted gaze
associated with higher comprehensibility scores, which may have been due to the participants engaging in
thinking rather than a lack of attention. They also found that greater eyebrow expressiveness corresponded
with lower accentedness and higher fluency scores. Trofimovich et al. (2021) also found that nods, a
nonverbal backchannel associated with engagement, led to higher comprehensibility scores, and positive
affect played a minor role in enhancing comprehensibility in a subset of the raters. Finally, Kim et al. (2023)
                                                                                                            282


studied the impact of behaviors on fluency ratings (which may be considered a constituent aspect of
comprehensibility), finding that smiling and eyebrow movements led to greater fluency scores. As a whole,
these elements of engagement and positive affect align with the elements of approachability that the raters
in this dissertation study reported.
         Overall proficiency and assuredness. The final pattern related to the relationship between
assuredness and holistic impressions of language ability. Assuredness was mostly conceived of as
confidence, but it was closely related to anxiety as well. The raters’ decisions in this area were both based
on an overall perception of affect and affect as perceived through nonverbal behavior. In particular,
confidence was often perceived as a lack of behaviors that showed signs of struggle, such as shifting and
averted gaze or being perceived as searching for linguistic resources. The presence of such behaviors led to
direct judgements of low confidence. Inexpressiveness in facial expressions or otherwise non-target-like,
“awkward” movements (e.g., fidgeting, self-adaptors) also contributed to these judgements. Averted gaze,
if perceived as a sign of thinking about content and preparing an upcoming utterance, was not perceived
negatively. Being perceived as confident and at ease almost always associated with positive impressions of
language ability, often described as fluency in the broad sense of the term. Nonetheless, though evidence
was limited, it did not appear that being anxious or lacking confidence necessarily caused lower proficiency
scores; some raters mentioned that a lack of ability would lead to more anxiety and lower confidence in a
testing situation. Thus, raters viewed proficiency and assuredness as being closely interrelated aspects of
the same phenomenon.
         However, given that many of the comments regarding assuredness also related to competence, and
given that competence was also at least partially related to successful interactions, assuredness may have
provided raters with additional evidence of overall successful performance outcomes. Supposing this is the
case, these findings suggest that not only is confidence an integral predictor of educational achievement
(Ahammer et al., 2019; Cobb-Clark, 2015; Heckman et al., 2006; Judge & Hurst, 2007; Stankov et al.,
2012), but the perception of confidence also becomes part of the perception of competence. In relation to
overall proficiency, the close relationship between assuredness and overall score impressions relates to
                                                                                                          283


theorizations in the L2 literature that confidence and low anxiety associate closely with gains in language
acquisition (Botes et al., 2020; Clément et al., 1980; Clément et al., 1994; Clément & Kruidenier, 1985;
Dewaele & Alfawzan, 2018; Dewaele & Li, 2022; Dewaele & MacIntyre, 2014; Dewaele et al., 2019;
Doqaruni, 2015; Jiang & Dewaele, 2019; Jin et al., 2017; Labrie & Clément, 1986; Li et al., 2020;
MacIntyre et al., 1997; Noels & Clément, 1996; Noels et al., 1996; Teimouri et al., 2019).
Triangulated findings
         The final stage of explanatory sequential mixed methods designs is to triangulate the findings from
the various components (Creswell & Plano Clark, 2017). A triangulation analysis gives information about
the overlap of key findings, with more overlap indicating a greater confidence in the findings. The three
studies brought forward evidence that largely aligned together and coalesced into three general findings.
Comprehensibility, engagement, and positive affect
         The first triangulated finding was that comprehensibility was measured not only by linguistic
elements of language (fluency, vocabulary, and grammar), but also by affective stances that were
determined by seeing nonverbal behavior. Study 1 showed that involvement was a key predictor in changes
in comprehensibility scores. Involvement was made up of engagement, attention, and interactiveness, and
as such can be seen as a measure of presence in the test interaction. Study 2 showed that mean valence
predicted comprehensibility scores when base proficiency level was taken into account. Lower-level
candidates were more likely to benefit from higher valence than candidates with higher proficiency. This
showed that iMotions measurements of positive affect (e.g., smiling) corresponded to greater
comprehensibility. Finally, study 3 demonstrated that approachability was a main factor that raters took
into account when using evidence to make decisions about comprehensibility. Approachability was made
up of elements of engagement, confidence, low anxiety, and positive affect, with constituent behaviors such
as smiling, attentive gaze, and forward leaning posture playing important roles.
         This study provides ample support for previous research in the area. These authors have argued that
affect (Nagle et al., 2022; Trofimovich et al., 2021) as well as discrete behaviors (Tsunemoto et al., 2022)
add to a greater ease of understanding L2 speakers. Behaviors showing engagement (showing attention,
                                                                                                          284


leaning forward, smiling) were consistently elements that the raters in this dissertation used when rating
and mentioned as making speech easier to follow. Although positive affect was not a key predictor of
comprehensibility in the first study, it was in the second and third, showing that a friendly, personable
demeanor was important in facilitating communication, especially for speakers with lower proficiency.
Positive psychology would support these findings given that positive emotions have a broadening and
building effect, showing that happy and warm individuals are interested and have a desire to get involved
(Fredrickson, 2001, 2003; Seligman, 2011). This was not always fully supported in the literature (Chong &
Aryadoust, 2023, Trofimovich et al., 2021), where positive affect had a somewhat inconclusive effect on
language outcomes. Indeed, as noted by the extremely small R2 value in Chapter 6, it may be the case that
positive affect has an impact, but it is quite small. Nevertheless, the evidence suggested that both
engagement and positive affect quite possibly allowed test takers to establish rapport with raters, even
through remote recordings, which may have had an impact on raters as an affective contagion (Elfenbein,
2014; Hatfield et al., 1994), leading the raters to correspondingly show engagement or positive affect.
Having “contracted” the test takers’ positive affect, the raters may have then wanted to listen and
communicate with the test takers. With greater attention being paid to the test takers, the resulting speech
would likely have been understood more easily, resulting in a higher comprehensibility score.
         A question that remains is whether the narrative from positive psychology translates across cultural
boundaries, and whether the findings from these raters would be generalizable to other cultural contexts.
All the raters in this study were American by birth, and thus shared similar (though certainly not the same)
cultural backgrounds. Wierzbicka (1994) noted that for middle-class (and generally, white) Americans,
interaction may be characterized by “great emphasis being placed on being liked and approved of, on being
perceived as friendly and cheerful” (p. 182). Happiness may communicate success and achievement, pride,
superiority, and self-esteem in American participants (Uchida & Kitayama, 2009; Kitayama, et al., 2006;
Shaver, et al., 1987). Thus, test takers smiling, laughing, and otherwise being perceived as exhibiting
positive affect may have unconsciously communicated to the American raters a certain degree of comfort
and success during the speaking test. While features such as fluency, grammar, and vocabulary had features
                                                                                                         285


that stood out and could be detected (relatively) easily, comprehensibility may have served as a catch-all
category that indicated overall success in these circumstances, thus being more influenced by the
approachability of the test takers. Whether these affective stances would have communicated the same type
of information to raters from other backgrounds is thus unknown at this point.
Proficiency outcomes and confidence
         Confidence also emerged as a strong predictor of proficiency outcomes—in particular fluency,
vocabulary, and grammar—across at least two of the three studies. In study 1, assuredness, a measure of
confidence and low anxiety, predicted changes in these three proficiency measures but not
comprehensibility. In study 3, confidence was a frequent topic of the raters’ thought processes and related
quite closely to perceptions of language ability. Both of these studies suggest a bi-directional relationship
for confidence (Edwards & Roger, 2015) in which greater confidence may make someone appear more
proficient, and greater proficiency may make someone appear more confident. The raters used confidence
as cognitive evidence of ability, as it related to the latent trait of proficiency, and also as a psychological
affective measure, as it conveyed social information during the testing context. The various roles of
confidence align with arguments made by Stankov et al. (2012), that confidence plays various cognitive,
psychological, and social roles in interaction. In study 2, proficiency outcomes were positively predicted
by variance in eye gaze in the group with higher proficiency, while greater variance in gaze associated with
lower scores in the lower proficiency group. In the higher proficiency group, gaze may have appeared more
target-like, shifting back and forth naturally similarly to proficient communication between L1 speakers
(Goodwin, 1980; Rossano, 2012). More natural shifts in gaze may have conveyed attention and also a sense
of confidence to the examiners, as a key function of gaze was to deduce confidence and anxiety in the rater
group. Indeed, in Tsunemoto et al. (2022), the frequency counts of averted gaze associated with greater
comprehensibility, which may have also been a measure of confidence in a similar vein. In the lower
proficiency group, higher variance in gaze patterns may have been due to shifting gaze during moments of
breakdown and struggle (such as in Burton, 2023). As noted in the recalls, shifting gaze was unanimously
perceived as negative as it was seen as a sign of anxiety. It is possible that this finding from study 2
                                                                                                           286


regarding attention shifts indirectly supported the findings about confidence here.
         The close relationship between assuredness and proficiency outcomes is widely attested in the
literature in both L2 studies and studies of general achievement (Ahammer et al., 2019; Botes et al., 2020;
Clément & Kruidenier, 1985; Clément et al., 1980; Clément et al., 1994; Cobb-Clark, 2015; Dewaele &
Alfawzan, 2018; Dewaele & Li, 2022; Dewaele & MacIntyre, 2014; Dewaele et al., 2019; Doqaruni, 2015;
Heckman et al., 2006; Jiang & Dewaele, 2019; Jin et al., 2017; Judge & Hurst, 2007; Labrie & Clément,
1986; Li et al., 2020; MacIntyre et al., 1997; Noels & Clément, 1996; Noels et al., 1996; Stankov et al.,
2012; Teimouri et al., 2019). This study thus adds to these findings in that the affective stance of showing
confidence and low anxiety can impact how someone is perceived. This finding may also help explain the
findings in Jenkins and Parra (2003) and Neu (1990), case studies in which more confident test takers were
able to overcome weaker linguistic skills when displaying a greater amount of assuredness.
Listening comprehension and competence
         There was also some convergence in the importance of listening comprehension and how it factored
into scores. Study 3 found that raters paid close attention to listening comprehension, frequently remarking
on the presence of nonunderstanding and repair. They then observed whether test takers were able to repair
and reestablish intersubjectivity or mutual understanding (Burch & Kley, 2020). Being able to realize
successful interactions was thus an important criterion for the raters. Raters often discussed comprehension,
repair, and success in terms of competence, one of the scales used in the rating study. Interestingly, study
1 reported that competence correlated with all four proficiency outcomes (M = .76), with an association
strong enough to support its inclusion in a language factor along with the four proficiency outcomes. This
shows that the raters implicitly viewed listening comprehension as a core aspect of spoken communicative
competence, even though it was not included explicitly in the rating scales. A core role for input
comprehension is also supported by literature on psycholinguistic models of communicative competence
(e.g., de Jong, 2023) and work on speech processing (Levelt, 1993), which may relate quite closely to
semantic prediction (Levinson, 2016; Pickering & Garrod, 2013).
          In studies on the relationship between different constituent elements of fluency (speed, breakdown,
                                                                                                          287


and repair; Tavakoli & Skehan, 2005), it is notable that repair fluency rarely correlates highly with fluency
measures when taking into account speed and pausing (Saito et al., 2018; Suzuki & Kormos, 2020). If raters
truly consider listening comprehension as integral to L2 fluency, one would expect that repair fluency
would form a stronger relationship with fluency measures. One possibility to explain this attenuated
relationship may be in how test takers manage breakdown sequences. Findings from study 3 suggested that
more adaptable test takers may leverage their affect and nonverbal behavior to manage repair sequences
more effectively than less adaptable candidates. Adaptability would then moderate the relative impact of
comprehension breakdowns and repair and could then explain why repair fluency may play less of a role in
how fluency is perceived.
Overall impact of nonverbal behavior
         The three studies showed that nonverbal behavior, either viewed through its semantic, cognitive,
affective, or social roles or as a discrete phenomenon, had an impact on all four proficiency outcomes in
different ways. What is of interest in the overall size of the impact. In study 1, relationship between
subjective, observed predictors and outcomes explained about 20% of the variance in scores. In study 2,
the relationships between externally measured, objective predictors were much smaller, explaining around
2% of the variance in scores. In study 3, nonverbal behaviors made up 11% of the total number of comments.
The literature shows similar types of relationships. For comments in stimulated recall designs, Sato and
McNamara (2019) and May (2011) found a similar number of comments about nonverbal behavior in
relation to communicative success and interactional competence, respectively. Choi (2022), Nakatsuhara
(2021a), and Nambiar and Goon (1993) all showed that the visual realm exerted a positive impact on test
scores, approximately enough to raise scores by half a band in the context of IELTS. Trofimovich et al.
(2021) reported that adding in a predictor of nodding resulted in an explanatory gain of about 13% in their
model of comprehensibility. Chong and Aryadoust (2023) did not report variance explained, but instead
noted that it was too negligible to support claims that the models were meaningful. Tsunemoto et al. (2022)
did not report R2 values or standardized effect sizes in their modeling of behavior, unfortunately. In any
case, although these various values are not directly comparable, they appear to converge in that nonverbal
                                                                                                          288


behavior is able to explain variance in each of these studies, with score gains being noted by all authors.
However, the size of those gains is not always enough to change scores substantially. Regardless, in a high-
stakes situation, any variance explained, even if it is 3%, may be enough to shift outcomes for particular
test takers. This alone is reason to investigate the phenomenon, consider construct revision, and make
efforts in rater training. I now turn to issues with the construct in the next section.
A revised model of L2 communication
          This findings from this dissertation suggest that our understanding of communicative competence,
which informs models of language proficiency, may only partially explain how L2 speakers use language
in real world settings. A tiered visualization of these elements was presented in Figure 2.1, organized in
layers ranging from those located within the individual at the center and those being co-constructed with
others in the outer layers. At the core of this model are cognitive elements, which represent the inner
workings of language comprehension, processing, prediction, and formulation. Structural elements are the
traditional linguistic competences of grammar, vocabulary, and pronunciation. Discursive elements are
those cohesive and organizational features that help utterances create complex meaning. Structural and
discursive features are learned, and cognitive features represent the automatization of their acquisition. Co-
constructed elements are those that the user creates with others depending on situational needs, including
strategic, pragmatic, and interactional elements. These are leveraged according to context and allow the
speaker to make meaning with others. These competences are situated in and react to social context in order
to fulfil communicative aims. Missing, however, are the effects of nonverbal behavior and affect. If positive
affect, engagement, and attention can indeed play major roles in meaning making with others, and if
constituent nonverbal behavior reveals something about the cognitive, structural, and organizational
features of language, a revision is necessary to better encompass these elements. These elements are critical
for ability for use, a core aspect of Hymes’ (1972) communicative competence, which included these
psychosocial aspects of communication. Any model eschewing the visual realm would be logocentric
(Mondada, 2016), failing to represent the true complexity of human communication.
          As reviewed in the literature, nonverbal behavior was described as playing a primarily social-
                                                                                                           289


interactional role in prior models of L2 communicative competence. Canale and Swain (1980) and Canale
(1983) described how nonverbal behavior played a compensatory role in strategic competence, and Celce-
Murcia (2004) and Galaczi and Taylor (2018) more explicitly included a range of nonverbal behavior in
their discussions of interactional competence. However, I reviewed studies from psychology, applied
linguistics, and human communication showing that nonverbal behavior can also convey cognitive,
semantic, social-interactional, and affective information. L2 communication can combine verbal and
nonverbal elements to provide information about the language development of the speaker and their ideas
and intentions. Nonverbal and verbal strategies and interactional moves help speakers navigate breakdowns,
compensate for gaps in their lexical knowledge, and manage conversation with others. Conveying affective
information allows speakers to demonstrate their inner stances, motivations, desires, and feelings about
events, which can provide important cues for interactants navigating dynamic and fluid contexts. The
deployment of pragmatic competence, interactional competence, strategic competence, and skillful affect
is the hallmark of highly competent L2 speakers (Roever, 2021), and when raters view this, they may
perceive an individual as stronger than their actual inner psycholinguistic structural and cognitive
competences. Affect and emotion are furthermore co-constructed amongst interlocutors, and the affective
behavior of one person can impact the responses of another in different ways. These responses can also
color the judgements people make about their interlocutors.
         In the studies discussed in this dissertation, I have presented additional evidence that affect and
nonverbal behavior can further color perceptions of the language abilities of speakers. Perceived
assuredness provides clues as to speakers’ underlying proficiency (cognitive traits, such as fluency,
grammar, and vocabulary). Being perceived as engaged (through nodding and leaning forward, for example)
may engage others in closer listening behaviors, thus improving their comprehensibility, perhaps even
compensating for developing phonemic control. Eye gaze patterns marked by a greater number of shifts
may indicate struggle in lower proficient speakers and provide evidence of lower cognitive fluency, while
positive behaviors such as smiling aid in comprehensibility. A full range of other behaviors that raters
observed provided evidence for other traits relating to language proficiency as well. For this reason, there
                                                                                                         290


is strong evidence to revise models of L2 communication to incorporate nonverbal behavior, affect, and the
ever-dynamic role of context. Figure 8.2 displays an extension of de Jong’s (2023) conceptualization of
language proficiency, presented earlier in Figure 2.1, adding in an affective dimension that closely interacts
between language abilities and social context.
Figure 8.2
An Extended Model of Language Proficiency
         The abilities of individuals to convey affective and emotional meaning in the outside tier of this
model are culturally bound and largely innate. As such, I do not classify them as competences. Rather,
together with strategic, interactional, and pragmatic competences, they are socially co-constructed aspects
of meaning making that contribute to overall messages, may reveal information about a person’s language
ability, and add to their capacities to communicate. These external layers allow users to adapt to various
social contexts with varying task demands, goals, and purposes. They can mediate language when there are
                                                                                                          291


breakdowns in understanding or ability, allowing speakers to succeed even in the face of challenges.
Alterations in the valence and strength of affect have been shown to moderate listeners’ interpretations of
speech, and as such, their inclusion in the model is warranted.
         My interpretation of the relationship between nonverbal and verbal communication as it relates to
the model above is presented in Figure 8.3. Verbal communication generally conveys semantic, ideational
meaning, and it is a core component of cognitive and structural aspects of speech. It can, however, also
convey meaning about people’s affective responses, stances, and orientations. Nonverbal communication
is generally non-propositional, and thus the information it conveys is largely not semantic in nature. Instead,
nonverbal behavior conveys large amounts of socially oriented information. It conveys affect and
orientations to others, as well as aiding in social situations by managing interactions and providing strategic
resources for speakers. That being said, nonverbal behavior may be used to infer information about speech
fluency and comprehension (cognitive components), as well as lexicosyntactic and phonological
information (structural components). The two modes of communication are linked, and meaning is often
combined from both modes to convey meaning from all levels. This is conveyed in Figure 8.3 by the
overlapping triangles, which show the strength of the associations with each aspect of proficiency.
                                                                                                            292


Figure 8.3
Relationship Between Verbal and Nonverbal Communication
                                                        293


          Past models and theorizations of communicative competence have often described target-like
“effective” communication, but these have not made a strong case for including aspects of language
development in their theorizations. Current models describe competence within the confines of an idealized,
often L1 “native” speaker. These can be problematic as L1-like attainment can be unrealistic for learners,
and an undesired goal, as bi- or multilingualism can be an important part of one’s identity (De Costa, 2016;
Norton, 2000, 2013; Pavlenko & Norton, 2007). Learners may be competent in a second language and
communicate effectively at many different levels of development. What is necessary, then, is a model which
is flexible enough to allow descriptions of the cognitive, structural, discursive, and social-interactional
patterns of development that have been discussed in second language acquisition research.                The
visualization presented above, with the integration of psycholinguistic core aspects of proficiency with
social-interactional and affective external aspects, provides a useful metaphor for understanding how
profiles of language learners with differing learning trajectories interact with social context. These are
presented in Figure 8.4, and described below.
                                                                                                         294


Figure 8.4
Profiles of Learners with Different Competences and Skills
                                                           295


         In Figure 8.4, hypothetical person A demonstrates an L2 learner with a balanced profile of the
various competences and skills. Each tier is approximately the same size, and the ability overall covers
nearly all of the needs of the social context. This speaker would be able to fulfill the requirements of social
interaction relying on a mix of all of their competences and skills, thus being perceived as an effective
speaker. Person B is also an effective speaker and can fulfill the needs required by social context. This
profile, however, shows a stronger core of psycholinguistic aspects of language learning. This person is
able to rely on their knowledge of grammar, vocabulary, and phonological features, because they are
automatized and available for use. In this particular social context, the individual does not need to leverage
interactional or affective aspects of communication as much to meet the demands of context. Person C,
however, is quite different. They are also able to successfully communicate, but they do so with a much
more reduced set of inner linguistic competences for the context. Instead, they are able to leverage their
pragmalinguistic tools, strategic competence, interactional abilities, and affective stances to communicate
effectively despite their relatively weaker linguistic skills. The differences between individual B and C may
represent why differing profiles of test takers in Jenkins and Parra (2003) and Neu (1990) were perceived
differently in their test discourse. Finally, person D has a much more reduced set of competences and skills,
and these are not sufficient to meet the needs of social context. The individual is not able to leverage
interactional and affective stances in the same way as person C.
         In my theorization in this model, none of these individuals have a static set of competences and
skills. These change and morph according to the needs of social context. Thus, these can also be thought of
the same individual in different contexts, as each situation will have a different set of needs that requires
experience and learning to accomplish. Not every situation will accommodate the affordances of affective
stances that are primarily visual, such as telephone conversations. Likewise, situations with different levels
of difficulty or power dynamics will reshape and modify the competences and skills a speaker will be able
to display.
                                                                                                           296


        This type of metaphor of language ability may also be able to extend to interactions themselves. A
foundational argument regarding interaction is that speakers co-construct discourse (Galaczi & Taylor,
2018; Roever & Dai, 2021; Roever & Kasper, 2018; Plough et al., 2018; Young, 2011), with each speaker
exerting an influence over the other. Disentangling the language abilities of more than one speaker,
especially interactional competence, can therefore prove complex. The above metaphors may offer a useful
way to visualize these effects as well. Figure 8.5 shows how interactional-facing skills may appear in a
dyadic encounter. Here the person on the left has a somewhat larger language profile in terms of core
elements for the particular social context, while the person on the right has a somewhat smaller profile.
These are separate because they reside within each individual. The social-interactional and affective
elements of these two individuals, on the other hand, are shared. The dyad co-constructs their encounter,
making meaning using their interactional competence and impacting each other through their affective
stances and orientations. In this particular example, the interactional and affective domains are
approximately equal in size, and together the two individuals cover the needs of the social context. They
are thus able to communicate successfully in this social encounter. In other encounters, the interactional
and affective contributions from both speakers may be somewhat different in size, but they will still overlap.
They may not cover all of the contextual requirements of the situation, and communication may not be as
effective. This metaphor could also be useful for understanding sociocognitive theories of language learning
(Lantolf et al., 2020; Vygotsky, 1978). In these theories, a stronger interactant may scaffold language, affect,
and elements of the social context to enable less proficient learners to achieve communicative goals that
are normally beyond what they can do alone or with a weaker interactant. The stronger interactant can, as
one might say, “bring out the best” in the test taker.
                                                                                                           297


Figure 8.5
Language Ability in a Dyadic Encounter
         The importance of the socially oriented layers of competence in the discussed models has been
attested especially with regard to advanced learners. Roever (2021) made the case earlier that pragmatic
knowledge is especially important for learners above at and above B2 level, as the nuances of
communication become much more than conveying the meaning of particular words and grammatical
structures. Importantly, Roever (2021) also noted that nonverbal affective information can help compensate
for any gaps in a speaker’s pragmatic knowledge, helping to convey meaning the verbal channel cannot.
Lantolf (2006) made a similar argument regarding the unification of language and culture into a broadly
termed “languaculture”, which becomes increasingly important for learners to tap into as they grow in
language proficiency. He used a similar circle metaphor, referring to the outer circle in terms of cultural
elements corresponding with language:
                                                                                                       298


         Outside of the circle, the domain of languaculture, meaning becomes much more interesting and
         complex because it entails knowledge of different concepts and how these are encoded in such
         features as conceptual metaphors, lexical networks, lexicogrammatical structures, schemas and the
         like that represent different ways of organizing the world and our experiences in it. (Lantolf, 2006,
         p. 79).
Essentially, Lantolf is describing similar elements in the models above, referencing pragmatic competences
and interaction. He goes on to emphasize nonverbal behavior as well, referencing Slobin (1996, 2006) and
Talmy’s (2000) work on path gestures to highlight aspects of culture visible through path motion depictions
of manner are critical to conveying meaning, yet lie beyond the inner psycholinguistic core of language
proficiency.
         Interaction is likely the primary driver of perceived advanced language proficiency. The
fundamental importance of interaction is well attested in the literature in SLA (Gass & Mackey, 2020; Long,
1996), which drives acquisitional processes when individuals need to use language in a social context.
Interaction is fundamental as well to sociocultural theories of language acquisition (Lantolf et al., 2020;
Vygotsky, 1978), as interactants can scaffold the social context creating opportunities for learners to use
language beyond what they would normally be able to produce without support. Interaction creates
opportunities for speakers to demonstrate the widest range of their abilities, include pragmatic, strategic,
and interactional skills (Roever, 2021). Interaction, when built into language assessment, can also create an
environment where learners feel more at ease and demonstrate a wider range of demonstrable language
(Thompson et al., 2016). As nonverbal behavior is critical to interaction management and conveying culture
(Celce-Murcia, 2007; Galaczi & Taylor, 2018; Lantolf, 2006; Plough et al., 2018), narrowing language
performance to audio-only or non-interactive formats would limit the abilities of test takers to display their
full range of language proficiency. This would further explain score differences found in Choi (2022),
Nakatsuhara et al. (2021a), and Nambiar and Goon (1993) where the visual realm was stripped from audio-
only rating, and test takers were not able to show their full L2 proficiency.
                                                                                                           299


Implications
Implications for language testing and SLA research
        Researchers have argued for decades that nonverbal behavior and affect may exert an effect on
performances or scores in speaking test settings (Pennycook, 1985; Harding, 2014; Kellerman, 1992;
Plough et al., 2018; Plough, 2021; Young, 2002). These non-linguistic elements have regularly been the
“elephant in the room,” as practitioners witness these effects in operational settings, but the score impact
has largely been unknown or only posited through small scale studies. This dissertation has then provided
some initial evidence of impact across a larger sample of individuals in a language testing context. It has
shown that novice raters that are not trained on specific rating scales take nonverbal behavior and the
information it conveys into account when formulating impressions of communicative competence, and
these become part of score variance, albeit to a somewhat small degree. It thus confirms many of the
findings of Gan and Davison (2011), Jenkins and Parra (2003), and Neu (1990), that nonverbal behaviors
can have an impact on score outcomes. It also explains some of the variance in test scores due to modality
differences (e.g., Choi, 2022; Nakatsuhara et al., 2021a; Nambiar & Goon, 1993).
        Theory to date has built a solid argument that affect and nonverbal behavior are integral parts of
human communication (Hall et al., 2019; Hall & Knapp, 2013; Matsumoto et al., 2016). It has even been
speculated to belong to an expanded version of communicative competence (e.g., Canale & Swain, 1980;
Hymes, 1972). Test constructs that draw from logocentric theorizations of L2 communication (Mondada,
2016)—those that do not include nonverbal behavior—may suffer from construct underrepresentation
(Messick, 1989) by not including critical facets of the test construct that exist in the target language use
domain. Indeed, apart from few accounts (such as the test revision project described in Jenkins & Parra,
2003), large-scale tests (like IELTS or TOEFL) rarely include descriptors of nonverbal behavior beyond
paralinguistic cues (e.g., prosody and pauses). In some testing contexts, developers have chosen audio-only
recordings as the basis for rating speech, thus entirely removing the visual world from the speaking test.
These performances often result in scores that are lower than performances in which the test taker can also
be seen (Choi, 2022; Nakatsuhara, 2021a; Nambiar & Goon, 1993). Removing the visual world can be
                                                                                                         300


problematic if it disadvantages test takers by not accounting for their repertoires of communication that go
beyond those that are verbal only. One way to strengthen the construct would be to include video recordings
of test takers as the basis of rating (or live, face-to-face rating), and to include a broader range of criteria
that reflect this expanded construct. As Plough (2021) argued in regard to the inclusion of nonverbal
behavior in test constructs, “we are obligated to create rubrics that, to the extent possible, account for the
full range of performance (on which candidates are evaluated)” (p. 62). She went on to issue a word of
caution given the challenges of individual variation, idiosyncrasy in interpretations, and contextual fluidity
of nonverbal behavior and the meaning it conveys. More research is needed in this area to determine which
aspects of behavior give reliable insight into language ability across cultures and contexts, and whether
these are meaningful in terms of second language development.
         The results of this study have implications for discrete skills and integrated skills assessment as
well. In the tradition of discrete testing, reading, writing, listening, and speaking are tested separately, and
efforts are made to minimize the impact each has on the other. In productive skills testing, however, input
is necessary to elicit language from test takers (e.g., a prompt, a question from an interlocutor), and decoding
that input always requires receptive skills of listening or reading. In the case of speaking, input is generally
in the form of some aural stimulus, such as a conversation partner, though it can also be in the form of
written instructions, or mixed (multimodal). This format of testing requires speakers to both listen and speak
in productive skills tests, but they are only scored on the basis of their spoken performance. The paradigm
of integrated testing, on the other hand, treats skills as interrelated and inseparable. Listening and speaking
form part of a unified construct, and scores provide inferences about both skills. The results from this study
suggest that raters naturally find listening to be a core part of the speaking construct, aligning with past
findings (Brown et al., 2005; Jenkins & Parra, 2003; May, 2011; Orr, 2002; Sato & McNamara, 2019). It
may be necessary to represent listening comprehension in rating scales to avoid construct
underrepresentation, thus more broadly adopting a form of integrated assessment. Given the wide range of
verbal and nonverbal features the raters mentioned, and the prevalence of these in the literature, it may be
possible to begin devising more meaningful scale categories for listening comprehension in speaking tests.
                                                                                                             301


         Other aspects of nonverbal behavior and affect may merit their inclusion in other scales, in
particular fluency. For example, the presence of shifting gaze, self-adaptors, and relative inexpressiveness
corresponded frequently with speakers with more limited language skills. A more attentive gaze (not purely
defined by mutual or averted gaze), the use of co-speech, representational gestures, and a more skillful use
of head nodding and eyebrow movements corresponded with more proficient speakers, as well as speakers
that conveyed a greater level of comfort and confidence. I do not think these behaviors should be assessed
discretely or separately from language subskills (see Jungheim [2001] and Pan [2016] for examples of the
discrete assessment of nonverbal behavior that are particularly problematic). As O’Sullivan (1996) argued
after discussing his own efforts to devise such an assessment of nonverbal competence for L2 speakers,
         [t]hough the possibility of developing tests which will indirectly test such competence is certainly
         appealing, it is as inappropriate to separate the non-verbal channel from its natural context of
         communication as it is to separate the verbal channel. Therefore, in as much as previous tests can
         be argued to lack validity for ignoring one important aspect of communication, such indirect tests
         will lack validity for the same reason (p. 319).
The results in this dissertation align with the above comment in that raters take a holistic view of nonverbal
behavior and integrate evidence from verbalizations when making their decisions. Thus, scales that include
both verbal and nonverbal behavior may be more informative and useful for raters, thus providing a broader
source of evidence about skill development. Fluency scales, for example, could include information about
gaze and gesticulation. Vocabulary or grammar scales that include descriptors pertaining to the use of
representational gestures may also be useful for raters when scoring these areas. Comprehensibility scales
could include descriptors about behaviors that convey engagement and positive affect. If used, the wording
of these should emphasize the skillful use of these behaviors when conveying meaning effectively together
with language elements. Any use of descriptors describing the mere presence or absence of behaviors would
undoubtedly poorly represent the construct, as raters do not consider behavior in a binary, yes/no manner.
         Another important takeaway from this study is the need to strengthen rater training programs with
discussions of behavior and affect. Unfortunately, the content of these trainings is rarely discussed in the
                                                                                                          302


literature beyond methodological aspects (e.g., frequency/duration of sessions) (e.g., Yan & Chuang, 2023;
Weigle, 1994, 1998). It is unknown how large scale or local language testing organizations address behavior
and affect in the context of rating speaking tests. In my own experience having worked as a rater for multiple
large scale testing organizations, I can attest that this type of training was extremely limited or non-existent.
What may be needed then is ethics and sensitivity training. These sessions could focus on building empathy
with test takers, as the testing situation can induce anxiety in many individuals and perhaps change the way
that these test takers appear visually. These trainings would be especially useful for dealing with test takers
with varying neurodiverse profiles who also have physical behaviors as symptoms of those profiles, such
as rocking, repetitive movements, or differing patterns of gaze (American Psychiatric Association, 2013).
It would be critical for raters to understand that individuals may exhibit differing patterns of openness and
attention if they are, for example, autistic, and that these behaviors should not impact the rating of their
language. This would require testing organizations to recognize the needs of these underserved groups to
build equity and reduce bias (Randez & Cornell, in press). Nonetheless, the accommodations needed for
many groups of test takers is still an active area of inquiry in the field (Taylor & Banerjee, in press), and
much more research is needed to uncover how best to serve these varying groups.
         In terms of SLA research, this study provides some evidence that there may be developmental
trends in how speakers use gaze and gestural behaviors as they develop in particular areas such as fluency
and comprehensibility. As discussed previously, it has provided some confirmatory evidence for studies in
this realm such as Kim et al. (2023), McDonough (2019, 2023), Nagle et al. (2022), Trovimovich et al.
(2021), and Tsunemoto et al. (2022). Nonetheless, much more research is needed to understand whether
these behaviors and the cognitive, social, and affective information they convey can be reliably separated
into stages of growth. Given the greater amount of research targeting gestures in SLA and the somewhat
contested findings about patterns of gestural development (Aziz & Nicoladis, 2018; Benazzo &
Morgenstern, 2014; Gullberg, 1998, 2006, 2012; Krauss & Hadar, 1999; Laurant & Nicoladis, 2015;
Nicoladis, 2007; Nicoladis et al., 1999, 2007; Sherman & Nicoladis, 2004), it may very well be the case
that hard coded or linear patterns of development do not exist, much as is the case with many aspects of
                                                                                                             303


language itself. In this case, understanding the various ways behaviors and behavioral ensembles are used
in context to display effective communication skills may be the best route to describing patterns of
development.
           Also in terms of development, tracking growth in high level speakers (upper intermediate/advanced,
B2–C2 on the CEFR) may not be entirely appropriate using traditional methods in SLA. Currently, language
development in SLA studies is often tracked using complexity, accuracy, and fluency (CAF) measures
(Ellis, 2003; Ellis & Barkhuizen, 2005; Skehan, 1998). These measures provide discrete snapshots of
development during educational interventions so that researchers can see gains in one or all three of the
measures, thus justifying certain methods of language learning. These have been used extensively in the
field, and they have provided vast insight for understanding growth in lower proficiency learners. Tracking
growth using these measures, however, does not take into account critical aspects of functional adequacy,
or whether communication is effective and sufficient to complete certain tasks (Pallotti, 2021). The raters
in this dissertation were especially attuned to communicative adequacy, which they often scored as
competence. Taking into account features of language that lead to functional adequacy, such as pragmatic
competence and interactional competence, may be especially important for more advanced speakers
(Roever, 2021). This dissertation has also shown that effective use of affect management (adaptability,
approachability) in the face of unpredictability can lead to differences in how learners may be perceived,
with more capable learners leveraging multimodal resources to accomplish tasks. If CAF measures fail to
show growth in higher proficiency learners, it may not be due to the intervention but rather the measures
themselves.
Methodological implications
           This study has a number of methodological implications as well. Firstly, to my knowledge, it is the
first of its kind to use iMotions facial analysis software to extract measures of nonverbal behavior in a study
of L2 communication. Although Chong and Aryadoust (2022) used FaceReader, a very similar application
that extracts base emotions, their study did not focus on nonverbal behaviors but rather how emotional
transfer may have impacted test scores.The employment of iMotions in this dissertation was useful because
                                                                                                            304


its measures broadly aligned with the findings from the rating scales and the stimulated recalls, showing
that cutting edge facial recognition technology may be used in meaningful ways in L2 research. Although
expensive, the benefit of using this technology is that it dramatically reduces the workload necessary to
measure nonverbal behavior. I did not report statistical information from the human-annotated ELAN
transcripts in this dissertation, but as a contrast, the research assistants took many months to transcribe 30
two-minute speech samples fully, while iMotions took less than an hour. The speed of these tools is
paramount for the study of larger samples of data.
         This comes with an important word of caution. As noted in Chapter 6, the correlations between the
iMotions variables and the observed variables were medium to low, even in the case of the very similar
measures of positivity and valence. One of the issues in using these algorithms is that developers do not
always fully disclose a) how the technology works, b) the specific facial movements it measures in its
emotional indices, or c) how accurate the classification system is. It is unknown whether the software can
accurately detect facial movements on the faces of individuals from varying cultural and ethnic
backgrounds; that is, the demographic information about individuals the software was trained on is not
disclosed. If classification accuracy is low, this will mathematically attenuate any correlations with outcome
measures, possibly resulting in skewed results or type II error. Thus, the use of these systems in L2 research
will require more validation work to understand the underlying features being measured and the systems’
accuracy in doing so.
         On a related topic, one theme that has arisen from this dissertation was the contrast between
perceived and observed measures. Perceived measures in this study were the scores awarded by the
undergraduate student raters on the affect rating scales, while observed measures were those that were
annotated by iMotions and the human-based annotations reported in Chapter 7. Gullberg (1998) found clear
differences between perceived and observed gestures and how these related to score differences. In her
study, raters’ perceptions of gestural use varied from manual annotations of gestures. Observation did not
always align with perception. Regarding the impact on scores, apart from one category of annotated iconic
gestures, only perceived gestures impacted scores that her raters awarded. In this dissertation, Study 1
                                                                                                            305


showed that the perceived measures of affect predicted with a certain degree of strength changes in the four
proficiency outcomes. Namely, assuredness predicted changes in fluency, vocabulary, and grammar, while
involvement predicted vocabulary and comprehensibility. Study 2, on the other hand, showed that
objectively observed behavior through machine learning could also be used to predict outcomes. In this
case, variance in attention predicted changes in fluency, vocabulary, and grammar, and overall valence
predicted comprehensibility. However, the models with observed variables explained far less variance (2-
3%) than the perceived measures (15-25%). This aligned with the stronger impact of perceived affect
measured in Nagle et al. (2022) and the weaker impact of observed features in Trofimovich et al., (2021),
Tsunemoto et al. (2022), and Chong and Aryadoust (2022). To some degree, it appears logical that
perceived features would relate more strongly to language outcomes: Raters observe both, and there is
probably some degree of overlap in what they perceive. As far as I can tell, in all studies that used perceived
variables, these were all measured simultaneously in the same rating session with language. A
methodological implication here is that studies need to be carried out where affect or behavior and language
proficiency are measured at different yet counterbalanced moments in time. Only by designing a study in
this way can the relative impact be teased apart.
         Finally, I also believe that this study has shown the value and strength of using emic, rater- and test
taker-focused methods when studying nonverbal behavior. Using an ethnomethodological approach,
including stimulated recall and multimodal conversation analysis, “explores the ways social actions are
built by the participants, contingent on and indexical for the specifics in any situation” (Kasper & Wagner,
2018, p. 82). Plough (2021) advocated for these approaches in research and test validation studies given
that nonverbal behavior “is not a static behavior that can be categorized; rather, it is part of a dynamic
interactional process” (p. 62). Multiple examples of these dynamic processes were seen in Chapter 7 on
nonverbal behavior and rater cognition. Currently, modeling techniques in statistics may be incapable of
modeling such dynamic behavior that shifts according to context. While there is some promise in the use
of machine learning and AI to study complex, dynamic phenomena, these require massive datasets that are
impractical to build for such quasi-exploratory research. Thus, while the empirical stance I took in Chapter
                                                                                                            306


5 (on affect and language proficiency) and Chapter 6 (on nonverbal behavior and language proficiency)
was justified and has indeed shown patterns useful for the study of this phenomenon, these findings would
be far less meaningful without the insights from the raters and also the transcripts of the speech samples
presented in Chapter 7. Given the nature of nonverbal behavior to co-occur with language and convey
cognitive, affective, and social information, the mixed-methods design using an ethnomethodological
approach in this dissertation is especially appropriate for studying this phenomenon of L2 use (Hulstijn et
al., 2014).
                                                                                                       307


                                       CHAPTER 9: CONCLUSION
         This study has presented a comprehensive analysis of the impact of nonverbal behavior and
interpersonal affect on L2 proficiency outcomes. It used a three-tiered design using mixed methods to
triangulate findings, thus leading to more stable inferences. The study has broadly found that nonverbal
behaviors and affect can impact proficiency outcomes in different ways. Desirable, communication-forward
behaviors such as mutual gaze, nodding, leaning forward posture, and representational gestures can convey
confidence, engagement, and positive affect, which lead to differential outcomes in ratings of fluency,
vocabulary, grammar, and comprehensibility. The variance explained by these phenomena was rather small,
as one would expect. However, even a small amount of variance could prove important when a test taker is
near a meaningful cut-point on a high stakes test. Thus, I have concluded that the results of this study and
the broader literature on nonverbal behavior and affect point to a need to revise models of communicative
competence, and I have presented one such alternative model. I have argued that the results here could also
be operationalized in speaking test constructs, with scores providing a much more valid inference about
language ability.
Limitations
         There are a number of limitations in this study. For one, any results must be interpreted within the
context of the participants: young, L1 English speakers in America observing the speech and behavior of
L2 English speakers from China. These effects may not be universal or generalizable to other cultural
contexts. For example, global contexts that do not place such cultural capital in appearing happy, positive,
and confidence—such as the case of the United States—may not show the same correlations with
proficiency judgements or improvements in comprehensibility. Without studies that extend this research to
other groups, it is unknown whether these effects may be generalizable more broadly. Likewise, it is
difficult to disentangle the effect of nationality of the sample participants from the study, as all test takers
were from the same cultural group. More research is needed to understand whether the background of the
L2 speakers influenced perceptions of their nonverbal behavior.
                                                                                                            308


         A second limitation concerns the speech samples. While the pool of raters was sufficiently large to
detect a number of rather small effects, the sample of 30 test takers was rather small. Although generally
representative in terms of spread of ability levels, this sample lacked extremely weak or strong speakers.
The set of samples also may have lacked variance in expressiveness in nonverbal behavior. Nonetheless,
the sample size was both a result of what the test developer could provide, as well as a result of the power
analysis. In the future, larger samples with broader ranges of behavior can be used to observe the impact of
these behaviors on language ratings. Instead of drawing from testing contexts, which may attenuate the
expression of strong emotions, it may be more desirable to have learners produce authentic, real-world
recorded language that can then be rated. This would also enhance the ecological validity of the
methodological design, relating more strongly to target language use in the real world.
         Although the test takers had a shared cultural background, the samples varied in both interlocutors
and topics. It is known that the verbal and nonverbal behavior of raters can impact test takers’ performances
(Briegel-Jones, 2014; Brown, 2003; Plough & Bogart, 2008), so it is unknown to what extent the raters in
this study impacted the test takers’ multimodal discourse. Likewise, the tasks varied in these samples, with
some topics appearing somewhat more difficult than others. Breakdowns in comprehension did not appear
in all samples, and these breakdowns appeared to have an effect on the scores raters awarded. It would be
desirable to control for this and only include samples without breakdowns to reduce the variance due to
incomprehension. The perceived difficulty of the test-tasks (even though these were validated and found to
exert no effect on scores in Nakatsuhara et al., 2021a) may have caused raters to give the benefit of the
doubt for more difficult tasks, or to focus more on the lack of comprehension rather than language
production. Each of these factors could then skew the effects of nonverbal behavior on language scores.
More comparable samples would be desirable for future research.
         The design of the rating instrument was also a limiting factor in this study. Having raters judge
language and affect simultaneously likely led to a halo effect across rating categories. This certainly
appeared to be the case in the correlation tables presented in Chapter 5. A study design where raters scored
language on one occasion and affective states on a separate occasion may have resulted in size differences
                                                                                                          309


in the correlations. However, because the qualitative data triangulated with these findings, I do believe that
the associations were not completely artificial. Another limitation related to the scale was the lack of
definitions and more extensive practice when assigning scores for language. In the stimulated verbal recall
sessions, raters quite often applied the broad definition of fluency (Lennon, 1990)—that of overall language
ability—when scoring this category. Thus, fluency served as a “catch all” category that may have consumed
interesting variance attributable to the narrow definition of fluency and overlapped excessively with
grammar and vocabulary. The strong correlations between these three categories suggests this was the case.
Comprehensibility, however, appeared to have variance distinct from that of proficiency. For future
research, better benchmarked samples, more extensive practice, and specific definitions of the language
categories could offer important insight on any differential effects across these categories.
         Another limitation with the rating design was the choice to have raters conduct the study remotely.
This made it impossible for me to control for distractions in their environment and to make sure they paid
attention to the screen during the rating. It also made it impossible to ensure that multiple people were not
involved in the ratings, that the participants were alert and attentive, or other concerns. However, this
methodological decision was partly made because of health concerns at the time data were collected in early
2022. Although labs were able to open at that time, many students were cautious and avoided physical
campus spaces. Our university was still experiencing short, temporary shutdowns at that time, when surges
in COVID-19 infections occurred. Hosting the rating instrument online avoided health concerns and made
the study accessible to far more participants than would have been possible with an in-person study. I put
measures into place to reduce the above limitations as much as possible. I included scales of affect to
essentially force raters to watch the video, as emotion and affect is largely detected through nonverbal
behavior. I wrote Java code in Qualtrics so that videos would be viewed in a large format, could not be
paused, and could only be seen once, again encouraging raters to attend to the videos while they could.
Instructions for the study were also complete and repeated throughout the study (when signing up, when
being provided the link, in the study itself, etc.), and thus the participants were well aware of my
expectations.
                                                                                                          310


          Regarding indices extracted from iMotions, although the measures used were objective and
computer-derived, there is some doubt as to the reliability and veridicality of the measures. Though studies
support generally high reliability of iMotions for use in the social sciences (e.g., Dupré et al., 2020; Flynn
et al., 2020; Kulke et al., 2020; Stöckli et al., 2018), if the training sets did not include individuals from a
range of backgrounds, cultures, and contexts, the results may be biased. Likewise, the features that factored
into engagement, valence, and attention, while somewhat documented, are each an amalgamation of various
features. These clusters may then mask the effects of individual behaviors. It is likely that more salient
behaviors in an online context, such as gaze aversion and smiling, would have a greater effect than their
clustered counterparts. However, this can only be explored with more nuanced correlational studies of
discrete behaviors.
          Finally, there are always limitations to the use of stimulated recall in mixed methods designs.
Although the method purports to look into raters’ memories of their cognitive processes, there is a recency
effect that may impair raters from truly accessing those memories. This was certainly the case with one
participant, who despite providing sufficient recalls, continuously reported that they “didn’t remember
anything” about their ratings. Observations can be contaminated by new observations in the second viewing,
thus calling into question the veridicality of the reports. However, I implemented a procedure to ensure that
raters had seen videos within 24 hours of the study in order to strengthen their memories of rating the
samples. I conducted the stimulated recall sessions in person, and piloted the instructions and prompts, and
thus raters were aware of the focus on memories during the rating process. As detailed in Chapter 7, I also
concluded that the participants had not been exposed to the research questions ahead of time as they did not
focus excessively on the topic of nonverbal behavior during their sessions. Thus, I am reasonably confident
that despite the limitations, the stimulated recall method provided insights that represented the rater
participants’ true rating processes.
Future research
          There are a number of directions for future research. For one, the study design could be extended
to work with a larger number of samples with a more diverse background of test takers, as well as a more
                                                                                                             311


diverse pool of interlocutors and raters. The current design was restricted to Chinese test takers, mostly
British interlocutors, and American raters, and as such the findings do not necessarily generalize well to
other groups without further analysis. In particular, though, a study with a larger number of test taker
samples would be most beneficial to examine a greater range of performances at a wider range of score
levels. Likewise, different modalities, including dyadic tests in a Zoom setting, could be an interesting
format to explore these effects further, as well as the effects’ relationships to interactional competence.
         Another area of future research is to extend the current study with data that I have already collected.
The ELAN files I annotated in this study were used to produce illustrative examples for the 10 samples
used in the stimulated recall design. However, all 30 of the files were annotated as part of the work with
the research assistants I hired. Work from Kim et al. (2023), Trofimovich et al. (2021), and Tsunemoto et
al. (2022) could be extended to include these human annotated statistics in models of the four language
outcomes. This would be similar, for example, to Tsunemoto et al.’s (2022) design, in which the outcome
variables were fluency, accentedness, and comprehensibility. In contrast to these studies, however, I have
a dataset that is much richer, including not only frequency counts, but also duration, as well as a larger
number of rated outcomes. Using a wider range of observed phenomena may reveal contrasting findings.
         Likewise, using the full dataset of annotated transcripts presents an opportunity to investigate
various phenomena occurring alongside talk in interaction. For example, McDonough et al. (2019, 2023)
analyzed instances of dyadic communication for various features of nonverbal behavior that correlated with
instances of nonunderstanding. I also conducted a small-scale multimodal conversation analytic study of
four sequences of nonunderstanding, finding idiosyncratic patterns of behavior that characterized the onset
of repair sequences and resolution (Burton, 2021a). With a dataset as rich as this one, this phenomenon and
others could be investigated in a robust ethnomethodological design, as there are 30 transcripts of
individuals with a span of proficiency levels.
         Another possibility using existing data would be a methodological analysis of the accuracy and
interpretability of iMotions for applied linguistics research. The iMotions dataset that was produced
ultimately contained 34 variables (the three reported in this study, base emotions, and discrete behavioral
                                                                                                            312


indices) and 51 Cartesian coordinate variables that represented points on the face. It may be possible to run
correlations or side-by-side analyses of the ELAN data and iMotions data to support inferences from
iMotions. A study of this type may look at the different issues to consider when extracting iMotions
variables, including benchmarking and thresholding.
         In terms of interlocutors, one of the outstanding questions in speaking assessment is the impact that
examiners have on the speaking test performances of test takers. Past research has found that examiners
can have a substantial impact on the language that test takers produce, resulting in score results that provide
conflicting interpretations of their ability (Brown, 2003; see also Thompson et al., 2016). The examiner’s
affect and nonverbal behavior are also salient to test takers and can also impact the test taker’s experience,
making them feel more relaxed and possibly altering the examiner-candidate power dynamic (Briegel-Jones,
2014; Plough & Bogart, 2008). No research exists, to my knowledge, that has analyzed the score impact of
examiners with different affective stances. Through observations of my own dataset, I noticed that when
examiners smiled, the test taker often returned the smile. This type of behavior matching, alignment, or
perhaps affective contagion has been found in verbal language and hypothesized in L2 nonverbal behavior
(Pickering & Garrod, 2004). Recent research has found that when behavioral alignment occurs in L2
speaking dyads, it predicts increases in participant motivation (McDonough et al., 2022a). If this is indeed
the case, rater behavior may impact not only the comfort and power dynamics in an oral proficiency test,
but also the performances and resulting scores. By leveraging automated analyses of behavior, this type of
study may be more feasible to carry out than before.
         Finally, there are a range of other ideas to investigate in the future. After more research is conducted
into the behaviors that appear at different stages of fluency or vocabulary development, scales could be
constructed, trialed, and compared with language-only scales to determine whether adding nonverbal
behavior and affect to these scales is meaningful and effective. Some research in this area has already been
conducted (e.g., Jenkins & Parra, 2003). Likewise, as detailed in the discussion section, models of L2
communicative competence need revision and extension. Eventually, work to include the topics included
in this dissertation would be valuable for applied linguists and others that use these frameworks when
                                                                                                              313


developing assessment instruments.
Final word
         This dissertation represents the culmination of a large body of research on an understudied topic
within the field of language testing. Though certainly not the first study of its kind, it has added substantially
to our understanding of the relationship among nonverbal behavior, affect, and language proficiency. Its
limitations are diverse, as is the case with all research, but despite these limitations, the triangulated findings
are interpretable and generalizable to at least the populations sampled. Much more needs to be done to
extend this work in various directions to confirm the findings and determine which aspects of this area are
most applicable. I sincerely hope that my efforts here can make a positive outcome on language testing
practice. For one, it can benefit learners by taking into account the much wider realm of visual
communication beyond linguistic resources, thus recognizing the learners’ full repertoires of abilities when
communicating in their second language. It can also benefit raters, as without descriptors fully representing
targeted test constructs, they may rate using their own set of internal criteria, drawing from the visual realm
when it is not represented. Drafting rating scales that better represent the construct would also benefit score
users, as these individuals would have a better representation of the test taker’s abilities to communicate.
                                                                                                                314


                                               REFERENCES
Affectiva. (n. d.). Affectiva media analytics. https://go.affectiva.com/affdex-for-market-research
Afifi, T. D., & Denes, A. (2013). Feedback processes and physiological responding. In In J. A. Hall & M.
          L. Knapp (Eds.), Nonverbal communication (pp. 333–368). De Gruyter.
          https://doi.org/10.1515/9783110238150.333
Ahammer, A., Lackner, M., & Voigt, J. (2019). Does confidence enhance performance? Causal evidence
          from the field. Managerial and Decision Economics, 40(6), 704–717.
          https://doi.org/10.1002/mde.3038
Alibali, M. W., Kita, S., & Young, A. J. (2000). Gesture and the process of speech production: We think,
          therefore we gesture. Language and Cognitive Processes, 15(6), 593–613.
          https://doi.org/10.1080/016909600750040571
Allen, L. Q. (1995). The effect of emblematic gestures on the development and access of mental
          representations of French expressions. Modern Language Journal, 79(4), 521–529.
          https://doi.org/10.1111/j.1540-4781.1995.tb05454.x
Ambady, N. (2010). The perils of pondering: Intuition and thin slice judgments. Psychological
          Inquiry, 21(4), 271–278. https://doi.org/10.1080/1047840X.2010.524882
American Psychiatric Association. (2013). Diagnostic and statistical manual of mental disorders (5th
          ed.). https://doi.org/10.1176/appi.books.9780890425596
Argyle, M. (1988). Bodily communication (2nd ed.). Methuen.
Argyle, M., & Dean, J. (1965). Eye-contact, distance and affiliation. Sociometry, 28, 289–304.
          https://doi.org/10.2307/2786027
Arnold, M. B. (1960). Emotion and personality: (Vol. 1) Psychological aspects. Columbia University
          Press.
Aryadoust, V., Ng, L. Y., & Sayama, H. (2021). A comprehensive review of Rasch measurement in
          language assessment: Recommendations and guidelines for research. Language Testing, 38(1),
          6–40. https://doi.org/10.1177/0265532220927487
Attali, Y. (2016). A comparison of newly-trained and experienced raters on a standardized writing
          assessment. Language Testing, 33(1), 99–115. https://doi.org/10.1177/0265532215582283
Aziz, J. R., & Nicoladis, E. (2019). “My French is rusty”: Proficiency and bilingual gesture use in a
          majority English community. Bilingualism: Language and Cognition, 22(4), 826–835.
          https://doi.org/10.1017/S1366728918000639
Bachman, L. F., & Palmer, A. S. (1982). The construct validation of some components of communicative
          proficiency. TESOL Quarterly, 16, 449–465. https://doi.org/10.2307/3586464
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing useful
          language test. Oxford University Press.
                                                                                                      315


Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Oxford University Press.
Back, M. D., Schmukle, S. C., & Egloff, B. (2011). A closer look at first sight: Social relations lens
          model analysis of personality and interpersonal attraction at zero acquaintance. European
          Journal of Personality, 25(3), 225–238. https://doi.org/10.1002/per.790
Baralt, M., Gurzynski-Weiss, L., & Kim, Y. (2016). Engagement with the language: How examining
          learners’ affective and social engagement explains successful learner-generated attention to
          form. In M. Sato & S. Ballinger (Eds.), Peer interaction and second language learning:
          Pedagogical potential and research agenda (pp. 209–239). John Benjamins.
Barkaoui, K. (2010). Variability in ESL essay rating processes: The role of the rating scale and rater
          experience. Language Assessment Quarterly, 7(1), 54–74.
          https://doi.org/10.1080/15434300903464418
Barnwell, D. (1989). ‘Naive’ native speakers and judgements of oral proficiency in Spanish. Language
          Testing, 6(2), 152–163. https://doi.org/10.1177/026553228900600203
Barrett, L. F. (2017). How emotions are made: The secret life of the brain. Macmillan.
Batty, A. O. (2021). An eye-tracking study of attention to visual cues in L2 listening tests. Language
          Testing, 38(4), 511–535. https://doi.org/10.1177/0265532220951504
Bavelas, J. B., Black, A., Lemery, C. R., & Mullett, J. (1986). "I show how you feel": Motor mimicry as a
          communicative act. Journal of Personality and Social Psychology, 50(2), 322–
          329. https://doi.org/10.1037/0022-3514.50.2.322
Bavelas, J. B., Coates, L., & Johnson, T. (2002). Listener responses as a collaborative process: The role of
          gaze. Journal of communication, 52(3), 566–580. https://doi.org/10.1111/j.1460-
          2466.2002.tb02562.x
Beattie, G., & Shovelton, H. (1999). Mapping the range of information contained in the iconic hand
          gestures that accompany spontaneous speech. Journal of Language and Social
          Psychology, 18(4), 438–462. https://doi.org/10.1177/0261927X99018004005
Belío-Apaolaza, H. S., & Hernández Muñoz, N. (2021). Emblematic gestures learning in Spanish as
          L2/FL: Interactions between types of gestures and tasks. Language Teaching Research. Advance
          online publication. https://doi.org/10.1177/13621688211006880
Beltrán, J. (2016). The Effects of visual input on scoring a speaking achievement test. Working Papers in
          TESOL & Applied Linguistics, 16(2), 1–24. https://doi.org/10.7916/D8795GKM
Benazzo, S., & Morgenstern, A. (2014). A bilingual child’s multimodal path into
          negation. Gesture, 14(2), 171–202. https://doi.org/10.1075/gest.14.2.03ben
Berry, V. (2007). Personality differences and oral test performance. Peter Lang.
Birdwhistle, R. (1970). Kinesics and context. University of Pennsylvania Press.
          https://doi.org/10.9783/9780812201284
                                                                                                        316


Blairy, S., Herrera, P., & Hess, U. (1999). Mimicry and the judgement of emotional facial expressions.
         Journal of Nonverbal Behavior, 23, 5–41. https://doi.org/10.1023/A:1021370825283
Bloom, B. (1953). Thought-processes in lectures and discussions. Journal of General Education, 7(3),
         160–169. https://www.jstor.org/stable/27795429
Boiger, M., & Mesquita, B. (2012). The construction of emotion in interactions, relationships, and
         cultures. Emotion Rreview, 4(3), 221–229. https://doi.org/10.1177/1754073912439765
Boiten, F. (1996). Autonomic response patterns during voluntary facial action. Psychophysiology, 33(2),
         123–131. https://doi.org/10.1111/j.1469-8986.1996.tb02116.x
Bond, T., Yan, Z., & Heene, M. (2020). Applying the Rasch model: Fundamental measurement in the
         human sciences. Routledge. https://doi.org/10.4324/9780429030499
Borkenau, P., Brecke, S., Möttig, C., & Paelecke, M. (2009). Extraversion is accurately perceived after a
         50-ms exposure to a face. Journal of Research in Personality, 43(4), 703–706.
         https://doi.org/10.1016/j.jrp.2009.03.007
Botes, E., Dewaele, J.-M., & Greiff, S. (2020). The power to improve: Effects of multilingualism and
         perceived proficiency on enjoyment and anxiety in foreign language learning. European Journal
         of Applied Linguistics, 8(2), 1–28. http://doi.org/10.1515/eujal-2020-0003
Bourdieu, P. (1977). Outline of a theory of practice. Cambridge University Press.
         https://doi.org/10.1017/CBO9780511812507
Bourgeois, P., & Hess, U. (2008). The impact of social context on mimicry. Biological Psychology, 77(3),
         343–352. https://doi.org/10.1016/j.biopsycho.2007.11.008
Bowles, M. (2018). Introspective verbal reports: Think-alouds and stimulated recall. In A. Phakiti, P. De
         Costa, L. Plonsky & S. Starfield (Eds.), The Palgrave handbook of applied linguistics research
         methodology (pp. 339–357). Palgrave Macmillan. https://doi.org/10.1057/978-1-137-59900-1
Brant, R. (1990). Assessing proportionality in the proportional odds model for ordinal logistic
         regression. Biometrics, 46(4), 1171–1178. https://doi.org/10.2307/2532457
Briegel-Jones, L. (2014). An investigation into the nonverbal behavior in the oral proficiency interview:
         Perceptions of interview variability and the impact on candidates [Unpublished MA thesis].
         Newcastle University, United Kingdom.
Briner, R. B., & Kiefer, T. (2005). Psychological research into the experience of emotion at work:
         Definitely older, but are we any wiser? In N. M. Ashkanasy, C. E. J. Hartel, & W. J. Zerbe
         (Eds.), Research on emotion in organizations: The effect of affect in organizational settings (pp.
         281–307). Emerald Group Publishing. https://doi.org/10.1016/S1746-9791(05)01112-0
Brown, A. (2003). Interviewer variation and the co-construction of speaking proficiency. Language
         Testing, 20(1), 1–25. https://doi.org/10.1191/0265532203lt242oa
Brown, A., Iwashita, N., & McNamara, T. (2005). An examination of rater orientations and test‐taker
         performance on English‐for‐academic‐purposes speaking tasks. ETS Research Report
         Series, 2005(1), i–157. https://doi.org/10.1002/j.2333-8504.2005.tb01982.x
                                                                                                         317


Brown, P., & Levinson, S. C. (1987). Politeness: Some universals in language usage. Cambridge
         University Press. https://doi.org/10.1017/CBO9780511813085
Brunswik, E. (1956). Perception and the representative design of psychological experiments. University
         of California Press. https://doi.org/10.1525/9780520350519
Buck, R. (1984). The Communication of Emotion. Guilford.
Buck, R., & Powers, S. R. (2006). The biological foundations of social organization: The dynamic
         emergence of social structure through nonverbal communication. In V. Manusov & M. L.
         Patterson (Eds.), The Sage handbook of nonverbal communication (pp. 119–138). Sage
         Publications. https://doi.org/10.4135/9781412976152.n7
Buck, R., & VanLear, C. A. (2002). Verbal and nonverbal communication: Distinguishing symbolic,
         spontaneous, and pseudo-spontaneous nonverbal behavior. Journal of communication, 52(3),
         522–541. https://doi.org/10.1111/j.1460-2466.2002.tb02560.x
Burch, A. R., & Kley, K. (2020). Assessing interactional competence: The role of intersubjectivity in a
         paired-speaking task. Papers in Language Testing and Assessment, 9(1), 25–63.
         http://www.altaanz.org/uploads/5/9/0/8/5908292/2020_9_1__2_burch_kley.pdf
Burgoon, J. K. (1978). A communication model of personal space violations: Explication and an initial
         test. Human Communication Research, 4(2), 129–142. https://doi.org/10.1111/j.1468-
         2958.1978.tb00603.x
Burgoon, J. K., Buller, D. B., Hale, J. L., & de Turck, M. A. (1984). Relational messages associated with
         nonverbal behaviors. Human Communication Research, 10(3), 351–378.
         https://doi.org/10.1111/j.1468-2958.1984.tb00023.x
Burgoon, J. K., Guerrero, L. K., & Floyd, K. (2016). Nonverbal communication. Routledge.
         https://doi.org/10.4324/9781315663425
Burton, J. D. (2020). “How scripted is this going to be?” Raters’ views of authenticity in speaking-
         performance tests. Language Assessment Quarterly, 17(3), 244–
         261. https://doi.org/10.1080/15434303.2020.1754829
Burton, J. D. (2021a). The face of communication breakdown: Multimodal repair in L2 oral proficiency
         interviews. Papers in Language Testing and Assessment, 10(2), 30–61.
         http://www.altaanz.org/uploads/5/9/0/8/5908292/3._plta_10_2__burton.pdf
Burton, J. D. (2021b). The impact of nonverbal behavior on second language proficiency [Project].
         https://osf.io/u6243
Burton, J. D. (2023). Gazing into cognition: Eye behavior in online L2 speaking tests. Language
         Assessment Quarterly, 23(2), 190–214. https://doi.org/10.1080/15434303.2022.2143680
Byram, M. (2021). Teaching and assessing intercultural communicative competence: Revisited (2nd ed.).
         Multilingual Matters. https://doi.org/10.21832/9781800410251
                                                                                                       318


Cacioppo, J. T., Berntson, G. G., Larsen, J. L., Poehlmann, K. M., & Ito, T. A. (2000). The physiology of
          emotion. In M. Lewis & J. M. Haviland-Jones (Eds.), Handbook of emotions (pp. 173–191). The
          Guilford Press.
Cadierno, T. (2004) Expressing motion events in a second language: A cognitive typological perspective.
          In M. Achard & S. Niemeir (Eds.), Cognitive linguistics, second language acquisition, and
          foreign language teaching (pp. 13-43). De Gruyter. https://doi.org/10.1515/9783110199857.13
Campbell, J., Quincy, C., Osserman, J., & Pedersen, O. (2013). Coding in-depth semi-structured
          interviews: Problems of unitization and intercoder reliability and agreement. Sociological
          Methods & Research, 42(3), 294–320. https://doi.org/10.1177/0049124113500475
Canagarajah, S. (2006). Changing communicative needs, revised assessment objectives: Testing English
          as an international language. Language Assessment Quarterly, 3(3), 229–242.
          https://doi.org/10.1207/s15434311laq0303_1
Canale, M. (1983). From communicative competence to communicative performance. In Richards, J. C.
          & R. W. Schmidt (Eds.), Language and communication (pp. 2–27). Longman.
Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to second language
          teaching and testing. Applied Linguistics, 1(1), 1–47. https://doi.org/10.1093/applin/I.1.1
Cappella, J. N., & Greene, J. O. (1982). A discrepancy‐arousal explanation of mutual influence in
          expressive behavior for adult and infant‐adult interaction. Communications Monographs, 49(2),
          89–114. https://doi.org/10.1080/03637758209376074
Cassell, J., McNeill, D., & McCullough, K. E. (1999). Speech-gesture mismatches: Evidence for one
          underlying representation of linguistic and nonlinguistic information. Pragmatics & Cognition,
          7(1), 1–33. https://doi.org/10.1075/pc.7.1.03cas
Celce-Murcia, M. (1995). The elaboration of sociolinguistic competence: Implications for teacher
          education. In J. E. Alatis, C. A. Straehle, & M. Ronkin (Eds.), Linguistics and the education of
          language teachers: Ethnolinguistic, psycholinguistic, and sociolinguistic aspects (pp. 699–710).
          Georgetown University Press.
Celce-Murcia, M. (2007). Rethinking the role of communicative competence in language teaching. In E.
          Alcón Soler & M. P. Safont Jordà (Eds.), Intercultural language use and language learning (pp.
          41–57). Springer. https://doi.org/10.1007/978-1-4020-5639-0_3
Celce-Murcia, M., Dörnyei, Z., & Thurrell, S. (1995). A pedagogical framework for communicative
          competence: A Pedagogically motivated model with content. Issues in Applied Linguistics, 6(2),
          5–35. https://doi.org/10.5070/L462005216
Chartrand, T. L., & Lakin, J. L. (2013). The antecedents and consequences of human behavioral
          mimicry. Annual Review of Psychology, 64, 285–308. https://doi.org/10.1146/annurev-psych-
          113011-143754
Choi, J. S. (2022). Investigating test delivery modes within video-conferenced English speaking
          proficiency assessment [Unpublished doctoral dissertation]. Michigan State University.
                                                                                                         319


Choi, S., & Bowerman, M. (1991). Learning to express motion events in English and Korean: The
          influence of language-specific lexicalization patterns. Cognition, 41(1–3), 83–121.
          https://doi.org/10.1016/0010-0277(91)90033-Z
Choi, S., & Lantolf, J. P. (2008). The representation and embodiment of meaning in L2 communication.
          Motion events in the speech and gesture of advanced L2 Korean and L2 English speakers.
          Studies in Second Language Acquisition, 30(2), 191–224.
          https://doi.org/10.1017/S0272263108080315
Chomsky, N. (1965). Aspects of the theory of syntax. MIT Press. https://doi.org/10.21236/AD0616323
Chong, J. J. Q., & Aryadoust, V. (2022). Investigating the effect of multimodality and sentiments on
          speaking assessments: A facial emotional analysis. Education and Information Technologies, 28,
          7413–7436. https://doi.org/10.1007/s10639-022-11478-7
Cienki, A. J. (2012). Usage events of spoken language and the symbolic units we (may) abstract from
          them. In J. Badio & K. Kosecki (Eds.), Cognitive processes in language (pp. 149–158). Peter
          Lang.
Clark, H. H. (1996). Using language. Cambridge University Press.
          https://doi.org/10.1017/CBO9780511620539
Clark, H. H. (2002). Speaking in time. Speech communication, 36(1-2), 5–13.
          https://doi.org/10.1016/S0167-6393(01)00022-X
Clément, R. (1980). Ethnicity, contact and communicative competence in a second language. In H. Giles
          & W. P. Robinson (Eds.), Language: Social psychological perspectives (pp. 147–154).
          Pergamon. https://doi.org/10.1016/B978-0-08-024696-3.50027-2
Clément, R. (1986). Second language proficiency and acculturation: An investigation of the effects of
          language status and individual characteristics. Journal of Language and Social Psychology, 5(4),
          271–290. https://doi.org/10.1177/0261927X8600500403
Clément, R., & Kruidenier, B. (1983). Orientations in second language acquisition: I. The effects of
          ethnicity, milieu and target language on their emergence. Language Learning, 33(3), 273–291.
          https://doi.org/10.1111/j.1467-1770.1983.tb00542.x
Clément, R., & Kruidenier, B. G. (1985). Aptitude, attitude and motivation in second language
          proficiency: A test of Clément's model. Journal of language and Social Psychology, 4(1), 21–
          37. https://doi.org/10.1177/0261927X8500400102
Clément, R., Dörnyei, Z., & Noels, K. (1994). Motivation, self- confidence, and group cohesion in the
          foreign language classroom. Language Learning, 44(3), 417–448.
          https://doi.org/10.1111/j.1467-1770.1994.tb01113.x
Cobb-Clark, D. A. (2015). Locus of control and the labor market. IZA Journal of Labor Economics, 4,
          Article 3. https://doi.org/10.1186/s40172-014-0017-x
Cohen, R. L., & Otterbein, N. (1992). The mnemonic effect of speech gestures: Pantomimic and non-
          pantomimic gestures compared. European Journal of Cognitive Psychology, 4(2), 113–139.
          https://doi.org/10.1080/09541449208406246
                                                                                                        320


Coniam, D. (2001). The use of audio or video comprehension as an assessment instrument in the
          certification of English language teachers: A case study. System, 29(1), 1–14. https://doi.
          org/10.1016/S0346-251X(00)00057-9
Conlan, C. J., Bardsley, W. N., & Martinson, S. H. (1994). Study of intra-rater reliability of assessments
          of live versus audio-recorded interviews in the IELTS Speaking component [Unpublished study].
          International Editing Committee of IELTS.
Cook, S. W., Yip, T. K., & Goldin-Meadow, S. (2012). Gestures, but not meaningless movements, lighten
          working memory load when explaining math. Language and Cognitive Processes, 27(4), 594–
          610. https://doi.org/10.1080/01690965.2011.567074
Cope, B., & Kalantzis, M. (2020). Making sense: Reference, agency, and structure in a grammar of
          multimodal meaning. Cambridge University Press. https://doi.org/10.1017/9781316459645
Corbin, J., & Strauss, A. (2015). Basics of qualitative research: Techniques and procedures for
          developing grounded theory (4th ed.). Sage.
Coulson, M. (2004). Attributing emotion to static body postures: Recognition accuracy, confusions, and
          viewpoint dependence. Journal of Nonverbal Behavior, 28, 117–139.
          https://doi.org/10.1023/B:JONB.0000023655.25550.be
Council of Europe. (2020). Common European framework of reference for languages: Learning,
          teaching, assessment. Companion volume. Council of Europe. https://rm.coe.int/cefr-companion-
          volume-with-new-descriptors- 2018/1680787989
Cox, T. L., Brown, A. V., & Thompson, G. L. (2022). Temporal fluency and floor/ceiling scoring of
          intermediate and advanced speech on the ACTFL Spanish Oral Proficiency Interview–computer.
          Language Testing. Advance online publication. https://doi.org/10.1177/02655322221114614
Creswell, J. W., & Clark, V. L. P. (2017). Designing and conducting mixed methods research. Sage
          publications.
Crivelli, C., & Fridlund, A. J. (2018). Facial displays are tools for social influence. Trends in Cognitive
          Sciences, 22(5), 388–399. https://doi.org/10.1016/j.tics.2018.02.006
Crivelli, C., Jarillo, S., Russell, J. A., & Fernández-Dols, J. M. (2016). Reading emotions from faces in
          two indigenous societies. Journal of Experimental Psychology: General, 145(7), 830–843.
          https://psycnet.apa.org/doi/10.1037/xge0000172
Crowther, D., Holden, D., & Urada, K. (2022). Second language speech comprehensibility. Language
          Teaching, 55(4), 470–489. https://doi.org/10.1017/S0261444821000537
Cuddy, A. J. C., Fiske, S. T., & Glick, P. (2007). The BIAS map: Behaviors from intergroup affect and
          stereotypes. Journal of Personality and Social Psychology, 92(4), 631–
          648. https://doi.org/10.1037/0022-3514.92.4.631
Cuddy, A. J. C., Glick, P., & Beninger, A. (2011). The dynamics of warmth and competence judgments,
          and their outcomes in organizations. Research in Organizational Behavior, 31, 73–98.
          https://doi.org/10.1016/j.riob.2011.10.004
                                                                                                           321


Cuddy, A. J., Fiske, S. T., & Glick, P. (2008). Warmth and competence as universal dimensions of social
         perception: The stereotype content model and the BIAS map. Advances in Experimental Social
         Psychology, 40, 61–149. https://doi.org/10.1016/S0065-2601(07)00002-0
Cuddy, A. J., Wilmuth, C. A., Yap, A. J., & Carney, D. R. (2015). Preparatory power posing affects
         nonverbal presence and job interview performance. Journal of Applied Psychology, 100(4),
         1286–1295. https://doi.org/10.1037/a0038543
Cumming, A. (1990). Expertise in evaluating second language compositions. Language Testing, 7(1), 31–
         51. https://doi.org/10.1177/026553229000700104
Cutrone, P. (2005). A case study examining backchannels in conversations between Japanese–British
         dyads. Multilingua, 24(3), 237–274. https://doi.org/10.1515/mult.2005.24.3.237
Dael, N., Mortillaro, M., & Scherer, K. R. (2012). Emotion expression in body action and
         posture. Emotion, 12(5), 1085–1101. https://doi.org/10.1037/a0025737
Dahl, T. I., & Ludvigsen, S. (2014). How I see what you’re saying: The role of gestures in native and
         foreign language listening comprehension. The Modern Language Journal, 98(3), 813–833.
         https://doi.org/10.1111/modl.12124
Dai, D. W. (2023). What do second language speakers really need for real-world interaction? A needs
         analysis of L2 Chinese interactional competence. Language Teaching Research. Advance online
         publication. https://doi.org/10.1177/13621688221144836
Dao, P., & McDonough, K. (2018). Effect of proficiency on Vietnamese EFL learners’ engagement in
         peer interaction. International Journal of Educational Research, 88, 60–72.
         https://doi.org/10.1016/j.ijer.2018.01.008
Davies, L. (2009). The influence of interlocutor proficiency in a paired oral assessment. Language
         Testing, 26(3), 367–396. https://doi.org/10.1177/0265532209104667
De Costa, P. I. (2016). The power of identity and ideology in language learning: Designer immigrants
         learning English in Singapore. Springer.
de Jong, N. H. (2023). Assessing second language speaking proficiency. Annual Review of Linguistics, 9,
         541–560. https://doi.org/10.1146/annurev-linguistics-030521-052114
De Rivera, J., & Grinkis, C. (1986). Emotions as social relationships. Motivation and emotion, 10, 351-
         369. https://doi.org/10.1007/BF00992109
De Rivera, J., & Grinkis, C. (1986). Emotions as social relationships. Motivation and Emotion, 10, 351–
         369. https://doi.org/10.1007/BF00992109
De Riviera, J. (1977). A structural theory of emotions. International Universities Press.
DeKeyser, R. (2020). Skill acquisition theory. In B. VanPatten, G. D. Keating, & S. Wulff
         (Eds.), Theories in second language acquisition (pp. 83-104). Routledge.
         https://doi.org/10.4324/9780429503986-5
                                                                                                      322


DeKeyser, R. M. (1997). Beyond explicit rule learning: Automatizing second language
         morphosyntax. Studies in Second Language Acquisition, 19(2), 195–221.
         https://doi.org/10.1017/S0272263197002040
DePaulo, B. M., & Friedman, H. S. (1998). Nonverbal communication. In S. T. Fiske, D. Gilbert, & G.
         Lindzey (Eds.), The handbook of social psychology (3rd ed., Vol. 2, pp. 3–40). McGraw-Hill.
Derrida, J. (1967). De la grammatologie. Les Éditions de Minuit.
Dewaele, J.-M. (2010). Multilingualism and affordances: Variation in self-perceived communicative
         competence and communicative anxiety in French L1, L2, L3 and L4. International Review of
         Applied Linguistics, 48(2–3), 105–129. https://doi.org/10.1515/iral.2010.006
Dewaele, J.-M., & Alfawzan, M. (2018). Does the effect of enjoyment outweigh that of anxiety in foreign
         language performance? Studies in Second Language Learning and Teaching, 8(1), 21–45.
         https://doi.org/10.14746/ssllt.2018.8.1.2
Dewaele, J.-M., & Li, C. (2020). Emotions in second language acquisition: A critical review and research
         agenda. Foreign Language World, 196(1), 34-49.
Dewaele, J.-M., & Li, C. (2022). Foreign language enjoyment and anxiety: Associations with general and
         domain specific English achievement. Chinese Journal of Applied Linguistics, 45(1), 23–48.
         https://doi.org/10.1515/cjal-2022-0104
Dewaele, J.-M., & MacIntyre, P. D. (2014). The two faces of Janus? Anxiety and enjoyment in the
         foreign language classroom. Studies in Second Language Learning and Teaching, 4(2), 237–
         274. http://doi.org/ 10.14746/ssllt.2014.4.2.5
Dewaele, J.-M., Özdemir, C., Karci, D., Uysal, S., Özdemir, E. D., & Balta, N. (2019). How distinctive is
         the foreign language enjoyment and foreign language classroom anxiety of Kazakh learners of
         Turkish? Applied Linguistics Review, 13(2), 243–265. https://doi.org/10.1515/applirev-2019-
         0021
Diener, E., Smith, H., & Fujita, F. (1995). The personality structure of affect. Journal of Personality and
         Social Psychology, 69(1), 130–141. https://doi.org/10.1037/0022-3514.69.1.130
Dimberg, U., Thunberg, M., & Elmehed, K. (2000). Unconscious facial reactions to emotional facial
         expressions. Psychological science, 11(1), 86–89. https://doi.org/10.1111/1467-9280.00221
Dingemanse, M., Roberts, S. G., Baranova, J., Blythe, J., Drew, P., Floyd, S., Gisladottir, R. S., Kendrick,
         K. H., Levinson, S. C., Manrique, E., Rossi, G., & Enfield, N. J. (2015). Universal principles in
         the repair of communication problems. PloS one, 10(9), Article e0136100.
         https://doi.org/10.1371/journal.pone.0136100
Doherty-Sneddon, G., & Phelps, F. G. (2005). Gaze aversion: A response to cognitive or social difficulty?
         Memory and Cognition, 33(4), 727–733. https://doi.org/10.3758/bf03195338
Doherty-Sneddon, G., Bruce, V., Bonner, L., Longbotham, S., & Doyle, C. (2002). Development of gaze
         aversion as disengagement from visual information. Developmental Psychology, 38(3), 438–
         445. https://doi.org/10.1037//0012-1649.38.3.438
                                                                                                         323


Doqaruni, V. (2015). Increasing confidence to decrease reticence: A qualitative action research in second
          language education. Canadian Journal of Action Research, 16(3), 42–60.
          https://doi.org/10.33524/cjar.v16i3.227
Douglas, D. (1994). Quantity and quality in speaking test performance. Language Testing, 11(2), 125–
          144. https://doi.org/10.1177/026553229401100203
Drijvers, L., & Özyürek, A. (2017). Visual context enhanced: The joint contribution of iconic gestures
          and visible speech to degraded speech comprehension. Journal of Speech, Language, and
          Hearing Research, 60(1), 212–222. https://doi.org/10.1044/2016_JSLHR-H-16-0101
Ducasse, A. (2010). Interaction in paired oral proficiency assessment in Spanish. Peter Lang.
          https://doi.org/10.3726/978-3-653-05393-7
Ducasse, A. M., & Brown, A. (2009). Assessing paired orals: Raters’ orientation to interaction. Language
          Testing, 26(3), 423–443. https://doi.org/10.1177/0265532209104669
Duncan, S., & Fiske, D. W. (2015). Face-to-face interaction: Research, methods, and theory. Routledge.
          https://doi.org/10.4324/9781315660998
Dupré, D., Krumhuber, E. G., Küster, D., & McKeown, G. J. (2020). A performance comparison of eight
          commercially available automatic classifiers for facial affect recognition. PLoS ONE, 15(4),
          Article e0231968. https://doi.org/10.1371/journal.pone.0231968
Educational Testing Services (ETS). (n.d.). TOEFL iBT integrated speaking rubrics. Educational Testing
          Services. https://www.ets.org/pdfs/toefl/toefl-ibt-speaking-rubrics.pdf
Edwards, E., & Roger, P. S. (2015). Seeking out challenges to develop L2 self-confidence: A language
          learner's journey to proficiency. TESL-EJ, 18(4), 1–24. http://tesl-ej.org/pdf/ej72/a3.pdf
Egi, T. (2008). Investigating stimulated recall as a cognitive measure: Reactivity and verbal reports in
          SLA research methodology. Language Awareness, 17(3), 212–228.
          https://doi.org/10.1080/09658410802146859
Ekman, P. (1972). Universals and cultural differences in facial expressions of emotion. In J. Cole (Ed.),
          Nebraska symposium on motivation (pp. 207–283). University of Nebraska Press.
Ekman, P., & Friesen, W. V. (1974). Nonverbal behavior and psychopathology. In R. J. Friedman & M.
          M. Katz (Eds.), The psychology of depression: Contemporary theory and research (pp. 203–
          232). John Wiley & Sons.
Ekman, P., & Frisen, W. (1969). The repertoire of nonverbal behavior: Categories, origins, usage, and
          coding. Semiotica, 1(1), 49–98. https://doi.org/10.1515/9783110880021.57
Ekman, P., & Rosenberg, E. (Eds.) (2005). What the face reveals: Basic and applied studies of
          spontaneous expression using the Facial Action Coding System (FACS) (2nd ed). Oxford
          University Press. https://doi.org/10.1093/acprof:oso/9780195179644.001.0001
Ekman, P., Friesen, W. V., & Hager, J. C. (2002). Facial action coding system. Research Nexus, Network
          Research Information.
                                                                                                         324


Ekman, P., Sorenson, E. R., & Friesen, W. V. (1969). Pan-cultural elements in facial displays of
           emotion. Science, 164(3875), 86–88. https://doi.org/10.1126/science.164.3875.86
Elfenbein, H. A. (2014). The many faces of emotional contagion: An affective process theory of affective
           linkage. Organizational Psychology Review, 4(4), 326–362.
           https://doi.org/10.1177/2041386614542889
Elfenbein, H. A., & Ambady, N. (2002). On the universality and cultural specificity of emotion
           recognition: a meta-analysis. Psychological bulletin, 128(2), 203–235.
           https://doi.org/10.1037/0033-2909.128.2.203
Elfenbein, H. A., Beaupré, M., Lévesque, M., & Hess, U. (2007). Toward a dialect theory: cultural
           differences in the expression and recognition of posed facial expressions. Emotion, 7(1), 131–
           146. https://doi.org/10.1037/1528-3542.7.1.131
Ellis, R. (2003). Task-based language learning and teaching. Oxford University Press.
Ellis, R., & Barkhuizen, G. (2005). Analysing learner language. Oxford University Press.
Engle, R. A. (1998). Not channels but composite signals: Speech, gesture, diagrams and object
           demonstrations are integrated in multimodal explanations. In M. A. Gernsbacher & S. J. Derry
           (Eds.), Proceedings of the twentieth annual conference of the cognitive science society (pp. 321–
           326). Routledge. https://doi.org/10.4324/9781315782416-65
Ericsson, K.A, & Simon, H. A. (1993). Protocol analysis: Verbal reports as data. MIT Press.
           https://doi.org/10.7551/mitpress/5657.001.0001
FaceReader. (n.d.). Noldus FaceReader. https://www.noldus.com/facereader
Fan, J., & Yan, X. (2020). Assessing speaking proficiency: a narrative review of speaking assessment
           research within the argument-based validation framework. Frontiers in psychology, 11, Article
           330. https://doi.org/10.3389/fpsyg.2020.00330
Firth, A., & Wagner, J. (1997). On discourse, communication, and (some) fundamental concepts in SLA
           research. Modern Language Journal, 81(3), 285–300. https://doi.org/10.1111/j.1540-
           4781.2007.00667.x
Fiske, S. T., Cuddy, A. J., & Glick, P. (2007). Universal dimensions of social cognition: Warmth and
           competence. Trends in Cognitive Sciences, 11(2), 77–83.
           https://doi.org/10.1016/j.tics.2006.11.005
Floyd, S., Manrique, E., Rossi, G., & Torreira, F. (2016). Timing of visual bodily behavior in repair
           sequences: Evidence from three languages. Discourse Processes, 53(3), 175–204.
           https://doi.org/10.1080/0163853X.2014.992680
Flynn, M., Effraimidis, D., Angelopoulou, A., Kapatanios, E., Williams, D., Hemanth, J., & Towell, T.
           (2020). Assessing the effectiveness of automated emotion recognition in adults and children for
           clinical investigation. Frontiers in Human Neuroscience, 14(70), Article 70.
           https://doi.org/10.3389/fnhum.2020.00070
                                                                                                         325


Fredrickson, B. L. (2001). The role of positive emotions in positive psychology: The broaden-and-build
         theory of positive emotions. American Psychologist, 56(3), 218–226.
         https://doi.org/10.1037/0003-066X.56.3.218
Fredrickson, B. L. (2003). The value of positive emotions: The emerging science of positive psychology
         is coming to understand why it’s good to feel good. American Scientist, 91(4), 330–335.
         https://doi.org/10.1511/2003.26.330
Frick-Horbury, D., & Guttentag, R. E. (1998). The effects of restricting hand gesture production on
         lexical retrieval and free recall. The American Journal of Psychology, 111(1), 43–62.
         https://doi.org/10.2307/1423536
Fridlund, A. J. (1994). Human facial expression: An evolutionary view. Academic Press.
Frijda, N. (2005). Emotion experience. Cognition & Emotion, 19(4), 473–497.
         https://doi.org/10.1080/02699930441000346
Frijda, N. H. (1986). The emotions. Cambridge University Press.
Frijda, N. H. (1994). Varieties of affect: Emotions and episodes, moods, and sentiments. In R. J.
         Davidson (Ed.), The nature of emotion – fundamental questions (pp. 59–67). Oxford University
         Press.
Frijda, N. H., & Mesquita, B. (1994). The social roles and functions of emotions. In S. Kitayama & H. R.
         Markus (Eds.), Emotion and culture: Empirical studies of mutual influence (pp. 51–87).
         American Psychological Association. https://doi.org/10.1037/10152-002
Fulcher, G. (2003). Testing second language speaking. Pearson Longman.
Galaczi, E., & Taylor, L. (2018). Interactional competence: Conceptualisations, operationalisations, and
         outstanding questions. Language Assessment Quarterly, 15(3), 219–236.
         https://doi.org/10.1080/15434303.2018.1453816
Gan, Z., & Davison, C. (2011). Gestural behavior in group oral assessment: A case study of higher- and
         lower-scoring students. International Journal of Applied Linguistics, 21(1), 95–120.
         https://doi.org/10.1111/j.1473-4192.2010.00264.x
Gardner, R. C., & MacIntyre, P. D. (1993). On the measurement of affective variables in second language
         learning. Language Learning, 43(2), 157–194. https://doi.org/10.1111/j.1467-
         1770.1992.tb00714.x
Gass, S. M., & Mackey, A. (2016). Stimulated recall methodology in applied linguistics and L2 research
         (2nd ed.). Routledge. https://doi.org/10.4324/9781315813349
Gass, S. M., & Mackey, A. (2020). Input, interaction, and output in L2 Acquisition. In B. VanPatten, G.
         D. Keating, & S. Wulff (Eds.), Theories in second language acquisition (pp. 192–222).
         Routledge. https://doi.org/10.4324/9780429503986-9
Gifford, R. (2013). Personality is encoded in, and decoded from, nonverbal behavior. In J. A. Hall & M.
         K. Knapp (Eds.), Nonverbal communication (pp. 369–402). De Gruyter Mouton.
         https://doi.org/10.1515/9783110238150.369
                                                                                                       326


Givens, D. B., & White, J. (2021). The Routledge dictionary of nonverbal communication. Routledge.
         https://doi.org/10.4324/9780429293665
Gkonou, C., Daubney, M., & Dewaele, J.-M. (Eds.). (2017). New insights into language anxiety: Theory,
         research, and educational implications. Multilingual Matters.
         https://doi.org/10.21832/9781783097722
Glenberg, A. M., Schroeder, J. L., & Robertson, D. A. (1998). Averting the gaze disengages the
         environment and facilitates remembering. Memory & Cognition, 26(4), 651–658. https://doi.org/
         10.3758/BF03211385
Godfroid, A. (2019). Eye tracking in second language acquisition and bilingualism: A research synthesis
         and methodological guide. Routledge. https://doi.org/10.4324/9781315775616
Goldin-Meadow, S. (2003). Hearing gesture: How our hands help us think. The Belknap Press.
         https://doi.org/10.1037/e413812005-377
Goldin-Meadow, S., & Alibali, M. W. (2013). Gesture’s role in speaking, learning, and creating language.
         Annual Review of Psychology, 64, 257–283. https://doi.org/10.1146/annurev-psych-113011-
         143802
Goldin-Meadow, S., & Brentari, D. (2017). Gesture, sign, and language: The coming of age of sign
         language and gesture studies. Behavioral and Brain Sciences, 40, Article E46.
         https://doi.org/10.1017/S0140525X15001247
Goldin-Meadow, S., Nusbaum, H., Kelly, S. D., & Wagner, S. (2001). Explaining math: Gesturing
         lightens the load. Psychological Science, 12(6), 516–522. https://doi.org/10.1111/1467-
         9280.00395
Goldin-Meadow, S., Wein, D., & Chang, C. (1992). Assessing knowledge through gesture: Using
         children’s hands to read their minds. Cognition and Instruction, 9(3), 201–219.
         https://doi.org/10.1207/s1532690xci0903_2
Goodwin, C. (1980). Restarts, pauses, and the achievement of a state of mutual gaze at turn-
         beginning. Sociological Inquiry, 50(3-4), 272–302. https://doi.org/10.1111/j.1475-
         682X.1980.tb00023.x
Goodwin, C. (2000). Action and embodiment within situated human interaction. Journal of
         Pragmatics, 32(10), 1489–1522. https://doi.org/10.1016/S0378-2166(99)00096-X
Goodwin, C. (2018). Co-operative action. Cambridge University Press.
         https://doi.org/10.1017/9781139016735
Goturk, N., & Chukharev-Hudilainen, E. (2023). Strategy use in a spoken dialog system-delivered paired
         discussion task: A stimulated recall study. Language Testing. Advance online publication.
         https://doi.org/10.1177/02655322231152620
Graham, J. A., & Argyle, M. (1975). A cross‐cultural study of the communication of extra‐verbal
         meaning by gestures. International Journal of Psychology, 10(1), 57–67.
         https://doi.org/10.1080/00207597508247319
                                                                                                     327


Graham, J. A., & Heywood, S. (1975). The effects of elimination of hand gestures and of verbal
         codability on speech performance. European Journal of Social Psychology, 5(2), 189–195.
         https://doi.org/10.1002/ejsp.2420050204
Gray, H. M., & Ambady, N. (2006). Methods for the study of nonverbal communication. In V. Manusov
         & M. L. Patterson (Eds.), The Sage handbook of nonverbal communication (pp. 41–58). Sage
         Publications. https://doi.org/10.4135/9781412976152.n3
Graziano, M., & Gullberg, M. (2018). When speech stops, gesture stops: Evidence from developmental
         and crosslinguistic comparisons. Frontiers in Psychology, 9, Article 879.
         https://doi.org/10.3389/fpsyg.2018.00879
Greer, T., & Potter, H. (2008). Turn-taking practices in multi-party EFL oral proficiency tests. Journal of
         Applied Linguistics, 5(3), 295–318. https://doi.org/10.1558/japl.v5i3.297
Gregersen, T. S. (2005). Nonverbal cues: Clues to the detection of foreign language anxiety. Foreign
         Language Annals, 38(3), 388-400. https://doi.org/10.1111/j.1944-9720.2005.tb02225.x
Gregersen, T., MacIntyre, P. D., & Meza, M. D. (2014). The motion of emotion: Idiodynamic case studies
         of learners’ foreign language anxiety. The Modern Language Journal, 98(2), 574–588.
         https://doi.org/10.1111/modl.12084
Gregersen, T., Olivares-Cuhat, G., & Storm, J. (2009). An examination of L1 and L2 gesture use: What
         role does proficiency play? Modern Language Journal, 93(2), 195–208.
         https://doi.org/10.1111/j.1540-4781.2009.00856.x
Groeber, S., & Pochon-Berger, E. (2014). Turns and turn-taking in sign language interaction: A study of
         turn-final holds. Journal of Pragmatics, 65, 121–136.
         https://doi.org/10.1016/j.pragma.2013.08.012
Guerin, B. (1986). Mere presence effects in humans: A review. Journal of Experimental Social
         Psychology, 22(1), 38–77. https://doi.org/10.1016/0022-1031(86)90040-5
Guerrero, L. K., & Wiedmaier, B. (2013). Nonverbal intimacy: Affectionate communication, positive
         involvement behavior, and flirtation. In J. A. Hall & M. L. Knapp (Eds.), Nonverbal
         communication (pp. 577–612). De Gruyter. https://doi.org/10.1515/9783110238150.577
Gullberg, M. (1998). Gesture as a communication strategy in second language discourse: A study of
         learners of French and Swedish. Lund University Press.
         https://lup.lub.lu.se/search/files/4825091/3912717.pdf
Gullberg, M. (2006). Some reasons for studying gesture and second language acquisition (Hommage à
         Adam Kendon). IRAL - International Review of Applied Linguistics in Language Teaching,
         44(2), 103–124. https://doi.org/10.1515/IRAL.2006.004
Gullberg, M. (2009a). Gestures and the development of semantic representations in first and second
         language acquisition. Acquisition et interaction en langue étrangère, 28(1), 117–139.
         https://doi.org/10.4000/aile.4514
                                                                                                        328


Gullberg, M. (2009b). Reconstructing verb meaning in a second language: How English speakers of L2
          Dutch talk and gesture about placement. Annual Review of Cognitive linguistics, 7(1), 221–244.
          https://doi.org/10.1075/arcl.7.09gul
Gullberg, M. (2012). Gesture analysis in second language acquisition. In C. Chapelle (Ed.), The
          encyclopedia of applied linguistics. Wiley. https://doi.org/10.1002/9781405198431
Gullberg, M. (2022). Bimodal convergence: How languages interact in multicompetent language users’
          speech and gestures. In A. Morgenstern & S. Goldin-Meadow (Eds.), Gesture in language:
          Development across the lifespan (pp. 317–333). American Psychological
          Association. https://doi.org/10.1037/0000269-013
Gullberg, M., & De Bot, K. (Eds.) (2010). Gestures in language development (Vol. 28). John Benjamins
          Publishing. https://doi.org/10.1075/bct.28
Halberstadt, A. G., Parker, A. E., & Castro, V. L. (2013). Nonverbal communication: Developmental
          perspectives. In J. A. Hall & M. L. Knapp (Eds.), Nonverbal communication (pp. 93–127). De
          Gruyter. https://doi.org/10.1515/9783110238150.93
Hall, E. T. (1963). A system for the notation of proxemic behavior. American Anthropologist, 65(5),
          1003–1026. https://doi.org/10.1525/aa.1963.65.5.02a00020
Hall, J. A. (1984). Nonverbal sex differences: Communication accuracy and expressive style. The Johns
          Hopkins University Press. https://doi.org/10.56021/9780801824401
Hall, J. A. (2006). Women’s and men’s nonverbal communication. In V. Manusov & M. L. Patterson
          (Eds.), The sage handbook of nonverbal communication (pp. 201–218). Sage Publications.
          https://doi.org/10.4135/9781412976152.n11
Hall, J. A., & Gunnery, S. D. (2013). Gender differences in nonverbal communication. In J. A. Hall & M.
          K. Knapp (Eds.), Nonverbal communication (pp. 639–696). De Gruyter Mouton.
          https://doi.org/10.1515/9783110238150.639
Hall, J. A., & Knapp, M. L. (Eds.). (2013). Nonverbal communication. De Gruyter.
          https://doi.org/10.1515/9783110238150
Hall, J. A., Horgan, T. G., & Murphy, N. A. (2019). Nonverbal communication. Annual Review of
          Psychology, 70, 271–294. https://doi.org/10.1146/annurev-psych-010418-103145
Halleck, G. B. (1992). The oral proficiency interview: Discrete point test or a measure of communicative
          language ability? Foreign Language Annals, 25(3), 227–231. https://doi.org/10.1111/j.1944-
          9720.1992.tb00532.x
Halliday, M. A. K. (1985). An introduction to functional grammar. Edward Arnold.
Harding, L. (2011). Accent and listening assessment: A validation study of the use of speakers with L2
          accents on an academic English listening test. Peter Lang.
Harding, L. (2014). Communicative language testing: Current issues and future research. Language
          Assessment Quarterly, 11(2), 186–197. https://doi.org/10.1080/15434303.2014.895829
                                                                                                       329


Hardison, D. M. (2018). Effects of contextual and visual cues on spoken language processing: Enhancing
          L2 perceptual salience through focused training. In S. M. Gass, P. Spinner, & J. Behney (Eds.),
          Salience in second language acquisition (pp. 201–220). Routledge.
          https://doi.org/10.4324/9781315399027-11
Hardison, D., & Pennington, M. C. (2021). Multimodal second-language communication: Research
          findings and pedagogical implications. RELC Journal, 52(1), 62–76.
          https://doi.org/10.1177/0033688220966635
Harrell, F. (2020, September 20). Violation of proportional odds is not fatal. Statistical Thinking.
          https://www.fharrell.com/post/po
Harrigan, J. A. (2013). Methodology: coding and studying nonverbal behavior. In In J. A. Hall & M. L.
          Knapp (Eds.), Nonverbal communication (pp. 35–68). De Gruyter.
          https://doi.org/10.1515/9783110238150.35
Hashemi, M. (2011). Language stress and anxiety among the English language learners. Procedia: Social
          and Behavioral Sciences, 30, 1811–1816. https://doi.org/10.1016/j.sbspro.2011.10.349
Hatfield, E., Cacioppo, J. T., & Rapson, R. L. (1994). Emotional contagion. Cambridge University Press.
          https://doi.org/10.1017/CBO9781139174138
Heaver, B., & Hutton, S. B. (2011). Keeping an eye on the truth? Pupil size changes associated with
          recognition memory. Memory, 19(4), 398–405. https://doi.org/10.1080/09658211.2011.575788
Heckman, J. J., Stixrud, J., & Urzua, S. (2006). The effects of cognitive and noncognitive abilities on
          labor market outcomes and social behavior. Journal of Labor Economics, 24(3), 411–482.
          https://doi.org/10.1086/504455
Heidari, K. (2019). Willingness to communicate: A predictor of pushing vocabulary knowledge from
          receptive to productive. Journal of Psycholinguistic Research, 48, 903–920.
          https://doi.org/10.1007/s10936-019-09639-w
Hellermann, J. (2008). Social actions for classroom language learning. Multilingual Matters.
          https://doi.org/10.21832/9781847690272
Heng, C. S., Abdullah, A. N., & Yusof, N. B. (2012). Investigating the construct of anxiety in relation to
          speaking skills among ESL tertiary learners. 3L: The Southeast Asian Journal of English Studies,
          18(3), 155–166.
Heritage, J. (1984). A change-of-state token and aspects of its sequential placement. In Atkinson &
          Heritage (Eds.), Structures of social action: Studies in conversation analysis (pp. 299–345).
          Cambridge University Press.
Heritage, J. (1998). Oh-prefaced responses to inquiry. Language in society, 27(3), 291-334.
          https://doi.org/10.1017/S0047404500019990
Hess, U., & Fischer, A. (2013). Emotional mimicry as social regulation. Personality and Social
          Psychology Review, 17(2), 142–157. https://doi.org/10.1177/1088868312472607
                                                                                                        330


Hirschmüller, S., Schmukle, S. C., Krause, S., Back, M. D., & Egloff, B. (2018). Accuracy of self‐esteem
          judgments at zero acquaintance. Journal of Personality, 86(2), 308–319.
          https://doi.org/10.1111/jopy.12316
Hirt, F., Werlen, E., Moser, I., & Bergamin, P. (2019). Measuring emotions during learning: lack of
          coherence between automated facial emotion recognition and emotional experience. Open
          Computer Science, 9(1), 308–317. https://doi.org/10.1515/comp-2019-0020.
Hırçın Çoban, M., & Sert, O. (2020). Resolving interactional troubles and maintaining progressivity in
          paired speaking assessment in an EFL context. Papers in Language Testing and Assessment,
          9(1), 64–94. http://www.altaanz.org/uploads/5/9/0/8/5908292/2020_9_1__3_hircin-
          soban_sert.pdf
Holler, J., Kendrick, K. H., & Levinson, S. C. (2018). Processing language in face-to-face conversation:
          Questions with gestures get faster responses. Psychonomic Bulletin and Review, 25(5), 1900–
          1908. https://doi.org/10.3758/s13423-017-1363-z
Horwitz, E. K., Horwitz, M. B., & Cope, J. (1986). Foreign language classroom anxiety. The Modern
          Language Journal, 70(2), 125–132. http://doi.org/10.2307/327317
Hox, J., Moerbeck, M., & van de Schoot, R. (2018). Multilevel analysis: techniques and applications (3rd
          ed.). Routledge. https://doi.org/10.4324/9781315650982
Hulstijn, J. H. (2015). Language proficiency in native and non-native speakers: Theory and research.
          John Benjamins. https://doi.org/10.1075/lllt.41
Hulstijn, J. H., Young, R. F., Ortega, L., Bigelow, M., DeKeyser, R., Ellis, N. C., Lantolf, J. P., Mackey,
          A., & Talmy, S. (2014). Bridging the gap: Cognitive and social approaches to research in second
          language learning and teaching. Studies in Second Language Acquisition, 36(3), 361–421.
          https://doi.org/10.1017/S0272263114000035
Hymes, D. (1972). On communicative competence. In A. Duranti (Ed.), Linguistic anthropology: A
          reader (pp. 53–73). Blackwell.
iMotions. (2017). Facial expression analysis. https://imotions.com/biosensor/fea-facial-expression-
          analysis
Inceoglu, S. (2016). Effects of perceptual training on second language vowel perception and production.
          Applied Psycholinguistics, 37(5), 1175–1199. https://doi.org/10.1017/S0142716415000533
Interagency Language Roundtable. (n.d.). Interagency Language Roundtable language skill level
          descriptions – Speaking. https://www.govtilr.org/Skills/ILRscale2.htm
International English Language Testing System (IELTS). (n.d.). Speaking: Band descriptors (public).
          IELTS. https://www.ielts.org/-/media/pdfs/speaking-band-descriptors.ashx
Iwaniec, J. (2019). Questionnaires: Implications for effective implementation. In J. McKinley & H. Rose
          (Eds.), The Routledge handbook of research methods in applied linguistics (pp. 324–335).
          Routledge. https://doi.org/10.4324/9780367824471-28
                                                                                                         331


Jackson, J. C., Watts, J., Henry, T. R., List, J. M., Forkel, R., Mucha, P. J., Greenhill, S. J., Gray, R. D., &
          Lindquist, K. A. (2019). Emotion semantics show both cultural variation and universal
          structure. Science, 366(6472), 1517–1522. https://doi.org/10.1126/science.aaw8160
Jakobovits, L. A. (1970). Foreign language learning. Newbury House.
James, W. (1884). What is an emotion? Mind, 9(34), 188–205. https://doi.org/10.1093/mind/os-IX.34.188
Jenkins, S., & Parra, I. (2003). Multiple layers of meaning in an oral proficiency test: The complementary
          roles of nonverbal, paralinguistic, and verbal behaviors in assessment decisions. The Modern
          Language Journal, 87(1), 90–107. https://doi.org/10.1111/1540-4781.00180
Jiang, Y., & Dewaele, J.-M. (2019). How unique is the foreign language classroom enjoyment and
          anxiety of Chinese EFL learners? System, 82(59), 13–25.
          http://doi.org/10.1016/J.SYSTEM.2019.02.017
Jin, Y., De Bot, K., & Keijzer, M. (2017). Affective and situational correlates of foreign language
          proficiency: A study of Chinese university learners of English and Japanese. Studies in Second
          Language Learning and Teaching, 7(1), 105–125. https://doi.org/10.14746/ssllt.2017.7.1.6
Judge, T. A., & Hurst, C. (2007). Capitalizing on one's advantages: Role of core self-evaluations. Journal
          of Applied Psychology, 92(5), 1212–1227. https://doi.org/10.1037/0021-9010.92.5.1212
Jungheim, N. O. (2001). The unspoken element of communicative competence: Evaluating language
          learners’ nonverbal behavior. In T. Hudson & J. Brown (Eds.), A focus on language test
          development: Expanding the language proficiency construct across a variety of tests (pp. 1-35).
          University of Hawaii, Second Language Teaching and Curriculum Center.
Kalantzis, M., & Cope, B. (2020). Adding sense: Context and interest in a grammar of multimodal
          meaning. Cambridge University Press. https://doi.org/10.1017/9781108862059
Kappas, A., Krumhuber, E., & Küster, D. (2013). Facial behavior. In J. A. Hall & M. L. Knapp (Eds.),
          Nonverbal communication (pp. 131–165). De Gruyter.
          https://doi.org/10.1515/9783110238150.131
Kasper, G., & Wagner, J. (2018). Epistemological reorientations and L2 interactional settings: A
          postscript to the special issue. The Modern Language Journal, 102(S1), 82–90.
          https://doi.org/10.1111/modl.12463
Keevallik, L. (2014). Turn organization and bodily-vocal demonstrations. Journal of Pragmatics, 65,
          103–120. https://doi.org/10.1016/j.pragma.2014.01.008
Kellerman, S. (1992). ‘I see what you mean’: The role of kinesic behaviour in listening and implications
          for foreign and second language learning. Applied Linguistics, 13(3), 239–258.
          https://doi.org/10.1093/applin/13.3.239
Kelly, S. D., Barr, D. J., Church, R. B., & Lynch, K. (1999). Offering a hand to pragmatic understanding:
          The role of speech and gesture in comprehension and memory. Journal of memory and
          Language, 40(4), 577–592. https://doi.org/10.1006/jmla.1999.2634
                                                                                                             332


Kelly, S. D., Manning, S. M., & Rodak, S. (2008). Gesture gives a hand to language and learning:
          Perspectives from cognitive neuroscience, developmental psychology and education. Language
          and Linguistics Compass, 2, 569–588. https://doi.org/10.1111/j.1749-818X.2008.00067.x
Kendon, A. (1980). Gesticulation and speech: Two aspects of the process of utterance. In M. R. Key
          (Ed.), The relationship of verbal and nonverbal communication (pp. 207–227). Mouton
          Publishers. https://doi.org/10.1515/9783110813098.207
Kendon, A. (2004). Gesture: Visible action as utterance. Cambridge University Press.
          https://doi.org/10.1017/CBO9780511807572
Kendon, A. (2007). Some topics in gesture studies. In A. Esposito, M. Bratanic, E. Keller, & M. Marinaro
          (Eds.), Fundamentals of verbal and nonverbal communication and the biometric issue (pp. 3–
          19). IOS Press.
Kikuchi, Y., & Noriuchi, M. (2019). Power of self-touch: its neural mechanism as a coping strategy. In S.
          Fukuda (Ed.), Emotional engineering, vol. 7: The age of communication (pp. 33–47). Springer.
          https://doi.org/10.1007/978-3-030-02209-9_
Kim, Y. L., Liu, C., Trofimovich, P., & McDonough, K. (2023). Do visual cues matter for perceived
          fluency during L2 conversations? [Paper presentation] American Association of Applied
          Linguistics. Portland, Oregon, United States.
Kita, S. (2009). Cross-cultural variation of speech-accompanying gesture: A review. Language and
          Cognitive Processes, 24(2), 145–167. https://doi.org/10.1080/01690960802586188
Kita, S., & Özyürek, A. (2003). What does cross-linguistic variation in semantic coordination of speech
          and gesture reveal?: Evidence for an interface representation of spatial thinking and
          speaking. Journal of Memory and language, 48(1), 16–32. https://doi.org/10.1016/S0749-
          596X(02)00505-3
Kitayama, S., Mesquita, B., & Karasawa, M. (2006). Cultural affordances and emotional experience:
          Socially engaging and disengaging emotions in Japan and the United States. Journal of
          Personality and Social Psychology, 91(5), 890–903. https://doi.org/10.1037/0022-3514.91.5.890
Knapp, M. L., Hall, J. A., & Horgan, T. G. (2014). Nonverbal communication in human interaction (8th
          ed.). Cengage Learning.
Knoch, U., Deygers, B., & Khamboonruang, A. (2021). Revisiting rating scale development for rater-
          mediated language performance assessments: Modelling construct and contextual choices made
          by scale developers. Language Testing, 38(4), 602–
          626. https://doi.org/10.1177/0265532221994052
Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients
          for reliability research. Journal of Chiropractic Medicine, 15(2), 155–163.
          https://doi.org/10.1016/j.jcm.2016.02.012
Kormos, J. (2006). Speech production and second language acquisition. Lawrence Erlbaum.
Kramsch, C. (1986). From language proficiency to interactional competence. The Modern Language
          Journal, 70(4), 366–372. https://doi.org/10.1111/j.1540-4781.1986.tb05291.x
                                                                                                        333


Krason, A., Fenton, R., Varley, R., & Vigliocco, G. (2022). The role of iconic gestures and mouth
          movements in face-to-face communication. Psychonomic Bulletin & Review, 29, 600–612.
          https://doi.org/10.3758/s13423-021-02009-5
Krauss, R. K., Chen, Y., & Gottesman R. F. (2000). Lexical gestures and lexical access: A process model.
          In D. McNeill (Ed.), Language and gesture (pp. 261–283). Cambridge University Press.
          https://doi.org/10.1017/CBO9780511620850.017
Krauss, R. M., & Hadar. U. (1999). The role of speech-related arm/hand gestures in word retrieval. In R.
          Campbell & L. Messing (Eds.), Language and gesture (pp. 261–283). Cambridge University
          Press. https://doi.org/10.1093/acprof:oso/9780198524519.003.0006
Kreibig, S. D. (2010). Autonomic nervous system activity in emotion: A review. Biological
          Psychology, 84(3), 394–421. https://doi.org/10.1016/j.biopsycho.2010.03.010
Kuiken, F., & Vedder, I. (2014). Rating written performance: What do raters do and why? Language
          Testing, 31(3), 329–348. https://doi.org/10.1177/0265532214526174
Kulke, L., Feyerabend, D., & Schacht, A. (2020). A comparison of the Affectiva iMotions facial
          expression analysis software with EMG for identifying facial expressions of emotion. Frontiers
          in Psychology, 11(329), Article 329. https://doi.org/10.3389/fpsyg.2020.00329
Kunnan, A. (2018). Evaluating language assessments. Routledge.
          https://doi.org/10.4324/9780203803554
Labrie, N., & Clement, R. (1986). Ethnolinguistic vitality, self‐confidence and second language
          proficiency: An investigation. Journal of Multilingual & Multicultural Development, 7(4), 269–
          282. https://doi.org/10.1080/01434632.1986.9994244
LaFrance, M., & Mayo, C. (1976). Racial differences in gaze behavior during conversations: Two
          systematic observational studies. Journal of Personality and Social Psychology, 33(5), 547–552.
          https://doi.org/10.1037/0022-3514.33.5.547
Lakin, J. L. (2006). Automatic cognitive processes and nonverbal communication. In V. Manusov & M.
          L. Patterson (Eds.), The Sage handbook of nonverbal communication (pp. 59–77). Sage
          Publications. https://doi.org/10.4135/9781412976152.n4
Lam, D. M. K. (2015). Contriving authentic interaction: Task implementation and engagement in school-
          based speaking assessment in Hong Kong. In G. Yu & Y. Jin (Eds.), Assessing Chinese learners
          of English. Language constructs, consequences and conundrums (pp. 38–60). Palgrave
          Macmillan. https://doi.org/10.1057/9781137449788_3
Lam, D. M. K. (2021). Don’t turn a deaf ear: A case for assessing interactive listening. Applied
          Linguistics, 42(4), 740–764. https://doi.org/10.1093/applin/amaa064
Lambert, C., Philp, J., & Nakamura, S. (2017). Learner-generated content and engagement in second
          language task performance. Language Teaching Research, 21(6), 665–680.
          https://doi.org/10.1177/1362168816683559
                                                                                                      334


Lansing, C. R., & McConkie, G. W. (2003). Word identification and eye fixation locations in visual and
          visual-plus-auditory presentations of spoken sentences. Perception and Psychophysics, 65(4),
          536–552. https://doi.org/10.3758/BF03194581
Lantolf, J. P. (2006). Re(de)fining language proficiency in light of the concept of ‘languaculture.’ In H.
          Byrnes (Ed.), Advanced language learning: The contribution of Halliday and Vygotsky (pp. 72–
          91). Bloomsbury.
Lantolf, J. P., Thorne, S. L., & Poehner, M. E. (2020). Sociocultural theory and L2 development. In B.
          VanPatten, G. D. Keating & S. Wulff (Eds.), Theories in second language acquisition (pp. 223–
          247). Routledge. https://doi.org/10.4324/9780429503986-10
Larson, J. W. (1984). Testing speaking ability in the classroom: The semi-direct alternative. Foreign
          Language Annals, 17(5), 499–507. https://doi.org/10.1111/j.1944- 9720.1984.tb01738.x
Laurent, A., & Nicoladis, E. (2015). Gesture restriction affects French–English bilinguals’ speech only in
          French. Bilingualism: Language and cognition, 18(2), 340–349.
          https://doi.org/10.1017/S1366728914000042
Lavolette, E. (2013). Effects of technology modes on ratings of learner recordings. The IALLT Journal,
          43(2), 1–27. https://doi.org/10.17161/iallt.v43i2.8524
Lazaraton, A. (1996). Interlocutor support in oral proficiency interviews: The case of CASE. Language
          Testing, 13(2), 151–172. https://doi.org/10.1177/026553229601300202
Lazarus, R. S. (1991). Emotion and adaptation. Oxford University Press.
Lennon, P. (1990). Investigating fluency in EFL: A quantitative approach. Language Learning, 40(3),
          387–417. https://doi.org/10.1111/j.1467-1770.1990.tb00669.x
Levelt, W. J. (1993). Speaking: From intention to articulation.
          https://doi.org/10.7551/mitpress/6393.001.0001
Levelt, W. J., Roelofs, A., & Meyer, A. S. (1999). A theory of lexical access in speech
          production. Behavioral and Brain Sciences, 22(1), 1–38.
          https://doi.org/10.1017/S0140525X99001776
Levenson, R. W., & Ekman, P. (2002). Difficulty does not account for emotion-specific heart rate
          changes in the directed facial action task. Psychophysiology, 39(3), 397–405.
          https://doi.org/10.1017/S0048577201393150
Levenson, R. W., Ekman, P., & Friesen, W. V. (1990). Voluntary facial action generates emotion‐specific
          autonomic nervous system activity. Psychophysiology, 27(4), 363–384.
          https://doi.org/10.1111/j.1469-8986.1990.tb02330.x
Levinson, S. C. (2016). Turn-taking in human communication–origins and implications for language
          processing. Trends in Cognitive Sciences, 20(1), 6–14. https://doi.org/10.1016/j.tics.2015.10.010
Levy, R. (1973). Tahitians. Chicago University Press.
                                                                                                         335


Li, C., Dewaele, J.-M., & Jiang, G. (2020). The complex relationship between classroom emotions and
          EFL achievement in China. Applied Linguistics Review, 11(3), 485–510.
          http://doi.org/10.1515/applirev-2018-0043
Li, H. Z. (2006). Backchannel responses as misleading feedback in intercultural discourse. Journal of
          Intercultural Communication Research, 35(2), 99–116.
          https://doi.org/10.1080/17475750600909253
Lim, G. (2011). The development and maintenance of rating quality in performance writing assessments:
          A longitudinal study of new and experienced raters. Language Testing, 28(4), 543–560.
          https://doi.org/10.1177/0265532211406422
Lim, G. S., Geranpayeh, A., Khalifa, H., & Buckendahl, C. W. (2013). Standard setting to an
          international reference framework: Implications for theory and practice. International Journal of
          Testing, 13(1), 32–49. https://doi.org/10.1080/15305058.2012.678526
Lin, Y. (2022). Speech-accompanying gestures in L1 and L2 conversational interaction by speakers of
          different proficiency levels. International Review of Applied Linguistics in Language
          Teaching, 60(2), 123–142. https://doi.org/10.1515/iral-2017-0043
Linacre, J. M. (2002). What do infit and outfit, mean-square and standardized mean? Rasch Measurement
          Transactions, 16(2), 878. https://www.rasch.org/rmt/rmt162.pdf
Linacre, J. M. (2021). Facets Rasch measurement computer program (Version 3.83.6) [Computer
          software]. https://www.winsteps.com/facets.htm
Linacre, M. (n.d.). Inter-rater and intra-rater reliability. https://www.winsteps.com/facetman/inter-rater-
          reliability.htm
Lindberg, R., McDonough, K., & Trofimovich, P. (2021). Investigating verbal and nonverbal indicators
          of physiological response during second language interaction. Applied Psycholinguistics, 42(6),
          1403–1425. https://doi.org/10.1017/S014271642100028X
Lindberg, R., McDonough, K., & Trofimovich, P. (2022). Second language anxiety in conversation and
          its relationship with speakers’ perceptions of the interaction and their social networks. Studies in
          Second Language Acquisition. Advanced online publication.
          https://doi.org/10.1017/S0272263122000523
Lischetzke, T., & Eid, M. (2003). Is attention to feelings beneficial or detrimental to affective well-being?
          Mood regulation as a moderator variable. Emotion, 3(4), 361–377. https://doi.org/10.1037/1528-
          3542.3.4.361
Lochner, K. (2016). Successful emotions: How emotions drive cognitive performance. Springer.
          https://doi.org/10.1007/978-3-658-12231-7
Long, M. (1996). The role of the linguistic environment in second language acquisition. In W. C. Ritchie
          & T. K. Bhatia (Eds.), Handbook of second language acquisition (pp. 413–468). Academic
          Press.
                                                                                                           336


Luk, J. (2010). Talking to score: Impression management in L2 oral assessment and the co-construction
          of a test discourse genre. Language Assessment Quarterly, 7(1), 25–53.
          https://doi.org/10.1080/15434300903473997
Lumley, T. (2002). Assessment criteria in a large-scale writing test: what do they really mean to the
          raters? Language Testing, 19(3), 246–276. https://doi.org/10.1191/0265532202lt230oa
MacIntyre, P. D., & Gardner, R. C. (1994). The subtle effects of language anxiety on cognitive processing
          in the second language. Language Learning, 44(2), 283–305. https://doi.org/10.1111/j.1467-
          1770.1994.tb01103.x
MacIntyre, P. D., & Gregersen, T. (2012). Emotions that facilitate language learning: The positive-
          broadening power of the imagination. Studies in Second Language Learning and Teaching, 2(2),
          193–213. http://doi.org/10.14746/ssllt.2012.2.2.4
MacIntyre, P. D., Gregersen, T., & Mercer, S. (2019). Setting an agenda for positive psychology in SLA:
          Theory, practice, and research. The Modern Language Journal, 103(1), 262–274.
          https://doi.org/10.1111/modl.12544
MacIntyre, P., Clément, R., Dörnyei, Z., & Noels, K. (1998). Conceptualising willingness to
          communicate in a L2: A situational model of L2 confidence and affiliation. Modern Language
          Journal, 82(4), 545–562. https://doi.org/10.1111/j.1540-4781.1998.tb05543.x
MacIntyre, P., Noels, K., & Clément, R. (1997). Biases in self- ratings of second language proficiency:
          The role of language anxiety. Language Learning, 47(2), 265–287. https://doi.org/10.1111/0023-
          8333.81997008
Manstead, A. S., & Wagner, H. L. (1981). Arousal, cognition and emotion: An appraisal of two-factor
          theory. Current Psychological Reviews, 1(1), 35–54. https://doi.org/10.1007/BF02979253
Marefat, F., & Heydari, M. (2016). Native and Iranian teachers’ perceptions and evaluation of Iranian
          students’ English essays. Assessing Writing, 27, 24–36.
          https://doi.org/10.1016/j.asw.2015.10.001
Mark, G., Knight, D., O’Keeffe, A., & Fitzgerald, C. (2023). Interactional Variation Online (IVO):
          Corpus approaches and applications to analyzing multi-modal collaboration in virtual meetings
          [Paper presentation]. American Association of Applied Linguistics. Portland, Oregon, USA.
Marsh, A. A., Elfenbein, H. A., & Ambady, N. (2003). Nonverbal “accents” cultural differences in facial
          expressions of emotion. Psychological Science, 14(4), 373–376. https://doi.org/10.1111/1467-
          9280.24461
Masuda, T., Ellsworth, P. C., Mesquita, B., Leu, J., Tanida, S., & Van de Veerdonk, E. (2008). Placing
          the face in context: Cultural differences in the perception of facial emotion. Journal of
          Personality and Social Psychology, 94(3), 365–381. https://doi.org/10.1037/0022-3514.94.3.365
Masuda, T., Wang, H., Ishii, K., & Ito, K. (2012). Do surrounding figures' emotions affect judgment of
          the target figure's emotion? Comparing the eye-movement patterns of European Canadians,
          Asian Canadians, Asian international students, and Japanese. Frontiers in Integrative
          Neuroscience, 6, Article 72. https://doi.org/10.3389/fnint.2012.00072
                                                                                                        337


Matsumoto, D. (2001). Culture and emotion. In D. Matsumoto (Ed.), The handbook of culture and
         psychology (pp. 171–194). Oxford University Press.
Matsumoto, D., & Hwang, H. C. (2016). The cultural bases of nonverbal communication. In D.
         Matsumoto, H. C. Hwang, & M. G. Frank (Eds.), APA handbook of nonverbal communication
         (pp. 77–101). American Psychological Association. https://doi.org/10.1037/14669-004
Matsumoto, D., & Hwang, H. S. (2012). Culture and emotion: The integration of biological and cultural
         contributions. Journal of Cross-Cultural Psychology, 43(1), 91–118.
         https://doi.org/10.1177/0022022111420147
Matsumoto, D., Hwang, H. C., & Frank, M. G. (Eds.) (2016). APA handbook of nonverbal
         communication. American Psychological Association. https://doi.org/10.1037/14669-000
Matsumoto, D., Olide, A., Schug, J., Willingham, B., & Callan, M. (2009). Cross-cultural judgments of
         spontaneous facial expressions of emotion. Journal of Nonverbal Behavior, 33, 213–238.
         https://doi.org/10.1007/s10919-009-0071-4
Matsumoto, Y. (2018). Functions of laughter in English-as-a-lingua-franca classroom interactions: A
         multimodal ensemble of verbal and nonverbal interactional resources at miscommunication
         moments. Journal of English as a Lingua Franca, 7(2), 229–260. https://doi.org/10.1515/jelf-
         2018-0013
Max Planck Institute. (2020). ELAN (Version 5.9) [Computer software]. The Language Archive.
May, L. (2009). Co-constructed interaction in a paired speaking test: The raters’ perspective. Language
         Testing, 26(3), 397–421. https://doi.org/10.1177/0265532209104668
May, L. (2011). Interactional competence in a paired speaking test: Features salient to raters. Language
         Assessment Quarterly, 8(2), 127–145. https://doi.org/10.1080/15434303.2011.565845
McCafferty, S. G. (2006). Gesture and the materialization of second language prosody. International
         Review of Applied Linguistics, 44(2), 195–207. https://doi.org/10.1515/IRAL.2006.008
McCafferty, S. G., & Ahmed, M. K. (2000). The appropriation of gestures of the abstract by L2 learners.
         In J. P. Lantolf (Ed.), Sociocultural theory and second language learning (pp. 199–218). Oxford
         University Press.
McDonough, K., Crowther, D., Kielstra, P., & Trofimovich, P. (2015). Exploring the potential
         relationship between eye gaze and English L2 speakers’ responses to recasts. Second Language
         Research, 31(4), 563–575. https://doi.org/10.1177/0267658315589656
McDonough, K., Kim, Y. L., Uludag, P., Liu, C., & Trofimovich, P. (2022a). Exploring the relationship
         between behavior matching and interlocutor perceptions in L2 interaction. System, 109, Article
         102865. https://doi.org/10.1016/j.system.2022.102865
McDonough, K., Lindberg, R., & Trofimovich, P. (2022b). Examining rater perception of holds as a
         visual cue of listener nonunderstanding. Studies in Second Language Acquisition, 44(5), 1240–
         1259. https://doi.org/10.1017/S0272263122000018
                                                                                                       338


McDonough, K., Lindberg, R., Trofimovich, P., & Tekin, O. (2023). The visual signature of non-
         understanding: A systematic replication of McDonough, Trofimovich, Lu, and Abashidze
         (2019). Language Teaching, 56(1), 113–127. https://doi.org/10.1017/S0261444821000197
McDonough, K., Trofimovich, P., Lu, L., & Abashidze, D. (2019). The occurrence and perception of
         listener visual cues during nonunderstanding episodes. Studies in Second Language
         Acquisition, 41(5), 1151–1165. https://doi.org/10.1017/S0272263119000238
McGurk, H., & McDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746–748.
         https://doi.org/10.1038/264746a0
McNamara, T. (1996). Measuring second language performance. Longman.
McNamara, T., Knoch, U., & Fan, J. (2019). Fairness, justice and language assessment. Oxford.
McNeill, D. (1985). So you think gestures are nonverbal? Psychological Review, 92(3), 350–371.
         https://doi.org/10.1037/0033-295X.92.3.350
McNeill, D. (1992) Hand and mind: What the hands reveal about thought. University of Chicago Press.
McNeill, D. (2005). Gesture & thought. The University of Chicago Press.
         https://doi.org/10.7208/chicago/9780226514642.001.0001
McNeill, D., & Duncan, S. (2000). Growth points in thinking-for-speaking. In D. McNeill (Ed.),
         Language and gesture (pp. 141–161). Cambridge University Press.
         https://doi.org/10.1017/CBO9780511620850.010
Mehrabian, A. (1972). Nonverbal communication. Aldine-Atherton.
Mehrabian, A. (1981). Silent messages: Implicit communication of emotions and attitudes (2nd ed.).
         Wadsworth.
Mehrabian, A. (1996). Pleasure-arousal-dominance: A general framework for describing and measuring
         individual differences in temperament. Current Psychology, 14, 261–292.
         https://doi.org/10.1007/BF02686918
Mehrabian, A., & Williams, M. (1969). Nonverbal concomitants of perceived and intended
         persuasiveness. Journal of Personality and Social Psychology, 13(1), 37–
         58. https://doi.org/10.1037/h0027993
Meltzoff, A. N., & Moore, M. K. (1997). Explaining facial imitation: A theoretical model. Infant and
         Child Development, 6(3‐4), 179–192. https://doi.org/10.1002/(SICI)1099-
         0917(199709/12)6:3/4%3C179::AID-EDP157%3E3.0.CO;2-R
Mesquita, B. (2022). Between us: How cultures create emotions. W. W. Norton & Company.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–
         103). Macmillan.
Mondada, L. (2006). Participants’ online analysis and multimodal practices: projecting the end of the turn
         and the closing of the sequence. Discourse Studies, 8(1), 117–129.
         https://doi.org/10.1177/1461445606059561
                                                                                                       339


Mondada, L. (2007). Multimodal resources for turn-taking: Pointing and the emergence of possible next
         speakers. Discourse Studies, 9(2), 194–225. https://doi.org/10.1177/1461445607075346
Mondada, L. (2014). The local constitution of multimodal resources for social interaction. Journal of
         Pragmatics, 65, 137–156. https://doi.org/10.1016/j.pragma.2014.04.004
Mondada, L. (2016). Challenges of multimodality: Language and the body in social interaction. Journal
         of Sociolinguistics, 20(3), 336–366. https://doi.org/10.1016/j.pragma.2014.04.004
Morett, L. M. (2014). When hands speak louder than words: The role of gesture in the communication,
         encoding, and recall of words in a novel second language. The Modern Language
         Journal, 98(3), 834–853. https://doi.org/10.1111/modl.12125
Morgenstern, A., & Goldin-Meadow, S. (Eds.). (2022). Introduction to gesture in language. In A.
         Morgenstern & S. Goldin-Meadow (Eds.), Gesture in language: Development across the
         lifespan (pp. 3–17). De Gruyter Mouton. https://doi.org/10.1037/0000269-001
Morreale, S. P., Spitzbert, B. H., & Barge, J. K. (2013). Communication: motivation, knowledge, skills
         (3rd Ed.). Peter Lang. https://doi.org/10.3726/978-1-4539-0257-8
Morris, D., Collet, P., Marsh, P., & O’Shaughnessy, M. (1979). Gestures: Their origin and distribution.
         Stein and Day Publisher.
Morris, J. S., Frith, C. D., Perrett, D. I., Rowland, D., Young, A. W., Calder, A. J., & Dolan, R. J. (1996).
         A differential neural response in the human amygdala to fearful and happy facial
         expressions. Nature, 383(6603), 812–815. https://doi.org/10.1038/383812a0
Morrow, K. (1979). Communicative language testing: Revolution or evolution? In C. J. Brumfit & K.
         Johnson (Eds.), The communicative approach to language teaching (pp. 143–159). Oxford
         University Press.
Morsbach, H., & Tyler, W. J. (1986). A Japanese emotion: Amae. In R. Harré (Ed.), The social
         construction of emotion (pp. 289–308). Blackwell.
Munro, M. J., & Derwing, T. M. (1995). Processing time, accent, and comprehensibility in the perception
         of native and foreign-accented speech. Language and Speech, 38(3), 289–306.
         https://doi.org/10.1177/002383099503800305
Nagle, C. L., Trofimovich, P., O’Brien, M. G., & Kennedy, S. (2022). Beyond linguistic features:
         Exploring behavioral and affective correlates of comprehensible second language
         speech. Studies in Second Language Acquisition, 44(1), 255–270.
         https://doi.org/10.1017/S0272263121000073
Nakatsuhara, F. (2011). Effects of test taker characteristics and the number of participants in group oral
         tests. Language Testing, 28(4), 483–508. https://doi.org/10.1177/0265532211398110
Nakatsuhara, F., Inoue, C., & Taylor, L. (2021a). Comparing rating modes: Analyzing live, audio, and
         video ratings of IELTS speaking test performances. Language Assessment Quarterly, 18(2), 83–
         106. https://doi.org/ 10.1080/15434303.2020.1799222
                                                                                                          340


Nakatsuhara, F., Inoue, C., Berry, V., & Galaczi, E. (2017). Exploring the use of video-conferencing
          technology in the assessment of spoken language: A mixed-methods study. Language
          Assessment Quarterly, 14(1), 1–18. https://doi.org/10.1080/15434303.2016.1263637
Nakatsuhara, F., Inoue, C., Berry, V., & Galaczi, E. (2021b). Video-conferencing speaking tests: Do they
          measure the same construct as face-to-face tests? Assessment in Education: Principles, Policy &
          Practice, 28(4), 369–388. https://doi.org/10.1080/0969594X.2021.1951163
Nakatsukasa, K. (2016). Efficacy of recasts and gestures on the acquisition of locative prepositions.
          Studies in Second Language Acquisition, 38(4), 771–799.
          https://doi.org/10.1017/S0272263115000467
Nambiar, M. K., & Goon, C. (1993). Assessment of oral skills : A comparison of scores obtained through
          audio recordings to those obtained through face-to-face evaluation. RELC Journal, 24(1), 15–
          31. https://doi.org/10.1177/003368829302400102
Naumann, L. P., Vazire, S., Rentfrow, P. J., & Gosling, S. D. (2009). Personality judgments based on
          physical appearance. Personality and Social Psychology Bulletin, 35(12), 1661–1671.
          https://doi.org/10.1177/0146167209346309
Negueruela, E., Lantolf, J. P., Jordan, S. R., & Gelabert, J. (2004). The “private function” of gesture in
          second language speaking activity: a study of motion verbs and gesturing in English and
          Spanish. International Journal of Applied Linguistics, 14(1), 113-147.
          https://doi.org/10.1111/j.1473-4192.2004.00056.x
Nestler, S., & Back, M. D. (2013). Applications and extensions of the lens model to understand
          interpersonal judgments at zero acquaintance. Current Directions in Psychological
          Science, 22(5), 374–379. https://doi.org/10.1177/0963721413486148
Neu, J. (1990). Assessing the role of nonverbal communication in the acquisition of communicative
          competence in L2. In R. C. Scarcella, E. S. Andersen, & S. D. Krashen (Eds.), Developing
          communicative competence in a second language (pp. 121–138). Newbury House.
Nicoladis, E. (2007). The effect of bilingualism on the use of manual gestures. Applied Psycholinguistics,
          28(3), 441-454. https://doi.org/10.1017/S0142716407070245
Nicoladis, E. (2007). The effect of bilingualism on the use of manual gestures. Applied
          Psycholinguistics, 28(3), 441–454. https://doi.org/10.1017/S0142716407070245
Nicoladis, E., Mayberry, R. I., & Genesee, F. (1999). Gesture and early bilingual
          development. Developmental Psychology, 35(2), 514–526. https://doi.org/10.1037/0012-
          1649.35.2.514
Nicoladis, E., Pika, S., Yin, H. U., & Marentette, P. (2007). Gesture use in story recall by Chinese–
          English bilinguals. Applied Psycholinguistics, 28(4), 721–735.
          https://doi.org/10.1017/S0142716407070385
Noels, K. A., & Clément, R. (1996). Communicating across cultures: Social determinants and
          acculturative consequences. Canadian Journal of Behavioural Science/Revue canadienne des
          sciences du comportement, 28(3), 214–228. https://doi.org/10.1037/0008-400X.28.3.214
                                                                                                           341


Noels, K. A., Pon, G., & Clément, R. (1996). Language, identity, and adjustment: The role of linguistic
          self-confidence in the acculturation process. Journal of Language and Social Psychology, 15(3),
          246–264. https://doi.org/10.1177/0261927X960153003
Norton, B. (2000). Identity and language learning: Gender, ethnicity and educational change. Pearson.
Norton, B. (2013). Identity and language learning: Extending the conversation. Multilingual Matters.
          https://doi.org/10.21832/9781783090563
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory. McGraw-Hill.
O’Sullivan, B. (1996). The evaluation of gestures in non-verbal communication. In G. van Troyer, S.
          Cornwell, & H. Morikawa (Eds.), Proceedings of the JALT 1995 conference (pp. 316–320). The
          Japan Association for Language Teaching. https://files.eric.ed.gov/fulltext/ED402769.pdf
O’Sullivan, B. (2004). Modelling factors affecting oral language test performance: A large scale empirical
          study. In M. Milanovic & C. Weir (Eds.), European language testing in a global context:
          Studies in Language Testing 18 (pp. 129–142). Cambridge University Press.
Ockey, G. J. (2007). Construct implications of including still image or video in computer-based listening
          tests. Language Testing, 24(4), 517–537. https://doi.org/10.1177/0265532207080771
Ockey, G. J. (2009). The effects of group members’ personalities on a test taker’s L2 group oral
          discussion test scores. Language Testing, 26(2), 161–186.
          https://doi.org/10.1177/0265532208101005
Oloff, F. (2018). “Sorry?”/“Como?”/“Was?” – Open class and embodied repair initiators in international
          workplace interactions. Journal of Pragmatics, 126, 29–51.
          https://doi.org/10.1016/j.pragma.2017.11.002
Orr, M. (2002). The FCE speaking test: Using rater reports to help interpret test scores. System, 30(2),
          143–154. https://doi.org/10.1016/S0346-251X(02)00002-7
Osgood, C. E., Suci, G. J., & Tannenbaum, P. (1957). The measurement of meaning. University of Illinois
          Press.
Otero, S. C., Weekes, B. S., & Hutton, S. B. (2011). Pupil size changes during recognition
          memory. Psychophysiology, 48(10), 1346–1353. https://doi.org/10.1111/j.1469-
          8986.2011.01217.x
Oxford, R. L. (2016). Toward a psychology of well-being for language learners: The “EMPATHICS”
          vision. In T. Gregersen, M. D. MacIntyre, & S. Mercer (Eds.), Positive psychology in SLA (pp.
          10–87). Multilingual Matters. https://doi.org/10.21832/9781783095360-003
Özyürek, A. (2014). Hearing and seeing meaning in speech and gesture: Insights from brain and
          behaviour. Philosophical Transactions of the Royal Society B, 369(1651), Article 20130296.
          https://doi.org/10.1098/rstb.2013.0296
                                                                                                         342


Özyürek, A., & Kita, S. (1999). Expressing manner and path in English and Turkish: Differences in
          speech, gesture, and conceptualization. In M. Hahn & S. C. Stoness (Eds.), Proceedings of the
          21st cognitive science meeting (pp. 507–512). Lawrence Erlbaum Associates, Publishers.
          https://doi.org/10.4324/9781410603494-94
Özyürek, A., & Kelly, S. D. (2007). Gesture, language, and brain. Brain and Language, 101(3), 181–185.
          https://doi.org/10.1016/j.bandl.2007.03.006
Pallotti, G. (2021). Measuring complexity, accuracy, and fluency (CAF). In P. Winke & T. Brunfaut
          (Eds.), The Routledge handbook of second language acquisition and language testing (pp. 201–
          210). Routledge. https://doi.org/10.4324/9781351034784-23
Pan, M. (2016). Nonverbal delivery in speaking assessment: From an argument to a rating scale
          formulation and validation. Springer. https://doi.org/10.1007/978-981-10-0170-3
Parkinson, B. (1996). Emotions are social. British journal of Psychology, 87(4), 663–683.
          https://doi.org/10.1111/j.2044-8295.1996.tb02615.x
Parkinson, B. (2011). Interpersonal emotion transfer: Contagion and social appraisal. Social and
          Personality Psychology Compass, 5(7), 428-439. https://doi.org/10.1111/j.1751-
          9004.2011.00365.x
Parkinson, B. (2019). Heart to heart: How your emotions affect other people. Cambridge University
          Press. https://doi.org/10.1017/9781108696234
Parkinson, B., & Simons, G. (2009). Affecting others: Social appraisal and emotion contagion in
          everyday decision making. Personality and Social Psychology Bulletin, 35(8), 1071–1084.
          https://doi.org/ 10.1177/0146167209336611
Patterson, M. L. (1973). Compensation in nonverbal immediacy behaviors: A review. Sociometry, 36(2),
          237–252. https://doi.org/10.2307/2786569
Patterson, M. L. (1983). Nonverbal Behavior: A Functional Perspective. Springer-Verlag.
          https://doi.org/10.1007/978-1-4612-5564-2
Patterson, M. L., & Ritts, V. (1997). Social and communicative anxiety: A review and meta-analysis.
          Annals of the International Communication Association, 20(1), 263–303.
          https://doi.org/10.1080/23808985.1997.11678944
Pavlenko, A. (2006). Bilingual minds: Emotional experience, expression, and representation (Vol. 56).
          Multilingual Matters. https://doi.org/10.21832/9781853598746
Pavlenko, A. (2014). The bilingual mind: And what it tells us about language and thought. Cambridge
          University Press. https://doi.org/10.1017/CBO9781139021456
Pavlenko, A., & Norton, B. (2007). Imagined communities, identity, and English language learning. In J.
          Cummins & C. Davison (Eds.), International handbook of English language teaching (pp. 669-
          680). Springer. https://doi.org/10.1007/978-0-387-46301-8_43
                                                                                                      343


Pekarek Doehler, S., & Berger, E. (2018). L2 interactional competence as increased ability for context-
           sensitive conduct: A longitudinal study of story-openings. Applied Linguistics, 39(4), 555–578.
           https://doi.org/10.1093/applin/amw021
Pekarek Doehler, S., & Pochon-Berger, E. (2015). The development of L2 interactional competence:
           evidence from turn-taking organization, sequence organization, repair organization and
           preference organization. In T. Cadierno & S. W. Eskildsen (Eds.), Usage-Based Perspectives on
           Second Language Learning (pp. 233–268). De Gruyter Mouton.
           https://doi.org/10.1515/9783110378528-012
Pekarek Doehler, S., & Skogmyr Marian, K. (2022). Functional diversification and progressive
           routinization of a multiword expression in and for social interaction: A longitudinal L2 study.
           The Modern Language Journal, 106(S1), 23–45. https://doi.org/10.1111/modl.12758
Pennycook, A. (1985). Actions speak louder than words: Paralanguage, communication, and education.
           TESOL Quarterly, 19(2), 259–282. https://doi.org/10.2307/3586829
Pérez Castillejo, S. (2019). The role of foreign language anxiety on L2 utterance fluency during a final
           exam. Language Testing, 36(3), 327–345. https://doi.org/10.1177/0265532218777783
Philp, J., & Duchesne, S. (2016). Exploring engagement in tasks in the language classroom. Annual
           Review of Applied Linguistics, 36, 50–72. https://doi.org/10.1017/S0267190515000094
Pickering, M. J., & Garrod, S. (2013). An integrated theory of language production and
           comprehension. Behavioral and Brain Sciences, 36(4), 329–347.
           https://doi.org/10.1017/S0140525X12001495
Pickering, M., & Garrod, S. (2004). Toward a mechanistic psychology of dialogue. Behavioral and Brain
           Sciences, 27(2), 169–226. https://doi.org/10.1017/S0140525X04000056
Pika, S., Nicoladis, E., & Marentette, P. (2009). How to order a beer: Cultural differences in the use of
           conventional gestures for numbers. Journal of Cross-Cultural Psychology, 40(1), 70–80.
           https://doi.org/10.1177/0022022108326197
Pine, K. J., Gurney, D. J., & Fletcher, B. (2010). The semantic specificity hypothesis: When gestures do
           not depend upon the presence of a listener. Journal of Nonverbal Behavior, 34, 169–178.
           https://doi.org/10.1007/s10919-010-0089-7
Pitzl, M. L. (2010). English as a Lingua Franca in international business: Resolving miscommunication
           and reaching shared understanding. VDM-Verlag Müller.
Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes in L2 research. Language
           Learning, 64(4), 878–912. https://doi.org/10.1111/lang.12079
Plough, I. (2021). A case for nonverbal behavior: Implications for construct, performance, and
           assessment. In Salaberry, M. R. & Burch, A. R. (Eds.), Assessing speaking in context—
           Expanding the construct and its applications (pp. 50–72). Multilingual Matters.
           https://doi.org/10.21832/9781788923828-004
                                                                                                           344


Plough, I. C., & Bogart, P. S. (2008). Perceptions of examiner behavior modulate power relations in oral
         performance testing. Language Assessment Quarterly, 5(3), 195–217.
         https://doi.org/10.1080/15434300802229375
Plough, I., Banerjee, J., & Iwashita, N. (2018). Interactional competence: Genie out of the bottle.
         Language Testing, 35(3), 427–455. https://doi.org/10.1177/0265532218772325
Plusquellec, P., & Denault, V. (2018). The 1000 most cited papers on visible nonverbal behavior: A
         bibliometric analysis. Journal of Nonverbal Behavior, 42(3), 347–377.
         https://doi.org/10.1007/s10919-018-0280-9
Plutchik, R. (2001). The nature of emotions: Human emotions have deep evolutionary roots, a fact that
         may explain their complexity and provide tools for clinical practice. American Scientist, 89(4),
         344–350. https://doi.org/10.1511/2001.28.344
Prior, M. T. (2019). Elephants in the room: An “affective turn,” or just feeling our way? The Modern
         Language Journal, 103(2), 516–527. http://doi.org/10.1111/modl.12573
Qiu, X., & Lo, Y. Y. (2017). Content familiarity, task repetition and Chinese EFL learners’ engagement
         in second language use. Language Teaching Research, 21(6), 681–698.
         https://doi.org/10.1177/1362168816684368
R Core Team. (2022). R: A language and environment for statistical computing. R Foundation for
         Statistical Computing. https://www.R-project.org/
Randez, R. A., & Cornell, C. (in press). Advancing equity in language assessment for learners with
         disabilities. Language Testing.
Rasmussen, G. (2014). Inclined to better understanding—the coordination of talk and ‘leaning forward’ in
         doing repair. Journal of Pragmatics, 65, 30–45. https://doi.org/10.1016/j.pragma.2013.10.001
Rauscher, F. H., Krauss, R. M., & Chen, Y. (1996). Gesture, speech, and lexical access: The role of
         lexical movements in speech production. Psychological Science, 7(4), 226–231.
         https://doi.org/10.1111/j.1467-9280.1996.tb00364.x
Reynolds Jr, D. A. J., & Gifford, R. (2001). The sounds and sights of intelligence: A lens model channel
         analysis. Personality and Social Psychology Bulletin, 27(2), 187–200.
         https://doi.org/10.1177/0146167201272005
Richmond, V., & McCroskey, J. (2004). Nonverbal behavior in interpersonal relationships. Allyn and
         Bacon.
Rimé, B., Mesquita, B., Boca, S., & Philippot, P. (1991). Beyond the emotional event: Six studies on the
         social sharing of emotion. Cognition & Emotion, 5(5-6), 435-465.
         https://doi.org/10.1080/02699939108411052
Rizzolatti, G., & Craighero, L. (2004). The mirror-neuron system. Annual Review of Neuroscience, 27,
         169–192. https://doi.org/10.1146/annurev.neuro.27.070203.144230
Roever, C. (2021). Teaching and testing second language pragmatics and interaction. Routledge.
         https://doi.org/10.4324/9780429260766
                                                                                                       345


Roever, C., & Dai, D. W. (2021). Reconceptualizing interactional competence for language testing. In
          Salaberry, M. R. & Burch, A. R. (Eds.), Assessing speaking in context—Expanding the construct
          and its applications (pp. 23–40). Multilingual Matters.
          https://doi.org/10.21832/9781788923828-003
Roever, C., & Kasper, G. (2018). Speaking in turns and sequences: Interactional competence as a target
          construct in testing speaking. Language Testing, 35(3), 331–
          355. https://doi.org/10.1177/0265532218758128
Ross, S. (1992). Accommodative questions in oral proficiency interviews. Language Testing, 9(2), 173–
          185. https://doi.org/10.1177/026553229200900205
Rossano, F. (2012). Gaze behavior in face-to-face interaction [Unpublished doctoral dissertation].
          Radboud University. http://hdl.handle.net/2066/99151
Rost, M. (2016). Teaching and researching listening (3rd ed.). Taylor and Francis.
          https://doi.org/10.4324/9781315732862
Rule, N. O., & Alaei, R. (2016). “Gaydar” the perception of sexual orientation from subtle cues. Current
          Directions in Psychological Science, 25(6), 444–448.
          https://doi.org/10.1177/0963721416664403
Russell, J. A. (1994). Is there universal recognition of emotion from facial expression? A review of the
          cross-cultural studies. Psychological Bulletin, 115(1), 102–141. https://doi.org/10.1037/0033-
          2909.115.1.102
Russell, J. A. (2012). Introduction to special section: On defining emotion. Emotion Review, 4(4), 337–
          337. https://doi.org/10.1177/1754073912445857
Russell, J. A., & Fehr, B. (1987). Relativity in the perception of emotion in facial expressions. Journal of
          Experimental Psychology: General, 116(3), 223–237. https://doi.org/10.1037/0096-
          3445.116.3.223
Rylander, J., Clark, P., & Derrah, R. (2013). A video-based method of assessing pragmatic awareness. In
          S. J. Ross & G. Kasper (Eds.), Assessing second language pragmatics (pp. 65–97). Springer.
          https://doi.org/10.1057/9781137003522_3
Sacks, H., Schegloff, E. A., & Jefferson, G. (1974). A simplest systematics for the organization of turn-
          taking for conversation. Language, 50(4), 696–735. https://doi.org/10.2307/412243
Saito, K., Ilkan, M., Magne, V., Tran, M. N., & Suzuki, S. (2018). Acoustic characteristics and learner
          profiles of low-, mid-and high-level second language fluency. Applied Psycholinguistics, 39(3),
          593–617. https://doi.org/10.1017/S0142716417000571
Saito, K., Webb, S., Trofimovich, P., & Isaacs, T. (2016). Lexical correlates of comprehensibility versus
          accentedness in second language speech. Bilingualism: Language and Cognition, 19(3), 597–
          609. https://doi.org/10.1017/S1366728915000255
Sato, T., & McNamara, T. (2019). What counts in second language oral communication ability? The
          perspective of linguistic laypersons. Applied Linguistics, 40(6), 894–916.
          https://doi.org/10.1093/applin/amy032
                                                                                                         346


Sauter, D. A., & Eimer, M. (2010). Rapid detection of emotion from human vocalizations. Journal of
         Cognitive Neuroscience, 22(3), 474–481. https://doi.org/10.1162/jocn.2009.21215
Sauter, D. A., Eisner, F., Ekman, P., & Scott, S. K. (2010). Cross-cultural recognition of basic emotions
         through nonverbal emotional vocalizations. Proceedings of the National Academy of
         Sciences, 107(6), 2408–2412. https://doi.org/10.1073/pnas.0908239106
Savignon, S. J. (1972). Communicative competence: An experiment in foreign-language teaching. Center
         for Curriculum Development.
Scarcella, R. C., Andersen, E. S., & Krashen, S. D. (1990). Developing communicative competence in a
         second language. Newbury House Publishers.
Schachter, S. (1964). The interaction of cognitive and physiological determinants of emotional state.
         Advances in Experimental Social Psychology, 1, 49–80. https://doi.org/10.1016/S0065-
         2601(08)60048-9
Schegloff, E. A. (2006) Interaction: The infrastructure for social institutions, the natural ecological niche
         for language, and the arena in which culture is enacted. In N. J. Enfield & S. C. Levinson (Eds.),
         Roots of human society (pp. 70–96). Berg. https://doi.org/10.4324/9781003135517-4
Scherer, K. R. (2005). What are emotions? Social Science Information, 44(4), 695–729.
         https://doi.org/10.1177/0539018405058216
Schieffelin, B. B., & Ochs, E. (Eds.) (1979). Developmental pragmatics. Academic Press.
Schmid Mast, M., & Cousin, G. (2013). Power, dominance, and persuasion. In J. A. Hall & M. L. Knapp
         (Eds.), Nonverbal communication (pp. 613–635). De Gruyter.
         https://doi.org/10.1515/9783110238150.613
Schoonen, R. (2005). Generalizability of writing scores: an application of structural equation modeling.
         Language Testing, 22(1), 1–30. https://doi.org/10.1191/0265532205lt295oa
Scovel, T. (1978). The effect of affect on foreign language learning: A review of the anxiety research.
         Language Learning, 28(1), 129–142. https://doi.org/10.1111/j.1467-1770.1978.tb00309.x
Segalowitz, N. (2010). Cognitive bases of second language fluency. Routledge.
         https://doi.org/10.4324/9780203851357
Seligman, M. E. P. (2011). Flourish: A visionary new understanding of happiness and well-being. Free
         Press.
Şen, M., & Oz, H. (2021). Vocabulary size as a predictor of willingness to communicate inside the
         classroom. In N. Zarrinabadi & M. Pawlak (Eds.), New perspectives on willingness to
         communicate in a second language (pp. 235–259). Springer. https://doi.org/10.1007/978-3-030-
         67634-6_12
Seo, M. S., & Koshik, I. (2010). A conversation analytic study of gestures that engender repair in ESL
         conversational tutoring. Journal of Pragmatics, 42(8), 2219–2239.
         https://doi.org/10.1016/j.pragma.2010.01.021
                                                                                                           347


Sevinç, Y. (2018). Language anxiety in the immigrant context: Sweaty palms? International Journal of
          Bilingualism, 22(6), 717–739. https://doi.org/10.1177/1367006917690914
Seyfeddinipur, M. (2006). Disfluency: Interrupting speech and gesture [Unpublished doctoral
          dissertation]. Radboud University.
Shaver, P., Schwartz, J., Kirson, D., & O'Connor, C. (1987). Emotion knowledge: Further exploration of a
          prototype approach. Journal of Personality and Social Psychology, 52(6), 1061–1086.
          https://doi.org/10.1037/0022-3514.52.6.1061
Shaw, S. (2002). The effect of training and standardisation on rater judgement and inter-rater reliability.
          Cambridge Research Notes, 8(5), 13–17. https://www.cambridgeenglish.org/Images/23120-
          research-notes-08.pdf
Sherman, J., & Nicoladis, E. (2004). Gestures by advanced Spanish–English second-language learners.
          Gesture, 4(2), 143–156. https://doi.org/10.1075/gest.4.2.03she
Shi, L. (2001). Native- and nonnative-speaking EFL teachers’ evaluations of Chinese students’ English
          writing. Language Testing, 18(3), 303–325. https://doi.org/10.1177/026553220101800303
Shirvan, M. E., & Talebzadeh, N. (2017). English as a foreign language learners’ anxiety and
          interlocutors’ status and familiarity: An idiodynamic perspective. Polish Psychological Bulletin,
          48(4), 489–503. https://doi.org/10.1515/ppb-2017-0056
Shohamy, E. (1994). The validity of direct versus semi-direct oral tests. Language Testing, 11(2), 99–123.
          https://doi.org/10.1177/026553229401100202
Siegel, E. H., Sands, M. K., Van den Noortgate, W., Condon, P., Chang, Y., Dy, J., Quigley, K. S., &
          Barrett, L. F. (2018). Emotion fingerprints or emotion populations? A meta-analytic
          investigation of autonomic features of emotion categories. Psychological Bulletin, 144(4), 343–
          393. https://doi.org/10.1037/bul0000128
Simms, L. J., Zelazny, K., Williams, T. F., & Bernstein, L. (2019). Does the number of response options
          matter? Psychometric perspectives using personality questionnaire data. Psychological
          Assessment, 31(4), 557–566. https://doi.org/10.1037/pas0000648
Singelis, T. (1994). Nonverbal communication in intercultural interactions. In R. Brislin & T. Yoshida
          (Eds.), Improving intercultural interactions (pp. 268–294). Sage.
          https://doi.org/10.4135/9781452204857.n14
Skehan, P. (1998). A cognitive approach to language learning. Oxford University Press.
          https://doi.org/10.1177/003368829802900209
Skogmyr Marian, K. (2023). The development of L2 interactional competence: A multimodal study of
          complaining in French interactions. Routledge. https://doi.org/10.4324/9781003271215
Slobin, D. I. (1996). Two ways to travel: Verbs of motion in English and Spanish. In M. Shibatani & S.
          A. Thompson (Eds.), Grammatical constructions: Their form and meaning (pp. 195–220).
          Clarendon Press.
                                                                                                          348


Slobin, D. I. (2006). What makes manner of motion salient? Explorations in linguistic typology,
          discourse, and cognition. In M. Hickmann & S. Robert (Eds.), Space in languages: Linguistic
          systems and cognitive categories (pp. 59–81). https://doi.org/10.1075/tsl.66.05slo
Smith, C. A., & Lazarus, R. S. (1993). Appraisal components, core relational themes, and the
          emotions. Cognition & Emotion, 7(3-4), 233–269. https://doi.org/10.1080/02699939308409189
Snider, J. G., & Osgood, C. E. (Eds.). (1969). Semantic differential technique: A sourcebook. Aldine.
So, W. C., Kita, S., & Goldin-Meadow, S. (2013). When do speakers use gestures to specify who does
          what to whom? The role of language proficiency and type of gestures in narratives. Journal of
          Psycholinguistic Research, 42, 581–594. https://doi.org/10.1007/s10936-012-9230-6
Spielberger, C. D. (1983). Manual for the State-Trait-Anxiety: STAI (form Y). Consulting Psychologists
          Press.
Stam, G. (1998). Changes in patterns of thinking about motion with L2 acquisition. In S. Santi, I.
          Guaïtella, C. Cavé, & G. Konopczynski (Eds.), Oralité et gestualité: Communication
          multimodale, interaction (pp. 615–619). L'Harmattan.
Stam, G. (2006). Thinking for speaking about motion: L1 and L2 speech and gesture. IRAL, 44(2), 145–
          171. https://doi.org/10.1515/IRAL.2006.006
Stam, G. (2008). What gestures reveal about second language acquisition. In S. G. McCafferty & S. Stam
          (Eds.). Gesture: Second language acquisition and classroom research (pp. 231–256). Routledge.
Stam, G. (2010). Can a L2 speaker's patterns of thinking for speaking change? In Z. H. Han & T.
          Cadierno (Eds.), Linguistic relativity in SLA: Thinking for speaking (pp. 59–83). Multilingual
          Matters. https://doi.org/10.21832/9781847692788-005
Stam, G. (2017). Verb-framed, satellite framed, or in between? A L2 learner's thinking for speaking in her
          L1 and L2 over 14 years. In I. Ibarretxe-Antuñano (Ed.), Motion and space across languages:
          Theory and applications (pp. 329–366). John Benjamins. https://doi.org/10.1075/hcp.59.14sta
Stam, G., & Tellier, M. (2022). Gesture helps second and foreign language learning and teaching. In A.
          Morgenstern & S. Goldin-Meadow (Eds.), Gesture in language: Development across the
          lifespan (pp. 335–363). American Psychological Association. https://doi.org/10.1037/0000269-
          014
Stankov, L., Lee, J., Luo, W., & Hogan, D. J. (2012). Confidence: A better predictor of academic
          achievement than self-efficacy, self-concept and anxiety?. Learning and Individual
          Differences, 22(6), 747–758. https://doi.org/10.1016/j.lindif.2012.05.013
Stivers, T., Enfield, N. J., Brown, P., Englert, C., Hayashi, M., Heinemann, T., Hoymann, G., Rossano,
          F., de Ruiter, J. P., Yoon, K.-E., & Levinson, S. C. (2009). Universals and cultural variation in
          turn-taking in conversation. Proceedings of the National Academy of Sciences, 106(26), 10587–
          10592. https://doi.org/10.1073/pnas.090361610
Stöckli, S., Schulte-Mecklenbeck, M., Borer, S., & Samson, A. C. (2018). Facial expression analysis with
          AFFDEX and FACET: A validation study. Behavior Research Methods, 50, 1446–1460.
          https://doi.org/10.3758/s13428-017-0996-1
                                                                                                          349


Streeck, J. (2009). Forward-gesturing. Discourse Processes, 46(2–3), 161–179.
          https://doi.org/10.1080/01638530902728793
Streeck, J., & Hartge, U. (1992). Previews: Gestures at the transition place. In P. Auer & A. di Luzio
          (Eds.), The contextualization of language (pp. 135–157). Benjamins.
          https://doi.org/10.1075/pbns.22.10str
Styles, P. (1993). Inter- and intra-rater reliability of assessments of “live” versus audio- and video-
          recorded interviews in the IELTS Speaking test. British Council.
Sueyoshi, A., & Hardison, D. (2005). The role of gestures and facial cues in second language listening
          comprehension. Language Learning, 55(4), 661–669. https://doi.org/10.1111/j.0023-
          8333.2005.00320.x
Suslow, T., P. Ohrmann, J. Bauer, A. V. Rauch, W. Schwindt, V. Arolt, W. Heindel, & H. Kugel. (2006).
          Amygdala activation during masked presentation of emotional faces predicts conscious detection
          of threat-related faces. Brain and Cognition, 61(3), 243–248.
          https://doi.org/10.1016/j.bandc.2006.01.005
Suvorov, R. (2015). The use of eye tracking in research on video-based second language (L2) listening
          assessment: A comparison of context videos and content videos. Language Testing, 32(4), 463–
          483. https://doi.org/10.1177/0265532214562099
Suvorov, R. (2018). Test takers’ use of visual information in an L2 video-mediated listening test:
          Evidence from cued retrospective reporting. In E. Wagner & G. J. Ockey (Eds.), Assessing L2
          listening: Moving towards authenticity (pp. 145–160). John Benjamins.
          https://doi.org/10.1075/lllt.50.10suv
Suzuki, S., & Kormos, J. (2020). Linguistic dimensions of comprehensibility and perceived fluency: An
          investigation of complexity, accuracy, and fluency in second language argumentative
          speech. Studies in Second Language Acquisition, 42(1), 143–167.
          https://doi.org/10.1017/S0272263119000421
Suzuki, S., Kormos, J., & Uchihara, T. (2021). The relationship between utterance and perceived fluency:
          A meta‐analysis of correlational studies. The Modern Language Journal, 105(2), 435–463.
          https://doi.org/10.1111/modl.12706
Tabachnik, B. G., & Fidell, L. S. (2013). Using multivariate statistics (6th ed.). Pearson.
Talmy, L. (1985). Lexicalization patterns: Semantic structure in lexical forms. In T. Shopen (Ed.),
          Language typology and syntactic description: Vol. 3. Grammatical categories and the lexicon
          (pp. 57–149). Cambridge University Press.
Talmy, L. (2000). Towards a cognitive semantics: Vol. II: Typology and process in concept structuring.
          MIT Press. https://doi.org/10.7551/mitpress/6848.001.0001
Tavakoli, P., & Skehan, P. (2005). 9. Strategic planning, task structure and performance testing. In R.
          Ellis (Ed.), Planning and task performance in a second language (pp. 239-273). John
          Benjamins. https://doi.org/10.1075/lllt.11.15tav
                                                                                                        350


Taylor, L. B., & Banerjee, J. (in press). Accommodations in language testing and assessment:
          Safeguarding equity, access, and inclusion. Language Testing.
Teimouri, Y., Goetze, J., & Plonsky, L. (2019). Second language anxiety and achievement: A meta-
          analysis. Studies in Second Language Acquisition, 41(2), 363–387.
          https://doi.org/10.1017/S0272263118000311
Tellier, M. (2008). The effect of gestures on second language memorisation by young
          children. Gesture, 8(2), 219–235. https://doi.org/10.1075/gest.8.2.06tel
The National Standards Collaborative Board. (2015). World-readiness standards for learning languages
          (4th ed.). The National Standards Collaborative Board.
Thompson, C. P. (2016). Preliminary study of the role of eye contact, gestures, and smiles produced by
          Chinese-as-a-first-language test takers on ratings assigned by English-as-a-first- language
          examiners during IELTS speaking tests [Unpublished MA thesis]. University of Victoria,
          Canada. http://hdl.handle.net/1828/7724
Thompson, G. L., Cox, T. L., & Knapp, N. (2016). Comparing the OPI and the OPIc: The effect of test
          method on oral proficiency scores and student preference. Foreign Language Annals, 49(1), 75–
          92. https://doi.org/10.1111/flan.12178
Tiedens, L. Z., & Fragale, A. R. (2003). Power moves: Complementarity in dominant and submissive
          nonverbal behavior. Journal of Personality and Social Psychology, 84(3), 558–568.
          https://doi.org/10.1037/0022-3514.84.3.558
Todorov, A. (2017). Face value: The irresistible influence of first impressions. Princeton University
          Press. https://doi.org/10.1515/9781400885725
Tominaga, W. (2013). The development of extended turns and storytelling in the Japanese oral
          proficiency interview. In S. J. Ross & G. Kasper (Eds.), Assessing second language pragmatics
          (pp. 220–257). Palgrave Macmillan. https://doi.org/10.1057/9781137003522_9
Tomkins, S. S., & McCarter, R. (1964). What and where are the primary affects? Some evidence for a
          theory. Perceptual and Motor Skills, 18(1), 119–158. https://doi.org/10.2466/pms.1964.18.1.119
Tran, V. (2007). The use, overuse, and misuse of affect, mood, and emotion in organizational research. In
          C. E. J. Härtel, N. M. Ashkanasy, & W. J. Zerbe (Eds.), Functionality, intentionality and
          morality (pp. 31–53). Elsevier. https://doi.org/10.1016/S1746-9791(07)03002-7
Trofimovich, P., & Isaacs, T. (2012). Disentangling accent from comprehensibility. Bilingualism:
          Language and Cognition, 15(4), 905–916. https://doi.org/10.1017/S1366728912000168
Trofimovich, P., Tekin, O., & McDonough, K. (2021). Task engagement and comprehensibility in
          interaction: Moving from what second language speakers say to what they do. Journal of Second
          Language Pronunciation, 7(3), 435–461. https://doi.org/10.1075/jslp.21006.tro
Tsunemoto, A., Lindberg, R., Trofimovich, P., & McDonough, K. (2022). Visual cues and rater
          perceptions of second language comprehensibility, accentedness, and fluency. Studies in Second
          Language Acquisition, 44(3), 659–684. https://doi.org/10.1017/S0272263121000425
                                                                                                      351


Uchida, Y., & Kitayama, S. (2009). Happiness and unhappiness in east and west: Themes and variations.
         Emotion, 9(4), 441–456. https://doi.org/10.1037/a0015634
Uchida, Y., Townsend, S. S., Rose Markus, H., & Bergsieker, H. B. (2009). Emotions as within or
         between people? Cultural variation in lay theories of emotion expression and
         inference. Personality and Social Psychology Bulletin, 35(11), 1427–1439.
         https://doi.org/10.1177/0146167209347322
Uludag, P., McDonough, K., & Trofimovich, P. (2022). Exploring shared and individual assessment of
         paired oral interactions. Studies in Language Assessment, 11(2), 1–24.
         studihttps://www.altaanz.org/uploads/5/9/0/8/5908292/1._sila_11_2__uludag_et_al..pdf
van Compernolle, R. A. (2013). Interactional competence and the dynamic assessment of L2 pragmatic
         abilities. In S. J. Ross & G. Kasper (Eds.), Assessing second language pragmatics (pp. 327–
         353). Palgrave Macmillan. https://doi.org/10.1057/9781137003522_13
van der Wel, P., & Van Steenbergen, H. (2018). Pupil dilation as an index of effort in cognitive control
         tasks: A review. Psychonomic Bulletin & Review, 25, 2005–2015.
         https://doi.org/10.3758/s13423-018-1432-y
Van Ek, J. A. (1986). Objectives for foreign language learning. Volume I: Scope. Council of Europe.
Vygotsky, L. (1978). Mind in society: The development of higher psychological processes. Harvard
         University Press.
Walther, J. B., & Tidwell, L. C. (1995). Nonverbal cues in computer-mediated communication, and the
         effect of chronemics on relational communication. Journal of Organizational Computing, 5(4),
         355–378. https://doi.org/10.1080/10919399509540258
Watson, D., Clark, L. A., & Tellegen, A. (1988). Development and validation of brief measures of
         positive and negative affect: The PANAS Scales. Journal of Personality and Social Psychology,
         54, 1063–1070. https://doi.org/10.1037/0022-3514.54.6.1063
Watson, O. M., & Graves, T. D. (1966). Quantitative research in proxemic behavior 1. American
         Anthropologist, 68(4), 971–985. https://doi.org/10.1525/aa.1966.68.4.02a00070
Wei, J., & Llosa, L. (2015). Investigating differences between American and Indian raters in assessing
         TOEFL iBT speaking tasks. Language Assessment Quarterly, 12(3), 283–304.
         https://doi.org/10.1080/15434303.2015.1037446
Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197–
         223. https://doi.org/10.1177/026553229401100206
Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263–287.
         https://doi.org/10.1177/026553229801500205
Westfall, J., Kenny, D. A., & Judd, C. M. (2014). Statistical power and optimal design in experiments in
         which samples of participants respond to samples of stimuli. Journal of Experimental
         Psychology: General, 143(5), 2020–2045. https://doi.org/10.1037/xge0000014
                                                                                                       352


WIDA. (2020). WIDA English language development standards framework, 2020 edition: Kindergarten–
          grade 12. Board of Regents of the University of Wisconsin System.
          https://wida.wisc.edu/sites/default/files/resource/WIDA-ELD-Standards-Framework-2020.pdf
Wierzbicka, A. (1992). Talking about emotions: Semantics, culture, and cognition. Cognition &
          Emotion, 6(3-4), 285–319. https://doi.org/10.1080/02699939208411073
Wierzbicka, A. (1994). Emotion, language, and cultural scripts. In S. Kitayama & H. R. Markus (Eds.),
          Emotion and culture: Empirical studies of mutual influence (pp. 133–196). American
          Psychological Association. https://doi.org/10.1037/10152-004
Willis, J., & Todorov, A. (2006). First Impressions: Making up your mind after a 100-ms exposure to a
          face. Psychological Science, 17(7), 592–598. https://doi.org/10.1111/j.1467-9280.2006.01750.x
Winke, P., Zhang, X., & Pierce, S. (2022). A closer look at a marginalized test method: Self-assessment
          as a measure of speaking proficiency. Studies in Second Language Acquisition, Advanced online
          publication. https://doi.org/10.1017/S0272263122000079
Wojciszke, B., Bazinska, R., & Jaworski, M. (1998). On the dominance of moral categories in impression
          formation. Personality and Social Psychology Bulletin, 24(12), 1251–1263.
          https://doi.org/10.1177/01461672982412001
Worster, E., Pimperton, H., Ralph‐Lewis, A., Monroy, L., Hulme, C., & MacSweeney, M. (2018). Eye
          movements during visual speech perception in deaf and hearing children. Language
          Learning, 68(S1), 159–179. https://doi.org/10.1111/lang.12264
Wright, B. D., & Linacre J. M. (1994). Reasonable mean-square fit values. Rasch Measurement
          Transactions, 8(3), 370. https://www.rasch.org/rmt/rmt83b.htm
Xi, X., & Mollaun, P. (2011). Using raters from India to score a large-scale speaking test. Language
          Learning, 61(4), 1222–1255. https://doi.org/10.1111/j.1467-9922.2011.00667.x
Yan, X., & Chuang, P.-L. (2023). How do raters learn to rate? Many-facet Rasch modeling of rater
          performance over the course of a rater certification program. Language Testing, 40(1), 153–179.
          https://doi.org/10.1177/02655322221074913
Yentes, R. D., & Wilhelm, F. (2018) careless: Procedures for computing indices of careless responding. R
          packages [version 1.2.0]. https://github.com/ryentes/careless
Young, R. (2002). Discourse approaches to oral language assessment. Annual Review of Applied
          Linguistics, 22, 243–262. https://doi.org/10.1017/S0267190502000132
Young, R. F. (2011). Interactional competence in language learning, teaching, and testing. In E. Hiinkel
          (Ed.), Handbook of research in second language teaching and learning volume II (pp. 426–443).
          Routledge.
Young, R., & He, A. W. (Eds.). (1998). Talking and testing: Discourse approaches to the assessment of
          oral proficiency. Benjamins. https://doi.org/10.1075/sibil.14
                                                                                                       353


Zahn, C. J., & Hopper, R. (1985). Measuring language attitudes: The speech evaluation instrument.
         Journal of Language and Social Psychology, 4(2), 113–123.
         https://doi.org/10.1177/0261927X8500400203
Zhang, B. (2010). Assessing the accuracy and consistency of language proficiency classification under
         competing measurement models. Language Testing, 27(1), 119–
         140. https://doi.org/10.1177/0265532209347363
Zhang, Y., & Elder, C. (2011). Judgements of oral proficiency by non-native and native English speaking
         teacher raters: Competing or complementary constructs? Language Testing, 28(1), 31–50.
         https://doi.org/10.1177/0265532209360671
Zhu, X., Liao, X., & Cheong, C. M. (2019). Strategy use in oral communication with competent synthesis
         and complex interaction. Journal of Psycholinguistic Research, 48, 1163–1183.
         https://doi.org/10.1007/s10936-019-09651-0
                                                                                                      354


      APPENDIX A: INFORMATION, CONSENT, AND NON-DISCLOSURE AGREEMENT
Research Participant Information and Consent Form
Study Title:                                           Research on Online Language Tests of Speaking
Researcher and Title:                                  John Dylan Burton, PhD Student
Department and Institution:               Second Language Studies, Michigan State University
Contact Information:                                   burtonjd@msu.edu, 517-604-1486
BRIEF SUMMARY
I am an PhD student at Michigan State University, currently studying in the Second Language Studies
program, and I would like to invite you to take part in a research study about some aspects of speaking
tests in an online environment.
Please take time to read the following information carefully before you decide whether or not you wish to
take part. Researchers are required to provide a consent form to inform you about the research study, to
convey that participation is voluntary, to explain risks and benefits of participation including why you
might or might not want to participate, and to empower you to make an informed decision. You should
feel free to discuss and ask the researchers any questions you may have.
You are being asked to participate in a research study of speaking tests in an online environment. Because
most speaking tests take place face-to-face, we would like to understand more about how people perceive
online tests. Your participation in this study will take about two hours total. You will be asked to learn
how to use a rating scale and apply it to 10 speaking tests online. Each test will last about 5 minutes. After
this, you will be asked a few final follow-up questions.
There is no risk in taking part in this study. Your personal details will not be made public, and your
anonymity will be maintained.
You will receive compensation outlined below for your participation. Your input will greatly help us
understand how raters perceive online tests of speaking and how they assign scores. The information you
provide may be used in the future to create better and more interesting tests.
PURPOSE OF RESEARCH
I have approached you because you have been identified as an experienced rater and you have expressed
interest in this study. You and other experienced raters will be able to provide important insight for the
development of speaking tests, both in the format online and in the scales used to rate. Your experience is
valued greatly and your knowledge can contribute to our understanding of this aspect of test development.
I would be very grateful if you would agree to take part in this study.
WHAT YOU WILL BE ASKED TO DO
You will be asked to rate 5-minute speaking tests that have taken place via ZOOM. You will be able to
rate these from the comfort of your home or office. Although it is preferrable for you to rate the session in
one sitting, you may choose to take breaks as you wish. After rating, there will be a series of questions
about yourself and your experience rating. I estimate that the session will take 2 hours. You will receive
compensation for your participation in this study.
                                                                                                           355


POTENTIAL RISKS
There are no risks in taking part in this study. Your privacy will be kept by ensuring that no personal
information is connected to your rating data or interview data, and all records of your participation will be
kept separate from any research data.
PRIVACY AND CONFIDENTIALITY
The data for this project are being collected anonymously. Neither the researchers nor anyone else will be
able to link data to you, such as your name, IP address, e-mail address, etc.
The data will be kept in an online repository. Your personal information will not be available in this
repository. Only your ratings and transcribed interview data will be available in this repository and may
be used by myself and/or other researchers for analysis. Researchers at MSU, the Institutional Review
Board (IRB), and academics with access to the data repository will have access only to your ratings and
spoken data.
Your personal information will be kept safe and secure in an alternate location from any files available for
analysis. Your coworkers and students will not be able to access any of your personally identifiable data.
Although we will make every effort to keep your data confidential there are certain times, such as a court
order, where we may have to disclose your data.
The results of this study may be published or presented at professional meetings, but the identities of all
research participants will remain anonymous.
Your rights to participate, say no, or withdraw
Participation is voluntary. Refusal to participate will involve no penalty or loss of benefits to which you
are otherwise entitled. You may discontinue participation at any time without penalty or loss of benefits
to which you are otherwise entitled.
You have the right to say no. You may change your mind at any time and withdraw. You may choose not
to answer specific questions or to stop participating at any time. Choosing not to participate or
withdrawing from this study will not impact you.
If you change your mind, you are free to withdraw at any time during your participation in this study. If
you want to withdraw, please let me know, and I will attempt to extract any ideas or information (data)
you contributed to the study and destroy them. However, this may be impossible as the questionnaires and
recorded tests will be anonymous and randomized, so please tell me as early as possible.
COSTS AND COMPENSATION FOR BEING IN THE STUDY
You will receive $50 in compensation for your participation in this study ($25 per hour) in the form of an
Amazon gift card. You will be compensated upon completion of your study.
future research
Information that identifies you will be removed from all data files. The data could be used for future
research studies or distributed to another investigator for future research studies without additional
informed consent from you.
                                                                                                          356


Contact Information
If you have concerns or questions about this study, such as scientific issues, how to do any part of it, or to
report an injury, please contact the researcher:
Name:                                    Dylan Burton
Mailing Address:                         619 Red Cedar Road
                                         Michigan State University
                                         East Lansing, MI 48824
E-mail address:                          burtonjd@msu.edu
Phone number:                            517-604-1486
If you have questions or concerns about your role and rights as a research participant, would like to obtain
information or offer input, or would like to register a complaint about this study, you may contact,
anonymously if you wish, the Michigan State University’s Human Research Protection Program at 517-
355-2180, Fax 517-432-4503, or e-mail irb@msu.edu or regular mail at 4000 Collins Rd, Suite 136,
Lansing, MI 48910.
[RATING STUDY ONLY ] Documentation of Informed consent
I agree to allow quotes from the written comments, but not my personal information, to be disclosed in
reports and presentations.
    Yes                                      No                    Initials____________
Your signature below means that you voluntarily agree to participate in this research study.
Signature _____________________________                 Date________________________________
If you agree to participate in this study, please tick the above boxes, type your initials, and insert an
electronic signature in the space above, and write the date. Please save a copy of this consent form on
your computer for your own records.
[STIMULATED RECALL ONLY] Documentation of Informed Consent
Participation in this study requires that the interviews be audio recorded. Audio will be transcribed. The
audio will not be available to external researchers, but transcriptions will.
I agree to allow audiotaping of the interview.
    Yes                                      No                    Initials____________
I agree to allow quotes from the audio transcript, but not my personal information, to be disclosed in
reports and presentations.
    Yes                                      No                    Initials____________
Your signature below means that you voluntarily agree to participate in this research study.
Signature _____________________________                 Date________________________________
                                                                                                           357


Non-disclosure agreement (NDA)
This Nondisclosure Agreement or ("Agreement") has been entered into on the date of December 1, 2021
and is by and between:
Party Disclosing Information: John Dylan Burton with a mailing address of 619 Red Cedar Road,
Michigan State University, East Lansing, MI 48824 (“Disclosing Party”).
Party Receiving Information: {Examiner’s name automatically populated} with a contact address of
{populate e-mail address} (“Receiving Party”).
For the purpose of preventing the unauthorized disclosure of Confidential Information as defined below.
The parties agree to enter into a confidential relationship concerning the disclosure of certain proprietary
and confidential information ("Confidential Information").
1. Definition of Confidential Information. For purposes of this Agreement, "Confidential Information"
shall include all information or material that has or could have commercial value or other utility in the
business in which Disclosing Party is engaged. All audiovisual content (both videos, test questions, and
responses) constitute Confidential Information in the context of this research.
2. Exclusions from Confidential Information. Receiving Party's obligations under this Agreement do
not extend to information that is: (a) publicly known at the time of disclosure or subsequently becomes
publicly known through no fault of the Receiving Party; (b) discovered or created by the Receiving Party
before disclosure by Disclosing Party; (c) learned by the Receiving Party through legitimate means other
than from the Disclosing Party or Disclosing Party's representatives; or (d) is disclosed by Receiving
Party with Disclosing Party's prior written approval.
3. Obligations of Receiving Party. Receiving Party shall hold and maintain the Confidential Information
in strictest confidence for the sole and exclusive benefit of the Disclosing Party. Receiving Party shall not
allow access to Confidential Information to any other individuals. Receiving Party shall not, without the
prior written approval of Disclosing Party, use for Receiving Party's benefit, publish, copy, or otherwise
disclose to others, or permit the use by others for their benefit or to the detriment of Disclosing Party, any
Confidential Information. Receiving Party shall return to Disclosing Party any and all records, notes, and
other written, printed, or tangible materials in its possession pertaining to Confidential Information
immediately if Disclosing Party requests it in writing.
4. Time Periods. The nondisclosure provisions of this Agreement shall survive the termination of this
Agreement and Receiving Party's duty to hold Confidential Information in confidence shall remain in
effect until the Confidential Information no longer qualifies as a trade secret or until Disclosing Party
sends Receiving Party written notice releasing Receiving Party from this Agreement, whichever occurs
first.
5. Relationships. Nothing contained in this Agreement shall be deemed to constitute either party a
partner, joint venture or employee of the other party for any purpose.
6. Severability. If a court finds any provision of this Agreement invalid or unenforceable, the remainder
of this Agreement shall be interpreted so as best to affect the intent of the parties.
7. Integration. This Agreement expresses the complete understanding of the parties with respect to the
subject matter and supersedes all prior proposals, agreements, representations, and understandings. This
Agreement may not be amended except in writing signed by both parties.
8. Waiver. The failure to exercise any right provided in this Agreement shall not be a waiver of prior or
subsequent rights.
9. Notice of Immunity. Receiving Party is provided notice that an individual shall not be held criminally
or civilly liable under any federal or state trade secret law for the disclosure of a trade secret that is made
(i) in confidence to a federal, state, or local government official, either directly or indirectly, or to an
attorney; and (ii) solely for the purpose of reporting or investigating a suspected violation of law; or is
made in a complaint or other document filed in a lawsuit or other proceeding, if such filing is made under
seal. An individual who files a lawsuit for retaliation by an employer for reporting a suspected violation
of law may disclose the trade secret to the attorney of the individual and use the trade secret information
in the court proceeding, if the individual (i) files any document containing the trade secret under seal; and
                                                                                                             358


(ii) does not disclose the trade secret, except pursuant to court order.
This Agreement and each party's obligations shall be binding on the representatives, assigns and
successors of such party. Each party has signed this Agreement through its authorized representative.
DISCLOSING PARTY
Signature:
Typed or Printed Name:
Date:
RECEIVING PARTY
I hereby agree not to disclose any confidential information as outlined in this agreement, and agree that
my initials will be taken as a digital signature.
     Yes                                     No                     Initials____________
Full Name:
Date: _______________
                                                                                                        359


          APPENDIX B: RATING STUDY SIGN-UP, INSTRUCTIONS, AND PRACTICE
Sign-up questionnaire
Thank you for your interest in taking part in this study on foreign language testing.
In order to take part in this study, you must:
    • Be an undergraduate at MSU
    • Speak English as your first language
    • Use a laptop or desktop computer to do the study (no mobile devices)
If you are selected for the study, you will watch several videos and answer questions about them. If you
choose to do the study in our lab, participation will take about two hours on one day. If you choose to do
the study online, participation in the project will take place on two days, and each session will last
about one hour (two hours total). You may only take part in this study once.
You will need:
    • two hours
    • a quiet place, free from distractions
    • headphones to listen to the audio if you can.
You can choose when and where you do the study, but you must make sure that you have enough time to
complete each session the study in one sitting before you begin. You can take breaks while doing the
study,
As compensation, after you complete both sessions, you will receive a $30 e-gift certificate from
Amazon.
Click here to read more information about the study. This document also contains the non-disclosure
agreement you must agree to prior to beginning the study.
If you have any questions about the study, you may contact me first at burtonjd@msu.edu.
                                                                                                       360


First, I would like to know a little about you
First name:                                                           _____________________________
Last name:                                                            _____________________________
E-mail address:                                                       _____________________________
(E-mail Address: This must be an @msu.edu address, used to send you the survey and gift card at the end.)
Year of birth:                                                        _____________________________
Gender (choose):                                                      ○Male ○Female ○Other ○I would prefer
not to say
Country of Origin/Nationality:                          _____________________________
Do you speak more than one language?                    ○Yes ○No
What do you consider your first language?               ○English ○Spanish ○Chinese ○Other
(This refers to the language that you grew up speaking and that you use every day. If you are bilingual, it is the
language you feel is most dominate.)
What year are you in at MSU?                            ○Freshman ○Sophomore ○Junior ○Senior
                                                        ○Other: __________ (please specify)
What is your major?                                                   _____________________________
Do you have access to a quiet, distraction-free space, where you can complete this study?
                                                        ○Yes ○No
After completing the study, a small number of participants may be asked to take part in a face-to-face
interview to discuss their scores. Interviewees will be compensated for their time. Would you be willing
to meet with the researcher in a lab in Wells Hall to discuss your scores?
                                                        ○Yes ○No
Thank you, ${q://QID255/ChoiceTextEntryValue}! I will be in touch with you shortly if you are a good
fit for the study. If you have any questions, you can write to me at burtonjd@msu.edu.
                                                                                                                   361


Figure B.1
Instructions and Practice
                          362


Figure B.2
Practice Set 1
*Proprietary speech sample, participant signed release
                                                       363


Figure B.3
Practice Set Responses
                       364


Figure B.4
Practice Set 2
*image removed to protect test taker’s identity, but was visible to participants who signed NDAs
[ABOVE SCALES ARE PRESENTED AGAIN FOR THIS TEST TAKER]
                                                                                                 365


Figure B.5
Feedback Example
                 366


                                 APPENDIX C: FOLLOW-UP SURVEY
Thank you, ${e://Field/First%20Name}! You have now completed rating all samples. I would like to ask
you just a few more questions before we finish.
Please indicate your agreement with the following statements:
                                                                       Fully agree                                 Fully disagree
                                                                                     Agree   Not sure   Disagree
 The instructions were clear and effective
 The practice sessions were useful
 Rating in the online system was comfortable
 The audio and video were clear
 The rating scales were simple to use
 I feel confident about the ratings I awarded
If you experienced any technical problems while rating, please let me know here:
____________________
If you experienced any doubts about your scores, what made you feel this way? How did you resolve
these doubts?
______________________________________________________________________________
Which rating criterion was the hardest to apply? Why? ________________________________________
 Please rank the following features for how              Please rank the following features for how
 important they were when making a decision              important they were when making a decision
 about language scores (fluency, grammar,                about affect scores (confidence, anxiety,
 vocabulary, pronunciation). Click and drag the          engagement, etc). Click and drag the elements; 1
 elements; 1 is the most important, 10 is the least.     is the most important, 10 is the least.
     o   The speakers’ environment                           o   The speakers’ environment
     o   The quality of the video and audio                  o   The quality of the video and audio
     o   Mistakes speakers made                              o   Mistakes speakers made
     o   The things speakers talked about (content)          o   The things speakers talked about (content)
     o   The speakers facial reactions during the test       o   The speakers facial reactions during the test
     o   The speakers’ appearance (clothing, makeup)         o   The speakers’ appearance (clothing, makeup)
     o   The examiner (questions and speaking style)         o   The examiner (questions and speaking style)
     o   The speakers’ eye gaze and attention                o   The speakers’ eye gaze and attention
     o   The speakers’ accent                                o   The speakers’ accent
     o   The words and expressions speakers used             o   The words and expressions speakers used
Thank you, ${q://QID255/ChoiceTextEntryValue}! I will be in touch soon with your Amazon e-gift
certificate.
                                                                                                                            367


                        APPENDIX D: STIMULATED RECALL MATERIALS
Set-up Instructions
Prior to coming to the interview, each participant will be invited to the interview and will select a day and
time on the main researcher’s schedule using Calendly. Three days prior to the session, each participant
will receive an e-mail providing instructions about how to take part in the study. Participants will be told
that they will complete the survey online within 24 hours of the stimulated recall. On the day prior to the
interview, participants will be e-mailed a link to the survey and instructed to complete it before the
interview session at a time convenient to them
The main researcher will carry out the stimulated recall alone without research assistants. The recall will
be done with one participant at a time. Each participant will view 10 videos, provide their recalls, and
then conduct a wrap-up interview. The entire session will take no longer than 90 minutes, and participants
will be compensated for their time.
Recording devices
The main researcher will record sessions using both Quicktime and a digital recording device. Quicktime
will ensure video information about stopping points is available for analysis.
Initiation
     •   Welcome participant to the lab
     •   Invite them to sit at a computer terminal to the left of mine
     •   Provide a bottle of water
     •   Engage in small talk before getting started
     •   Have the participant sign a consent form
Instructions for research participants
What we’re going to do now is re-watch 10 of the videos from the testing survey. I am interested in what
you were thinking at the time you were watching and making decisions. So what I’d like you to do is tell
me what you were thinking, what was in your mind at that time. Any thoughts you were thinking that you
can remember are important.
We are going to watch the videos on this computer screen. You can pause the video any time you like by
pressing the space bar. So if you remember something that you thought when you saw the video, you can
push pause and tell me. I might want to ask about what you were thinking in a particular point, in which
case I will pause the audio and ask you to talk about your thought processes at that moment.
Now we are going to practice. Here is one of the sample videos that we will watch. To play the video, you
will press the space bar.
Demonstrate pressing the space bar. Allow the video to progress about 10 seconds.
                                                                                                           368


Now imagine here you remember thinking something. You would hit the space bar again to pause the
video.
Press the space bar to pause the video.
At this point you can tell me what you remember thinking when you watched the video the first time. Take
as much time as you need. When you are finished, you can hit the space bar to continue.
Press the space bar to continue. Stop the video.
I may also choose to ask you a question at a particular moment. In that case, I will pause the video myself
and ask you a question. Do you have any questions about how this method will work? If you need to take
a break at any time just let me know.
Ask if the participant is ready to begin. If the participant is ready, begin recording the session, and allow
the participant to start the first video when they are ready.
                                                                                                            369


Probe questions
 What were you thinking here/at this point/right then?
 What were your first impressions at this point?
 What about here/this point? What were you thinking?
 Do you remember thinking anything at this point in the video?
 Can you tell me what you were thinking at that point?
 Can you tell me what you thought when she said that?
 How did you arrive at this decision?
 Could you elaborate a bit more on that/this/this particular point?
 Can you talk about this point a bit more?
 Do you have any other comments on this file?
 Do you remember anything else you thought about this video?
 What makes you think that?
                                                                    370


Final interview
After all samples have been finished, conclude with some final questions.
I was wondering if I could ask you something now that the videos are done? I noticed that sometimes you
mentioned how the speaker behaved when you were making your decisions about language.
    •    How did the person’s overall behavior influence your scores? Did anything specific give you
         more information?
    •    What about eye gaze/facial behaviors/posture?
    •    How do you balance how they behave with the things that they say when you make decisions?
    •    Do you think any specific aspect of language is impacted by nonverbal behavior more than
         others?
                                                                                                     371


                      APPENDIX E: COMMUNICATIONS TO PARTICIPANTS
Sign-up letter
Dear MSU students,
I am excited to invite you to participate in my research study on online foreign language speaking tests.
This project is part of my dissertation. The goal of the study is to understand how people perceive second
language English speech of individuals taking an online speaking test. By participating, you will help us
better understand how the online environment impacts language processing.
Who can participate?
Up to 60 undergraduates at Michigan State University who are first language English speakers
What will you do?
You will watch and listen to a set of speech samples online and rate them using a set of rating scales.
How long will it take?
Approximately 2 hours spread out on two different days, online
What will you get for participating?
An Amazon gift card for $30 USD
Why should you participate?
To contribute to our understanding of second language speech perception and to help improve the online
formats of speaking tests
If you are interested in participating, please go to
https://msu.co1.qualtrics.com/jfe/form/SV_8faVZhqlS8FzGB0 to fill out a screener in order to
participate. Participation is first come first serve and will be subject to meeting the requirements
stipulated above. Please do not hesitate to contact me with any questions regarding your participation.
Best,
Dylan Burton
                                                                                                        372


Invitation letter for day 1
Dear ${e://Field/First%20Name},
Thank you for your recent expression of interest in participating in a research study on foreign language
testing. I have reviewed your survey response and believe you would be a good fit for the study.
Before starting:
     • Please remember to review the study's information sheet and non-disclosure agreement. You will
         be asked to digitally agree to these before beginning the study.
     • You will also need to choose a day and a time at your own convenience when you can complete
         the study in one sitting.
     • Each of the two sessions will take about an hour, and as you will be watching and listening to
         videos, you will need a quiet space free from distractions. I can arrange time in a computer lab in
         Wells Hall if you do not have access to a quiet space. Just reply to this e-mail and let me know.
     • The study can only be completed on a laptop or desktop computer. No mobile devices are
         allowed.
     • Please try to make time to complete the study by ${date://OtherDate/FL/+1%20week}. If you
         would like to drop out of the study, please let me know by replying to this e-mail address. There
         are few places remaining in the study, and once these have been filled, access to the study will be
         closed.
     • Once you complete both days of the study, you will receive a $30 Amazon e-gift card.
When you are ready to begin, you may click the link below. If for whatever reason your window closes
while doing the study, you can return to it by using the same link.
Start the survey!
If the above link asks you for "ExternalReferenceID", please enter the code: ${e://Field/id}
If you have any questions please let me know.
Best wishes,
Dylan Burton
PhD Candidate, Second Language Studies
${l://OptOutLink}
${l://SurveyLink?d=Take%20the%20survey}
                                                                                                         373


Invitation letter for day 2
Dear ${e://Field/First%20Name},
Thank you for completing Day 1 of the research study on foreign language tests. You may now complete
the second set of speech samples by clicking the below link. You can do this at the day/time of your
choosing, just like Day 1 of the study. Remember to set aside one hour to complete the study, and make
sure you have a quiet space free from distractions before you start.
You can access Day 2 of the study below. If your window closes during the study, you can return to it
using the same link.
https://msu.co1.qualtrics.com/jfe/form/SV_eXmSBnjjkXnlgwK?id=${e://Field/id}
If the link above asks you for a code, please enter ${e://Field/id}
Please let me know if you have any problems or questions.
Best wishes,
Dylan Burton
                                                                                                      374


Invitation letter for study and stimulated recall
Dear [Name],
Thank you for expressing interest in taking part in the research study on foreign language testing. I have
looked over your response, and I believe you would be a good fit for the study.
In the survey, you indicated that you would be willing to be interviewed face-to-face following
completion of the study in our lab in Wells Hall. If this is still the case, I would like to offer you the
chance to take part in the study. I will in turn give you an Amazon e-gift card worth $50 for participating:
$30 for the study itself, and $20 for the interview.
In order to take part, first choose a 90-minute spot for the face-to-face interview using the following link:
https://calendly.com/burtonjd/foreign-language-testing-research-study. 24-hours before the interview, I
will send you a link to complete the study in your home or other quiet place. The study will take about
two hours to complete. You can do this on the same day as the interview, but you must have enough time
to finish before the interview starts. If you wish to do the study in our lab prior to the interview, this can
also be arranged. Just let me know.
If you are unable to participate, or if you are no longer interested, please just let me know so that I can
invite someone else in your place.
Thank you so much. I look forward to hearing from you.
Best,
Dylan Burton
PhD Candidate, Second Language Studies
                                                                                                             375


Instructions e-mail for stimulated recall
Dear [Name],
Thank you for agreeing to participate in the research study on foreign language testing! This e-mail
contains instructions on how the study will proceed.
Instructions:
    • On [Date] at [Time], I will send you a link to the survey. You will need to choose a time within
         24 hours of our interview start time to complete this study online. It must be fully complete by the
         time our interview starts, so make sure you plan accordingly.
    • The survey will take about two hours and it includes a break. As you will be watching and
         listening to videos, you will need a quiet space free from distractions during that time.
    • You can use our lab in Wells Hall if you do not have access to a quiet space. If you would like,
         this can be just before our interview. Just reply to this e-mail and let me know and I can schedule
         this for you.
    • The study can only be completed on a laptop or desktop computer. No mobile devices are
         allowed.
    • We will meet in Wells Hall in room B417 (B-wing) on [Date] at [Time] for a 90-minute
         interview. You do not need to prepare anything for the interview, but please arrive on time.
    • Once you complete the study and the interview, you will receive a $50 Amazon e-gift card.
If you cannot complete both the survey and the interview, or if you have any other last minute changes or
requests, please let me know by replying to this e-mail address or contacting me at 517-604-1486.
Thank you!
Dylan Burton
PhD Candidate, Second Language Studies
                                                                                                         376


Invitation letter for one-day rating study
Dear ${e://Field/First%20Name},
Thank you for agreeing to participate in the research study on foreign language testing. This e-mail
contains the link to the survey, and following this we will meet in Wells Hall for a face-to-face interview.
Before starting:
     • Please remember to review the study's information sheet and non-disclosure agreement. You will
         be asked to digitally agree to these before beginning the study.
     • You will need to choose a time within 24 hours of our interview start time to complete this study.
         It must be fully complete by the time our interview starts, so make sure you plan at least two
         hours to complete the survey beforehand.
     • The survey will take about two hours and it includes a break. As you will be watching and
         listening to videos, you will need a quiet space free from distractions. You can use our lab in
         Wells Hall if you do not have access to a quiet space. If you would like, this can be just before
         our interview. Just reply to this e-mail and let me know and I can schedule this for you.
     • The study can only be completed on a laptop or desktop computer. No mobile devices are
         allowed.
     • If you cannot complete both the survey and the interview, or if you have any other last minute
         changes or requests, please let me know by replying to this e-mail address or contacting me at
         517-604-1486.
     • Once you complete the study and the interview, you will receive a $50 Amazon e-gift card.
When you are ready to begin, you may click the link below. If for whatever reason your window closes
while doing the study, you can return to it by using the same link.
${l://SurveyLink?d=Take%20the%20survey}
If the above link asks you for "ExternalReferenceID", please enter the code: ${e://Field/id}
If you have any questions please let me know.
Best wishes,
Dylan Burton
PhD Candidate, Second Language Studies
${l://OptOutLink}
                                                                                                           377


       APPENDIX F: ELAN TIER DESCRIPTIONS (ADAPTED FROM BURTON, 2021)
Table F.1
ELAN Tier Descriptions
 Tier  Label           Description
 1-2   Discourse       The examiner’s and the participants’ speech were separately transcribed into two tiers. In
                       most cases these represented full TCU. These units were segmented through a frame-by-
                       frame analysis of the audio and an inspection of the waveform output. These were
                       transcribed orthographically with breathing, filled pauses, and laughing indicated.
 3-4   Words           The examiner and the participants’ speech were likewise separately transcribed word by
                       word using orthographic transcription. Word boundaries were segmented by use of the
                       waveform and a reduced speed audio recording. Breathing was transcribed using <.hhh> for
                       inhaling and <hhh.> for exhaling. Filled pauses were marked with “uh” or “um” as spoken
                       by the test taker. Laughing was annotated as <hhuh>.
 5     Gaze            Gaze was annotated as averted from the moment of the first shift in eye direction until gaze
                       was reconnected with the examiner. When blinking occurred at gaze shift boundaries, the
                       blink was included with averted gaze.
 6     Blinks          Blinks were segmented separately from gaze. Blink segments were annotated from the first
                       moment the participant’s eyelid began to fall and ended when the eyeball again became
                       visible.
 7     Mouth           Three mouth behaviors were annotated. Pursed Lips were annotated when the participants’
                       mouth was tightly closed, generally with the cheek muscles tight on each side of the lips.
                       Smiling was annotated without distinguishing Duchenne and non-Duchenne types. Open
                       (non-speaking) was a category that appeared between speech segments where the participant
                       held her mouth open without speech. Laughing was annotated only when this behavior was
                       seen and heard. Tongue touching lips was annotated when the mouth was closed with the
                       tongue visible.
 8     Eyebrow         Two eyebrow movements were found in the dataset. Furrowed brows were contracted, often
                       with visible skin folds between the eyebrows. Raised indicated eyebrows lifted vertically
                       away from the eyes.
 9     Head Turn       Head movements were classified as head turn left, head turn right, head tilt left, head tilt
                       right, head raise, and head lower. These behaviors described movement when the head
                       moved either to the left or right (turn), in a diagonal direction (tilt), or the specified vertical
                       direction. Turns, raises, and lowers generally accompanied averted gaze.
 10    Head            Head gestures included nods, shakes, and pokes. Nods included nonverbal backchannels and
       Gesture         were annotated throughout their duration. Shakes were annotated as moments when an
                       individual disagreed or negated a statement with the head turning side to side quickly. Pokes
                       were annotated as a quick head movement forward that may convey nonunderstanding (Seo
                       & Koshik, 2010).
 11    Posture         Posture referred broadly to the relationship between the participants’ body and the camera.
                       Tilt forward occurred when the participant leaned from the neutral position towards the
                       camera. Tilt back referred to movement away from the camera from the neutral position.
                       Rocking was annotated when the test taker was seen moving backward and forward in
                       relation to the camera. Shift was annotated when an individual quickly moved right or left in
                       their chair to readjust.
 12    Gesture         Gestures were segmented broadly as general movements with the hand. Representational
                       gestures are gestures occurring with speech with a non-emblematic visual or metaphorical
                       referential meaning (Kendon, 2004) and were annotated by describing them as closely as
                       possible. Deictics were annotated when an individual pointed. Beats were annotated when
                       gestures were used to emphasize speech at prosodic boundaries. Self-adaptors are
                       movements, generally of the hands, which may not co-occur with speech and generally are
                       not representational in meaning (Ekman and Friesen, 1969). These were annotated as a
                       description of the action taking place (e.g., scratches head).
 13    Other           This category was left for occasional movements that were rarely observed in the dataset.
                       Swallowing and shoulder shrugging were annotated when visible.
                                                                                                                         378


                                APPENDIX G: STUDY VARIABLES
Table G.1
Summary of Variables
 Variable                Definition                    Type             Format
 Background variables (not modeled)
 Age                     The rater’s age at time of    Continuous/Ratio Integer
                         participation
 Gender                  The gender that a rater       Categorical      Four categories:
                         identifies with                                Male/Female/Other/Prefer
                                                                        not to say
 Nationality             The raters’ country of origin Categorical      Open-ended/Rater may
                                                                        specify
 L1                      The raters’ first language or Categorical      Open-ended/Rater may
                         language considered most                       specify
                         dominant
 L2                      Yes/no question of whether    Binary choice    Yes/No
                         the participant spoke an L2
 Major                   Main focus of study in        Categorical      Open-ended/Rater may
                         college education                              specify
 Predictor variables
 Valence                 Overall positivity/negativity Ratio            Mean score from -100 –
                         of emotional response                          100
 Attention               Directedness of eye gaze and  Ratio            Mean score of 0 – 100
                         head turns to webcam
                         camera
 Engagement              A measure of overall facial   Ratio            Mean score of 0 – 100
                         muscle activation
                         (expressiveness)
 IELTS Test Scores       Test scores originally        Ranked           Score of 1–7
                         reported on 1–9 band scale    categorical
                         rescaled to 1–7 scale for
                         comparability with other
                         measures in this study
 Affect scales (10)      Semantic differential scales  Ranked           Score of 1–7
                         of engagement, anxiety,       categorical
                         confidence, warmth,
                         attention, expressiveness,
                         happiness, competence,
                         interactiveness, attitude
 Outcome variables
 Language scales (4)     Semantic differential scales  Ranked           Score of 1–7
                         of fluency, grammar,          categorical
                         vocabulary, and
                         comprehensibility
                                                                                              379


                          APPENDIX H: CATEGORY STATISTICS
Figure H.1
Rasch Category Statistics
                                                          380


Figure H.2
Rasch Item Characteristic Curves
                                 381


                     APPENDIX I: DESCRIPTIVE DATA FOR RATERS
Figure I.1
Rasch Rater Data Organized by Fit Statistics
                                                             382


Figure I.1 (cont’d)
                    383


Figure I.2
Rasch Rater Data Organized by Severity Measure
                                               384


Figure I.2 (cont’d)
                    385


                                          APPENDIX J: INTERSECTIONS OF NONVERBAL BEHAVIOR
Table J.1
Percentages of Behavior Across Categories of Affect
                                                                       Desire to                                                                  Raw
                   Anxiety Attentive Attitude Competence Confidence Communicate  Engagement Expressiveness Happiness Humor Interactiveness Warmth Total
 Body Language
 (General)              32         5        5          5         26            0         11              5         5     0               0      5   19
 Eyebrows               14        14        5          0         10            0         14             19         5     0              14      5   19
 Face (General)         19         8        7          6         13            3          5             19         8     0               4      7  108
 Averted gaze           26        13        4          4         21            0         17              2         2     0               6      4   47
 Blinking                0         0        0          0          0            0          0              0         0     0               0      0     0
 Eyes grow wide         33        17        0          0          0            0          0             17        17     0               0     17     6
 Mutual gaze            10        17        5          6         10            3         25             11         3     0               6      3   63
 Shifting gaze          35         4        0          4         20            4         11             12         1     1               7      1   82
 Staring                20        13        7          0          7            0          7             13        13     0               7     13   15
 Unfocused gaze          0         0        0        100          0            0          0              0         0     0               0      0     1
 Lack of hand
 movement               16         5       11         11         11            0          5             11        11     0              21      0   19
 Random movement        50         0        0          0         17            0          0             17         0     0              17      0     6
 Representational        0         0        0          6          0            6         13             31        19     0              13     13   16
 Self-adaptor           55         5        0          0         14            0         14              5         5     0               0      5   22
 Head turn              21         7        7          0          7          14          21             14         0     0               0      7   14
 Nodding                13        18        7          2         13            0         18              9         4     0              11      4   45
 Laughing               21         4       11          3          6            4          1             14        11     6               3     15   71
 Frowning               20        20        0         20         20            0         20              0         0     0               0      0     5
 Lack of smile           3         3       22         13          6            6          0              3        25     0               3     16   32
 Lip Movements          33         0       17          0         17            0          8             17         8     0               0      0   12
 Mouth barely open      33         0        0          0         33            0          0             33         0     0               0      0     3
 Nervous smile          33         0        6          6         22            6          0              0        17     0               6      6   18
 Smile                  13         6       13          3         10            2          6             10        20     1               3     11  174
 Swallowing             31         0       15          0          0            0          8              8         8     8               0     23   13
 Audible breathing    100          0        0          0          0            0          0              0         0     0               0      0     2
 Backchannel             0        20       20          0          0            0         20              0        20     0               0     20     5
 Filled pauses          34         9        0         11         23            0         14              3         0     0               3      3   35
 Tone-Prosody           13         4        6          4          9            4          6             21        13     6               2     13   53
 Volume                  0        33        0          0         33            0          0              0         0     0               0     33     3
 Adjusting posture      38         0       13          0         25            0          0              0         0    13               0     13     8
 Leaning back-
 Slouching              33         0        0          0         33            0         33              0         0     0               0      0     9
 Leaning forward         6        24        0          0         18            6         29              6         0     0               6      6   17
 Moving around          25         0        5          0         20            5         15             15         0     5               5      5   20
 Rigid/Straight         21         5        5          0         16            0         11             21        11     0               5      5   19
 Rocking-Shaking        65         5        0          0         10            0         10              5         5     0               0      0   20
 Shoulders              50         0       17          0         17            0          0             17         0     0               0      0     6
                                                                                                                                                     386


Table J.2
Percentages of Behavior Within Categories of Affect
                                                                       Desire to
                   Anxiety Attentive Attitude Competence Confidence Communicate  Engagement Expressiveness Happiness Humor Interactiveness Warmth
 Body Language
 (General)               3         1        1          2          4            0          2              1         1     0               0      1
 Eyebrows                1         4        1          0          2            0          3              3         1     0               6      1
 Face (General)         10        12       11         17         11          13           5             17        10     0               8     10
 Averted gaze            6         8        3          5          8            0          8              1         1     0               6      3
 Blinking                0         0        0          0          0            0          0              0         0     0               0      0
 Eyes grow wide          1         1        0          0          0            0          0              1         1     0               0      1
 Mutual gaze             3        14        4         10          5            8         16              6         2     0               8      3
 Shifting gaze          13         4        0          7         12          13           9              9         1     8              13      1
 Staring                 1         3        1          0          1            0          1              2         2     0               2      3
 Unfocused gaze          0         0        0          2          0            0          0              0         0     0               0      0
 Lack of hand
 movement                1         1        3          5          2            0          1              2         2     0               8      0
 Random movement         1         0        0          0          1            0          0              1         0     0               2      0
 Representational        0         0        0          2          0            4          2              4         3     0               4      3
 Self-adaptor            6         1        0          0          2            0          3              1         1     0               0      1
 Head turn               1         1        1          0          1            8          3              2         0     0               0      1
 Nodding                 3        11        4          2          5            0          8              3         2     0              10      3
 Laughing                7         4       11          5          3          13           1              9         9    33               4     14
 Frowning                0         1        0          2          1            0          1              0         0     0               0      0
 Lack of smile           0         1       10         10          2            8          0              1         9     0               2      6
 Lip Movements           2         0        3          0          2            0          1              2         1     0               0      0
 Mouth barely open       0         0        0          0          1            0          0              1         0     0               0      0
 Nervous smile           3         0        1          2          3            4          0              0         3     0               2      1
 Smile                  11        14       32         14         13          13          11             16        38     8              13     26
 Swallowing              2         0        3          0          0            0          1              1         1     8               0      4
 Audible breathing       1         0        0          0          0            0          0              0         0     0               0      0
 Backchannel             0         1        1          0          0            0          1              0         1     0               0      1
 Filled pauses           6         4        0         10          6            0          5              1         0     0               2      1
 Tone-Prosody            3         3        4          5          4            8          3              9         8    25               2      9
 Volume                  0         1        0          0          1            0          0              0         0     0               0      1
 Adjusting posture       1         0        1          0          2            0          0              0         0     8               0      1
 Leaning back-
 Slouching               1         0        0          0          2            0          3              0         0     0               0      0
 Leaning forward         0         5        0          0          2            4          5              1         0     0               2      1
 Moving around           2         0        1          0          3            4          3              3         0     8               2      1
 Rigid/Straight          2         1        1          0          2            0          2              3         2     0               2      1
 Rocking-Shaking         6         1        0          0          2            0          2              1         1     0               0      0
 Shoulders               1         0        1          0          1            0          0              1         0     0               0      0
 Raw Total            217         76       73         42        129          24         100            116        92    12              48     78
                                                                                                                                             387


Table J.3
Percentages of Behavior Across Categories of Language
                    Comprehensibility   Comprehension Fluency Grammar Pronunciation Vocabulary Raw Total
 Body Language
 (General)                        25                0      25       0            25         25         4
 Eyebrows                           0              50       0      25            25          0         4
 Face (General)                   10               46      22       5             5         10       41
 Averted gaze                     22               28      39       6             0         22       18
 Blinking                           0               0       0       0             0          0         0
 Eyes grow wide                     0             100       0       0             0          0         2
 Mutual gaze                      12               31      19      15             8         12       26
 Shifting gaze                    13               21      40       4             6         13       47
 Staring                            0              50      50       0             0          0         2
 Unfocused gaze                     0               0       0       0             0          0         0
 Lack of hand
 movement                         17               17      33      17             0         17         6
 Random movement                  33                0      67       0             0         33         3
 Representational                   0             100       0       0             0          0         1
 Self-adaptor                     33                0      33       0             0         33         6
 Head turn                        22               22      33      11            11         22         9
 Nodding                            0              73       9       0             9          0       11
 Laughing                         15               15      40      10             5         15       20
 Frowning                           0             100       0       0             0          0         1
 Lack of smile                      0               0      25      25            50          0         4
 Lip Movements                      0               0      43      14            14          0         7
 Mouth barely open                25                0       0       0            50         25         4
 Nervous smile                      0              25       0       0            25          0         4
 Smile                              9              34      14      11             9          9       44
 Swallowing                         0               0      50       0             0          0         2
 Audible breathing                  0               0     100       0             0          0         1
 Backchannel                        0               0       0       0             0          0         0
 Filled pauses                    11               11      56      14             0         11       57
 Tone-Prosody                       7              11      22       7            44          7       27
 Volume                             0               0       0       0          100           0         1
 Adjusting posture                  0              50      50       0             0          0         4
 Leaning back-
 Slouching                          0               0       0       0             0          0         0
 Leaning forward                    0              50      33       0             0          0         6
 Moving around                    36                0       9       0            36         36       11
 Rigid/Straight                     0               0       0       0          100           0         1
 Rocking-Shaking                  13               13      50      13             0         13         8
 Shoulders                        25               25       0       0            25         25         4
                                                                                                    388


Table J.4
Percentages of Behavior Within Categories of Language
                         Comprehensibility      Comprehension Fluency Grammar Pronunciation Vocabulary
 Body Language
 (General)                             25                   0      25       0            25         25
 Eyebrows                                0                 50       0      25            25          0
 Face (General)                        10                  46      22       5             5         10
 Averted gaze                          22                  28      39       6             0         22
 Blinking                                0                  0       0       0             0          0
 Eyes grow wide                          0                100       0       0             0          0
 Mutual gaze                           12                  31      19      15             8         12
 Shifting gaze                         13                  21      40       4             6         13
 Staring                                 0                 50      50       0             0          0
 Unfocused gaze                          0                  0       0       0             0          0
 Lack of hand movement                 17                  17      33      17             0         17
 Random movement                       33                   0      67       0             0         33
 Representational                        0                100       0       0             0          0
 Self-adaptor                          33                   0      33       0             0         33
 Head turn                             22                  22      33      11            11         22
 Nodding                                 0                 73       9       0             9          0
 Laughing                              15                  15      40      10             5         15
 Frowning                                0                100       0       0             0          0
 Lack of smile                           0                  0      25      25            50          0
 Lip Movements                           0                  0      43      14            14          0
 Mouth barely open                     25                   0       0       0            50         25
 Nervous smile                           0                 25       0       0            25          0
 Smile                                   9                 34      14      11             9          9
 Swallowing                              0                  0      50       0             0          0
 Audible breathing                       0                  0     100       0             0          0
 Backchannel                             0                  0       0       0             0          0
 Filled pauses                         11                  11      56      14             0         11
 Tone-Prosody                            7                 11      22       7            44          7
 Volume                                  0                  0       0       0          100           0
 Adjusting posture                       0                 50      50       0             0          0
 Leaning back-Slouching                  0                  0       0       0             0          0
 Leaning forward                         0                 50      33       0             0          0
 Moving around                         36                   0       9       0            36         36
 Rigid/Straight                          0                  0       0       0          100           0
 Rocking-Shaking                       13                  13      50      13             0         13
 Shoulders                             25                  25       0       0            25         25
 Raw Total                             46                  95     119      32            41         53
                                                                                                   389