LARGER SONORITY DIFFERENCE, LARGER LAG: GESTURAL COORDINATION IN
SPEECH PRODUCTION

By

Yunting Gu

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Linguistics—Doctor of Philosophy

2025

ABSTRACT

Sonority has been one of the most debated concepts in phonetics and phonology. Constraints

involving sonority such as the Sonority Sequencing Principle (SSP), the Sonority Dispersion

Principle (SDP), or the Syllable Contact Law have long been used by phonologists to understand

syllable structure (Sievers, 1881, 1901; Steriade, 1982; Selkirk, 1984; Clements, 1990; Kenstowicz,

1994; Parker, 2002, 2011). However, there is no consensus on the phonetic basis of sonority, either

in the articulation or the perception of speech (Albert, 2023). This dissertation explores sonority

in speech production.

Current speech production theories do not predict the variation of gestural coordination relevant

to sonority. However, sonority has been observed to be correlated with systematic variation in

gestural coordination, based on CC clusters in Georgian (Crouch, 2022). Also, some observations

(Gao, 2008; Shaw and Chen, 2019) suggested that sonority seems to be a factor that systematically

correlates to CV gestural coordination variation. In my dissertation, I followed up on these previous

studies and explored whether there is a positive correlation between sonority difference and CV lag

(the gestural lag between a consonant and a vowel) in English and Mandarin. Based on corpus data

of English, as well as Electromagnetic articulography (EMA) experiments participated by English

and Mandarin speakers, I found that CV lag positively correlates with CV sonority difference in

both languages.

In experiment 1, there were 32 English stimuli from the Wisconsin X-ray Microbeam Database

(Westbury et al., 1990) used to test the main claim. Analyzing the corpus data suggested that there is

a significant positive correlation between CV lag and sonority difference. To address the limitation

of using an existing corpus and to provide a cross-linguistic comparison, EMA data of 24 English

stimuli (experiment 2) and 26 Mandarin tone 4 stimuli (experiment 3) were collected and analyzed.

Each set of stimuli in the EMA experiments was read 15 times in different randomized lists. When

collecting EMA data, sensors were glued to the tongue tip, tongue blade, tongue dorsum, upper

lip, and lower lip of each participant. All the kinematic data were annotated in Matlab using the

lp_findgest algorithm of the mview package (Tiede, 2005), where the landmarks were labeled at

20 percent thresholds of peak velocity. The CV lag was computed by subtracting the target onset

(onset of gestural plateau) timestamp of the consonant from the target onset timestamp of the

vowel (Zhang et al., 2019; Durvasula and Wang, 2023). The sonority difference was quantified by

subtracting the C sonority from the V sonority using the sonority scale in Parker (2011). Plots and

mixed effects modeling was generated in R (R Core Team, 2017) where CV lag was modeled as a

function of the sonority difference, with participant, stimuli, and C duration as random intercepts.

The finding is that for all the data, CV lag positively correlates to sonority difference significantly.

Sub-groups of the stimuli controlled for consonant place of articulation or vowel height mostly

exhibited the expected correlations. I also used consonant displacement and vowel displacement

as estimates for jaw movement, and these findings suggest that jaw movement may not be a valid

alternative account to the finding. The dissertation found a positive correlation between sonority

and CV gestural coordination in English and Mandarin. If we make an assumption that larger lags

are preferred within a syllable, the finding forms a basis to explain universal constraints such as the

SSP and the SDP.

Copyright by
YUNTING GU
2025

ACKNOWLEDGEMENTS

This dissertation could not have been completed without the help and support of many people. First,

I would like to express my sincere gratitude to all the professors, staff, and fellow students who have

supported me along the way. I would like to express my gratitude to Dr. Karthik Durvasula. Thank

you for always being there and supporting me during my PhD career. I appreciate your guidance,

encouragement, understanding, and help along the way.

I would also like to thank Dr. Yen-Hwei Lin. I appreciate your patience, guidance, and help

throughout my PhD career!

I would like to express my gratitude to Dr. Suzanne Wagner. You made the challenging processes

of my PhD journey smoother, and I am grateful for all you have done.

Many thanks to Dr. Silvina Bongiovanni who has always been supportive and encouraging

along the way.

I also would like to thank all the linguistic professors such as Dr. Brian Buccola, Dr. Alan

Munn, Dr. Betsy Sneller, Dr. Scott Borgeson, Dr. Alan Hezao Ke, and Dr. Cristina Schmitt. You

have educated, encouraged, and inspired me over the years.

Second, I would like to say thank you to my parents. They have been understanding and

supportive! 谢谢爸妈！ I would also like to thank my family members who have encouraged me

and helped me with some of my non-dissertation projects as participants or recruiters.

Third, many thanks to my friends, at MSU linguistic program and beyond. You have shared

many moments with me. I especially would like to thank the people in my beginner’s tennis club

such as Wang, Chen, and Wu. I have shared unforgettable moments with you. Being part of the

tennis club made my PhD journey special!

v

LIST OF ABBREVIATIONS .

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

TABLE OF CONTENTS

CHAPTER 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
1
1.1 Sonority and sonority-related constraints . . . . . . . . . . . . . . . . . . . . .
1
1.2 Controversy on the basis of sonority . . . . . . . . . . . . . . . . . . . . . .
.
6
1.3 Quantifying sonority difference . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.4 Theories of gestural coordination in speech production . . . . . . . . . . . .
. 10
1.5 Gestural coordination variations related to sonority . . . . . . . . . . . . . .
. 14
1.6 Other observations of gestural coordination variations . . . . . . . . . . . . .
. 21
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.7 Claims to be tested .
1.8 Measuring articulatory gestures . . . . . . . . . . . . . . . . . . . . . . . . . . 24
. 30
1.9 Recap of the introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 2

2.1 Methods .
2.2 Results . .
2.3 Conclusion . .

EXPERIMENT 1: ENGLISH CORPUS STUDY . . . . . . . . . . . . . 31
. 31
.
.
. 40
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
. .

.
.
. .

.
.
. .

.
.
.

CHAPTER 3

3.1 Methods .
3.2 Stimuli . .
3.3 Results . .
3.4 Conclusion . .

EXPERIMENT 2: ENGLISH EMA STUDY . . . . . . . . . . . . . . . 60
. 60
.
. 64
.
.
. 68
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
. .

.
.
. .
. .

.
.
. .
. .

.
.
.
.

CHAPTER 4

4.1 Methods .
4.2 Stimuli . .
4.3 Results . .
4.4 Conclusion . .

EXPERIMENT 3: MANDARIN EMA STUDY . . . . . . . . . . . . . . 91
.
. 91
.
. 92
. 97
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
. .

.
.
. .
. .

.
.
. .
. .

.
.
.
.

CHAPTER 5

DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 116
5.1 Potential explanations and theory for the finding . . . . . . . . . . . . . . . . . 121
. . . . . . . . . . . . . . . 125
5.2 Providing a basis for some phonological universals
5.3 A sonority-driven speech production model
. . . . . . . . . . . . . . . . . . . 129
5.4 Caveats and directions for future studies . . . . . . . . . . . . . . . . . . . . . 130

CHAPTER 6

CONCLUSION .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 136

BIBLIOGRAPHY . .

. .

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

APPENDIX A

ENGLISH RECRUITMENT EMAIL . . . . . . . . . . . . . . . . . . . 148

APPENDIX B

ENGLISH PRE-SCREENING SURVEY . . . . . . . . . . . . . . . . . 149

APPENDIX C

MANDARIN RECRUITMENT MESSAGE . . . . . . . . . . . . . . . 151

vi

APPENDIX D

MANDARIN PRE-SCREENING SURVEY . . . . . . . . . . . . . . . 152

APPENDIX E

ANNOTATION LABELS AND THEIR MEANINGS
EXPERIMENT 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

IN

APPENDIX F

ANNOTATION LABELS AND THEIR MEANINGS
EXPERIMENT 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

IN

APPENDIX G

ENGLISH
DISPLACEMENT AS FIXED EFFECT . . . . . . . . . . . . . . . . . 156

RESULTS WITH

EXPERIMENTS

VOWEL

APPENDIX H

ENGLISH EXPERIMENT RESULTS WITH CONSONANT
DISPLACEMENT AS FIXED EFFECT . . . . . . . . . . . . . . . . . 159

APPENDIX I

MANDARIN RESULTS FOR PAIRWISE COMPARISON DIFFER
IN C VOICING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

vii

SSP

SDP

C

V

CV lag

AP

EMA

T1

T2

T3

T4

TT

TB

TD

UL

LL

LA

Lip

PVEL

MAXC

GON

TON

TOF

GOF

LIST OF ABBREVIATIONS

Sonority Sequencing Principle

Sonority Dispersion Principle

Consonant

Vowel

The gestural lag or timing difference between a consonant and a vowel

Articulatory Phonology

Electromagnetic Articulography

Tongue tip

Tongue blade

Tongue dorsum

Tongue root

Tongue tip

Tongue blade

Tongue dorsum

Upper lip

Lower lip

Lip aperture

Lip aperture

The peak velocity point

The maximum constriction point

Gestural onset

Target onset

Target offset or release

Gestural offset or release

viii

C-center The midpoint between target onset and target offset

Lag𝐺𝑂𝑁 CV lag based on gestural onset

Lag𝑇𝑂𝑁 CV lag based on target onset

ix

CHAPTER 1

INTRODUCTION

1.1 Sonority and sonority-related constraints

The current study explores sonority in speech production. In this section, I first introduce the

concept of sonority. Then, I describe phonological constraints that are related to sonority. Sonority

has been one of the most debated concepts in phonetics and phonology. It is defined as a unique

type of relative, non-binary featurelike phonological concept that potentially categorizes all speech

sounds into a hierarchical scale (Parker, 2011). Even though there are different sonority scales in

the literature, most phonologists identify the following sonority scale as in (1), where > means

more sonorous than (Kenstowicz, 1994; Wright, 2004; Pons-Moll, 2008).

(1)

vowels > glides > liquids > nasals > obstruents

As an abstract concept, sonority is a primitive that has long been used by phonologists to

understand syllable structure (Sievers, 1881, 1901; Steriade, 1982; Selkirk, 1984; Clements, 1990;

Kenstowicz, 1994; Parker, 2002, 2011). There are several generalizations or sonority-related

phonological constraints that have been based on, and motivate, the abstract concept of sonority.

First, the Sonority Sequencing Principle (SSP) requires that each syllable should exhibit one

peak of sonority in the nucleus, and that, cross-linguistically, a sonority rise (such as [pl]) is

preferred in onsets over a sonority plateau (such as [pt]) which in turn is preferred over a sonority

fall (such as [lp]) (Sievers, 1881, 1901; Greenberg, 1965; Pike, 1972; Hooper and Bybee, 1976;

Steriade, 1982; Selkirk, 1984; Clements, 1990; Kenstowicz, 1994; Blevins, 1995; Parker, 2002,

2011). A schematic showing the optimal sonority contour can be found in Figure 1.1.

𝑛𝑢𝑐𝑙𝑒𝑢𝑠

𝑜𝑛𝑠𝑒𝑡

𝑐𝑜𝑑𝑎

Figure 1.1 Optimal organization of sonority in a syllable. There should be one peak of sonority in
the nucleus.

1

In Table 1.1, there are three sample syllables and their sonority contour at onsets. Sonority

increases from bottom to top in this table, and the * symbol denotes the relevant sonority category

of each sound. The trajectory of the * symbols shows that [pl] has an increasing sonority, while

[pt] has a level sonority or sonority plateau and [lp] has a falling sonority. Note that sonority rise

is preferred over sonority plateau over sonority fall. In this case, [pl] is preferred over [pt] over [lp]

at onsets.

Vowels
Glides
Liquids
Nasals
Obstruents

*

l]

*
[p

*

[l

*
p]

*
[p

*
t]

Table 1.1 Three sample syllables with rising ([pl]), plateau ([pt]), and falling ([lp]) sonority at
onsets. Sonority increases from bottom to top, and the * symbol denotes the relevant sonority
category of each sound. The trajectory of * shows that [pl] has an increasing sonority, while [pt]
has a level sonority or sonority plateau and [lp] has a falling sonority.

Cross-linguistically, there appear to be violations of the SSP, meaning there are sonority fall at

what appear to be onsets or sonority rise at what appear to be codas (Yin et al., 2023). For instance,

nasal-stop and sibilant-stop onset clusters (such as skill [sk], speak [sp] in English) have been found

in many languages. However, these apparent violations may not be considered as evidence against

the SSP since some segment such as [s] can be analyzed as extra-syllabic, which means that [s] is

not part of the onset (Cho and King, 2003; Parker, 2011). In Figure 1.2, [sk] in skill [skIl] is an

onset cluster, violating the SSP. In Figure 1.3, [s] is not part of the onset but rather an appendix,

and the syllable structure is not violating the SSP (Vaux and Wolfe, 2009).

Furthermore, some apparent violations may not be true violations if alternative sonority scales

are assumed. Regarding the sonority hierarchy of obstruents, while in Berent et al. (2007),

fricatives are assumed to be more sonorous than stops, Parker (2002, 2008, 2011) assumed that

voiced obstruents are more sonorous than voiceless obstruents, as in Table 1.3. In Table 1.3, there

is a partial sonority scale where a more sonorous natural class of sounds is indicated by a larger

value. Jespersen (1904) also provided a sonority hierarchy as in Table 1.2, where voiceless stops

2

𝜎

𝜎

onset

rime

s

onset

rime

nucleus

coda

nucleus

coda

x

x

x

x

x

x

x

s

k

I
Figure 1.2 Onset cluster [sk] in skill [skIl].

l

k

I
Figure 1.3 Extra-syllabic [s] in skill [skIl].

l

and fricatives are similar in terms of sonority.

1 voiceless stops

voiceless fricatives

2 voiced stops
3 voiced fricatives

Table 1.2 Partial sonority hierarchy in Jespersen (1904). The sonority index uses a larger value for
more sonorous classes.

Natural class
voiced fricatives
voiced stops
voiceless fricatives (including [h])
voiceless stops (including [ĳ])

Sonority index
6
4
3
1

Table 1.3 Partial hierarchy of relative sonority (Parker, 2002, 2008, 2011).

When we analyze the apparent violations of SSP, for example, in the sibilant-stop onset violation

cases, if we assume sibilants are more sonorous than stops, then sibilant-stop onset is a violation of

the SSP. However, if we assume that voiced fricatives and stops are more sonorous than voiceless

fricatives and stops as in Parker (2008), then some sibilant-stop onsets — those with voiceless

sibilants and voiced stops — are not true violations of the SSP. Even though voicing assimilation

on onset clusters is common, there are languages such as Georgian, Khasi, and Bilaan that have

mixed voicing onset clusters (Kreitman, 2010). In Mondern Hebrew, [sd] and [sg] are licit onset

clusters according to Kreitman (2010) and they are not violating the SSP if the sonority scale

in Parker (2008) is assumed.

Indeed, considering alternative assumptions does not account for

3

all the apparent violations of the SSP. If both extra-syllabicity and alternative sonority scales are

considered, the SSP can potentially still be maintained as a cross-linguistic generalization, as long

as one is clear that it refers to syllable internal sequences, and not just any sequences. This is

similar to the opinion mentioned in Parker (2011).

Second, the Sonority Dispersion Principle (SDP) specifies that in a syllable, from onset to

nucleus the sonority difference should be maximized and that from nucleus to coda the sonority

difference should be minimized (Steriade, 1982; Clements, 1990; Parker, 2011; Xhaferaj et al.,

2022). This predicts that [ta] is better formed than [na] than [la]. Also, the SDP favors syllables

that end in a vowel. Clements (2009) observed that at codas, few languages prefer obstruents over

sonorants. The languages that allow syllable-final stops and fricatives usually place restrictions

on them. Regarding CCV syllables, a preferred syllable has a maximal sum of all the sonority

differences for the onset to nucleus part. This study explores the relationship between sonority

difference and gestural timing in CV syllables. The main claim can potentially provide a reason

about why larger sonority difference is preferred for CV syllables.

Third, the Syllable Contact Law specifies that at syllable boundaries, a larger sonority decrease

is preferable (Hooper and Bybee, 1976; Murray and Vennemann, 1983). In other words, at syllable

boundaries, the coda A and onset B of the following syllable have a and b as their sonority values.

Structure A.B would be more preferable if a-b is larger (Hooper and Bybee, 1976; Murray and

Vennemann, 1983). If three consonants A, B and C have sonority values such that A < B < C,

then A.C1, as compared to B.C, is the preferred sequence at the syllabic boundary. For example,

the Syllable Contact Law requires that [Al.tA], with falling sonority, is preferred over [At.lA], with

rising sonority (Seo, 2011). Another set of specific language examples comes from Korean, where

Davis and Shin (1999) used the Syllable Contact Law to analyze Korean phonological processes

such as obstruent-nasalization, n-lateralization, l-nasalization, and nasalization of (non-coronal)

obstruent-liquid sequences as in Table 1.4.

1“.” is used to refer to a syllable boundary.

4

obstruent-nasalization
n-lateralization
l-nasalization
nasalization of obstruent-liquid sequences

/sip-ny@n/
/non-li/
/kam-li/
/p@p-li/

[sim.ny@n]
[nol.li]
[kam.ni]
[p@m.ni]

‘ten years’
‘logic’
‘supervision’
‘principle of law’

Table 1.4 Korean examples showing the Syllable Contact Law from Davis and Shin (1999).

Lastly, Steriade (1982) and Selkirk (1984) pointed out that there are language-specific

requirements for segments of a tautosyllabic consonant cluster to be separated by minimum

sonority differences. Parker (2008) interpreted Steriade (1982) and Selkirk (1984) that there is

some restriction on the sonority distance between tautomarginal consonant clusters in some

languages. For instance, in Spanish /pl/ is possible but not /pn/ since the sonority distance

between /p/ and /n/ is not large enough.

As pointed out by Clements (2005), not all principles about sonority are strictly independent.

For instance, the Syllable Contact Law is closely related to the Sonority Dispersion Principle (SDP)

and may partially derive from it (Clements, 2005). Since the SDP requires that syllable coda prefers

high sonority and syllable onset prefers low sonority, a preceding syllable ending in high sonority

will form a decrease in sonority when the following syllable onset is of low sonority. Therefore,

individual syllables conforming to the Sonority Dispersion Principle are likely to obey the Syllable

Contact Law as well (Clements, 2005). Clements (2005) mentioned that it is necessary to have

separate principles or laws because some languages obey one sonority principle but not the other.

Examining the SSP, SDP, Syllable Contact Law, and the restriction on sonority distance

mentioned above suggests that there is some optimal relative sonority value required or expected

for each part of the syllable. The general requirements regarding the sonority of syllable structures

focus on the sonority difference of adjacent sounds, within one syllable or across syllable

boundaries. The four principles suggest an optimal syllable should have peak sonority in the

nucleus, with onset sonority rising to the peak and falling coda sonority. Furthermore, the optimal

syllable should have a larger sonority difference at its beginning, either CC or CV, and its coda

should avoid serving as the beginning of the rising sonority onset of the following syllable.

5

1.2 Controversy on the basis of sonority

There is no consensus on the phonetic basis of sonority, either in the articulation or the

perception of speech (Albert, 2023). I have discussed some phonological generalizations involving

sonority. However, there is far less clarity and understanding of the generalizations discussed above

that use sonority as a primitive. There have been debates about whether sonority is a primitive

or is derivable from other phonetic factors. Some deny the existence of sonority as a primitive,

and instead opt to derive it from phonetic properties as (a) a complex function of the acoustics

(Ohala, 1990; Ohala and Kawasaki, 1997), (b) a correlate of intensity (Parker, 2008, 2011; Gordon

et al., 2012; Ladefoged and Johnson, 2014), (c) a correlate of pitch intelligibility (Albert, 2023), (d)

perceptual cue (Henke et al., 2012), (e) articulatory openness (Mattingly, 1981), and (f) articulatory

timing (Chitoran, 2016).

Here, I lay out a few arguments about the different ways to derive sonority. For instance,

Ohala and Kawasaki (1997) do not believe sonority exists as a primitive. Instead, they argued that

the degree of modulation should be used to account for phonological universals. This degree of

modulation is measured by various acoustic parameters such as amplitude, periodicity, spectral

shape, and F0.

Another way of deriving sonority comes from Henke et al. (2012), who argued that a perception

cue approach could explain the SSP, the syllable contact law, as well as the unmarked status of

the CV syllables. According to Henke et al. (2012), each natural class of sounds has its internal

cues of different robustness levels, in terms of manner cues, voicing cues, and place cues. A

partial summary can be found in Table 1.5. Moreover, the natural classes also differ in their

ability to carry the cues of their adjacent sounds. The internal and carrier cues of each natural

class determine the preference for organizing sounds. Henke et al. (2012) argued that this way

of accounting for phonological universals has advantages over the SSP in that it can also account

for violations of the SSP such as the sibilant-stop cluster at onsets. For instance, they argued that

fricatives have internal cues, and therefore, they can bear more gestural overlap than sounds that

rely on transitions. Specifically, sibilant fricatives are the least dependent on formant transitions

6

and therefore are expected to be surrounded by obstruents. They also argued that vowels have

robust internal cues in terms of manner cues, voicing cues, and place cues and they are also good

carriers of three kinds of cues. Therefore, vowels are optimal as the nuclei of syllables.

class

vowels
sibilant fricative
stops

manner
internal
robust
robust
poor

voicing
cues
cues
carrier
internal
carrier
good
robust
good
medium medium poor
poor
poor
poor

place
internal
robust
robust
poor

cues
carrier
good
poor
poor

Table 1.5 Partial summary of cue robustness in Henke et al. (2012).

Related to the current speech production study, Chitoran (2016) claimed that “the sonority

hierarchy can be best understood in its relation to articulatory timing” (p. 46). In the dissertation, I

follow up on the primary intuition laid out in Chitoran (2016) that there is a relationship between

the sonority hierarchy and articulatory gestural timing.

1.3 Quantifying sonority difference

As mentioned earlier, the phonological constraints related to sonority, including the SSP, SDP,

Syllable Contact Law, and the restriction on sonority distance, suggest that there is some optimal

relative sonority value expected for each part of the syllable. The requirements regarding the

sonority of syllable structures focus on the sonority difference of adjacent sounds, within one

syllable or across syllable boundaries.

In operationalizing the intuition in Chitoran (2016) that

there is a relationship between the sonority hierarchy and articulatory gestural timing, it is worth

noting that while sonority relates pairs of sound classes in a scale, gestural timing relates two

proximal gestures. Therefore, in order for sonority to be reducible to gestural timing, one needs

to talk about the sonority of proximal gestures. To put it another way, generalizations regarding

sonority and syllable structure can be boiled down to a requirement for sonority difference between

adjacent segments. For this reason, I specifically quantified the sonority difference between two

adjacent segments and explored its relation to gestural timing.

To quantify sonority difference, I considered many proposed sonority scales that are subtly

different in the literature (Clements, 1990; Kenstowicz, 1994; Mielke, 2008; Parker, 2008; Kang

7

et al., 2011). There were controversies on whether rhotics are more sonorous than laterals (Hall,

2002; Parker, 2002), or laterals are more sonorous than rhotics (Hankamer and Aissen, 1974). Also,

while in Berent et al. (2007), fricatives are assumed to be more sonorous than stops, Parker (2002,

2008, 2011) argued, based on intensity measurements, that voiced obstruents are more sonorous

than voiceless obstruents. Ultimately, I chose to implement the sonority scale developed by Parker

(2002, 2008, 2011) that is shown in Table 1.6.

Natural class
low vowels
mid peripheral vowels (not [@])
high peripheral vowels (not [1])
mid interior vowels([@])
high interior vowels ([1])
glides
rhotic approximants
flaps
laterals
trills
nasals
voiced fricatives
voiced affricates
voiced stops
voiceless fricatives (including [h])
voiceless affricates
voiceless stops (including [ĳ])

Sonority index
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1

Table 1.6 The hierarchy of relative sonority (Parker, 2002, 2008, 2011).

The further nuance provided by the sonority scale in Parker (2002, 2008, 2011) is argued to be

necessary to account for more complex syllabification patterns observed in some languages. For

example, in Imdlawn Tashlhiyt Berber, syllabification is staged and a fine-grained sonority scale

separating vowel height, as well as voicelessness in fricatives and stops, is necessary (Dell and

Elmedlaoui, 1985). Table 1.7 showed that the distinction between low and high vowel is necessary in

a sonority scale to yield the right syllabification in a stage-wise manner (top-to-bottom). Basically,

when formulating syllabification, one associates a core onset-nucleus syllable with any sequence

(Y)Z, where Z is a low vowel, a high vowel, a liquid, a nasal, a fricative, or a stop. In the example

8

in Table 1.7, the syllable with a low vowel is syllabified first, then the one with a high vowel.

Note that “I” stands for [+son, -cons, +high, -back, -round], and U stands for [+son, -cons, +high,

+back, +round]. Without considering the stages and the sonority of different segments, the right

syllabification cannot be accounted for.

low vowel
high vowel
liquid

[t-IzrUal-In]
t-Izr(wa)l-In
(t-i)zr(wa)(l-i)n
(t-i)(zr
˚

)(wa)(l-i)n

Table 1.7 Imdlawn Tashlhiyt Berber example showing staged syllabification (Dell and Elmedlaoui,
1985). The stages of syllabification are presented top-to-bottom. Sensitivity to vowel height
differences is observed above. I stands for [+son, -cons, +high, -back, -round], and U stands for
[+son, -cons, +high, +back, +round].

Above, I showed that some languages like Imdlawn Tashlhiyt Berber need a more nuanced

sonority scale for syllabification. Since ultimately I expect the main claim in the current study

to be generalized to all languages, a more nuanced scale is preferred. To sum up, I used the

specific scale in Parker (2002, 2008, 2011) because: (a) it is phonetically grounded based on the

estimated average intensity of the relevant sound class; (b) it is intended to cover all speech sound

categories; (c) the scale is quantified in a clear way; (d) while providing a much more nuanced

sonority scale, its relative ranking of major classes is consistent with (1), which accords with other

sonority scales (Clements, 1990; Kenstowicz, 1994; Smolensky, 1995; Clements, 2005); (e) it has

the potential to be used in cross-linguistic contexts since some languages require more nuanced

sonority hierarchies.

Using the sonority index in Table 1.6, I am able to operationalize the independent variable

sonority difference between a consonant and the subsequent vowel as follows: I subtracted the

sonority index of the specific C from that of the following V, according to the scale in Table 1.6.

For instance, the sonority difference of [ba] is 13 since sonorityV - sonorityC = sonoritylow vowel

- sonorityvoiced stop = 17 - 4 = 13. As will be elaborated below, this sonority difference forms the

independent variable in the dissertation. The dependent variable is articulatory gestural lag, and it

will be discussed later in this chapter.

9

It is important to note while the use of the scale from Parker (2002, 2008, 2011) has the above

advantages, it does imply that sonority hierarchy is a linear scale. Furthermore, I recognize that

many researchers consider sonority to be a relative notion. I therefore follow up on the main omnibus

analyses with a set of more nuanced comparisons specifically meant to address the concerns.

1.4 Theories of gestural coordination in speech production

This dissertation is about the relationship between sonority and gestural timing in CV syllables.

In this section, I am going to lay out claims about gestures and different claims related to gestural

coordination in speech production. Phonological representations are characterized in terms of

gestures and the relations of gestures where the gesture is a basic unit and a relatively abstract

concept (Browman and Goldstein, 1989, 1992).2 Gestures are events that unfold during speech

production, and these events consist of the formation and release of constrictions in the vocal tract.

The consequences of gestures can be observed in the movement of speech articulators. A schematic

illustration of gesture in Gafos (2002) is shown in Figure 1.4, where gestural onset (GON), target

onset (TON), C-center (the midpoint between target onset and target offset), target offset or release

(TOF), and gestural offset or release (GOF) are denoted from left to right. Many phonologists in

speech production do not identify the C-center point as Gafos (2002).

TON

C-center

TOF

GON

GOF

Figure 1.4 A sample gesture according to the view of Gafos (2002).

Within the AP framework, the coupled oscillator model of syllable structure argues that syllabic

structure is expressed articulatorily in differential timing relations (Browman and Goldstein, 2000;

Hermes et al., 2013; Iskarous and Pouplier, 2022).

It is commonly assumed that the gestural

coordination pattern is based on the type (consonant or vowel) and position of the gesture. For

2This is the claim of Articulatory Phonology (AP). Even though the current study does not necessarily operate
under the AP assumptions, the AP framework is reviewed here since a) it is widely assumed in speech production
studies and b) it provides a basis for understanding the timing relationships between articulations that are ultimately
the focus of this dissertation. The two reasons why AP is not assumed will be mentioned later in this section when
relevant.

10

a consonant-vowel (CV) syllable, AP claims that the consonantal and vowel gestures are timed

synchronously (Saltzman and Munhall, 1989; Browman and Goldstein, 2000; Goldstein, 2011;

Pouplier, 2020; Krivokapić, 2020; Liu et al., 2022), as in Figure 1.5.3

X

Y

Figure 1.5 Synchronous coordination.

There are other opinions on CV timing as well. For instance, Shaw et al. (2021) viewed the

distinction between synchronous and sequential coordination slightly differently from the classic

AP view by Browman and Goldstein (1989, 1992). Specifically, they argued that if two gestures

start at the same time, they have a synchronous relationship; if two gestures have a sequential

relationship, they are coordinated by end-gestural timing as in Figure 1.6. Shaw et al. (2021)

pointed out that an observed lag could come from the positive lag of synchronous coordination

or a negative lag of end-gestural timing. One interpretation of their claims is that synchronous or

sequential relationships are not pre-determined by segmental types of consonants or vowels. Rather,

the phasal relationships can be used to describe gestural coordination that fulfills the specific timing

requirements. Essentially, if one were to generalize the observation of consonant sequences to all

segments, then one would subsequently claim that CV timing could be sequential.

Y

X

Figure 1.6 Offset-onset timing, sequential timing, or end-gestural timing.

Furthermore, Durvasula and Wang (2023) generalized the findings of Shaw et al. (2021) and

argued that if two gestures belong to a pair of adjacent segments, then they have a sequential timing

3One reason why the current study does not operate under AP assumptions is that timing differences between C

and V in CV syllables were observed. Therefore, C and V are not strictly coordinated synchronously.

11

relationship, wherein the second gesture is timed to the end of the first gesture. According to them,

CV timing should have offset-onset alignment or a sequential timing relationship as in Figure 1.6.

Moreover, Nam (2007) proposed the split-gesture hypothesis which suggests that a stop

consonant can be split into two sub-gestures4, one as a closure sub-gesture and another as a

release sub-gesture. Each sub-gesture has a synchronous timing relationship with the following

vowel while maintaining a sequential relationship (as in Figure 1.6) with each other. The

sequential timing specifies that a second gesture starts at the extreme displacement point of the

abstract oscillatory cycle of the first gesture. The split-gesture hypothesis predicts that the vowel

onset is being timed to roughly the 12.5%-16.7% point in the stop articulation (Durvasula and

Wang, 2023). This is because gestures typically require 240◦ - 360◦ of their internal clock cycle to

reach their target configuration (Browman et al., 1990). Nam (2007) assumes that the vowel

gesture has a 60◦ phase difference with the closure sub-gesture and a -60◦ phase difference with

the release sub-gesture. If the closure sub-gesture (CLO in Figure 1.7 and 1.8) and the release

sub-gesture (REL in Figure 1.7 and 1.8) both require 240◦ - 360◦ of their internal clock cycle to

reach their target configuration, to satisfy the vowel gesture coordination condition, the two

boundary cases for the CV coordination according to the split-gesture hypothesis can be found in

Figure 1.7 and 1.8. Specifically, if we assume that the release gesture (REL) and closure gesture

(CLO) requires 240◦ of their internal clock cycle to reach their target configuration, the vowel

gesture is coordinated to the 60◦/(60◦+60◦+240◦)=16.7% of the whole C gesture as in Figure 1.7.

On the other hand, in another boundary case, if we assume that each gesture requires 360◦ of their

internal clock cycle to reach their target configuration, the vowel gesture is coordinated to the

60◦/(60◦+60◦+360◦)=12.5% of the whole C gesture, as in Figure 1.8.

4While the term sub-gesture is not a technical term, we use it here for expository convenience to highlight the fact

that both sub-gestures are there to model different aspects of a single stop articulation.

12

REL

CLO

60◦

60◦

V

Figure 1.7 Sample CV alignment according to the split-gesture model. Here it is assumed that each
gesture requires 240◦ of their internal clock cycle to reach their target configuration (Browman
et al., 1990). The vowel gesture has a 60◦ phase difference with the closure sub-gesture and a -60◦
phase difference with the release sub-gesture. Therefore, the vowel gesture is coordinated to the
60◦/(60◦+60◦+240◦)=16.7% of the whole C gesture.

REL

CLO

60◦

60◦

V

Figure 1.8 Sample CV alignment according to the split-gesture model. In this figure, it is assumed
that each gesture requires 360◦ of their internal clock cycle to reach their target configuration
(Browman et al., 1990). The vowel gesture has a 60◦ phase difference with the closure sub-gesture
and a -60◦ phase difference with the release sub-gesture. Therefore, the vowel gesture is coordinated
to the 60◦/(60◦+60◦+360◦)=12.5% of the whole C gesture.

In contrast to the above claims, Tilsen (2020) also suggested that CV coordination is eccentric.

Specifically, Tilsen (2020) found that “there is category-related information in speech signals well

before initiation of the articulatory gestures associated with those categories” (p. 20). Liu et al.

(2022) interpreted this to mean that vowel onset initiates before consonant gesture offset and after

consonant gesture onset. Also, Öhman (1966) argued that vowels begin during the consonant in

CV sequences.

Despite the different claims of CV coordination, the theories mentioned above predict consistent

CV coordination. Besides CV sequences, for syllables with consonant clusters (CC) at onset, AP

specifies the two onset consonants to have a consistent sequential relationship with each other.5

5Besides the reason mentioned earlier, another reason why AP is not assumed in the current study is that the
sonority-driven speech production model proposed in subsection 5.3 assumes that CC onset, CV, and perhaps VC have
similar coordination patterns.

13

In general, even though the theoretical claims mentioned above suggest different specifications

regarding CV coordination — synchronous (Browman and Goldstein, 2000; Nam and Saltzman,

2003; Goldstein et al., 2006; Xu et al., 2006; Hermes et al., 2013; Liu et al., 2020; Durvasula et al.,

2021; Liu et al., 2022; Iskarous and Pouplier, 2022), sequential (Shaw et al., 2021), and C-center

(Nam, 2007), none of the previous theoretical claims have predicted the systematic variation of

gestural coordination variation correlates to sonority.

1.5 Gestural coordination variations related to sonority

As discussed in the previous subsection, current theories of gestural coordination do not predict

the variation of gestural coordination relevant to sonority. However, some previous studies observed

gestural coordination variation, and some of the variation is potentially related to sonority. More

specifically, the gestural coordination of CC onset clusters has been argued to be related to sonority

(Crouch, 2022; Crouch et al., 2023). In addition, gestural coordination variation has been found

to correlate with factors such as a) C voicing (Hoole et al., 2009; Gibson et al., 2019), b) C place

(Byrd, 1994, 1996; Gafos et al., 2010; Bombien et al., 2013), c) C manner (Byrd, 1994; Wright,

1996; Byrd, 1996; Hoole et al., 2009; Gibson et al., 2017; Pouplier et al., 2022), d) vowel quality

(Fowler and Saltzman, 1993), e) prosodic effects such as stress, domain position (Öhman, 1966;

Hardcastle, 1985; Byrd, 1994, 1996; Byrd and Saltzman, 2003; Yanagawa, 2006; Gafos et al.,

2010; Gu, 2023), and f) stiffness parameter of the gestures (Du and Gafos, 2023). These factors

are crucial for analyzing gestural coordination.

At least some of the observations above can be generalized as a relationship between sonority

and gestural timing. As noted, some of the work explicitly pointed out the relevance of sonority to

gestural timing (Crouch, 2022; Crouch et al., 2023). Furthermore, the effects of voicing and manner

also are potentially related to sonority. In the following subsections, gestural coordination related

to sonority will be discussed first. Then, other factors that lead to gestural coordination variation

will be briefly discussed. Understanding non-sonority factors relevant to gestural coordination

variation is necessary for implementing the present experiment and for interpreting the results of

the dissertation.

14

Crouch (2022) and Crouch et al. (2023) explored the relationship between sonority and CC

timing, and observed that there was a correlation between sonority sequencing of consonant onset

clusters and their gestural overlap in Georgian. Specifically, they used the 20% threshold algorithm

of the mivew package (Tiede, 2005).6 They considered two measurements. The first measurement

was termed relative overlap, and it is the same with the onset lag measurement in Pouplier et al.

(2022) as in (3a). The second measurement is termed constriction duration overlap as in (2), and it

is the opposite of normalized plateau lag in Pouplier et al. (2022).

(2)

constriction duration overlap =

(C1 target offset - C2 target onset)
(C2 target offset - C1 target onset)

Crouch (2022) found that a sequence of two consonants in Georgian with a sonority rise

exhibited less overlap than those with a sonority plateau, which in turn were less overlapped than

those with a sonority fall. Crouch et al. (2023) speculated that the observed relationship between

sonority sequencing and consonant sequences was limited to Georgian consonant onset clusters.

There is also prior work that has observed variation in CC gestural coordination related to

C manner and C voicing (Hoole et al., 2009; Gibson et al., 2019; Pouplier et al., 2022). Even

though sonority was not used to account for the variation, one could infer from these results that

it is really sonority that is related to gestural coordination variation since both voicing and manner

relate to sonority. For example, Pouplier et al. (2022) analyzed the CC overlap in 7 languages and

argued that manner and voicing can condition CC overlap variation.7 For each CC onset cluster,

they measured onset lag and normalized plateau lag using the formulas in (3). As schematized

in Figure 1.9, onset lag was quantified by the difference between C2 gestural onset and C1 target

onset (orange line) divides the difference between C1 target offset and C1 target onset (red line).

Moreover, normalized plateau lag is quantified by the difference between C2 target onset and C1

6All the articulatory studies discussed in this subsection (1.5) used the algorithm.
7The 7 languages are: English, French, Russian, Georgian, German, Polish, and Romanian.

15

target offset (green dashed line) divides the difference between C2 target offset and C1 target onset

(blue dashed line).

(3)

a.

b.

onset lag =

(C2 gestural onset - C1 target onset)
(C1 target offset - C1 target onset)

normalized plateau lag =

(C2 target onset - C1 target offset)
(C2 target offset - C1 target onset)

C2

C1

Figure 1.9 Lag measurement in Pouplier et al. (2022). Onset lag is measured by the orange line
value divides the red line value. Normalized plateau lag is measured by the green dashed line
divides the blue dashed line.

While Pouplier et al. (2022) measured the degree of overlap rather than gestural lags, their

results can be reinterpreted in terms of gestural lags.

If we assume that the degree of overlap

and gestural lag are inversely related, plugging in the numbers of the sonority index in Table 1.6

for the observed sequences shows a positive correlation between sonority difference and gestural

lag. For instance, Pouplier et al. (2022) generally observed that consonant clusters that begin with

voiceless stops (/pl/, /kl/, and /kn/) had less overlap than those that started with voiced stops (/bl/,

/gl/, and /gn/, respectively). Since according to some sonority scales such as Parker (2012) in

Table 1.6, voiceless stops are less sonorous than voiced stops, voiceless stops have a larger sonority

difference with their subsequent C2. Therefore, the observation in Parker (2012) is consistent with

the generalization that there is a positive correlation between sonority difference and gestural lag.

Pouplier et al. (2022) also found that /sk/ and /sp/ clusters are more likely to have a larger

overlap than /Sm/ and /sm/. This observation suggests that when the first consonant is a fricative,

clusters where the second C is a stop have a larger overlap than when C2 is a nasal. Since in terms

16

of sonority, stops < fricatives < nasals, and according to Table 1.6 the difference between stop and

fricative is smaller than that between fricative and nasal, /sk/ and /sp/ clusters are likely to have

larger overlap than /Sm/ and /sm/ if we assume a positive relationship between sonority difference

and gestural lag. Similarly, Pouplier et al. (2022) found that a) /bl/ and /gl/ have less overlap than

/Sm/, /sm/, and /Sp/; b) /Sm/ and /sm/ have less overlap than /sp/ and /sk/; c) for stop-initial data,

/gn/ has shorter lag than /bl/, which has larger overlap than /kl/ and /pl/; d) /gl/ has larger overlap

than /kl/ and /pl/; e) /sp/ has the lowest onset lag, and other languages’ /sp/ extend into the negative

range. If we plug in the numbers of the sonority index in Table 1.6 for the relevant sequences,

one observes a tendency for a positive correlation between sonority difference and gestural lag can

predict those observations of gestural coordination variation, if we also assume degree of overlap

and lag are inversely related. Specifically, the sonority values of each onset cluster can seen in Table

1.8 where the sonority difference is calculated by subtracting the C1 sonority index from the C2

sonority index. The relationship between gestural lag and sonority can be found in all observations.

For instance, /sp/ has -2 as its sonority difference according to Table 1.6, which is the lowest among

the target CC in the study.

(1)

(2)

(3)

(4)

(5)

/Sp/
1-3=-2
/sk/
1-3=-2
/gn/
7-4=3
/gl/
9-4=5
/sp/
1-3=-2

/sm/
7-3=4
/sp/
1-3=-2
<

<

<

/Sm/
7-3=4
<

/bl/
9-4=5
/kl/
9-1=8
...

<

/gl/

/bl/

9-4=5 9-4=5
/sm/
/Sm/
7-3=4 7-3=4

<

/kl/

/pl/

9-1=8 9-1=8

/pl/
9-1=8

Table 1.8 Calculating the sonority difference in syllables in Pouplier et al. (2022). The number
in the first column refers to the specific observation regarding sonority and gestural timing in the
paragraph. The calculation below each cluster shows the CC sonority difference by subtracting the
C1 index from the C2 index.

Admittedly, in the above interpretation of Pouplier et al. (2022), the inverse relationship between

gestural overlap and lag is assumed but not tested. Additionally, they are patterns that are inferred

17

from k-means clustering results. Therefore, they should serve as suggestive rather than concrete

evidence for a positive correlation between gestural lag and sonority difference. This caveat is also

true for other studies on CC that used the measurement of overlap rather than lag.

There is also some acoustic data showing C manner could relate to gestural coordination

variation (Wright, 1996). Wright (1996) examined the acoustic data of Tsou, an Austronesian

language rich in word-initial consonant clusters. They found that clusters where one or both

consonants have internal cues showed a greater degree of overlap. Wright (1996) considered cues

to consonant place and manner contrasts. In word-initial stop+stop clusters, for example, the overlap

between consonants is minimized to maintain an audible C1 release burst. When C1 is a fricative,

more overlap is permitted because fricatives have internal cues to their place and manner. This

work is not analyzed beyond as some other observations showing a relationship between C manner

and articulatory timing because it used acoustic rather than articulatory data, and I am reluctant to

use acoustic results to infer articulatory gestural lags.

Besides Pouplier et al. (2022), some previous studies also observed that the voicing of consonants

is relevant to the gestural coordination of consonant cluster at onsets (Hoole et al., 2009; Gibson

et al., 2019). Using German EMA data, Hoole et al. (2009) observed that there is less articulatory

overlap for the voiceless compared to the voiced C1 for German onset clusters. The overlap of

consonant gestures was measured by (2), the same measurement used in Crouch et al. (2023).

Since voiceless C is less sonorous than voiced C, the C1C2 sonority difference would be larger for

voiceless C1 than voiced C1. Therefore, clusters with voiceless C1 would have a larger lag and

less overlap than clusters with voiced C1, given the positive correlation between gestural lag and

sonority difference. Furthermore, Gibson et al. (2019) examined the gestural coordination in onset

clusters in Spanish using EMA data. Gestural overlap was defined by subtracting the C1 target

offset from the C2 target onset as in (4). Gibson et al. (2019) found that two consonants that are

both voiced show more articulatory overlap than when C1 is voiceless and C2 is voiced. Again,

voiceless obstruents are less sonorous than voiced ones, and therefore, they would have a larger

sonority difference as C1. This predicts that C1voicelessC2voiced would have a larger gestural lag —

18

which can be interpreted as less overlap — than C1voicedC2voiced.

(4)

gestural overlap = C2 target onset - C1 target offset

If interpreted in terms of gestural lag, the above results are consistent with the claim that there is

a positive correlation between sonority difference and gestural lag. Given that voiceless obstruents

are less sonorous than voiced obstruents in the sonority scale in Parker (2002, 2008, 2011).

Some studies showed the relationship between manner and CC gestural timing (Gibson et al.,

2017), and again this relationship can be viewed as a link between sonority and gestural timing.

Gibson et al. (2017) analyzed EMA data of Spanish onset clusters. C1C2 overlap was quantified by

the timing difference between C1 and C2 targets or plateaus. They found that clusters where C2 is

a rhotic had significantly larger lag than clusters where C2 is a lateral. Rhotics are more sonorous

than laterals. Therefore, C1C2rhotic would have larger sonority difference and larger gestural lag

than C1C2lateral.

The above findings, suggesting a positive correlation between sonority and gestural coordination

in onset CC, are unexpected according to current speech production theories. According to AP, we

also do not expect CV coordination to share a similar pattern with CC coordination. However, the

following observations show that the positive correlation between sonority and gestural coordination

is also likely applicable to CV sequences. First, Gao (2008) observed variation in gestural lag

between the onset of the consonant and vowel gestures (CV lag) when exploring Mandarin tone-to-

segment alignment using kinematic data collected from Electromagnetic Articulography (EMA)

experiments. The CV lag measurement, namely, CV lag based on gestural onset (Lag𝐺𝑂𝑁 ), is

schematically shown in Figure 1.10 by the black dashed line. Specifically, Gao (2008) observed

that the CV lags of [t]-onset syllables were slightly longer than those of [n]-onset syllables, for

words with Tone 4.8

[t] is less sonorous than [n], so [t]-onset syllables have a larger sonority

difference with a same following vowel than [n]-onset syllables.

8Tone 4 refers to a falling tone as per the notation introduced by Chao (1930).

19

TON

TOF

GON

C

V

GOF

Lag𝐺𝑂𝑁

Figure 1.10 Schematic for CV lag computation. For each C or V gesture, the landmarks gestural
onset (GON), target onset (TON), target offset (TOF), and gestural offset (GOF) are labeled for
clarity from left to right. The black dashed line indicates CV lag based on gestural onset (Lag𝐺𝑂𝑁 ).

A second related observation was made by Shaw and Chen (2019), who conducted an EMA

study of Mandarin speakers producing CV monosyllables, consisting of labial consonants and back

vowels, in isolation. In their study, CV lag is the interval between the onset of the consonant gesture

and the onset of the vowel gesture — CV lag based on gestural onset (Lag𝐺𝑂𝑁 ) as indicated by the

black dashed line in Figure 1.10. When testing whether the spatial position of the tongue influences

CV coordination, Shaw and Chen (2019) found that the CV lag was significantly shorter in syllables

beginning with the nasal stop than in syllables beginning with the oral stop. They observed that

syllables beginning with [m] had a shorter CV lag than those that begin with [p]. Again, the nasal

[m] is more sonorous than the oral stop [p], so there is a shorter sonority difference and therefore

shorter lag for the nasal [m].

Relatedly, some study have claimed that vowel quality relates to gestural coordination variation.

Fowler and Saltzman (1993) noted that for /bV/ syllables, as the jaw closes for /b/, the following

vowel will oppose this motion; consequently, a following higher vowel such as /i/ will oppose the

jaw closing movement less and a following lower vowel /a/ will oppose it more. Even though this

claim does not involve gestural timing, one might perhaps extrapolate it to suggest that /ba/ will

have a larger lag than /bi/ since there is more opposing force in /ba/. Even though one could also

have extrapolated it to suggest that /ba/ will have a smaller lag to counteract the opposition between

the /b/ and the /a/, the variation of gestural coordination due to vowel quality difference was hinted

at. Since vowel quality is related to sonority, a relationship between sonority and gestural timing is

suggested.

To sum up, the results observed in a variety of previous studies are consistent with the claim

20

that gestural timing variability relates to sonority. Specifically, sonority seems to have a positive

correlation with gestural lag on CV and CC sequences. This serves as a promising entry point for

figuring out the sonority correlate in speech production. The current study tests this generalization

of a potential positive correlation between sonority and gestural lag on CV syllables.

1.6 Other observations of gestural coordination variations

In this subsection, I briefly discuss non-sonority or non-sonority-related factors that were argued

to be related to articulatory gestural timing. The factors to be discussed are: a) C place (Byrd, 1994,

1996; Gafos et al., 2010; Bombien et al., 2013), b) prosodic effect (Hardcastle, 1985; Byrd, 1994,

1996; Byrd and Saltzman, 2003; Yanagawa, 2006; Gafos et al., 2010; Gu, 2023), and c) stiffness

parameter of the gestures (Du and Gafos, 2023).

First, consonant place leads to gestural coordination variation in German CC onset clusters

(Bombien et al., 2013). Specifically, Bombien et al. (2013) found that /kl/ exhibited the highest

degree of overlap, followed by /pl/, /ps/, /ks/, and finally /kn/. Liu et al. (2022) interpreted the

results of variability as an effect of place — a CC cluster beginning with labials generally has a

higher degree of overlap than a CC cluster that begins with velars. Gafos et al. (2010) also found

that speaker-specific place order of Moroccan Arabic clusters is related to gestural overlap.

Second, gestural lags are larger at prosodic boundaries (Byrd and Saltzman, 2003) and when

stressed (Katsika, 2012; Gu, 2023). Additionally, there are different findings on speech rate’s

impact on gestural timing, though more seems to find that articulatory overlap increases with

speech rate (Hardcastle, 1985; Byrd, 1994; Luo, 2017).9 Moreover, cluster position in a syllable or

word affects gestural timing (Byrd, 1994, 1996; Yanagawa, 2006; Gafos et al., 2010). Byrd (1994,

1996) showed that an onset cluster is less overlapped than coda clusters. Gafos et al. (2010) also

found a speaker-specific word position effect in the gestural overlap of Moroccan Arabic clusters.

Furthermore, Öhman (1966) showed that in VCV sequences, the first V is affected by the second V.

This shows that segment position or syllable position affects gestural coordination. Considering the

9Hardcastle (1985) observed that there is more co-articulation during faster speech rate conditions. Similarly, Byrd
(1994) observed that articulatory overlap increases with speech rate in English consonant sequences. In contrast, Luo
(2017) found no speech rate effect on gestural overlap.

21

above analyses, when designing experiments and selecting stimuli for the dissertation, monosyllabic

citation words are preferred.

Third, Du and Gafos (2023) argued that in onset clusters, C2 stiffness contributes to gestural

overlap. They examined articulatory data from German, English, and Spanish participants. The

relevance of stiffness to articulatory timing will be discussed later when interpreting results. To sum

up, there are various factors that are not captured under the concept of sonority that can contribute

to gestural coordination variation. Therefore, when analyzing the effect of sonority on gestural

coordination, we need to consider these factors in experimental design and in the interpretation of

results.

To sum up, even though many researchers in speech production assume no gestural coordination

variation based on segment makeup, some studies showed gestural timing variability relates to

various possible factors. Among these factors, sonority’s potential positive correlation with gestural

lag on CV sequences was not seriously explored, even though there are many pieces of suggestive

evidence that can support hypothesizing this positive correlation.

1.7 Claims to be tested

This dissertation tested whether there is a positive correlation between sonority difference and

gestural lag. This positive correlation is likely to hold true in CC onset and CV syllables, and I

evaluate it on CV syllables in the dissertation. The claim to be tested is that for a CV sequence

within a syllable, the sonority difference between C and V positively correlates with the CV lag.

This claim regarding CV coordination suggests two sub-claims to be tested related to varying the

C for the same V as in (5a) and varying the V for the same C as in (5b). Specifically, claim (5a)

predicts that [ba] should have a larger CV lag than [ma] because the stop [b] is less sonorous than

the nasal [m]. Claim (5b) predicts that, for instance, [ba] should have a larger CV lag than [bi]

because the low vowel [a] is more sonorous than the high vowel [i].

(5) Claims to be tested related to CV timing:

a. For CV syllables with the same V, a less sonorous C leads to a larger CV lag.

22

b. For CV syllables with the same C, a more sonorous V leads to a larger CV lag.

In order to quantify the dependent variable, namely, CV lag, I measured the timing difference

between the consonant gesture and the following vowel gesture (CV lag = V timestamp - C

timestamp). This way of calculating lag by subtracting corresponding timestamps was used in

Zhang et al. (2019). Specifically, the CV lag in the current study was computed by subtracting the

target onset (onset of gestural plateau) timestamp of the consonant from the target onset timestamp

of the vowel. Target onset instead of gestural onset is used since target onset alignment has been

argued to be more consistent than gestural onset alignment (Zhang et al., 2019; Durvasula and

Wang, 2023). The visual illustration of the lag calculation can be found in Figure 1.11, where the

timestamp of target onset of a C is subtracted from that of target onset of a V to get the CV lag

based on target onset, as in the blue dashed line.10 There are three other measurements based on

the method of subtracting corresponding timestamps from another timestamp — CV lag based on

gestural onset (as in the black dashed line), CV lag based on gestural offset, and CV lag based on

target offset.

Lag𝑇𝑂𝑁

TON

TOF

GON

C

V

GOF

Lag𝐺𝑂𝑁

Figure 1.11 Schematic for CV lag computation. For each C or V gesture, the landmarks gestural
onset (GON), target onset (TON), target offset (TOF), and gestural offset (GOF) are labeled for
clarity from left to right. The black dashed line indicates CV lag based on gestural onset (Lag𝐺𝑂𝑁 ),
and the blue dashed line indicates CV lag based on target onset (Lag𝑇𝑂𝑁 ).

In the following chapters, I show the procedure and results of evaluating the claims on CV

syllables of an English corpus (Chapter 2), English EMA study (Chapter 3), and Mandarin EMA

study (Chapter 4). In the following section, before I detail the experimental procedures, I discuss

one major methodological concern in speech production studies, which is how to parse articulatory

10This figure is a repetition of Figure 1.10. I am presenting it here for the readers’ convenience.

23

gestures. In general, I discuss the default method, the threshold method, and its alternatives such

as the comparative method and the ensemble method. I argue that the threshold algorithm is used

in the dissertation since it has obvious advantages over its alternatives.

1.8 Measuring articulatory gestures

There are various existing methods of parsing articulatory gestures in speech production. In

this section, I discuss some methods of parsing articulatory gestures. Specifically, the threshold

technique will be presented in subsection 1.8.1, the minimal contrast technique will be discussed

in subsection 1.8.2, and the ensemble technique will be briefly mentioned in subsection 1.8.3. I

argue that the threshold technique has advantages over other methods, so it is used in the current

dissertation.

1.8.1 The threshold technique

The threshold technique to identify articulatory gestures is widely used in the field since it is

the underlying algorithm of the widely used lp_findgest in mview (Tiede, 2005). In this section,

I am going to briefly discuss the earlier uses of the technique (Hoole et al., 1994; Kroos et al.,

1996). Then, I am going to detail the algorithm lp_findgest, which is the default method of parsing

articulatory gestures.

1.8.1.1 The earlier usages of the threshold technique

One earlier usage of the threshold technique was in Hoole et al. (1994), where the study compared

the articulation of tense and lax vowels in German. The authors used the threshold technique to

identify the CV, nucleus, and VC of CVC syllable as in Figure 1 from Hoole et al. (1994) where

the segmentation procedure for CVC utterance /pi:p/ was shown. When pronouncing the utterance,

there are opening and closing of lips, which is crucial for identifying the gestural timing patterns.

The first step of the procedure is identifying the maximum vertical velocity point of the lower lip

from the C1 target to the vowel target. Then, two points that are 20% of this maximum velocity

were identified as the CV onset and offset, one when moving upwards and another when moving

downwards from this maximum point respectively. Similarly, the VC segment was identified. In

24

Figure 1 from Hoole et al. (1994), the CV syllables were labeled where the left edge of the label is

the onset and the right edge the offset.

According to Hoole et al. (1994), 20% is a high-velocity threshold, and it is chosen to avoid

problems with identifying the nucleus stage. They suggested that they want to avoid overlapping

CV and VC by using 20% as the threshold. Also, Hoole et al. (1994) claimed that they decided that

the nucleus should be a stage rather than a single point because it reflects the observation that tense

vowel is longer in duration than lax vowels. They used tangential velocity, which is the velocity

signal that incorporates movement in all three available dimensions (Shaw et al., 2023).

The advantage of this measurement criterion is that it yields more stable results (Mooshammer

and Fuchs, 2002). In a similar study, Kroos et al. (1996) also used the threshold method in German

in similar cases for CV and VC, when they also looked at CVC stimuli. In Kroos et al. (1996), the

threshold of 20% is also used after a bunch of experiments of other thresholds. Note that these

original studies used the threshold method for syllable identification, which is different from some

later uses of the threshold technique which is on a single articulatory gesture.

1.8.1.2 The lp_findgest algorithm of mview

The lp_findgest algorithm of mview package (Tiede, 2005) is the default tool used to identify

gestures in speech production studies, and the algorithm assumes the threshold technique, though

not strictly the same as the technique as in Hoole et al. (1994). The following indicates the procedure

for identifying gestures of the lp_findgest algorithm.

To use the algorithm, the researcher first needs to click on a point in the relevant articulatory

pellet’s information. The mouse click point is usually identified by checking the synchronous

acoustic information, and it will roughly be the point of the gestural plateau. There is no clear

requirement on where to make this click. After manually identifying the point, the algorithm finds

the maximum constriction point (MAXC), which is the closest velocity minimum to the clicked

point. Then there are two peak velocity points identified before and after the maximum constriction

point, which are called PVEL and PVEL2 respectively. After that, the gestural onset point is marked

by identifying the 20% peak velocity between the minimum velocity point before PVEL and the

25

peak velocity point (PVEL) itself. The nucleus onset is the 20% peak velocity point between PVEL

and the maximum constriction point (MAXC), and the nucleus offset is identified by the 20% peak

velocity point of the range between MAXC and the following peak velocity PVEL2. Similarly,

gestural offset is the 20% peak velocity point between the range between PVEL2 and the following

velocity minimum. The velocity in lp_findgest algorithm of mview is computed either as tangential

velocity (if multiple components are displayed) or as absolute magnitude (if one component is

displayed). For instance, in Figure 6 from Shaw et al. (2023), the lp_findgest algorithm was used to

identify the articulatory trajectory of a bilabial fricative (Shaw et al., 2023). In this case, they used

tangential velocities, incorporating movements in the vertical, longitudinal, and lateral dimensions.

We can see that even though the vertical dimension as in the second row from above has the largest

degree of displacement, there is also movement in the other two dimensions. In Figure 6 from

Shaw et al. (2023), the terms start, target, release, and end refer to gestural onset, target onset, target

offset, and gestural offset used in this dissertation.

1.8.2 The minimal contrast technique

In this section, I lay out some studies that used the minimal contrast technique. First, Benguerel

and Cowan (1974) analyzed French upper lip protrusion patterns. They found that French speakers

started lip protrusion for the vowels as early as 6 consonants before the rounded vowel. This finding

shows that considering the interaction of surrounding articulations is necessary.

Second, Gelfer et al. (1989) advocated the minimal contrast technique in speech production.

Specifically, they argued that in American English a migration of lip rounding back to the

beginning of the consonant string in /iCu/ utterances may not support the look-ahead model since

a similar lip rounding pattern can also be observed in /iCi/ utterances. They also observed

comparable correlation coefficients between electromyographic (EMG) onset time and consonant

string duration for utterances with or without lip rounding. The study shows that some speakers

produce alveolar consonants with significant lip rounding activity in both rounded and unrounded

vowel environments. Crucially, the study advocates that studies of co-articulation should employ

the minimal contrast technique.

26

Third, Liu et al. (2022) is a recent study that promotes the minimal contrast technique. They

used a minimal triplet paradigm to analyze Mandarin coarticulation data and argued that syllables

are coordinated synchronically. For instance, if the target syllable is C1V1, the triplets will be

C2V1, C1V2, and C1V1. Participants produced the targeted utterances by embedding them into

the carried phrase bi ___ wei shan [bi ___ weI ùan] ‘more hypocritical than’. Note that this

production with carrier phrase is not a complete sentence. And since ‘wei shan’ [weI ùan] is a

compound word meaning hypocritical, there is unlikely to be a pause between wei [weI] and shan

[ùan]. This means that there is significant coarticulation between wei [weI] and shan [ùan], and it

may not be easy to parse a boundary after wei [weI].

To analyze each minimal pair in each triplet (i.e., vowel minimal pair C1V1-C1V2, consonant

minimal pair C1V1-C2V1), the time points where two trajectories diverge significantly were

identified by generalized additive mixed models (GAMMs). The onset times were determined by

when the model indicated a statistically significant difference in the trajectories relevant to either

C or V. An issue here is that the point of statistical significance in terms of difference does not

necessarily indicate the point of start of non-trivial co-articulation.

Liu et al. (2022) contributed to the discussion of methods by showing the comparison of two

triplets as in Figure 14 and Figure 15 in Liu et al. (2022). While the onsets identified by the two

methods are more different in Figure 14, the identifications are similar in Figure 15. As pointed

out by Liu et al. (2022), the pair <maoliwei> /maUluweI/ and <maoluwei> /maUliweI/ shows more

difference because <mao> /maU/ has a rounding gesture at its later part. They also mentioned that

studies could avoid obvious gestural confounds as such in the stimulus design process. Gelfer et al.

(1989) and Liu et al. (2022) argued that the real advantage of the minimal contrast technique is

that it can avoid covert confounds in cases where there are no predictably similar gestures in the

previous syllable as in Figure 15. However, just because there is a difference in the measurement

outcome does not mean that it is a significant confound. In other words, there is no evidence that

the inference based on the different techniques are different, even if different techniques result in

slightly different values. It seems that a larger absolute difference can be avoided by stimulus design

27

in cases such as ‘maolu’ /maUlu/ vs. ‘maoli’ /maUli/ (Figure 14 in Liu et al. (2022)), and yet the

covert “confound” does not result in much absolute difference as in ‘laili’ /laIli/ vs. ‘lailu’ /laIlu/

(Figure 15 in Liu et al. (2022)).

As mentioned by Durvasula and Wang (2023), an issue with the comparative technique is that

it requires phonetic minimal pairs rather than phonological minimal pairs. There are logically

infinite number of phonetic parameters available, so it is difficult to determine phonetic minimal

pairs. Another issue with the minimal contrast technique is inherent phonetics (Durvasula, 2024).

For instance, even oral vowels have inherent nasality and low vowels have more inherent nasality

than mid or high vowels. Therefore, if looking at nasality, the results would be different if different

baselines were chosen.

1.8.3 The ensemble technique

There are other

techniques being used such as

the minimum velocity technique

(Blackwood Ximenes et al., 2017) or the zero velocity technique (Mücke et al., 2012). Since there

are pros and cons to each method, implementing several possible methods and adopting the

advantages of each method seem to be a solution to the issue of lack of methodology consensus.

One recent study, Svensson Lundmark et al. (2021), identified the problem that previous studies

on tone gesture used a variety of methods, and they addressed the inconsistency in the varieties of

methods by including the comparison of 13 measurements. Specifically, they choose different

measurements for different articulators such as lip or tongue. For lip aperture, they used zero

velocity and maximum acceleration/deceleration. The temporal landmarks on lip aperture were

automatically extracted in R (Team et al., 2013) when the lips had minimal movement, 20%

threshold from zero to peak velocity, and when the movement accelerated or decelerated the most.

For the tongue body, they used minimal tangential velocity, maximum tangential velocity, zero

vertical velocity, and 20% zero vertical velocity.

The conclusion in Svensson Lundmark et al. (2021) is that since consonantal gestures are

more often characterized by larger changes in velocity, consonant gestures should be identified by

peaks in the acceleration curve. Vowel gestures are made with constantly and relatively slowly

28

moving tongue body, and Svensson Lundmark et al. (2021) claimed that vowels may use different

measurements.

It is a challenge to justify those arguments since we need certain assumptions

to know about the characteristics of gestures in the first place. While the suggestion regarding

specific techniques is inconclusive, Svensson Lundmark et al. (2021) attempted to argue for an

ensemble technique that builds on the advantage of different techniques. The potential issue with

the ensemble technique is that it makes cross-stimuli comparison difficult. Also, since we are not

sure what the right single method is, ensembling several techniques multiplied the scope of the

problem.

1.8.4 Summary of measuring techniques

Articulatory trajectory is a complex fact. The position of an articulator is affected not only by

the intended movement for the abstract phonological features of a sound, but also by co-articulation

and inherent phonetics of those segments. Parsing articulatory gestures involves various decision-

making that may not always be straightforward.

The section presents various available methods of identifying the onset and offset of articulatory

gestures. There are pros and cons of each method. The threshold method is the default one. One of

its issues is that many decisions such as the 20% threshold are arbitrary. Another major issue is that

it is unclear whether the articulatory movement in question comes from the intended production

or some unintended/inherent phonetics. The threshold technique only looked at the articulatory

trajectory of one gesture, lacking the ability to parse out the overlapping or confounding gestures

(Liu et al., 2022). The comparative technique, also called the minimal contrast technique, has

an advantage over the threshold technique in this regard. Using the comparative method, sets of

stimuli are used as baselines for the target measure. However, this method is unlikely to solve the

inherent phonetics completely because there are an infinite number of dimensions available and it

is practically difficult and logically impossible to find real phonetic minimal pairs. I also mentioned

the ensemble technique, where the intention is to take advantage of various measurements. The

major issues with the ensemble technique are: a) that it is time-consuming for human annotators;

b) that cross-stimuli comparison is difficult since different measures are used. The discussions in

29

this section show that the comparative and ensemble techniques do not have significant advantages

over the threshold technique. They also seem to have problems such as the need for more stimuli

or measurements. To make the study’s result comparable to most available research in the field,

the lp_findgest algorithm was used. The nuances of choosing the right measure will be discussed

in Chapter 5.

1.9 Recap of the introduction

In this chapter, I introduced sonority and phonological constraints related to sonority. Even

though sonority’s function in syllabification has been widely recognized, the phonetic correlation of

sonority is controversial. If we take a closer look at sonority and its relationship with gestural timing,

there is some previous empirical evidence showing that there is likely to be a positive correlation

between sonority difference and gestural timing. Specifically, I showed that this positive correlation

has been found in CV and CC sequences. Since Crouch (2022) has specifically tested the relation

between CC timing and sonority, I propose to test the claim that there is a positive correlation

between CV lag and sonority difference.

In the following three chapters, I test the claim on a

English corpus (Chapter 2), on English EMA study (Chapter 3), and on Mandarin EMA study

(Chapter 4). The methods and results of each experiment will be discussed in each chapter. There

will be discussions in Chapter 5 and a brief conclusion in Chapter 6.

This chapter also involves brief review of various methods of parsing articulatory data. Since

the threshold method lp_findgest has advantages over alternative methods, it will be used to annotate

the articulatory data of the study.

30

CHAPTER 2

EXPERIMENT 1: ENGLISH CORPUS STUDY

2.1 Methods

2.1.1 The corpus

To test the claim that there is a positive correlation between CV lag and sonority difference, I

analyzed kinematic data from the Wisconsin X-ray Microbeam Database (Westbury et al., 1990).

The data was originally collected using the X-ray Microbeam method to digitally track the

movements of the gold pellets on each speaker’s mouth. The X-ray Microbeam method relied on

X-rays produced by an accelerated electron beam. Then, a narrow beam of the incident X-rays

passed through a pinhole aperture where its path is determined by the location of the electron

beam. The X-ray path was adjusted during the original data collection to make sure it produced

X-ray scans that surrounded the expected pellet

location so that

the system recorded the

coordinates of the movements.

The frequency of operation for each pellet was specified

separately, at rates ranging between 20 and 180 Hz.

The schematic positions of the pellets can be found in Figure 2.1. To obtain reference points

indicated by Ref in Figure 2.1, three pellets were attached to the speaker’s head: one on the bridge

of the nose, the second on the buccal surface of the maxillary incisors, and the third either on the

nosebridge lower than the first or an arm projecting from a snug-fitting pair of eyeglass frames. To

extract information about tongue movement, four pellets, which are denoted by T1 to T4 in Figure

2.1, were attached along the longitudinal sulcus of each speaker’s tongue. T1 was placed 10 mm

posterior to the tongue tip, and T4 was placed about 60 mm posterior to the tongue tip, depending

on each speaker’s tolerance. Positions of T2 and T3 were chosen so that the four tongue pellets

were equally distanced. As for labial articulation, one pellet each was attached to the upper lip

(UL) and lower lip (LL).

The Wisconsin X-ray Microbeam Database has 118 speech production tasks of various types

such as paragraph reading, sentence reading, citation word production, and number sequence

31

𝐶 𝑀 𝐼

Ref

Ref

T3

T2

T4

Ref

T1

UL

LL

𝑀𝑎𝑥𝑂𝑃

70

50

30

10

-10

-30

-50

-70

-100

-80

-60

-40

-20

0

20

40

60

Figure 2.1 Approximate pellet placement locations. The x-axis represents the position with respect
to the central mandibular incisor (CMI), and the y-axis shows the position with respect to the
maxillary occlusal plane (MaxOP). All numbers are in millimeters (mm). The figure is recreated
based on Figure 5.2 of the Wisconsin X-ray Microbeam Database manual (Westbury et al., 1990).

production. Speakers represented in the database were recruited from the University of Wisconsin-

Madison as well as the surrounding city, and a majority of speakers spoke an Upper Midwest dialect

of American English. Altogether, speech production data of 57 different speakers (32 females and

25 males) were included in the database. The median age for the speaker sample was 21.1 years

old (female 21.3 years; male 20.8 years).

2.1.2 Stimuli

All stimuli in the experiment came from the citation word reading tasks of the Wisconsin X-ray

Microbeam Database, where the speakers were instructed to “Read each item once, slowly and

32

clearly, with a brief pause between items. Read in column order.” The citation word list reading

tasks had monosyllabic words, which ensured that factors like prosody or stress — which may

induce gestural variation (Byrd and Saltzman, 2003; Katsika, 2012; Byrd and Krivokapić, 2021;

Gu, 2023) — were controlled for. I considered two sets of stimuli, one for each claim in (6).1

(6) Claims to be tested related to CV timing:

a. For CV syllables with the same V, a less sonorous C leads to a larger CV lag.

b. For CV syllables with the same C, a more sonorous V leads to a larger CV lag.

The list of nonce word stimuli to test claim (6a) is shown in Table 2.1. These words are from

task 16 of the corpus and have a template uhCa. The varying consonants and the same vowel [A] of

the uhCa stimuli make them suitable to test the claim (6a) that a less sonorous C leads to a larger

CV lag for CV syllables with the same V. All words in task 16 were included in the dissertation

except for uhga and uhka (marked by ★), since the velar stop and low vowel both use tongue dorsum

as the measurement sensor and it is difficult if not impossible to tease apart whether tongue dorsum

movement comes from the consonant or the vowel. Each table of stimuli also includes columns

labeled C-V Pellets, Sonority, and Difference; I return to elaborating on these columns later in this

section.

1This repeats the claim in (5) for the readers’ convenience.

33

(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
★
★

IPA C-V Pellets Sonority Difference
Stimuli
uhyA
[@jAAA]
uhwA [@wAAA]
[@lAAA]
uhlA
uhmA [@mAAA]
[@nAAA]
uhnA
uhvA
[@vAAA]
[@zAAA]
uhzA
uhzhA [@ZZZAAA]
[@dAAA]
uhdA
[@bAAA]
uhbA
[@fAAA]
uhfA
[@SSSAAA]
uhshA
[@sAAA]
uhsA
[@tAAA]
uhtA
[@pAAA]
uhpA
uhgA
[@gAAA]
[@kAAA]
uhkA

T1 - T3
Lip - T3
T1 - T3
Lip - T3
T1 - T3
LL - T3
T1 - T3
T1 - T3
T1 - T3
Lip - T3
LL - T3
T1 - T3
T1 - T3
T1 - T3
Lip - T3
T4 - T2
T4 - T2

12-17
12-17
9-17
7-17
7-17
6-17
6-17
6-17
4-17
4-17
3-17
3-17
3-17
1-17
1-17
4-17
1-17

5
5
8
10
10
11
11
11
13
13
14
14
14
16
16
13
16

Table 2.1 List of stimuli for the claim (6a). The stimuli are from task 16 of the Wisconsin X-ray
Microbeam Database (Westbury et al., 1990). The target CV sequence is in boldface and underlined.
The C-V Pellets column shows the pellet positions for C and V. The Sonority column indicates the
C and V sonority respectively, and the Difference column lists the sonority difference.

The list of stimuli to test claim (6b) is listed in Tables 2.2 and 2.3. They are all real words

except for the ones with an asterisk before them. For the real words, speakers saw the orthography

in the column Stimuli; but for each nonce word, speakers saw a real word that exemplified the vowel

pronunciation of the nonce words as in *sud (dud); *soid (Lloyd); *sowd (loud); *sood (wood);

*sayed (bayed). The stimuli in Table 2.2 all have the template sVd, where the V varied, and the

stimuli are from task 13 of the corpus. I included the second set of stimuli in Table 2.3 to address

a potential issue with sVd stimuli that the consonant articulations involve an articulator related to

that of the following vowel. In contrast, in bV words, the consonant and vowel articulations use

different articulators, namely, the lips and the tongue. The stimulus been is from task 9 and back

is from task 100.2 All the stimuli in Table 2.2 or Table 2.3 have the same consonant with varying

vowels within each set, which makes them suitable for testing the claim (6b) that a more sonorous

2There are two repetitions of back for each speaker.

34

V leads to a larger CV lag for CV syllables with the same C. Note that all CV syllables used to test

the claim (6b) occur in a controlled immediate phonological environment (e.g. #s_d#) except for

back and been, which have different coda consonants. The different coda consonants probably do

not affect CV timing due to previous observation — Gao (2008) observed that the CV lag of [ma]

for Mandarin speakers was not significantly different from that of [man].

(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)

Stimuli
seed
sid
sued
*sood
*sayed
surd
said
*sud
sewed
sawed
*sowd
side
sod
*soid
sad

IPA
[sid]
[sIIId]
[sud]
[sUUUd]
[sEEEId]
[s3ô3ô3ôd]
[sEEEd]
[s222d]
[sod]
[sOOOd]
[saud]
[saId]
[sOOOd]
[sOOOId]
[sæææd]

C-V Pellets Sonority Difference
3-15
3-15
3-15
3-15
3-16
3-16
3-16
3-16
3-16
3-16
3-17
3-17
3-16
3-16
3-17

T1 - T3
T1 - T3
T1 - T3
T1 - T3
T1 - T3
T1 - T3
T1 - T3
T1 - T3
T1 - T3
T1 - T3
T1 - T3
T1 - T3
T1 - T3
T1 - T3
T1 - T3

12
12
12
12
13
13
13
13
13
13
14
14
13
13
14

Table 2.2 sVd stimuli for claim (6b). The stimuli came from task 13 of the Wisconsin X-ray
Microbeam Database (Westbury et al., 1990). The target CV sequences are in boldface and
underlined. The C-V Pellets column shows the pellet positions for C and V. The Sonority column
indicates the C and V sonority respectively, and the Difference column lists the sonority difference.

Stimuli
been
back

(16)
(17)

IPA C-V Pellets Sonority Difference
[bin]
[bæææk]

Lip - T3
Lip - T3

4-15
4-17

11
13

Table 2.3 bV stimuli for claim (6b). The stimuli been is from task 9, and back is from task 100 of
the Wisconsin X-ray Microbeam Database (Westbury et al., 1990). The target CV sequences are in
boldface and underlined.

The pellet positions corresponding to the consonant and vowel gestures of each stimulus are

shown in the C-V Pellets column of the above tables, where the C and V measurements were

separated by a hyphen. The legend for the pellet positions is in Table 2.4. The stimulus tables

35

document that different pellet positions were used for vowels. During the data annotation process,

the pellets were selected to ensure that the articulatory movement correctly reflected the acoustics

of the relevant segment, and this selection was done before any analysis was performed on the data.

While lip aperture was automatically computed by the mdp_LipAperture algorithm of the mivew

package (Tiede, 2005), other pellets’ information came directly from data collection. For the current

study, the relevant measurements used for C and V were chosen based on previous literature (Gao,

2008; Hall, 2010; Zhang et al., 2019) and an understanding of the articulatory events involved.

For example, /n/ involves tongue tip alveolar closure gestures, so T1 (tongue tip) was measured

for the consonant closure of /n/. Since consonants such as /j/, /z/, and /s/ also involve tongue tip

articulation, T1 was also measured for them. Furthermore, the feature [labial] corresponds to the

use of the lip tract variables, so lip aperture was measured for syllables with [w], [m], [b], or [p]

(Gao, 2008; Hall, 2010; Zhang et al., 2019). Similarly, the gesture for labiodental fricatives [f]

and [v] was measured by lower lip. As for the vowels, I evaluated the potential pros and cons of

different measurements. We could use the same pellet to measure vowels with all qualities, leading

to consistent measurement but less precise estimate of each vowel. Alternatively, we could use

different tongue pellets for each type of vowel — for instance, T2 for the front vowel and T4 for the

back vowel. While our original analysis had pellets varying based on what the annotator thought

best represented the acoustics of the vowel, on the recommendation of some anonymous experts,

I chose to use a single pellet to represent the vowels. Therefore, in order to test the claims stated

earlier, I chose to use one sensor T3 consistently for the sVd stimuli. In general, using the above

pellet choices did allow us to identify gestures that were consistent with the acoustic waveforms or

spectrographic information for the relevant consonants and vowels.

36

Index
T1
T2
T3
T4
Lip
LL

Gesture
tongue tip
tongue blade
tongue dorsum
tongue root
lip aperture
lower lip

Table 2.4 Measure indexes and their correlated gestures.

The specific sonority difference for each stimulus is based on the sonority indexes of C and V,

and the information can be found in the Difference column (abbreviated from Sonority Difference)

of the stimulus tables (Tables 2.1-2.3). The sonority differences were calculated based on the C and

V sonority indexes indicated in the Sonority column, where the first number and second number

indicate the sonority index of C and V respectively based on the hierarchy in Parker (2012) shown

previously in Table 1.6.3 Note that English has passively or weakly voicing in its voiced stops

(Iverson and Salmons, 1995), but they are still labeled as voiced stops.

2.1.3 Data annotation and analysis

The kinematic data were annotated in Matlab using the default settings of the lp_findgest

algorithm of the mview package where gestural onset, gestural offset, nucleus onset, and nucleus

offset used the 20% threshold of the velocity profile (Tiede, 2005). The tangential velocity of x and

y axes was considered by the algorithm.

The procedure for identifying gestures of the lp_findgest algorithm can be found in Section 1.8.1

of Chapter 1.1. Basically, to use the algorithm, the annotator first clicked on a point in the relevant

articulatory pellet’s information. The mouse click point was usually identified by checking the

synchronous acoustic information, and it would roughly be the point of the gestural plateau. After

manually identifying the point, the algorithm found the maximum constriction point (MAXC),

3In Table 2.2, if a syllable has a diphthong with two vowels of different sonority indexes, the sonority index of
the first vowel of the diphthong is recorded as the sonority index for V, since Hsieh (2017) suggests that vowels in
English diphthongs are coordinated sequentially. A second option to index diphthong sonority would have been to use
the average sonority index of the two targets in the diphthong, which implies that two vowels in diphthong are coupled
synchronously as in Dutch and Romanian (Collier et al., 1982; Marin and Goldstein, 2012). In the dissertation, I did not
consider the diphthongal realizations of tense vowels. Future research is necessary to probe the nuances of diphthong
articulation.

37

which was the closest velocity minimum to the clicked point. Then, there were two peak velocity

points identified before and after the maximum constriction point, which were called PVEL and

PVEL2 respectively. After that, the gestural onset point was marked by identifying the 20% peak

velocity between the minimum velocity point before PVEL and the peak velocity point (PVEL)

itself. The nucleus onset was the 20% peak velocity point between PVEL and the maximum

constriction point (MAXC), and the nucleus offset was identified by the 20% peak velocity point of

the range between MAXC and the following peak velocity PVEL2. Similarly, gestural offset was the

20% peak velocity point between the range between PVEL2 and the following velocity minimum.

In the current study, the velocity in lp_findgest algorithm of mview was computed as tangential

velocity since multiple components are displayed. The lip aperture was automatically calculated by

the mdp_LipAperture_old algorithm of the mview package. The mdp_LipAperture_old algorithm

computed the Euclidean distance between the Lower Lip sensor and Upper Lip sensor at each time

point. The formula can be found in (7).

(7)

√︃

𝐿 𝐴 =

(𝑈 𝐿𝑥 − 𝐿𝐿𝑥)2 + (𝑈 𝐿 𝑦 − 𝐿𝐿 𝑦)2 + (𝑈 𝐿𝑧 − 𝐿𝐿𝑧)2

Based on the information on the acoustics as well as the articulatory movement trajectories,

the consonant gesture and the following vowel gesture of each token were annotated by the first

author. For instance, if it is a low vowel, we would expect the T3 gesture to be lower in the vertical

dimension. Figure 2.2 shows a sample gestural annotation for [mA] — where only relevant rows

LA (in white) and T3 (in red) are displayed here for clarity. The white text was added to denote

gestural onset (GON), target onset (TON), target offset (TOF), and gestural offset (GOF) for the

gesture of the consonant, and red texts were added for that of the vowel. These landmarks of the

articulation were provided automatically by the algorithm. At the bottom of Figure 2.2, there is

an axis indicating time, and the timestamps for gestural onsets, gestural offsets, target onsets, and

target offsets were recorded for the consonant gesture and vowel gesture of each token. Using the

38

timestamp information, CV lags were computed for each CV sequence.

Figure 2.2 Sample annotation of [mA] in uhmA of Speaker JW11, Task 16. The white labels refer
to the LA (lip aperture) gesture, and the red labels refer to the T3 (tongue dorsum) gesture. The
curves show the displacement of sensors at the x and y axes for T3 and other non-LA rows. The
y-axis has a lighter color in each row. For both LA and T3, the labels were added by the author to
denote gestural onset (GON), target onset (TON), target offset (TOF), and gestural offset (GOF).

After collecting data from the corpus and computing CV lags, the relationship between the CV

lags and the sonority difference was analyzed and plotted using the tidyverse package (Wickham

et al., 2019). Subsequent mixed-effects modeling was done using the lme4 (Bates et al., 2014)

and lmerTest (Kuznetsova et al., 2017) packages in R (R Core Team, 2017), where each CV lag

was modeled as a function of sonority difference, with participant, consonant duration,

and sometimes word as random intercepts. Consonant duration was included in the model to

address the alternative explanation that longer consonant duration is related to a larger CV lag.

Also, word is not used as a random intercept if in the subset there is a one-to-one mapping between

sonority difference and word.

39

2.2 Results

2.2.1 Overall analysis

All 57 different speaker datasets in the Wisconsin X-ray Microbeam corpus have been included

in the current analysis, and altogether 3399 tokens were measured. Excluding [gA] and [kA], there

were 3214 tokens measured in the overall analysis of all the data, and the claim I proposed was

supported. Specifically, I claimed that the CV lag positively correlates to the sonority difference

between the C and V. As can be observed in Figure 2.3, there is indeed such a positive correlation

between CV lag and the sonority difference.

Figure 2.3 CV lag increases with sonority difference for all data based on target onset. The
geom_jitter option (size = 0.9) is used to spread out the overlapping dots for clarity.

Furthermore, the mixed effects model results, presented in Table 2.5, are also consistent with

the visual inspection of the data above. Though I have not presented the data here in the interest of

concision, CV lag based on other landmarks such as gestural onset, gestural offset, and target offset

all exhibited the same expected pattern.

40

Target Onset
(Intercept)
Sonority difference

Estimate Std. Error

-1.34
17.91

35.39
2.84

df
30.39
29.97

t value Pr(>|t|)
-0.04
6.30

0.97
<0.0001

Table 2.5 Mixed effects model results for all data.

One possible explanation of the observation may be that there is a positive correlation between

sonority difference and consonant duration, and longer consonant duration is related to a larger

lag. To evaluate this alternative explanation, I measured the gesture duration of the consonant and

included it in the statistical model. Adding consonant duration as one more random intercept in the

mixed effect model still shows a positive correlation between sonority difference and CV lag based

on target onset as in Table 2.6. Henceforth, I will include consonant duration as a random intercept

in all comparisons to control for this confound.

Target Onset
(Intercept)
Sonority difference

Estimate Std. Error

25.95
16.49

27.84
2.21

df
30.33
28.60

t value Pr(>|t|)

0.93
7.45

0.34
<0.0001

Table 2.6 Mixed effects model results for all data. Random intercepts: C duration, word, participant.

While an analysis with all the data and the use of a sonority difference score as a predictor

is straightforward to statistically model and has higher statistical power, it does have some issues.

First, the analysis assumes that the sonority scale is linear and not just relative, which is contrary

to most phonologists’ beliefs. Furthermore, it collapses across different articulators or gestures.

For these reasons, I also analyzed more nuanced sub-groups of the data. Specifically, I looked at

sets of stimuli that control for the place of articulation or gesture of the consonant. I also looked

at comparisons where there is agreement on the predicted lag variation even if different sonority

scales are considered. The results for the subsets for claim (5a) are presented in Section 2.2.2, and

those for claim (5b) in Section 2.2.3. Analyzing the subgroups also allows us to address potential

alternative explanations of the observation such as jaw movement, or place of articulation.

41

2.2.2 Claim 5a: the same vowel with different consonants

2.2.2.1 Different consonants using lips as the primary articulator

To eliminate gesture as a potential confounding variable, the results for stimuli with different

consonants using lips as the primary articulators were separated into the lip aperture group (target

CV sequences [wA], [mA], [bA], and [pA]) and lower lip group (target CV sequences [fA] and [vA]).

Figure 2.4 shows the results for the lip aperture group ([wA], [mA], [bA], and [pA]), where the

CV lag based on target onset clearly shows the expected pattern that CV lag increases with sonority

difference. Additionally, the mixed effects model indicated a statistically significant positive slope

as in Table 2.7.4

Figure 2.4 CV lag based on target onset for lip aperture consonants.

This sub-analysis of the data only involves vowels of one quality, and the variation of gestural

timing is found. Therefore, vowel quality alone cannot be used to account for the observation.

Target Onset
(Intercept)
Sonority difference

Estimate Std. Error

-20.17
14.71

14.54
1.20

df
154.23
130.12

t value Pr(>|t|)
-1.39
12.26

0.17
<0.0001

Table 2.7 Mixed effects model results for lip aperture consonants. Random intercepts: C duration,
participant.

Even though stimuli from the lip aperture group (target CV sequences [wA], [mA], [bA], and [pA])

all involve the same oral gesture for the consonant, they do not share the same manner of articulation

or voicing; however, manner or voicing may affect gestural overlap (Du and Gafos, 2023). To control

4In fact, all other CV lag landmarks — CV lag based on target onset, target offset, and gestural offset — exhibited

significant results in the expected direction. This is mostly true for other sub-analyses of the study, too.

42

for voicing, [mA] and [bA] were compared. Note, the comparison also controls for jaw movement.

The descriptive plot for the CV lag comparison (Figure 2.5) and the corresponding mixed-effects

model (Table 2.8) suggest that CV sequences with oral bilabial stops generally induced a larger

CV lag than their counterparts with nasal consonants. This replicates the observation in Shaw and

Chen (2019) mentioned above for Mandarin that CV lag for nasal stop is shorter than CV lag for

oral stop. Furthermore, the major difference between the [m] and [b] articulations is the lowering

or raising of velum, and there is no obvious articulatory reason that velum movement by itself

should cause gestural lag variation between the lips and the tongue. The finding suggests that jaw

movement may not be a valid alternative account of the observed gestural lag variation, since in

this pair the two segments [m] and [b] involve similar degrees of jaw movement. Therefore, by

comparing CV lag for [mA] and [bA], our claim is more strongly supported.

Figure 2.5 CV lag based on target onset comparison for [mA], [bA].

Target Onset
(Intercept)
Sonority difference

Estimate Std. Error

-19.57
14.73

42.86
3.68

df
47.43
45.00

t value Pr(>|t|)
-0.46
4.01

0.65
0.0002

Table 2.8 Mixed effects model results for [mA], [bA]. Random intercepts: C duration, participant.
Adding or not adding C duration in the model as a random effect does not change the model results.

Previous studies have found that consonant manner and place could lead to gestural coordination

variation (Bombien et al., 2013; Wright, 1996; Pouplier et al., 2022). Comparing the CV lag for

[pA] and [bA] can control for the potential confounding factors since the pair has the same manner

and place of articulation, as well as jaw movement. The results for the [pA] and [bA] comparison can

43

be found in Figure 2.6 and Table 2.9. There is a significant positive correlation between sonority

difference and CV lag for stimuli with voiced and voiceless bilabial stops and the same vowel. This

shows that manner and place of articulation cannot account for the observation, which strengthens

the claim supporting the link between sonority and gestural timing.

Figure 2.6 CV lag based on target onset comparison for [pA], [bA].

Target Onset
(Intercept)
Sonority difference

Estimate Std. Error

-15.23
14.40

55.06
3.77

df
47.22
45.51

t value Pr(>|t|)
-0.28
3.82

0.78
0.0004

Table 2.9 Mixed effects model results for [pA], [bA]. Random intercepts: C duration, participant.

Figure 2.7 and Table 2.10 show the results for stimuli involving lower lip as the primary

consonant articulator. The visual inspection of the plot suggested that CV lags based on target

onsets increase with the rise in sonority difference, and correspondingly a positive correlation was

observed in the statistical modeling though the effect is not statistically significant. Given that the

estimate is in the same direction, I suggest that this might be a power issue, related to a limited

amount of data.

Figure 2.7 CV lag based on target onset comparison for [fA], [vA].

44

Target Onset
(Intercept)
Sonority difference

Estimate Std. Error

92.13
7.11

57.05
4.54

df
44.08
42.41

t value Pr(>|t|)

1.62
1.57

0.11
0.13

Table 2.10 Mixed effects model results for [fA], [vA]. Random intercepts: C duration, participant.

2.2.2.2 Different consonants using tongue tip as the primary articulator

In this section, I present the results of my analysis for the stimuli where tongue tip was used as

the primary articulator for the consonant. Within this group, there are nine uhCa nonce words with

the target CV sequences [jA], [lA], [nA], [zA], [ZA], [dA], [SA], [sA], and [tA]. The results for stimuli

involving the T1 pellet for the consonant can be seen in Figure 2.8 and Table 2.11. Both the visual

inspection and the mixed effects model again suggest a positive correlation between CV lag and

sonority difference.

Figure 2.8 CV lag based on target onset for consonants using T1 gesture. The geom_jitter option
(size = 0.9) is used to spread out the overlapping dots for clarity.

Target Onset
(Intercept)
Sonority difference

Estimate Std. Error

21.72
15.53

44.31
3.75

df
6.77
6.70

t value Pr(>|t|)

0.49
4.14

0.64
0.005

Table 2.11 Mixed effects model results for consonants using T1 gesture. Random intercepts: C
duration, participant.

45

However, despite the clear positive relationship, the T1 group analyzed above involves different

places of articulation for the consonant. Namely, six of them ([lA], [nA], [zA], [dA], [sA], and

[tA]) are alveolar consonants, while [Z] and [S] are postalveolar and [j] is palatal. To make sure

stimuli with the same place of consonant articulation are compared to each other, the stimuli with

an alveolar consonant are analyzed as a whole, as in Figure 2.9 and Table 2.12. We still see a

significant positive slope with roughly the same magnitude of difference.

Figure 2.9 CV lag based on target onset for alveolar consonants. The geom_jitter option (size =
0.9) is used to spread out the overlapping dots for clarity.

Target Onset
(Intercept)
Sonority difference

Estimate Std. Error

23.36
14.24

45.29
3.68

df
3.71
3.68

t value Pr(>|t|)

0.52
3.87

0.64
0.02

Table 2.12 Mixed effects model results for alveolar consonants. Random intercepts: C duration,
participant.

Note that the group of stimuli with alveolar consonants ([lA], [nA], [zA], [dA], [sA], and [tA])

involve different manners of articulation and voicing for the consonant articulation. To exclude the

account that jaw movement is the cause of the gestural lag variation, and to control for voicing,

the voiced alveolar consonants [nA] and [dA] are compared in Figure 2.10 and Table 2.13. Again,

46

the comparison clearly shows that the larger sonority difference between C and V significantly

correlates with larger CV lags for [nA] and [dA].

Figure 2.10 CV lag based on target onset comparison for [nA], [dA].

Target Onset
(Intercept)
Sonority difference

Estimate Std. Error

-80.88
22.09

49.13
4.24

df
90.00
90.00

t value Pr(>|t|)
-1.65
5.21

0.10
<0.0001

Table 2.13 Mixed effects model results for [nA], [dA]. Random intercepts: C duration, participant.
Adding or not adding C duration as a random effect in the model yielded the same results.

To control for orality and jaw movement, I compared [tA] and [dA] as in Figure 2.11 and Table

2.14. The results show that there is a significant positive correlation between gestural lag and

sonority difference for the two stimuli that differ in voicing. The above result shows that manner

or place of articulation, or jaw movement, cannot account for the observation, since the stimuli are

the same in the two aspects but still differ in gestural timing.

Figure 2.11 CV lag based on target onset comparison for [tA], [dA].

47

Target Onset
(Intercept)
Sonority difference

Estimate Std. Error

134.72
5.84

37.42
2.53

df
43.12
40.82

t value Pr(>|t|)

3.60
2.31

0.00
0.03

Table 2.14 Mixed effects model results for [tA], [dA]. Random intercepts: C duration, participant.

For a similar reason, I also looked at [zA] and [sA], which are different in voicing. The results

are in Figure 2.12 and Table 2.15. The visual inspection and the mixed effects modeling generally

show a positive correlation, but it is not significant. Given the expected direction of the estimate,

the insignificant effect is likely due to the lack of statistical power.5

Figure 2.12 CV lag based on target onset comparison for [zA], [sA].

Target Onset
(Intercept)
Sonority difference

Estimate Std. Error

105.92
9.74

72.41
5.77

df
50.04
47.96

t value Pr(>|t|)

1.46
1.69

0.15
0.10

Table 2.15 Mixed effects model results for [zA], [sA]. Random intercepts: C duration, participant.

2.2.3 Claim 5b: the same consonants with different vowels

In general, I conclude that claim 5a — for CV syllables with the same V, a less sonorous C

leads to a larger CV lag — has been supported by the Wisconsin Microbeam corpus data. I now

turn to probing the second claim by keeping the consonant constant and varying the vowel. As

mentioned before, fifteen nonce words with the template sVd, with the crucial vowel in between,

were measured along with two real words back and been. I first present the results for the sVd

words (Section 2.2.3.1) and then present the results for the bV real words (Section 2.2.3.2).

5I did not compare the postalveolar fricative [Z] and [S] since T1 is not a precise pellet position to measure

post-alveolar consonants.

48

2.2.3.1

sVd words

In this section, I looked at the 15 sVd stimuli in our stimulus set from the corpus. The results for

the 15 sVd stimuli that are shown in Figure 2.13 do not suggest a clear positive relationship, though

the estimate is in the expected direction. Moreover, the mixed effects models for target onsets

indicate a positive correlation as in Table 2.16, though the positive correlation is not statistically

significant.

Figure 2.13 CV lag based on target onset for sVd words. The geom_jitter option (size = 0.9) is
used to spread out the overlapping dots for clarity.

Target Onset
(Intercept)
Sonority difference

Estimate Std. Error

187.61
6.25

69.42
5.35

df
13.53
13.47

t value Pr(>|t|)

2.70
1.17

0.02
0.26

Table 2.16 Mixed effects model results for sVd words. Random intercepts: C duration, word,
participant.

The null result, however, should be interpreted with caution since there is a tradeoff with using

the same sensor to measure all vowels, which are of different quality, as mentioned before. Given

that the estimate was in the expected direction, the null result is potentially due to additional

variance in the measurements. It is possible that the use of the same pellet for all vowels resulted

in imprecise vowel measurements, and therefore led to more noise. Note, the issue is further

49

exarcerbated because the consonants are alveolar. To address the issue of compatibility and the

issue of varying vowel quality, I present the following analysis which involves a subset of the

original dataset. Originally, for each stimulus, either T2 or T3 was used for vowel measurement,

depending on what I thought was most indicative of the vowel in the acoustics — this was done

prior to any data analysis and was based on the judgment of the annotator. However, to address the

worry of consistency in the use of pellets, I only analyzed the measurements with the pellet that was

used for the majority of the tokens of a stimulus. For instance, for seed, there are 50 measurements

using T2 and 6 measurements using T3, so the 50 CV syllables with T2 measurement were included

and analyzed in the subset. Table 2.17 showed vowel measurement for each stimulus. It seems that

overall T2 matches more with the acoustic information.

(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)

Stimuli
seed
sid
sued
*sood
*sayed
surd
said
*sud
sewed
sawed
*sowd
side
sod
*soid
sad

IPA
[sid]
[sIIId]
[sud]
[sUUUd]
[sEEEId]
[s3ô3ô3ôd]
[sEEEd]
[s222d]
[sod]
[sOOOd]
[saud]
[saId]
[sOOOd]
[sOOOId]
[sæææd]

V Pellet
T2
T2
T2
T2
T3
T2
T2
T2
T2
T2
T2
T2
T2
T2
T2

Table 2.17 sVd stimuli for claim (5b). The stimuli came from task 13 of the Wisconsin X-ray
Microbeam Database. The target CV sequences are in boldface and underlined. The C-V Pellets
column shows the pellet positions for C and V.

The results for the subset of sVd stimuli can be found in Figure 2.14 and Table 2.18. The

estimate is much larger. Furthermore, if word is removed as a random intercept, I also have more

confidence to reject the null hypothesis (estimate = 12.1; Pr(>|t|)=0.07). These are just speculative

thoughts at this point. However, given the direction of the estimate, I believe the insignificant result

50

is due to insufficient statistical power and the complexity of vowel measurement. Analyzing more

data may bring out the positive correlation more significantly.

Figure 2.14 CV lag based on target onset for a subset of sVd words. The geom_jitter option (size
= 0.9) is used to spread out the overlapping dots for clarity.

Target Onset
(Intercept)
Sonority difference

Estimate Std. Error

142.47
12.09

161.43
12.49

df
11.70
11.80

t value Pr(>|t|)

0.88
0.97

0.40
0.35

Table 2.18 Mixed effects model results for a subset of sVd words. Random intercepts: C duration,
word, participant.

2.2.3.2 bV real words

While there is some separability between the tongue tip gesture and the tongue body gesture to

parse out the initial consonant and vowels in sVd stimuli, there is still a possibility of interference

between the gestures. To resolve this issue, I also looked at bV words, which have different C

and V articulators. The two bV real words in question have a bilabial consonant measured by lip

aperture and a vowel measured by T3. The expected pattern cannot be clearly seen in target onsets

as exemplified by Figure 2.15 and Table 2.19.

51

Figure 2.15 CV lag based on target onset for bV real words.

Target Onset
(Intercept)
Sonority difference

Estimate Std. Error

193.07
-1.49

83.07
6.88

df
32.38
31.25

t value Pr(>|t|)

2.32
-0.22

0.03
0.83

Table 2.19 Mixed effects model results for bV real words. Random intercepts: C duration,
participant.

Given the issue of the use of a uniform pellet discussed earlier, I followed up with analysis

similar to the sVd stimuli, I also subset the bV stimuli. Namely, I annotated according to the

acoustics, and only included the majority pellet in the analysis. Again, I would like to remind the

reader that this annotation was prior to any analysis. For most speakers, the T3 pellet best matches

the vowel acoustics in the back and been stimuli.

(16)
(17)

Stimuli
been
back

IPA V Pellet
[bin]
[bæææk]

T3
T3

Table 2.20 bV stimuli for claim (5b). The stimuli been is from task 9 and back is from task 100 of the
Wisconsin X-ray Microbeam Database. The target CV sequences are in boldface and underlined.

The results for the subset can be found in Figure 2.16 and Table 2.21. For bV stimuli, there is a

significant positive correlation between sonority difference and lag in CV syllables. I believe that

there are some significant results for bV but not sVd due to the separate lip and tongue measurements

for C and V respectively for bilabial stimuli.

52

Figure 2.16 CV lag based on target onset for a subset of bV real words.

Target Onset
(Intercept)
Sonority difference

Estimate Std. Error

-63.13
21.06

72.48
6.03

df
34.26
33.59

t value Pr(>|t|)
-0.87
3.49

0.39
0.001

Table 2.21 Mixed effects model results for a subset of bV real words. Random intercepts: C
duration, participant.

2.2.4 Could the significant positive correlation be an artifact of vowel displacement?

Shaw and Chen (2019) observed that CV lag based on gestural onsets is negatively correlated

with the displacement of the vowel from gesture onset to the achievement of the target. Essentially,

if the tongue body has to move more to achieve the vowel target, then the movement starts earlier,

and consequently, the CV lag based on gestural onsets is smaller. It is therefore logically possible

that my results are somehow artifactual and based on the relation observed by Shaw and Chen

(2019). To check for this possibility, I ran another analysis wherein I added another fixed effect,

namely, the horizontal distance from vowel gesture onset to target achievement.6 This additional

analysis involved all the stimuli. I chose the whole dataset as it was the largest stimulus set and

therefore the analysis would suffer the least in terms of statistical power from the addition of a

post-hoc variable. Note that vowel displacement could be an estimate of jaw movement. Therefore,

the post-hoc analysis also serves as another exploration of the potential effect of jaw movement on

gestural coordination.

6Shaw and Chen (2019) observed that this was a better predictor than the Euclidean distance traversed. Therefore,

I employ this measure.

53

2.2.4.1 Post-hoc analyses with vowel displacement using all the stimuli

The model with both sonority difference and vowel displacement as independent variables is

in Table 2.22. As can be seen from the table, the model shows that there is a negative relationship

between CV lag based on target onsets and vowel displacement. This replicates the findings of

Shaw and Chen (2019). Crucially, for our purposes, the effect of sonority difference is still clearly

present and to almost the same degree as in the original model presented before (Estimate in original

model = 16.49 vs. estimate in the current model = 15.76). Again, I interpret the result as showing

that the main finding in this article, that there is a positive relationship between sonority difference

and CV lag, once there is an adjustment for the contributory effect of vowel displacement.

Estimate Std. Error

(Intercept)
Sonority difference
Vowel displacement

29.20
15.76
-3.97

25.93
2.06
0.67

df
30.83
29.00
901.04

t value
1.13
7.65
-5.95

Pr(>|t|)
0.27
< 0.00001
< 0.00001

Table 2.22 Mixed effect model for all stimuli with sonority difference and vowel displacement as
fixed effects. Random intercepts: stimuli, participants, and consonant duration.

2.2.5 Could the significant positive correlation be a confound of jaw movement?

It is obvious that jaw movement can vary by consonant (Gracco and Lofqvist, 1994).

Furthermore,

it has been observed that jaw movement correlates with variation in gestural

coordination (Gracco, 1994; Gracco and Lofqvist, 1994; Mooshammer et al., 2003; Redford,

1999; MacNeilage and Davis, 2000). Could the significant positive correlation in the current

study be actually due to a confound of jaw movement? I observed both the voiced C-voiceless C

comparison, as well as the nasal C-oral C comparison exhibited a positive correlation between

sonority and gestural timing, despite having putatively similar jaw movements. However, it is still

worth confirming for those comparisons that jaw movement is not the (unique) source CV lag

variation observed here. In the following subsections, I tested for this possibility by adjusting for

any effect of jaw movement

in our data. Consonant displacement was chosen to be an

approximation of jaw movement.

Inspired by Shaw and Chen (2019), consonant displacement

means the horizontal distance from consonant gesture onset to target achievement.

I evaluated

54

consonant displacement on the whole dataset, and all pairs that I claimed are controlled for jaw

movement. These are pairs of stimuli that differ in voicing or nasality.

2.2.5.1 Post-hoc analyses with consonant displacement using all the stimuli

The model with both sonority difference and consonant displacement as fixed effects is in Table

2.23. I can see that there is a significant positive correlation between consonant displacement and

CV lag (estimate = 2.19). However, this does not show that the correlation with sonority difference

was confounded since the sonority difference effect remains effectively unaltered (estimate in

original model = 16.49 vs. estimate in the current model = 16.76). In the following subsections, I

am going to confirm that the pairwise comparison which I believed controlled for jaw movement

indeed exhibits little effect of consonant displacement.

Estimate Std. Error

(Intercept)
Sonority difference
C displacement

24.78
16.76
2.19

24.31
1.93
0.54

df
30.07
28.02
266.73

t value
1.02
8.70
4.03

Pr(>|t|)
0.32
< 0.00001
0.0001

Table 2.23 Mixed effect model for all stimuli with sonority difference and consonant displacement
as fixed effects. Random intercepts: stimuli, participants, and consonant duration.

2.2.5.2 Post-hoc analyses with consonant displacement using the voicing pairs

There were two pairs of stimuli which differ in consonant voicing used to control for jaw

movement — [pA, bA] and [tA, dA]. Here, I test whether the pairs truly control for consonant

displacement, which is an approximation of jaw movement.

I first look at the [pA, bA] pair. The model with both sonority difference and consonant

displacement as fixed effects is in Table 2.24. There is an insignificant negative correlation

between consonant displacement and CV lag variation. Since the effect size for sonority difference

still remains similar when one considers consonant displacement (estimate in original model =

14.40 vs. estimate in the current model = 15.03), I conclude that for [pA, bA] comparison, the

observed positive correlation is not a confound of jaw movement.

55

(Intercept)
Sonority difference
C displacement

Estimate Std. Error
-53.17
15.03
-2.31

58.65
3.76
1.28

df
53.91
46.18
80.64

t value
-0.91
4.00
-1.81

Pr(>|t|)
0.37
0.0002
0.07

Table 2.24 Mixed effect model for [pA], [bA] stimuli with sonority difference and consonant
displacement as fixed effects. Random intercepts: participants and consonant duration.

I then looked at another pair where the stimuli differ in consonant voicing — [tA, dA]. The model

with both sonority difference and consonant displacement as fixed effects is in Table 2.25. When

considering consonant displacement, there is still a significant positive correlation between CV lag

and sonority difference (Estimate in original model = estimate in the current model = 5.84). The

effect size, though statistically significant, is smaller than other pairs. This is probably due to the

fact that coronal consonants share the same tongue articulator with vowels. Including consonant

displacement in the model shows that when controlled for jaw movement, the [tA, dA] pair exhibited

a positive correlation between sonority difference and CV lag.

(Intercept)
Sonority difference
C displacement

Estimate Std. Error
136.09
5.84
-1.34

39.03
2.63
1.47

df
44.78
41.90
64.39

t value
3.49
2.22
-0.91

Pr(>|t|)
0.00
0.03
0.36

Table 2.25 Mixed effect model for [dA], [tA] stimuli with sonority difference and consonant
displacement as fixed effects. Random intercepts: participants and consonant duration.

2.2.5.3 Post-hoc analyses with consonant displacement using the nasality pairs

In the previous subsection, I looked at two pairs that differ in consonant voicing. I confirmed

that those two pairs showed significant positive correlations between CV lag and sonority, even

when controlled for jaw movement. In this subsection, I conduct similar analyses for pairs of stimuli

that differ in nasality of the consonant — [bA, mA] and [nA, dA]. I claimed that the pairs should

have controlled jaw movement, but I am going to confirm it here.

The model for [bA, mA] is shown in Table 2.26. Even though there is an insignificant negative

correlation between consonant displacement and CV lag, there is still a significant positive

56

correlation between sonority and CV lag (estimate in the original model = 14.73 vs. estimate in

the current model = 13.73).

(Intercept)
Sonority difference
C displacement

Estimate Std. Error
-35.18
13.73
-2.24

44.39
3.76
1.53

df
52.30
47.69
51.63

t value
-0.79
3.65
-1.47

Pr(>|t|)
0.43
0.001
0.15

Table 2.26 Mixed effect model for [bA], [mA] stimuli with sonority difference and consonant
displacement as fixed effects. Random intercepts: participants and consonant duration.

The model for [nA, dA] is shown in Table 2.27. Again, there is a negative correlation between

consonant displacement and CV lag. However, the effect of the relationship between sonority and

CV lag appears to be unchanged (estimate in the original model = 22.09, and estimate in the current

model = 22.68).

(Intercept)
Sonority difference
C displacement

Estimate Std. Error
-81.43
22.68
-5.08

48.35
4.18
2.57

df
89.00
89.00
89.00

t value
-1.68
5.43
-1.98

Pr(>|t|)
0.10
< 0.00001
0.05

Table 2.27 Mixed effect model for [nA], [dA] stimuli with sonority difference and consonant
displacement as fixed effects. Random intercepts: participants and consonant duration.

2.2.6 Summary

In general, experiment 1 using English corpus data showed that there is a positive correlation

between CV lag and sonority difference. The overall results, including results for subgroups, are

summarized in Table 2.28, and the pairwise comparison results can be found in Table 2.29. In Table

2.28, all the data as well as subgroups of the whole dataset showed the expected positive correlation

except for the sVd stimulus group. The expected positive correlation can still be observed when

considering vowel or consonant displacement.

In Table 2.29, all the pairs that control voicing,

nasality, or vowel height exhibited a significant positive correlation between CV lag and sonority

difference. However, the comparison between voiced and voiceless fricatives did not show the

expected pattern. The non-significant results for sVd stimuli may be due to the interaction between

C and V measures since both use tongue sensors. Also, the non-significant of the fricative pairs

57

may be due to that the pair does not differ in voicing in their realization. Future research may

consider coding sonority index according to the voicing realizations of obstruents.

Dataset (English corpus data)
All English corpus data
All English corpus data, V displacement
All English corpus data, C displacement
sVd stimuli (subset)
T1 C stimuli
Alveolar C stimuli (la, na, za, da, sa, ta)
Lip aperture (wa, ma, ba, pa)

Estimate (sonority diff) Estimate (displace)
16.49 ***
15.76 ***
16.76 ***
12.09
15.53 **
14.24 *
14.71 ***

-3.97 ***
2.19 ***

Table 2.28 Summarizing the results of experiment 1. *** means that p ⩽ 0.001; ** means that p ⩽
0.01; * means that p ⩽ 0.05.

Pairwise comparison
Nasality differ

Voicing differ, stop

Voicing differ, fricative

Vowel height (subset)

Stimulus pair Estimate (sonority difference)
ma, ba
na, da
pa, ba
da, ta
fa, va
sa, za
been, back

14.73 ***
22.09 ***
14.40 ***
5.84 *
7.11
9.74
21.06 ***

Table 2.29 Summarizing pairwise comparison of experiment 1. *** means that p ⩽ 0.001; **
means that p ⩽ 0.01; * means that p ⩽ 0.05.

2.3 Conclusion

Even though the main claim was supported, this experiment on corpus data was not ideal for

the following reasons. First, the stimuli in the corpus study were restricted because of using an

existing corpus. For claim (5a) that for CV syllables with the same V, a less sonorous C leads to

a larger CV lag, only nonce words with low vowel [A] were involved. It is not clear whether real

words with other vowels, and even [A] will support the claim. Moreover, for claim (5b) that for CV

syllables with the same C, a more sonorous V leads to a larger CV lag, only [s] and [b] were tested

as the onsets, and analyzing a variety of consonants is necessary. Second, experiment 1 was based

on English corpus data, lacking a cross-linguistic validation. To address the above issues, EMA

58

experiments using carefully designed English and Mandarin real words are proposed to evaluate

the main claim.

59

CHAPTER 3

EXPERIMENT 2: ENGLISH EMA STUDY

The study of English corpus data showed that there was a positive correlation between CV lag and

sonority difference. The same pattern has been consistently observed for stimuli with controlled

C and varying V, as well as for stimuli with controlled V and varying C. The positive correlation

has also been observed when considering alternative factors such as jaw movement, voicing, or

nasality. However, it has limited stimulus selection. To address this issue, I conducted an EMA

experiment in English, as described in the following sections of this chapter.

3.1 Methods

3.1.1 Data collection

The NDI Vox-EMA System (VOX) manufactured by Northern Digital Inc. (NDI) was used to

collect articulatory data of the study in the Phonetics Lab of Michigan State University. Articulatory

data was collected at a sampling rate of 400 Hz, and acoustic data was recorded simultaneously

at a sampling frequency of 16 kHz. 8 sensors were attached to the participants’ articulators and

other reference points using PeriAcryl Oral Tissue Adhesive. Specifically, 3 sensors were glued to

the tongue — one on the tongue tip (TT), about 1 cm from the anatomical tip; one on the tongue

dorsum (TD), as far back as comfortable; one between tongue tip (TT) and tongue dorsum (TD)

so that there was equal distance between two sensors — tongue blade (TB). Two sensors, used to

track lip movements, were glued to the vermillion border of the upper and lower lips respectively.

Reference sensors were glued to the left and right mastoid, and nasion to correct head movement.

Some rough positions of the sensors can be found in Figure 3.1.

3.1.2 Participant recruitment

English participants were recruited through email. Recruitment emails were sent to sections

of LIN 401 Introduction to Linguistics and IAH 231C Roles of Language in Society of the 2023

Fall semester. See the recruitment email in Appendix A. Potential participants first completed a

pre-screening survey via Google Forms. The pre-screening survey can be found in Appendix B.

60

LM(Ref)

RM(Ref)

Ref

TD TBTT

UL

LL

Figure 3.1 Approximate pellet placement locations. Ref: reference sensor. LM: left mastoid; RM:
right mastoid. UL: upper lip. LL: lower lip. TT: tongue tip. TB: tongue blade. TD: tongue dorsum.

Then, they were contacted by the experimenter to schedule the actual experiment.

3.1.3 Experimental set-up and procedure

The participants were given time to read and sign the consent forms when they first greeted

the experimenters. Before or after the participant read the consent form, one experimenter briefly

introduced EMA, its setup, and the participants’ task in layman terms. Participants were then

instructed to use a disposable toothbrush to brush the midline of their tongue, before they rinsed

their mouth with water. Participants were provided with water to drink during the experiment, and

they were advised to use the restroom before the actual experiment, which lasted 45 minutes to

1.5 hours. When ready, participants sat facing a computer screen. The field generator was placed

towards the left side of the participant, and its position was adjusted after the sensor application so

that the sensors were roughly at the center of the field. See Figure 3.2 for a photo of the lab setup.

During the experiment, one experimenter sat to the right of the participant to monitor the recording

process. Another experimenter was sitting behind the participant to fill in the protocol file, which

was a document about the experimental procedure, containing information about mispronounced

words or falling sensors. After the attachment of sensors on mastoids and the nasion, participants

were instructed to bite a bite plane in a still position, and the still position of the sensors was

recorded 3 times where each recording was 2 seconds.

Then the lingual sensors were glued to the participants’ mouths. First, sensors were glued to

61

Figure 3.2 The set-up of the EMA lab from the view of the participant.

the tongue, from the back to the front. Then, sensors were glued to the upper and lower lips of

the participants. Dental edible pigments were used to denote 3 points for tongue sensors. The first

mark was the 1cm point to the tongue tip, and this was the Tongue Tip sensor. The second mark

was about 5-6 cm from the tongue tip (not Tongue Tip sensor point), or in some cases, the furthest

back that the participant could tolerate without discomfort. This was the Tongue Dorsum sensor.

Lastly, the third mark denoting the position of the Tongue Blade sensor was on the midpoint of the

first two marks.

At the end of the experiment session, 30 dollars in cash was given to the participant, and

participants signed the receipt to indicate that they had received the money.

Altogether data from 18 English participants were collected, and those from 10 were annotated

and analyzed in the current study. For the 10 participants, there were already 7268 manual

annotations of gestures for the EMA English data. I annotated the data according to the reverse

order of data collection, and the rest of the data were not annotated due to time constraints. Among

the 10 participants, 9 were female and 1 was male, and the average age was 19.9 years old.

62

3.1.4 Data processing and annotation

The data collected were head-corrected and annotated in Matlab. I used the findgest algorithm,

where gestural onset, gestural offset, nucleus onset, and nucleus offset used the 20% threshold of

the velocity profile (Tiede, 2005). The tangential velocity rather than absolute velocity was used

since multiple components are displayed.

I annotated the data according to the following assumptions. The gesture was selected based on

previous literature, the understanding of articulatory movement, and by considering the consistency

of the overall analysis. For instance, alveolar consonants like [t,d,n,s,l] were measured by tongue

tip, and [p, b, m, w] were measured by lip aperture. The vowels were annotated by tongue dorsum,

which is to ensure the consistency of the comparison across all stimuli. The disadvantage of the

decision is that some sensors may not be the exact tongue gesture used, but some tradeoff has to be

made and arguably tongue gestures are not independent of each other. In each annotation, the click

was on the mid-point of the gestural movement area. In annotating a lip aperture, for instance, the

closure was located roughly by the acoustics and then automatically by the algorithm.

During the annotation process, sometimes I was not confident about the annotation. In such

cases,

the annotation was marked “Questionable”,

“MultipleMeasure”,

“NoneDefault”,

“SensorUnavailable”, “Mispronounced”. The label names are intuitive, but their meanings can be

found in Appendix E. For instance, if a word (A) was pronounced incorrectly (as B), then the

actual pronunciation (B) as well as the label “Mispronounced” were coded in the datasheet. Out

of 3528 syllables in question, 3241 syllables (92%) did not have any labels, and these

unambiguous annotations were analyzed in the current study.1

3.1.5 Data analysis

The data analysis process for the English study is similar to that of experiment 1. After

collecting data from the corpus and computing CV lags, the relationship between the CV lag and

the sonority difference was analyzed and visualized using the tidyverse package (Wickham

et al., 2019). Subsequent mixed-effects modeling was done using the lme4 (Bates et al., 2014) and

1When syllables regardless of label or certainty level were analyzed, similar results were shown but not presented

here.

63

lmerTest (Kuznetsova et al., 2017) packages in R (R Core Team, 2017), where the CV lag was

modeled as a function of the sonority difference, with participant and consonant duration

as random intercepts. Word was chosen as another random effect in the comparison involving

different types of stimuli.2 For smaller subgroups where stimuli were controlled for C or V and

only had one varying V or C, Word was not considered as a random effect since the stimuli in the

subset perfectly correlate with sonority difference.3. For instance, in the pairwise comparison

of peak, pack, Word was not considered as a random effect since the potential random effect of

Word is perfectly correlated with the fixed effect of vowel height. Since the subgroups have 5 or

less than 5 stimuli, the decision also follows the “convention” in mixed-effect modeling that there

should be at least 5 levels of a variable to be considered as a random effect (Gelman, 2007; Kéry

and Royle, 2020; Harrison et al., 2018; Arnqvist, 2020; Harrison, 2015).

3.2 Stimuli

There were 24 English stimuli, and each participant repeated them in 15 randomized lists with

filler words between the blocks. This means that there were 15 repetitions of each stimulus. When

organized in different ways, the subgroups of the stimuli can be used to test the two sub-hypotheses

of the dissertation. I will first present subgroups of stimuli used to test the claim that for CV syllables

with the same V, a less sonorous C leads to a larger CV lag. Then, I will present subgroups of the

stimuli used to test the claim that for CV syllables with the same C, a more sonorous V leads to a

larger CV lag. A summary of the English stimuli can be found at the end of this section.

3.2.1 Same V different C

The groups of stimuli in this subsection were used to test the claim that for CV syllables with

the same V, a less sonorous C leads to a larger CV lag. There were two types of consonants:

bilabial (labial) or coronal. In each of the subsections, from the top to the bottom of each table

of stimuli, CV gestural lag is expected to decrease since sonority difference decreases due to C

sonority increases.

2The code is lmer(CV lag based on target onset∼Sonority difference+(1|Participant)+(1|Stimuli)+(1|C duration)
3The code is lmer(CV lag based on target onset∼Sonority difference+(1|Participant)+(1|C duration)

64

3.2.1.1 Same V different bilabial C

Presented in this subsection are stimuli with the same vowel and different bilabial consonants.

Since the vowel uses the tongue gesture and bilabial (labial) consonants use the lip as the primary

articulator, words with bilabial consonants have been the top choice for speech production studies.

As in Table 3.1, 3.2, and 3.3, high, mid, and low vowels are considered and the coda environment

is controlled in each subgroup. The first column in each table has the index for each stimulus. As

the reader can see in subsection 3.2.2, the stimuli are organized in a different way to test another

sub-claim of the study. The stimulus tables for the Mandarin EMA study also have a similar index

column.

Index Stimuli C V C category V category C sonority V sonority Sonority diff

peak
1
beak
2
3 meek
4 week

i
p
i
b
m i
w i

bilabial
bilabial
bilabial
bilabial

high
high
high
high

1
4
7
12

15
15
15
15

14
11
8
3

Table 3.1 Same high V different bilabial C.

Index Stimuli C V C category V category C sonority V sonority Sonority diff

pain
5
6
bane
7 main
8 wane

e
p
b
e
m e
w e

bilabial
bilabial
bilabial
bilabial

mid
mid
mid
mid

1
4
7
12

16
16
16
16

15
12
9
4

Table 3.2 Same mid V different bilabial C.

Index Stimuli C V C category V category C sonority V sonority Sonority diff

9

back
b æ bilabial
10 pack
p æ bilabial
m æ bilabial
11 Mac
12 whack w æ bilabial

low
low
low
low

4
1
7
12

17
17
17
17

13
16
10
5

Table 3.3 Same low V different bilabial C.

3.2.1.2 Same V different coronal C

In the following Tables 3.4, 3.5, and 3.6, there were stimuli with the same vowel and varying

coronal consonants in each subgroup of stimuli, and the vowels were high, mid, and low vowels

65

respectively.

Index Stimuli C V C category V category C sonority V sonority Sonority diff

13
two
sue
14
15 do
16 new

t
s
d
n

u
u
u
u

coronal
coronal
coronal
coronal

high
high
high
high

1
3
4
7

15
15
15
15

14
12
11
8

Table 3.4 Same high V different coronal C.

Index Stimuli C V C category V category C sonority V sonority Sonority diff

toe
so

17
18
19 doe
20 know

t
s
d
n

o
o
o
o

coronal
coronal
coronal
coronal

mid
mid
mid
mid

1
3
4
7

16
16
16
16

15
13
12
9

Table 3.5 Same mid V different coronal C.

Index Stimuli C V C category V category C sonority V sonority Sonority diff

21
talk
sock
22
23 dock
24 knock

t
s
d
n

A
A
A
A

coronal
coronal
coronal
coronal

low
low
low
low

1
3
4
7

17
17
17
17

16
14
13
10

Table 3.6 Same low V different coronal C.

3.2.2 Same C different V

The following stimuli groups have the same C and different V. The stimuli in the subsection

are a rearrangement of the stimuli above for the purpose of testing another sub-claim. To control

for the coda environment, only high and low vowel words are considered for the bilabial group,

and only high and mid-vowel words are considered for the coronal group. For the bilabial stimuli

in Table 3.7, in each subgroup, the high vowel stimuli should have a smaller lag than low vowel

stimuli — since a high vowel is less sonorous than a low vowel, a high vowel also has a smaller

sonority difference than a low vowel. Similarly, for the coronal subgroup in Table 3.8, the high

vowel stimuli should have a smaller lag than the mid-vowel stimuli.

66

Index Stimuli C V C category V category C sonority V sonority Sonority diff

i

i

10

1 peak
pack
2 beak
9 back
3 meek
11 Mac
4 week

p
bilabial
p æ bilabial
b
bilabial
b æ bilabial
m i
bilabial
m æ bilabial
bilabial
w i
12 whack w æ bilabial

high
low
high
low
high
low
high
low

1
1
4
4
7
7
12
12

15
17
15
17
15
17
15
17

14
16
11
13
8
10
3
5

Table 3.7 Same bilabial C different V.

Index Stimuli C V C category V category C sonority V sonority Sonority diff

two
13
toe
17
sue
14
18
so
15 do
19 doe
16 new
20 know

t
t
s
s
d
d
n
n

u
o
u
o
u
o
u
o

coronal
coronal
coronal
coronal
coronal
coronal
coronal
coronal

high
mid
high
mid
high
mid
high
mid

1
1
3
3
4
4
7
7

15
16
15
16
15
16
15
16

14
15
12
13
11
12
8
9

Table 3.8 Same coronal C different V.

3.2.3 Summary of English experiment stimuli

A summary of all English stimuli can be found in Table 3.9. The sonority index for the consonant

and vowel can be found in the C sonority and V sonority columns. The sonority difference of each

stimulus can be found in the last column.

67

Index Stimuli C V C category V category C sonority V sonority Sonority diff

1 peak
bilabial
i
p
2 beak
bilabial
b
i
3 meek
bilabial
m i
4 week
bilabial
w i
5 pain
bilabial
e
p
6 bane
bilabial
b
e
7 main
bilabial
m e
8 wane
w e
bilabial
9 back
b æ bilabial
10 pack
p æ bilabial
m æ bilabial
11 Mac
12 whack w æ bilabial
coronal
t
two
13
coronal
s
sue
14
coronal
d
15 do
coronal
n
16 new
coronal
t
toe
17
coronal
s
18
so
coronal
d
19 doe
coronal
n
20 know
talk
coronal
t
21
coronal
s
22
sock
coronal
d
23 dock
coronal
n
24 knock

u
u
u
u
o
o
o
o
A
A
A
A

high
high
high
high
mid
mid
mid
mid
low
low
low
low
high
high
high
high
mid
mid
mid
mid
low
low
low
low

1
4
7
12
1
4
7
12
4
1
7
12
1
3
4
7
1
3
4
7
1
3
4
7

15
15
15
15
16
16
16
16
17
17
17
17
15
15
15
15
16
16
16
16
17
17
17
17

14
11
8
3
15
12
9
4
13
16
10
5
14
12
11
8
15
13
12
9
16
14
13
10

Table 3.9 English stimuli summary.

Target Sounds
Bilabial [p, b, m, w]
Alveolar [t, d, n, s]
Vowel

Articulatory Sensor Gesture
lower and upper lip
tongue tip
tongue dorsum

lip aperture
tongue tip
tongue dorsum

Table 3.10 Articulatory sensors and gestures for each type of sounds – English.

3.3 Results

3.3.1 Overall analysis

The analysis of all the data based on target onset can be found in Figure 3.3 and the mixed

effect model can be found in Table 3.11. There was a significant positive correlation as predicted.

As mentioned in the previous chapter, while analysis with all the data and the use of a sonority

68

difference score as a predictor is easier to statistically model and has higher statistical power, it

does have some issues. First, the analysis assumes that the sonority scale is linear and not just

relative, which is contrary to most phonologists’ beliefs. Furthermore, it collapses across different

articulators or gestures. For these reasons, I analyzed more nuanced sub-groups of the data. The

results for the subsets can be found in the following sections.

Figure 3.3 CV lag based on target onset for English participants.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

156.70
11.24

22.73
1.64

df
33.01
21.92

t value
6.89
6.86

Pr(>|t|)
<0.00001
<0.00001

Table 3.11 Mixed effects model results for English participants.

3.3.2 Results for claim 5a: the same vowel with different consonants

I first tested the claim 5a that for the CV syllables with the same vowel and different consonants,

a less sonorant consonant leads to larger CV lag. Results for stimuli with the same vowel and

varying bilabial consonant will be presented before results for stimuli with the same vowel and

varying coronal consonant.

69

3.3.2.1 Same V and varying bilabial C

This subsection shows the result for subgroups of stimuli that have the same V and varying

bilabial C. The results for bilabial consonants and high vowels are shown in Figure 3.4 and Table

3.12. There was a significant positive correlation between sonority difference and gestural lag for

stimuli with different bilabial consonants and the same high vowel.

Figure 3.4 CV lag based on target onset for English participants: bilabial C and high V.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

95.46
14.00

19.47
1.20

df
22.19
439.70

t value
4.90
11.71

Pr(>|t|)
0.00
< 0.00001

Table 3.12 Mixed effects model results for English participants: bilabial C and high V.

As for stimuli with different bilabial C and the same mid V, there was also a significant positive

correlation between gestural lag and sonority difference, as in Figure 3.5 and Table 3.13.

70

Figure 3.5 CV lag based on target onset for English participants: bilabial C and mid V.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

145.14
14.35

20.87
1.26

df
28.11
453.34

t value
6.96
11.42

Pr(>|t|)
<0.00001
< 0.00001

Table 3.13 Mixed effects model results for English participants: bilabial C and mid V.

When there were the same low vowel and different bilabial consonants, there also was a

significant positive correlation as in Figure 3.6 and Table 3.14. The effect size was slightly lower

for the stimuli with low vowel, as compared to the stimuli with high and mid vowels.

Figure 3.6 CV lag based on target onset for English participants: bilabial C and low V.

71

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

124.19
10.47

23.04
1.47

df
42.61
363.78

t value
5.39
7.11

Pr(>|t|)
0.000003
< 0.00001

Table 3.14 Mixed effects model results for English participants: bilabial C and low V.

Previous studies have found that consonant manner and place could lead to gestural coordination

variation (Bombien et al., 2013; Wright, 1996; Pouplier et al., 2022). Comparing the CV lag for

bilabial stimuli differing in the voicing of C can control for the potential confounding factors such

as manner, place, and jaw movement, since the pair has the same manner and place of articulation,

as well as jaw movement. The results for the peak, beak comparison can be found in Figure 3.7

and Table 3.15. There was a positive correlation between CV lag and sonority difference. As in

Appendix H, I also added C displacement as a fixed effect. There was no effect of C displacement,

confirming that the jaw movement was controlled.

Figure 3.7 CV lag based on target onset for English participants: peak, beak.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

162.74
8.92

64.22
4.87

df
203.01
200.26

t value Pr(>|t|)

2.53
1.83

0.01
0.07

Table 3.15 Mixed effects model results for English participants: peak, beak.

The results for pain, bane can be found in Table 3.16 and Figure 3.8. There was a significant

positive correlation between CV lag and sonority difference.

72

Figure 3.8 CV lag based on target onset for English participants: pain, bane.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

171.58
12.81

69.45
5.01

df
207.31
186.32

t value Pr(>|t|)

2.47
2.56

0.01
0.01

Table 3.16 Mixed effects model results for English participants: pain, bane.

The results for the low vowel group pack, back can be found in Figure 3.9 and Table 3.17.

There was a significant positive correlation between sonority difference and CV lag. Adding C

displacement as a fixed effect showed that there was no clear relationship between CV lag and C

displacement. This showed that when jaw movement or manner of articulation was controlled,

there was still a significant positive correlation between CV lag and sonority difference observed.

Figure 3.9 CV lag based on target onset for English participants: pack, back.

73

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

92.79
12.66

82.98
5.52

df
166.39
149.64

t value Pr(>|t|)

1.12
2.29

0.27
0.02

Table 3.17 Mixed effects model results for English participants: pack, back.

Below I compared the nasal and stop with the same vowel and coda environment. A significant

difference found in the comparison can strengthen the main claim tested because [m] and [b] differ

in nasality, and they have controlled jaw movement, frontness, and voicing. As we can see in this

section, a significant positive correlation was found in bilabial nasal and stop with different vowel

heights. As in Figure 3.10 and Table 3.18, for the same rime environment with high vowels, the

bilabial stop had a significantly larger lag than bilabial nasal. The major difference between the [m]

and [b] articulations is the lowering or raising of velum, and there is no obvious articulatory reason

that velum movement by itself should cause gestural lag variation between the lips and the tongue.

The finding suggests that jaw movement may not be a valid alternative account of the observed

gestural lag variation, since in this pair the two segments [m] and [b] differ in nasality and involve

similar degrees of jaw movement. Therefore, by comparing CV lag for the pair meek, beak, our

claim is more strongly supported.

Figure 3.10 CV lag based on target onset for English participants: meek, beak.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

19.11
21.84

49.92
4.98

df
231.70
233.55

t value Pr(>|t|)

0.38
4.38

0.70
0.00002

Table 3.18 Mixed effects model results for English participants: meek, beak.

74

As in Figure 3.11 and Table 3.19, for the stimulus pairs with mid vowels, the bilabial stop had

a larger lag than the bilabial nasal. This brought out the main claim tested in the dissertation by

confirming that the results are likely to be confounded by voicing and jaw movement — since the

pair had controlled jaw movement and voicing.

Figure 3.11 CV lag based on target onset for English participants: main, bane.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

62.26
21.59

57.34
5.21

df
220.29
204.31

t value Pr(>|t|)

1.09
4.14

0.28
0.0001

Table 3.19 Mixed effects model results for English participants: main, bane.

The results for low vowels with bilabial onsets can be found in Figure 3.12 and Table 3.20. A

significant correlation between sonority difference and gestural lag can be found in the subgroup

as well, showing the result was not confounded by jaw movement and voicing of consonants.

Figure 3.12 CV lag based on target onset for English participants: Mac, back.

75

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

106.32
11.20

67.63
5.65

df
216.79
213.27

t value Pr(>|t|)

1.57
1.98

0.12
0.05

Table 3.20 Mixed effects model results for English participants: Mac, back.

3.3.2.2 Same V and varying coronal C

In this subsection, stimulus subgroups with varying coronal C and the same V were analyzed.

Figure 3.13 and Table 3.21 showed the results for stimuli with varying coronal consonants and high

vowel, that there was a significant positive correlation between sonority difference and CV gestural

lag.

Figure 3.13 CV lag based on target onset for English participants: coronal C and high V.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

155.68
8.02

32.88
2.57

df
144.16
506.29

t value Pr(>|t|)
0.00001
0.002

4.74
3.12

Table 3.21 Mixed effects model results for English participants: coronal C and high V.

For the varying coronal consonants and mid vowel group, the expected positive correlation was

found as in the plot in Figure 3.14 and Table 3.22.

76

Figure 3.14 CV lag based on target onset for English participants: coronal C and mid V.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

191.71
9.40

29.69
2.11

df
142.13
446.85

t value
6.46
4.46

Pr(>|t|)
0.00
< 0.0001

Table 3.22 Mixed effects model results for English participants: coronal C and mid V.

The varying coronal consonants and low vowel group also exhibited significant positive

correlation as in Figure 3.15 and Table 3.23. Note that the effect size for the coronal C subgroups

was generally smaller than that of the bilabial C subgroups. This could be due to the fact that

coronal consonants and vowels all used tongue as the main articulator, and therefore separating

the gestures became less straightforward.

Figure 3.15 CV lag based on target onset for English participants: coronal C and low V.

77

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

174.68
7.43

27.62
1.82

df
146.50
524.96

t value
6.33
4.08

Pr(>|t|)
0.00
< 0.0001

Table 3.23 Mixed effects model results for English participants: coronal C and low V.

I also compared the stimuli with voiceless and voiced coronal C, to control for jaw movement

and manner of articulation. C duration was not used as a random intercept since voiced C and

voiceless C differ in duration (Denes, 1955).

In other words, since the durational difference is

correlated to C voicing and I am testing the effect of voicing, adding C duration as a random

intercept would counteract the pattern related to the target effect. There was no significant positive

correlation found for the two, do pair, as in Figure 3.16 and Table 3.24.

Figure 3.16 CV lag based on target onset for English participants: two, do.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

341.11
-5.64

72.43
5.64

df
249.13
246.90

t value
4.71
-1.00

Pr(>|t|)
<0.00001
0.32

Table 3.24 Mixed effects model results for English participants: two, do.

For the toe, doe comparison, there was an insignificant positive correlation between sonority

difference and CV lag as in Table 3.25 and Figure 3.17.

78

Figure 3.17 CV lag based on target onset for English participants: toe, doe.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

228.02
6.47

55.96
3.99

df
268.06
272.97

t value Pr(>|t|)
0.0001
0.11

4.08
1.62

Table 3.25 Mixed effects model results for English participants: toe, doe.

Also, there was no clear relationship observed between sonority and lag found for talk, dock,

as in Figure 3.18 and Table 3.26. In general, there was no clear relationship observed for coronal

C stimulus pairs that differ in voicing.

Figure 3.18 CV lag based on target onset for English participants: talk, dock.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

296.47
-0.23

57.11
3.72

df
233.89
275.03

t value
5.19
-0.06

Pr(>|t|)
<0.00001
0.95

Table 3.26 Mixed effects model results for English participants: talk, dock.

79

Stimulus pairs of coronal nasal and stop with the same vowel were analyzed. As mentioned

before, this is to control the jaw movement and voicing. The results show that when the nasal and

stop pair is with a high, mid, or low vowel, significant correlations were found. The high vowel

pair comparison can be found in Figure 3.19 and Table 3.27, and the high vowel with coronal stop

had a significantly larger lag than that with coronal nasal.

Figure 3.19 CV lag based on target onset for English participants: do, new.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

18.06
23.74

51.89
5.26

df
248.12
240.84

t value Pr(>|t|)

0.35
4.52

0.73
0.00001

Table 3.27 Mixed effects model results for English participants: do, new.

Figure 3.19 and Table 3.28 show that the coronal stop with mid vowel had a larger lag than the

coronal nasal with the same mid vowel.

Figure 3.20 CV lag based on target onset for English participants: doe, know.

80

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

165.83
11.83

50.11
4.49

df
230.31
244.86

t value Pr(>|t|)
0.001
0.01

3.31
2.64

Table 3.28 Mixed effects model results for English participants: doe, know.

The syllable that starts with a coronal stop and ends with a low vowel had a significantly larger

lag than that with a coronal nasal as in Figure 3.21 and Table 3.29.

Figure 3.21 CV lag based on target onset for English participants: dock, knock.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

68.00
17.58

44.00
3.59

df
226.27
269.80

t value
1.55
4.90

Pr(>|t|)
0.12
0.000002

Table 3.29 Mixed effects model results for English participants: dock, knock.

3.3.3 Results for claim 5b: the same C and different V

3.3.3.1 Same bilabial C and different V

For the syllables with the same bilabial C [p] and different vowel, there was a positive correlation

between gestural lag and sonority difference as in Figure 3.22. However, the positive correlation

was not statistically significant as in Table 3.30.

81

Figure 3.22 CV lag based on target onset for English participants: peak, pack.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

239.06
3.20

119.20
7.86

df
154.58
143.32

t value Pr(>|t|)

2.01
0.41

0.05
0.69

Table 3.30 Mixed effects model results for English participants: peak, pack.

The same bilabial consonant [b] with a low vowel had a shorter lag than those with a high

vowel, but this difference was not statistically significant, as in Table 3.31 and Figure 3.23.

Figure 3.23 CV lag based on target onset for English participants: b and different V.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

267.69
-0.22

92.36
7.46

df
171.80
156.97

t value Pr(>|t|)

2.90
-0.03

0.00
0.98

Table 3.31 Mixed effects model results for English participants: b and different V.

82

Words that have onset [m] and a rime of a low vowel had a larger lag than that with a high

vowel. However, the difference was not significant.

Figure 3.24 CV lag based on target onset for English participants: meek, Mac.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

134.53
7.61

70.24
7.65

df
260.07
246.81

t value Pr(>|t|)

1.92
1.00

0.06
0.32

Table 3.32 Mixed effects model results for English participants: m and different V.

Words with bilabial consonant [w] and a low vowel had a significantly larger lag than those with

a high vowel, as in Figure 3.25 and Table 3.33. The reason why [w] and different vowels exhibited

a difference but other bilabial consonants did not show the expected pattern is unclear. It is possible

that the vowel and the coda consonant in each word all involved tongue movement, so the tongue

movement from the following coda consonant may affect the preceding vowel. Therefore, there

were no significant patterns in most bilabial consonants. Syllables with bilabial consonant [w] and

vowels did show that sonority difference is positively correlated to gestural lag. It may be that

the observation was confounded by the fact that [w] also involved tongue movement. The tongue

movement in [w] also affected the tongue movement of vowels, which surfaced as a significant

observation.

83

Figure 3.25 CV lag based on target onset for English participants: w and different V.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

80.66
18.23

33.96
6.80

df
57.56
250.31

t value Pr(>|t|)

2.38
2.68

0.02
0.01

Table 3.33 Mixed effects model results for English participants: w and different V.

3.3.3.2 Same coronal C and different V

I look at results for the same coronal C and different V. For stimuli with [t] and different V,

there was a positive correlation between CV lag and sonority difference as in Figure 3.26 and Table

3.34. The effect sizes in this subsection are bigger than those in other analyses. The bigger effect

size may not indicate that coronal C stimuli had a more significant correlation. Rather, it may come

from some measuring confounds since C and V used the same articulator tongue.

Figure 3.26 CV lag based on target onset for English participants: t and different V.

84

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error
-604.05
61.90

202.26
13.86

df
259.52
256.03

t value Pr(>|t|)
-2.99
0.003
0.00001
4.47

Table 3.34 Mixed effects model results for English participants: t and different V.

For stimuli with [d] and different vowels, there was a positive correlation between CV lag and

sonority difference, as in Figure 3.27 and Table 3.35.

Figure 3.27 CV lag based on target onset for English participants: d and different V.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

-23.95
27.51

175.16
15.17

df
264.96
262.21

t value Pr(>|t|)
-0.14
1.81

0.89
0.07

Table 3.35 Mixed effects model results for English participants: d and different V.

For stimuli with [s] and different vowels, there was a significant positive correlation between

CV lag and sonority difference, as in Figure 3.28 and Table 3.36.

Figure 3.28 CV lag based on target onset for English participants: s and different V.

85

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error
-678.84
76.03

177.35
14.13

df
257.04
253.25

t value
-3.83
5.38

Pr(>|t|)
0.0002
0.0000002

Table 3.36 Mixed effects model results for English participants: s and different V.

There was also a significant positive correlation found for stimuli with [n] and different vowels,

as in Figure 3.29 and Table 3.37.

Figure 3.29 CV lag based on target onset for English participants: n and different V.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error
-278.23
60.78

120.45
14.00

df
251.69
242.75

t value Pr(>|t|)
-2.31
4.34

0.02
0.00002

Table 3.37 Mixed effects model results for English participants: n and different V.

3.3.4 The vowel displacement and C displacement analysis

As justified by experiment 1, it is logically possible that our results are somehow artifactual and

based on the relation observed by Shaw and Chen (2019) — that CV lag based on gestural onsets is

negatively correlated with the displacement of the vowel from gesture onset to the achievement of

the target. Furthermore, as mentioned before, jaw movement correlates with variation in gestural

coordination (Gracco, 1994; Gracco and Lofqvist, 1994; Mooshammer et al., 2003; Redford, 1999;

MacNeilage and Davis, 2000).

To check for this possibility that the significant correlation is not due to consonant or vowel

displacement, I ran another analysis wherein I added random intercepts which are vowel

86

displacement and consonant displacement. The vowel displacement is the horizontal distance

from vowel gesture onset to target achievement. For measuring consonant displacements, I

subtracted the gesture onset value from the target onset value. Bilabial consonant displacement

was measured by lip aperture displacement difference between gesture onset and target onset.

Coronal consonant displacement was measured by the y-axis (vertical) distance between gesture

onset and target onset.

This additional analysis involved all the stimuli. I chose the whole dataset as it was the largest

stimulus set and therefore the analysis would suffer the least in terms of statistical power from the

addition of a post-hoc variable. Note that both vowel displacement and consonant displacement

could be an estimate of jaw movement. Therefore, the post-hoc analysis also serves as another

exploration of the potential effect of jaw movement on gestural coordination. Besides the previous

random intercepts of participant, word, consonant duration, there are also two more random

intercepts of vowel displacement and consonant displacement.4 Since the C displacement is

measured differently for stimuli with bilabial C and stimuli with coronal C, there were separate

analyses conducted, one for the bilabial C stimuli as in Table 3.38, one for the coronal C stimuli as

in Table 3.39. In both cases, there were significant positive correlations between sonority difference

and CV gestural lag. I also tested the C displacement and V displacement as fixed effects. As in

Appendix G, vowel displacement in fact contributes to the CV lag variation in both positive and

negative directions, and the effect of sonority difference was still there. Specifically, for bilabial C

stimuli, the effect size for sonority difference was 14.07 with V displacement as a fixed effect. For

coronal C stimuli, the effect size for sonority difference was 6.74 with V displacement as a fixed

effect. It is likely that C displacement does not have a significant effect on CV lag in the dataset

tested here.

4The R formula is shown here: CV lag based on target onset∼Sonority difference+(1|Participant)+(1|Word)+(1|C

duration)+(1|V displacement)+(1|C displacement).

87

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

120.97
12.74

29.37
2.46

df
13.98
10.11

t value Pr(>|t|)
0.001
0.0004

4.12
5.17

Table 3.38 Mixed effects model results for all bilabial C stimuli. Random intercepts: participant,
word, consonant duration, vowel displacement, consonant displacement.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

182.46
8.31

44.14
3.39

df
11.80
10.00

t value Pr(>|t|)

4.13
2.45

0.00
0.03

Table 3.39 Mixed effects model results for all coronal C stimuli. Random intercepts: participant,
word, consonant duration, vowel displacement, consonant displacement.

3.3.5 Summary

The summary of the results can be found in the following Table 3.40 and Table 3.41. In Table

3.40, all the subgroups showed that there is a significant positive correlation between CV lag and

sonority difference. The bilabial group had a slightly larger effect size than the coronal group.

Dataset (English EMA data)
All English EMA data

Bilabial C All bilabial C data, C and V displacement

High V (week, meek, beak, peak)
Mid V (wane, main, bane, pain)
Low V (whack, Mac, back, pack)

Coronal C All coronal C data, C and V displacement
High V (new, do, sue, two)
Mid V (know, doe, so, toe)
Low V (knock, dock, sock, talk)

Estimate (sonority difference)
11.24 ***
12.74 ***
14.00 ***
14.35 ***
10.47 ***
8.31 *
8.02 **
9.40 ***
7.43 ***

Table 3.40 Summary of English EMA results. *** means that p ⩽ 0.001; ** means that p ⩽ 0.01;
* means that p ⩽ 0.05.

In Table 3.41, the pairs differ in nasality for both bilabial C and coronal C stimuli exhibited

significant positive correlations between CV lag and sonority difference. The coronal C stimuli

with differences in C voicing did not show the expected pattern. Also, the bilabial C stimuli with

differences in vowel height did not consistently show the expected pattern. The non-significant

result is potentially due to the same V measure for different vowel heights, as well as the imprecise

C voicing coding.

88

Pairwise comparison Stimulus pair Estimate (sonority difference)

Bilabial C Nasality differ

Voicing differ

Vowel height differ

Coronal C Nasality differ

Voicing differ

Vowel height differ

Mac, back
meek, beak
main, bane
beak, peak
bane, pain
back, pack
peak, pack
beak, back
meek, Mac
week, whack
new, do
know, doe
knock, dock
two, do
toe, doe
talk, dock
two, toe
do, doe
sue, so
new, know

11.20 *
21.84 ***
21.59 ***
8.92
12.81 **
12.66 *
3.2
-0.22
7.61
18.23 **
23.74 ***
11.83 **
17.58 ***
-5.64
6.47
-0.23
61.90 ***
27.51
76.03 ***
60.78 ***

Table 3.41 Summary of English EMA pairwise comparison. *** means that p ⩽ 0.001; ** means
that p ⩽ 0.01; * means that p ⩽ 0.05.

3.4 Conclusion

Overall the English experiment showed the expected pattern, on both bilabial and coronal

consonants combined with different vowel heights. I argued that the observed correlation cannot

simply be attributed to vowel quality or the effect of vowel displacement. Groups of stimuli with the

same vowel quality showed a significant positive correlation. Therefore, vowel quality is unlikely

to be the driver of observed CV lag variation. Additionally, in a post-hoc analysis, I showed there

were still positive correlation between sonority difference and CV lag observed for both stimuli

with bilabial C and stimuli with coronal C.

My results also suggest that jaw movement is unlikely to be an important factor driving the

observed correlation between sonority and gestural

lag variation.

First, using consonant

displacement as an approximation of jaw movement, I showed that

the observed positive

correlation is not confounded by jaw movement. Second, for the pairwise comparisons controlled

89

for jaw movement, I used post-hoc analyses to confirm that consonant displacement was indeed

controlled. Relatedly, the comparison using vowel displacement (which is also an approximation

of jaw movement) makes a similar point. The positive correlation between sonority difference and

CV lag is still exhibited when including vowel displacement. Based on the above analysis, jaw

movement is unlikely to be the factor leading to the observed relationship between CV lag and

sonority difference.

However, for some pairs the observation was not significant, though still in the expected

direction. Here are some potential reasons why the expected pattern is not shown – that the lag

could be affected by the coda environment and that the vowel sensor is not precise. First, the

coda consonant [k] may affect the articulation of the preceding vowel, and this may be the reason

why the three pairs – peak, pack; beak, back; meek, Mac – did not show the expected significant

correlation. The question remain why week, whack exhibited the expected pattern. It may be that

the significant effect is confounded by the velar or tongue movement in [w]. Similarly, in the pair

main, bane, the vowel may be nasalized due to the following coda nasal. Since [n] uses the tongue

and nasality is more sonorous, the CV timing may be affected. Furthermore, the consistent tongue

sensor is used to measure vowels of different heights, so maybe for this reason the pair do, doe did

not exhibit expected patterns. It might be that some characteristic of the [d] articulation made the

pair more sensitive to the consistent vowel sensor measure.

90

CHAPTER 4

EXPERIMENT 3: MANDARIN EMA STUDY

Both the English corpus study and the English EMA study showed a significant positive correlation

between sonority difference and CV lag. The relationship between sonority sequence and CV lag

has been found for stimuli with coronal C or bilabial C, combined with high, mid, or low vowels in

English.

As mentioned earlier, Crouch (2022) and Crouch et al. (2023) observed that sonority difference

positively correlates to CC onset lag in Georgian, and they argued it is due to language-specific

mechanism in Georgian. However, as shown in the previous two chapters, a positive correlation

has also been found in English. Since the main claim of the study is intended to be language-

independent, a further question would be whether CV lag in languages other than English still

exhibits this correlation between sonority and CV lag. To provide a cross-linguistic perspective, I

conducted an EMA study on Mandarin.

4.1 Methods

The Mandarin experiments shared the same experimental procedure despite the differences

discussed below.

The first difference in terms of method is the language used when

communicating with participants.

For the Mandarin experiment,

the communications with

participants include recruitment messages (see Appendix C), pre-screening surveys (see Appendix

D), and communications during experimental sessions. All the spoken communications were in

Mandarin Chinese and all the written communications were in simplified Chinese. The second

difference is that Mandarin participants were recruited through WeChat, the primary social media

platform among Chinese people. Thirdly, the Mandarin experiment had a carrier phrase. Since

the English experiments without carrier phrases sometimes had unreasonably large gestures

extended into the pause, in the later conducted Mandarin experiments, the carrier phrase zhe4 ge4

__ mo [úù@ k@ __ m@] ‘this __’ was used. The carrier phrase was chosen because before the target

word, there is a schwa, which is the neutral position. Another reason for choosing the carrier

phrase is that after the vowel of the target word CV, there is a bilabial consonant, which uses a

91

different articulator (i.e., lips) than the vowel (i.e., tongue). The carrier phrase did seem to serve

this purpose because there were fewer gestures that were annotated with uncertainty — English

8% uncertain labels, Mandarin 4% uncertain labels. Lastly, the uncertainty labels in Mandarin

were slightly different than those in the English experiment.1 Even though most labels were the

same, in Mandarin there were new labels such as "NaLamispron", which means that either [l] is

pronounced as [n], or [n] is pronounced as [l]. Some participants also told the experimenters

during the sessions that they could not distinguish between [n] and [l]. This is evidence of the

merger-in-progress of word-initial lateral [l] and [n], which occurred in many Chinese languages

such as Nanjing Mandarin, Chengdu, Southwestern Mandarin, languages in Southern China (Shi,

2015; Johnson and Song, 2016; Zhang and Levis, 2021; Cheng et al., 2023). This merger of [n]

and [l] has been observed in both production and perception of Chinese languages (Cheng et al.,

2023). Of the 10 participants analyzed, 2 from Jiangsu, which belongs to Southern China, had the

[n-l] merger.

Altogether data from 20 Mandarin participants were collected, and those from 10 were annotated

and analyzed in the current study. The data were annotated in the reverse order of data collection,

which means that the last 10 participants’ data were analyzed. The data of the first 10 participants

is not considered in the current dissertation due to time constraints. Altogether there were 4004

annotated Mandarin syllables, and 3849 of them (96.1%) were not marked with any uncertainty

labels. The results of these unambiguous annotations can be found in the Results section. Among

the 10 participants of the EMA Mandarin experiments, 9 participants were female and 1 was male.

The 10 participants aged from 23 to 56 years old, with an average age of 33.8 years old.

4.2 Stimuli

There were 27 Mandarin stimuli, and each participant repeated them in 15 randomized lists with

filler words between the blocks. This means that there were 15 repetitions of each stimulus. Just

like in the English experiment, when organized in different ways, the subgroups of the stimuli can

be used to test the two sub-hypotheses of the dissertation. I will first present subgroups of stimuli

1See the Mandarin annotation labels in Appendix F.

92

used to test the claim that for CV syllables with the same V, a less sonorous C leads to a larger CV

lag. Then, I will present subgroups of the stimuli used to test the claim that for CV syllables with

the same C, a more sonorous V leads to a larger CV lag. Similarly, a summary of the Mandarin

stimuli can be found at the end of this section. Similar to the English EMA experiment, the first

column in each stimulus table below has the index for each stimulus.2

4.2.1 Same V different C

The stimuli in this subsection were used to test the claim that for a CV syllable, the larger the

sonority difference, the larger the CV lag. In each of the subsections, from the top to the bottom

of each table of stimuli, CV gestural lag decreases since sonority difference decreases due to C

sonority increases.

4.2.1.1 Same V different bilabial C

There are two sets of stimuli for bilabial consonants in order to involve more variation of bilabial

consonants. For the low vowel stimuli, for instance, there were low vowel stimuli with nasalized

vowels and non-nasalized vowels.

Index Word Pinyin T C V C cat

1 僻 pi
2 臂 bi
3 秘 mi

i
p
4
4
i
b
4 m i

bilabial
bilabial
bilabial

V cat Gloss
distant
high
arm
high
secret
high

C son V son S dif
15
1
15
4
15
7

14
11
8

Table 4.1 Same high V different bilabial C. C cat means C category, and V cat means V category.
T stands for tone. This is also true for other Mandarin stimuli tables.

Index Word Pinyin T C V C cat

4 帕 pa
5 坝 ba
6 骂 ma
7 袜 wa

a
p
4
a
b
4
4 m a
4 w a

bilabial
bilabial
bilabial
bilabial

V cat Gloss
low
low
low
low

handkerchief
dam
scold
sock

C son V son S dif
17
1
17
4
17
7
17
12

16
13
10
5

Table 4.2 Same low V different bilabial C.

2Similar to the English experiment, I annotate the obstruents with indexes assuming true voicing distinction. A

more careful study in the future may consider the realization of voicing and code accordingly.

93

Index Word Pinyin T C V C cat

V cat Gloss C son V son S dif

8 配 pei
9 贝 bei
10 妹 mei
11 味 wei

p
4
@
4
b
@
4 m @
4 w @

bilabial mid
bilabial mid
bilabial mid
bilabial mid

match
shell
sister
flavor

1
4
7
12

14
14
14
14

13
10
7
2

Table 4.3 Same mid-V different bilabial C.

Index Word Pinyin T C V C cat

13 盼 pan
14 半 ban
15 曼 man
16 万 wan

p æ bilabial
4
b æ bilabial
4
4 m æ bilabial
4 w æ bilabial

V cat Gloss
hope
low
half
low
grace
low
ten thousand
low

C son V son S dif
17
1
17
4
17
7
17
12

16
13
10
5

Table 4.4 Same low nasalized V different bilabial C.

4.2.1.2 Same V different coronal C

This subsection has stimuli of the same V and different coronal C.

Index Word Pinyin T C V C cat

18 兔 tu
19 素 su
20 度 du
21 怒 nu
22 路 lu

4
4
4
4
4

t
s
d
n
l

u
u
u
u
u

coronal
coronal
coronal
coronal
coronal

V cat Gloss
rabbit
high
plain
high
degree
high
anger
high
road
high

C son V son S dif
15
1
15
3
15
4
15
7
15
9

14
12
11
8
6

Table 4.5 Same high V different coronal C.

Index Word Pinyin T C V C cat

23 踏 ta
24 飒 sa
25 大 da
26 那 na
27 腊 la

4
4
4
4
4

t
s
d
n
l

a
a
a
a
a

coronal
coronal
coronal
coronal
coronal

V cat Gloss C son V son S dif
low
low
low
low
low

step
cool
big
that
wax

17
17
17
17
17

16
14
13
10
8

1
3
4
7
9

Table 4.6 Same low V different coronal C.

4.2.2 Same C different V

For this subsection, the stimuli are organized in another way to test another sub-claim. In each

of the subsections, the high vowel group should have a smaller lag than the mid vowel group, and

the mid vowel group should have a smaller lag than the low vowel group - since high vowel is less

94

sonorous than mid vowel than low vowel, high vowel also has smaller sonority difference than mid

vowel than low vowel. For a same labial C, the syllable with lower vowel is predicted to have larger

CV gestural lag since its sonority difference is larger. All the syllable pairs in question share the

same coda environment.

Index Word Pinyin T C V C cat

1 僻
pi
4 帕 pa
bi
2 臂
ba
5 坝
mi
3 秘
ma
6 骂

4
p
i
4
p
a
4
b
i
a
b
4
4 m i
4 m a

bilabial
bilabial
bilabial
bilabial
bilabial
bilabial

V cat Gloss
distant
high
handkerchief
low
arm
high
dam
low
secret
high
scold
low

C son V son S dif
15
1
17
1
15
4
17
4
15
7
17
7

14
16
11
13
8
10

Table 4.7 Same labial C different V.

Index Word Pinyin T C V C cat

tu
18 兔
23 踏
ta
19 素 su
24 飒
sa
du
20 度
da
25 大
nu
21 怒
na
26 那
lu
22 路
la
27 腊

4
4
4
4
4
4
4
4
4
4

t
t
s
s
d
d
n
n
l
l

u
a
u
a
u
a
u
a
u
a

coronal
coronal
coronal
coronal
coronal
coronal
coronal
coronal
coronal
coronal

V cat Gloss
rabbit
high
step
low
plain
high
cool
low
degree
high
big
low
anger
high
that
low
road
high
wax
low

C son V son S dif
15
1
17
1
15
3
17
3
15
4
17
4
15
7
17
7
15
9
17
9

14
16
12
14
11
13
8
10
6
8

Table 4.8 Same coronal C different V.

4.2.3 Summary of Mandarin experiment stimuli

A summary of all Mandarin stimuli can be found in Table 4.9. The consonant and vowel

categories can be found in C cat and V cat columns. The sonority index for the consonant and

vowel can be found in the C son and V son columns. The sonority difference of each stimulus can

be found in the last column S diff.

95

Index Word Pinyin T C V C cat

1 僻
2 臂
3 秘
4 帕
5 坝
6 骂
7 袜
8 配
9 贝
10 妹
11 味
12 肺
13 盼
14 半
15 曼
16 万
17 饭
18 兔
19 素
20 度
21 怒
22 路
23 踏
24 飒
25 大
26 那
27 腊

pi
bi
mi
pa
ba
ma
wa
pei
bei
mei
wei
fei
pan
ban
man
wan
fan
tu
su
du
nu
lu
ta
sa
da
na
la

i
p
4
4
i
b
4 m i
a
p
4
4
a
b
4 m a
4 w a
p
4
@
4
b
@
4 m @
4 w @
f
4
@
p æ bilabial
4
b æ bilabial
4
4 m æ bilabial
4 w æ bilabial
4
4
4
4
4
4
4
4
4
4
4

V cat Gloss
distant
high
bilabial
arm
high
bilabial
secret
high
bilabial
napkin
low
bilabial
dam
low
bilabial
scold
low
bilabial
sock
low
bilabial
match
bilabial mid
shell
bilabial mid
sister
bilabial mid
flavor
bilabial mid
lung
mid
labial
hope
low
half
low
grace
low
ten thousand
low
meal
low
rabbit
high
plain
high
degree
high
anger
high
road
high
step
low
cool
low
big
low
that
low
wax
low

æ labial
u
u
u
u
u
a
a
a
a
a

coronal
coronal
coronal
coronal
coronal
coronal
coronal
coronal
coronal
coronal

f
t
s
d
n
l
t
s
d
n
l

C son V son S dif
15
1
15
4
15
7
17
1
17
4
17
7
17
12
14
1
14
4
14
7
14
12
14
3
17
1
17
4
17
7
17
12
17
3
15
1
15
3
15
4
15
7
15
9
17
1
17
3
17
4
17
7
17
9

14
11
8
16
13
10
5
13
10
7
2
11
16
13
10
5
14
14
12
11
8
6
16
14
13
10
8

Table 4.9 Summary of Mandarin stimuli. The consonant and vowel categories can be found in C
cat and V cat columns. The sonority index for the consonant and vowel can be found in the C son
and V son columns. The sonority difference of each stimuli can be found in the last column S diff.

Target Sounds
Bilabial [p, b, m, w]
Labial-dental [f]
Alveolar [t, d, n, s, l]
Vowel

Articulatory Sensor Gesture
lower and upper lip
lower lip
tongue
tongue

lip aperture
lower lip
tongue tip
tongue dorsum

Table 4.10 Articulatory sensors and gestures for each type of sounds – Mandarin.

96

4.3 Results

4.3.1 Overall analysis

For all the Mandarin data in the study, there was a positive correlation between sonority

difference and gestural lag, as in the plot in Figure 4.1 and Table 4.1. Following the reasoning

in the English experiment, results for analyzing the subgroups of the stimuli can be found in the

following subsections.

Figure 4.1 CV lag based on target onset for Mandarin participants.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

120.24
7.11

24.51
2.03

df
30.74
25.00

t value Pr(>|t|)
0.00003
0.002

4.91
3.51

Table 4.11 Mixed effects model results for Mandarin participants.

4.3.2 Results for claim 5a: the same V and different C

I first show results for the claim 5a that for the same V and different C, a less sonorous C leads

to a larger CV lag. I will first show results for the bilabial C subgroup, then I will present the results

for the coronal C subgroup.

97

4.3.2.1 Same V and varying bilabial C

The analyses in this subgroup have the same vowel and varying bilabial consonants. Figure 4.2

and Table 4.12 showed that different bilabial consonants with the same high vowel in Mandarin

had a significant positive correlation between CV lag and sonority difference.

Figure 4.2 CV lag based on target onset for Mandarin participants: bilabial C and high V.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

57.43
10.23

19.19
1.48

df
103.69
419.31

t value
2.99
6.90

Pr(>|t|)
0.00
< 0.00001

Table 4.12 Mixed effects model results for Mandarin participants: bilabial C and high V.

For Mandarin tone 4 stimuli with bilabial C and mid V, there was a significant positive correlation

between sonority difference and gestural lag as in Figure 4.3 and Table 4.13.

98

Figure 4.3 CV lag based on target onset for Mandarin participants: bilabial C and mid V.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

174.17
5.55

14.32
0.88

df
16.50
575.69

t value
12.17
6.28

Pr(>|t|)
0.00
< 0.00001

Table 4.13 Mixed effects model results for Mandarin participants: bilabial C and mid V.

When only low vowel stimuli without coda nasal were analyzed, there was a significant positive

correlation as in Figure 4.4 and Table 4.14.

Figure 4.4 CV lag based on target onset for Mandarin participants: bilabial C and low V.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

123.98
5.73

15.89
1.17

df
71.23
542.16

t value
7.80
4.91

Pr(>|t|)
<0.00001
<0.00001

Table 4.14 Mixed effects model results for Mandarin participants: bilabial C and low V.

99

When only low vowel stimuli with coda nasal were analyzed, there was a significant positive

correlation as in Figure 4.5 and Table 4.15.

Figure 4.5 CV lag based on target onset for Mandarin participants: bilabial C and low V, with coda
nasal.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

107.92
5.61

15.26
1.10

df
61.94
543.82

t value
7.07
5.11

Pr(>|t|)
0.00
< 0.00001

Table 4.15 Mixed effects model results for Mandarin participants: Bilabial C and low V, with coda
nasal.

As mentioned in the corpus English experiment which is experiment 1, there is no consensus on

whether voiceless stops are more sonorous than voiced stops. To resolve this potential ambiguity

and to control for jaw movement, the Mandarin bilabial nasals and stops were compared.3 If for

the same vowel, the syllable beginning with a stop has a larger gestural lag, then the main claim of

the dissertation will be supported. For the same high vowel, the syllable with the bilabial stop had

a significant larger lag than the syllable with the bilabial nasal, as in Figure 4.6 and Table 4.16.

3The pairwise comparison of two stimuli differ in voicing can be found in Appendix I.

100

Figure 4.6 CV lag based on target onset for Mandarin participants: bi4, mi4.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

58.84
10.18

25.53
2.46

df
176.97
276.23

t value Pr(>|t|)

2.31
4.14

0.02
0.00005

Table 4.16 Mixed effects model results for Mandarin participants: bi4, mi4.

For the same mid vowel, there was also a significant positive correlation between gestural lag

and sonority difference as in Figure 4.7 and Table 4.17.

Figure 4.7 CV lag based on target onset for Mandarin participants: bei4, mei4.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

58.96
17.62

30.26
3.18

df
146.55
286.01

t value
1.95
5.55

Pr(>|t|)
0.05
< 0.00001

Table 4.17 Mixed effects model results for Mandarin participants: bei4, mei4.

101

For the same low vowel, syllables with bilabial stops had a significantly larger lag than the one

with bilabial nasals as in Figure 4.8 and Table 4.18.

Figure 4.8 CV lag based on target onset for Mandarin participants: ba4, ma4.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

-32.56
17.93

50.42
4.29

df
284.01
277.47

t value Pr(>|t|)
-0.65
4.18

0.52
0.00004

Table 4.18 Mixed effects model results for Mandarin participants: ba4, ma4.

For the syllable with low vowel and a coda nasal, syllables with bilabial nasals had larger lags

than those with bilabial stops as in Figure 4.9 and Table 4.19.

Figure 4.9 CV lag based on target onset for Mandarin participants: ban4, man4.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

-80.27
20.54

49.06
4.09

df
244.41
264.72

t value
-1.64
5.02

Pr(>|t|)
0.10
< 0.00001

Table 4.19 Mixed effects model results for Mandarin participants: ban4, man4.

102

4.3.2.2 Same V and varying coronal C

Below are the results of analyzing coronal C and the same V. For Mandarin tone 4 stimuli with

coronal C and high V, there was a significant positive correlation between gestural lag and sonority

difference as in Figure 4.10 and Table 4.20.

Figure 4.10 CV lag based on target onset for Mandarin participants: coronal C and high V.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

41.60
16.54

17.76
1.30

df
46.64
698.27

t value
2.34
12.71

Pr(>|t|)
0.02
< 0.00001

Table 4.20 Mixed effects model results for Mandarin participants: coronal C and high V.

For Mandarin stimuli with coronal C and low V, there was a significant positive correlation

between CV lag and sonority difference as in Figure 4.11 and Table 4.21.

103

Figure 4.11 CV lag based on target onset for Mandarin participants: coronal C and low V.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

-41.02
19.84

21.55
1.37

df
59.64
691.61

t value
-1.90
14.48

Pr(>|t|)
0.06
< 0.00001

Table 4.21 Mixed effects model results for Mandarin participants: coronal C and low V.

Syllables with coronal nasals and stops were compared in the following subsection. For high

vowel [u], syllables with coronal stops have significantly larger lag than those with coronal stops,

as in Figure 4.12 and Table 4.22.

Figure 4.12 CV lag based on target onset for Mandarin participants: du4, nu4.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

-83.39
30.89

39.46
3.71

df
153.88
273.54

t value
-2.11
8.33

Pr(>|t|)
0.04
< 0.00001

Table 4.22 Mixed effects model results for Mandarin participants: du4, nu4.

104

For low vowel Mandarin stimuli, syllables with oral stops had significantly larger lag than those

with nasal stops as in Figure 4.13 and Table 4.23.

Figure 4.13 CV lag based on target onset for Mandarin participants: da4, na4.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error
-311.41
43.65

46.45
3.75

df
210.16
270.35

t value
-6.70
11.65

Pr(>|t|)
0.00
< 0.00001

Table 4.23 Mixed effects model results for Mandarin participants: na4, da4.

Figure 4.14 CV lag based on target onset for Mandarin participants: tu4, du4.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

292.34
-3.31

40.22
2.96

df
192.30
281.53

t value Pr(>|t|)

7.27
-1.12

0.00
0.26

Table 4.24 Mixed effects model results for Mandarin participants: tu4, du4.

105

Figure 4.15 CV lag based on target onset for Mandarin participants: ta4, da4.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

241.93
1.12

54.23
3.62

df
280.05
280.00

t value Pr(>|t|)

4.46
0.31

0.00
0.76

Table 4.25 Mixed effects model results for Mandarin participants: ta4, da4.

4.3.3 Results for claim 5b: the same C and different V

In the previous subsection, I showed that there was a significant positive correlation between

CV lag and sonority difference. In the current subsection, I showed the results for claim 5b that

for the same C and different V, a more sonorous V leads to a larger CV lag.

In the following

subsections, I present results for the same bilabial C first, then the same coronal C.

4.3.3.1 Same bilabial C and different V

The analyses below were used to test the claim on the same bilabial C and different V. If syllables

with low vowels have larger lags than those with high vowels, the main claim of the current study

will be supported. For target Mandarin syllables with [p], lower vowel syllables had a larger lag as

in Figure 4.16 and Table 4.26.

106

Figure 4.16 CV lag based on target onset for Mandarin participants: pi4, pa4.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

2.42
14.09

87.02
5.78

df
274.81
270.81

t value Pr(>|t|)

0.03
2.44

0.98
0.02

Table 4.26 Mixed effects model results for Mandarin participants: pi4, pa4.

For syllables starting with [b], the high vowel syllables had significantly larger lags than low

vowel syllables as indicated by Figure 4.17 and Table 4.27.

Figure 4.17 CV lag based on target onset for Mandarin participants: bi4, ba4.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

-12.33
16.72

69.25
5.69

df
282.51
274.16

t value Pr(>|t|)
-0.18
2.94

0.86
0.004

Table 4.27 Mixed effects model results for Mandarin participants: bi4, ba4.

107

For bilabial nasal syllables, those with high vowels had a larger lag than those with low vowels.

However, the fitted mixed effect model did not exhibit a significant pattern (as in Table 4.28) and

the descriptive plot in Figure 4.18 also did not show any obvious difference.

Figure 4.18 CV lag based on target onset for Mandarin participants: mi4, ma4.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

119.54
2.58

49.58
5.41

df
288.91
280.01

t value Pr(>|t|)

2.41
0.48

0.02
0.63

Table 4.28 Mixed effects model results for Mandarin participants: mi4, ma4.

4.3.3.2 Same coronal C and different V

This subsection shows the results for stimuli with the same coronal consonant and different

vowels. If the syllables with low vowels have larger lags than those with higher vowels, the main

claim of the dissertation will be supported. Figure 4.19 and Table 4.29 shows the comparison of

tu4 and ta4 – ta4 had a larger lag than tu4.

108

Figure 4.19 CV lag based on target onset for Mandarin participants: tu4, ta4.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

144.45
7.23

79.73
5.27

df
291.31
284.04

t value Pr(>|t|)

1.81
1.37

0.07
0.17

Table 4.29 Mixed effects model results for Mandarin participants: tu4, ta4.

The syllable pair su4 and sa4 had similar CV lags, and high vowel syllables had larger lags than

low vowel ones as in Figure 4.20 and Table 4.30.

Figure 4.20 CV lag based on target onset for Mandarin participants: su4, sa4.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

354.88
-8.47

80.82
6.12

df
290.98
282.06

t value Pr(>|t|)

4.39
-1.38

0.00
0.17

Table 4.30 Mixed effects model results for Mandarin participants: sa4, su4.

109

The coronal consonant syllable du4 and da4 also had similar CV lags as in Figure 4.21 and

Table 4.31. In other words, the expected pattern was not exhibited for this pair.

Figure 4.21 CV lag based on target onset for Mandarin participants: du4, da4.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

249.92
0.51

60.55
4.88

df
279.19
281.01

t value Pr(>|t|)
4.128
0.10

0.00
0.92

Table 4.31 Mixed effects model results for Mandarin participants: du4, da4.

The syllables nu4 had a significantly larger lag than na4, as in Figure 4.22 and Table 4.32. This

observation was opposite to the main claim of the dissertation. It could be that the vowels were

nasalized, and this changed the sonority difference.

Figure 4.22 CV lag based on target onset for Mandarin participants: nu4, na4.

110

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

308.76
-18.67

58.87
6.26

df
248.16
260.49

t value Pr(>|t|)

5.25
-2.98

0.00
0.003

Table 4.32 Mixed effects model results for Mandarin participants: nu4, na4.

The coronal consonant syllables lu4 and la4 had similar CV lags, which means that the expected

positive correlation was not found here. See Figure 4.23 and Table 4.33.

Figure 4.23 CV lag based on target onset for Mandarin participants: lu4, la4.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

154.44
-3.70

45.42
6.20

df
215.13
266.08

t value Pr(>|t|)
0.001
0.55

3.40
-0.60

Table 4.33 Mixed effects model results for Mandarin participants: lu4, la4.

4.3.4 The vowel displacement and C displacement analysis

As mentioned earlier, Shaw and Chen (2019) found that there was a negative correlation between

CV lag and vowel displacement. Also, C displacement was used as an estimate of jaw movement.

Just like the English experiment, in the following, I considered V displacement and C displacement

as random intercepts.4 Since C displacement was measured differently for stimuli with coronal

C vs. stimuli with bilabial C, the two types of stimuli were analyzed separately. The mixed-

effect model for stimuli with bilabial C can be found in Table 4.34, where there was a positive

4The R formula is shown here: CV lag based on target onset∼Sonority difference+(1|Participant)+(1|Word)+(1|C

duration)+(1|V displacement)+(1|C displacement).

111

correlation between sonority difference and CV lag. However, the effect of sonority difference was

not significant.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

136.54
3.65

22.71
1.96

df
16.13
13.21

t value Pr(>|t|)

6.01
1.86

0.00
0.09

Table 4.34 Mixed effects model results for Mandarin participants. The bilabial C stimuli. Random
intercepts: participant, word, C duration, V displacement, C displacement.

To test whether this insignificance is due to vowel displacement or consonant displacement, I

added V displacement or C displacement as fixed effects in the model. The results of considering

vowel displacement can be found in Table 4.35.5 After considering vowel displacement, there

was a significant positive correlation between sonority difference and CV lag. Also, there was a

significant positive correlation between sonority difference and vowel displacement. The positive

correlation between sonority and vowel displacement was surprising as it did not replicate either

Shaw and Chen (2019) or experiment 1. Further research needs to be conducted to conclude a

reason.

(Intercept)
Sonority difference
V displacement

Estimate Std. Error
107.87
5.73
11.90

21.38
1.87
0.64

df
15.42
13.12
1860.38

t value
5.04
3.07
18.61

Pr(>|t|)
0.00
0.01
< 0.00001

Table 4.35 Mixed effect model for bilabial C stimuli with sonority difference and V displacement
as fixed effects. Random intercepts: participants, word, C duration.

I also added C displacement as a fixed effect, and there was no significant relationship between

C displacement and CV lag as in Table 4.36.6

5CV lag based on target onset∼Sonority difference + V displacement + (1 | Participant) + (1 |Word) + (1 | C

duration) + (1 | C displacement).

6The code is: CV lag based on target onset∼Sonority difference+C displacement+(1|Participant)+(1|Word)+(1|V

displacement).

112

(Intercept)
Sonority difference
C displacement

Estimate Std. Error
134.04
3.34
-1.07

23.15
2.00
1.01

df
16.41
13.38
758.35

t value
5.79
1.67
-1.06

Pr(>|t|)
0.00
0.12
0.29

Table 4.36 Mixed effect model for bilabial C stimuli with sonority difference and C displacement
as fixed effects. Random intercepts: participant, word, V displacement.

Also, the mixed effect model result for considering C and V displacement for stimuli with

coronal C can be found in Table 4.37. When considering C and V displacements, there was a

significant positive correlation between CV lag and sonority difference.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

29.54
15.02

34.29
2.84

df
9.15
7.80

t value Pr(>|t|)

0.86
5.30

0.41
0.001

Table 4.37 Mixed effects model results for Mandarin participants. The coronal C stimuli. Random
intercepts: participant, word, C duration, V displacement, C displacement.

4.3.5 Summary

As in Table 4.38, most groups in Mandarin EMA data exhibited a significant positive correlation

between CV lag and sonority difference. Pairwise comparison in Table 4.39 showed that the stimuli

with bilabial C generally had significant positive correlations when the stimuli differed in voicing,

nasality, and vowel height. In those cases, the stimuli had controlled nasality, voicing, as well as C

place and manner.

113

Dataset (Mandarin EMA data)
All Mandarin EMA data
Bilabial C C, V displacement as random intercepts

V displacement as fixed effect
High V (mi4, bi4, pi4)
Mid V (wei4, mei4, bei4, pei4)
Low V no coda (wa4, ma4, ba4, pa4)
Low V with coda (wan4, man4, ban4, pan4)

Coronal C C, V displacement as random intercepts

High V (lu4, nu4, du4, su4, tu4)
Low V (la4, na4, da4, sa4, ta4)

11.90 ***

Est (son diff) Est (displ)
7.11 **
3.65
5.73 **
10.23 ***
5.55 ***
5.73 ***
5.61 ***
15.02
16.54 ***
19.84 ***

Table 4.38 Summary of Mandarin EMA results. Son diff means sonority difference, est means
estimate, and displ means displacement. *** means that p ⩽ 0.001; ** means that p ⩽ 0.01; *
means that p ⩽ 0.05.

Pairwise comparison Stimulus pair Estimate (sonority difference)

Bilabial C Nasality differ

Voicing differ

Vowel height differ

Coronal C Nasality differ

Voicing differ

Vowel height differ

mi4, bi4
mei4, bei4
ma4, ba4
man4, ban4
bi4, pi4
bei4, pei4
ba4, pa4
ban4, pan4
mi4, ma4
bi4, ba4
pi4, pa4
nu4, du4
na4, da4
tu4, du4
ta4, da4
tu4, ta4
su4, sa4
du4, da4
nu4, na4
lu4, la4

10.18 ***
17.62 ***
17.93 ***
20.54 ***
10.57 ***
6.95 *
9.69 *
8.11 *
2.58
16.72 **
14.09 *
30.89 ***
43.65 ***
-3.31
1.12
7.23
-8.47
0.51
-18.67 **
-3.7

Table 4.39 Summary of EMA Mandarin pairwise comparisons. *** means that p ⩽ 0.001; **
means that p ⩽ 0.01; * means that p ⩽ 0.05.

114

4.4 Conclusion

Overall the Mandarin experiment replicated the observations in the English experiment — in

other words, the Mandarin experiment showed that there is a positive correlation between sonority

difference and CV gestural lag for tone 4 Mandarin words. This is generally true for both the

claim 5a which is on different C and controlled V, and the claim 5b which is about the same C and

different V.

However, just like the English EMA experiment, there were a few sub-analyses that did not

exhibit the expected pattern. The possible reasons for the non-significant observations may be the

following — that the vowel sensor was the same regardless of vowel height, and that the nasality

of surrounding sounds affected the sonority level of the vowel. First, the coronal consonant and

different vowel groups had some unexpected patterns. This is probably due to the reasons a) that

the C and V share the tongue as the articulator and b) that there is one consistent tongue measure

used regardless of vowel height. For a similar reason, do and doe in English also did not exhibit

the expected pattern. Second, nasalization may affect the sonorous level or gestural lag of adjacent

sounds. This could be the reason why the expected pattern was not found in the mi4, ma4 pair.

Also, when including stimuli without a controlled environment (such as the bilabial group), the

pattern was not significant.

115

CHAPTER 5

DISCUSSION

The dissertation showed that there is a significant positive correlation between CV lag and sonority

difference for both English monosyllabic words and Mandarin tone 4 words. The positive correlation

was found when there is the same C and different V, as well as when there is the same V and different

C. Furthermore, the claim was also supported by more controlled comparisons of labial and coronal

consonants, as well as vowels of different heights.

For experiment 1, I observed the positive correlation in English corpus data. In experiment

2, a similar positive correlation was observed for English EMA data, which has more variation of

vowels and consonants in the stimuli. Furthermore, the correlation was not limited to English since

a similar positive correlation has been found in Mandarin EMA data in experiment 3.

In all 3 experiments, I argued that the observed correlation cannot simply be attributed to vowel

quality or the effect of vowel displacement. Specifically, I showed that voicing is unlikely to be the

sole factor leads to CV lag variation because two stimuli that differ in nasality and share the same

voicing — such as [bA, mA] — exhibited the positive correlation. Moreover, groups of stimuli with

the same vowel quality showed a significant positive correlation — such as the bilabial C group

and the alveolar C group. Therefore, vowel quality is unlikely to be the driver of observed CV lag

variation. Additionally, in a post-hoc analysis, I showed that vowel displacement in fact contributes

to the CV lag variation in the direction opposite to what I observed for sonority difference, and the

effect of sonority difference was almost unchanged.

The results of this dissertation also suggest that jaw movement is unlikely to be an important

factor driving the observed correlation between sonority and gestural lag variation. First, using

consonant displacement as an approximation of jaw movement, I showed that the observed positive

correlation is not confounded by jaw movement. Second, I looked at voicing pairs and nasality

pairs which controlled for jaw movement. The voiced and voiceless pairs differ in voicing and

involve a similar level of jaw movement showed the expected effect. Furthermore, in nasality

pairs [mA, bA] and [nA, dA], the stimuli with the nasal segment have a larger lag than the oral

116

segment of the same place of articulation. Since the two segments are mainly different in nasality

and involve similar degrees of jaw movement, the lag difference cannot be attributed to jaw

movement. For the pairwise comparisons controlled for jaw movement, I used post-hoc analyses

to confirm that consonant displacement was indeed controlled. Relatedly, the comparison using

vowel displacement (which is also an approximation of jaw movement) makes a similar point. The

positive correlation between sonority difference and CV lag is still exhibited when including vowel

displacement. Based on the above analysis, jaw movement is unlikely to be the factor leading to

the observed relationship between CV lag and sonority difference.

As noted in Chapter 1.1, previous theoretical claims about speech production have typically

predicted a consistent relationship of CV coordination, assuming prosodic factors are held constant

(Browman and Goldstein, 1989, 1992; Nam, 2007; Liu et al., 2020; Durvasula and Wang, 2023;

Liu et al., 2022). However, the results of the current study suggest that CV lags are correlated

with the sonority difference between the consonant and the vowel; therefore, sonority should be

a factor in modeling articulatory timing. In addition, the results of the dissertation also suggest

that sonority needs to be considered in experiments studying gestural coordination, particularly in

making comparisons between segment sequences consisting of different segments.

In my 3 experiments, Mandarin and English both showed significant positive correlations

between CV lag and sonority difference. The estimates were around 10 as in the following Table

5.1, 5.2, and 5.3.

Dataset (English corpus data)
All English corpus data
All English corpus data, V displacement
All English corpus data, C displacement
sVd stimuli (subset)
T1 C stimuli
Alveolar C stimuli (la, na, za, da, sa, ta)
Lip aperture (wa, ma, ba, pa)

Estimate (sonority diff) Estimate (displace)
16.49 ***
15.76 ***
16.76 ***
12.09
15.53 **
14.24 *
14.71 ***

-3.97 ***
2.19 ***

Table 5.1 Summarizing the results of experiment 1. *** means that p ⩽ 0.001; ** means that p ⩽
0.01; * means that p ⩽ 0.05. This Table is a repetition of Table 2.28, for the readers’ convenience.

117

Dataset (English EMA data)
All English EMA data

Bilabial C All bilabial C data, C and V displacement

High V (week, meek, beak, peak)
Mid V (wane, main, bane, pain)
Low V (whack, Mac, back, pack)

Coronal C All coronal C data, C and V displacement
High V (new, do, sue, two)
Mid V (know, doe, so, toe)
Low V (knock, dock, sock, talk)

Estimate (sonority difference)
11.24 ***
12.74 ***
14.00 ***
14.35 ***
10.47 ***
8.31 *
8.02 **
9.40 ***
7.43 ***

Table 5.2 Summary of English EMA results. *** means that p ⩽ 0.001; ** means that p ⩽ 0.01; *
means that p ⩽ 0.05. This Table is a repetition of Table 3.40, for the readers’ convenience.

Dataset (Mandarin EMA data)
All Mandarin EMA data
Bilabial C C, V displacement as random intercepts

V displacement as fixed effect
High V (mi4, bi4, pi4)
Mid V (wei4, mei4, bei4, pei4)
Low V no coda (wa4, ma4, ba4, pa4)
Low V with coda (wan4, man4, ban4, pan4)

Coronal C C, V displacement as random intercepts

High V (lu4, nu4, du4, su4, tu4)
Low V (la4, na4, da4, sa4, ta4)

11.90 ***

Est (son diff) Est (displ)
7.11 **
3.65
5.73 **
10.23 ***
5.55 ***
5.73 ***
5.61 ***
15.02
16.54 ***
19.84 ***

Table 5.3 Summary of Mandarin EMA results. Son diff means sonority difference, est means
estimate, and displ means displacement. *** means that p ⩽ 0.001; ** means that p ⩽ 0.01; *
means that p ⩽ 0.05. This Table is a repetition of Table 4.38, for the readers’ convenience.

For the pairwise comparisons, the stimulus pairs that differ in nasality exhibited the most

consistent results. There was less consistency in the differ-in-voicing pairs as well as the differ-

in-vowel-height pairs. The non-consistent results could be due to a) the same tongue sensor

measurements for different vowels or b) the imprecise coding of C voicing of the stimuli.

118

Pairwise comparison
Nasality differ

Voicing differ, stop

Voicing differ, fricative

Vowel height (subset)

Stimulus pair Estimate (sonority difference)
ma, ba
na, da
pa, ba
da, ta
fa, va
sa, za
been, back

14.73 ***
22.09 ***
14.40 ***
5.84 *
7.11
9.74
21.06 ***

Table 5.4 Summarizing pairwise comparison of experiment 1. *** means that p ⩽ 0.001; ** means
that p ⩽ 0.01; * means that p ⩽ 0.05. This Table is a repetition of Table 2.29.

Pairwise comparison Stimulus pair Estimate (sonority difference)

Bilabial C Nasality differ

Voicing differ

Vowel height differ

Coronal C Nasality differ

Voicing differ

Vowel height differ

Mac, back
meek, beak
main, bane
beak, peak
bane, pain
back, pack
peak, pack
beak, back
meek, Mac
week, whack
new, do
know, doe
knock, dock
two, do
toe, doe
talk, dock
two, toe
do, doe
sue, so
new, know

11.20 *
21.84 ***
21.59 ***
8.92
12.81 **
12.66 *
3.2
-0.22
7.61
18.23 **
23.74 ***
11.83 **
17.58 ***
-5.64
6.47
-0.23
61.90 ***
27.51
76.03 ***
60.78 ***

Table 5.5 Summary of English EMA pairwise comparison. *** means that p ⩽ 0.001; ** means
that p ⩽ 0.01; * means that p ⩽ 0.05. This Table is a repetition of Table 3.41.

119

Pairwise comparison Stimulus pair Estimate (sonority difference)

Bilabial C Nasality differ

Voicing differ

Vowel height differ

Coronal C Nasality differ

Voicing differ

Vowel height differ

mi4, bi4
mei4, bei4
ma4, ba4
man4, ban4
bi4, pi4
bei4, pei4
ba4, pa4
ban4, pan4
mi4, ma4
bi4, ba4
pi4, pa4
nu4, du4
na4, da4
tu4, du4
ta4, da4
tu4, ta4
su4, sa4
du4, da4
nu4, na4
lu4, la4

10.18 ***
17.62 ***
17.93 ***
20.54 ***
10.57 ***
6.95 *
9.69 *
8.11 *
2.58
16.72 **
14.09 *
30.89 ***
43.65 ***
-3.31
1.12
7.23
-8.47
0.51
-18.67 **
-3.7

Table 5.6 Summary of EMA Mandarin pairwise comparisons. *** means that p ⩽ 0.001; ** means
that p ⩽ 0.01; * means that p ⩽ 0.05. This Table is a repetition of Table 4.39.

In the following, I discuss potential explanations of the current finding in subsection 5.1.

Moreover, I discuss the potential impact of the link between sonority and gestural timing.

Specifically, the study provides a potential basis to account for several cross-linguistic typological

patterns such as the Sonority Sequencing Principle (SSP) (Sievers, 1881, 1901; Greenberg, 1965;

Pike, 1972; Hooper and Bybee, 1976; Steriade, 1982; Selkirk, 1984; Clements, 1990; Kenstowicz,

1994; Blevins, 1995; Parker, 2002, 2011), the Sonority Dispersion Principle (Clements, 1990;

Parker, 2011), and the asymmetry between CV and VC syllable frequencies (Ohala, 1990; Tabain

et al., 2004; Nam et al., 2009). These general typological patterns of human language may be

accounted for based on our results along with another premise of a preference for larger gestural

lag over shorter ones.

Furthermore,

in subsection 5.3, I propose a sonority-driven speech

production constraint. Lastly, in subsection 5.4, I discuss some caveats and directions for future

studies.

120

5.1 Potential explanations and theory for the finding

In this section, I intend to evaluate some claims that could potentially explain the finding of

the study that there is a positive correlation between CV lag and sonority difference. Firstly, the

language-specific mechanism in Georgian mentioned by Crouch (2022) and Crouch et al. (2023)

seems the least likely since similar patterns have been found in other languages such as English and

Mandarin (Gao, 2008; Shaw and Chen, 2019). For the rest of the section, to explain the findings of

the dissertation, I consider principles in speech production such as parallel transmission (Mattingly,

1981), coarticulatory resistance (Bladon and Al-Bamerni, 1976), and perceptual recoverability

(Chitoran et al., 2002). I also consider the prosodic gesture model to model the results (Byrd and

Saltzman, 2003).

The first claim I evaluate here comes from Mattingly (1981), who argued that segments are

grouped into syllables because listeners have certain expectations regarding the parallel

transmission of information. Specifically, segments from different classes of articulatory manners

should be ordered in a way that a more closed constriction must occur in the process of being

released to a more open constriction. They suggest that this kind of ordering or organization of

articulatory gestures ensures the parallel transmission of a syllable, which could be argued to be

more efficient than decoding isolated consonants or vowels. More specifically, they suggest that a

sonority rise is more likely to transmit in parallel over a sonority plateau (and a sonority plateau

over a sonority fall). Furthermore, they suggest that this view can be used to explain the Sonority

Sequencing Principle (SSP), which requires that each syllable should exhibit one peak of sonority

in the nucleus, and that, cross-linguistically, a sonority rise (such as [pl]) is preferred in onsets

over a sonority plateau (such as [pt]) which in turn is preferred over a sonority fall (such as [lp])

(Sievers, 1881, 1901; Greenberg, 1965; Pike, 1972; Hooper and Bybee, 1976; Steriade, 1982;

Selkirk, 1984; Clements, 1990; Kenstowicz, 1994; Blevins, 1995; Parker, 2002, 2011). Following

their reasoning, one would expect a sonority rise to have the least lag and a sonority fall to have

the longest lag — but this prediction is contrary to the observations, both in the current study and

those in Crouch (2022) and Crouch et al. (2023). Therefore, parallel transmission could not

121

account for the primary finding of the paper.

Another concept relevant to the interpretation of the finding is coarticulatory resistance, which

was originally used to account for the coarticulatory variation in English /l/ (Bladon and Al-Bamerni,

1976). Coarticulatory resistance is likely not to play a role here because it is non-directional. On

the contrary, the main claim of the dissertation requires a directional calculation of sonority

difference and gestural lag. Furthermore, Kent and Minifie (1977) argue that though the concept

of coarticulatory resistance can be used to model the observed variation, it seems unable to predict

or explain the general link between gestural timing and sonority difference, because it is expected

to vary by language, segment, and even potentially context. Rather, it serves as the numerical

redescription of the articulatory variation. Therefore, even though we could say that CV syllables

of larger sonority difference have higher coarticulatory resistance, the claim does not explain the

observations.

A third possible way to account for the claim is based on the prosodic gesture model (Byrd and

Saltzman, 2003), which suggests gestural lag variation. The prosodic or 𝜋-gesture model suggests

that prosodic gestures “temporally stretch gestural activation trajectories” (p. 149) and prosodic

gestures make the gestures in their activation domain longer, larger, and further apart (Byrd and

Saltzman, 2003). In Figure 5.1, prosodic gesture occurs in the prosodic tier, and it slows down the

gestural coordination of gesture 1 and gesture 2 between the two dashed lines. Cho (2006) suggested

that there are various strengths of prosodic gestures. Therefore, it is worth considering sonority as a

prosodic gesture. Note that if sonority is a prosodic gesture, the C and V should also be lengthened

as the lags are lengthened. To evaluate this, I test whether C duration positively correlates to sonority

difference and whether V duration positively correlates to sonority difference. Specifically, C or

V duration was modeled as a function of sonority difference, where participants and words were

modeled as random intercepts. For English corpus data from experiment 1, C duration (estimate =

4.66, p = 0.15) or V duration (estimate = 4.42, p = 0.35) did not have a relationship with sonority

difference. For English EMA data from experiment 2, C duration (estimate = 4.53, p = 0.05) had

a positive correlation with sonority difference, but V duration (estimate = 2.55, p = 0.20) did not

122

have a positive correlation with sonority difference. For Mandarin EMA data from experiment 3,

C duration (estimate = -3.31, p = 0.12) did not have a relationship with sonority difference, but

V duration had a positive correlation with sonority difference (estimate = 7.024, p= 0.01). These

results show that C and V gestures did not lengthen consistently according to sonority. Therefore,

sonority may not be modeled as a prosodic gesture.

Prosodic Gesture

Gesture 1

Gesture 2

Figure 5.1 Prosodic gesture.

A fourth relevant claim regarding the modulation of gestural timing comes from Chitoran et al.

(2002), who suggests that perceptual recoverability could be the underlying reason for the gestural

coordination they observed in Georgian. Specifically, Chitoran et al. (2002) proposed that, cross-

linguistically, a syllable should have structures that allow maximum gestural overlap with minimal

loss of information. Even though the hypothesis was rejected by Crouch (2022) and Crouch et al.

(2023) based on their results in Georgian CC timing, perceptual recoverability can provide a valid

explanation of the finding if we assume sonority is essentially an abstraction of intensity. Intensity

is considered to be the phonetic correlate of sonority in Parker (2002), where the sonority scale of

the dissertation comes from.

The speculative explanation for the link between gestural timing and sonority that I would like

to suggest is that a larger intensity difference requires a larger gestural lag to ensure perceptual

recoverability. Intensity is the acoustic correlate of sonority in Parker (2008), and the dissertation

assumes the sonority scale in Parker (2002). Therefore, the observed correlation between sonority

difference and CV lag could actually be due to the correlation between intensity difference and CV

lag. It could be that if two adjacent sounds are very different in intensity, it is likely that a large

degree of overlap will result in the masking of the lower-intensity sound. On the other hand, if two

sounds are similar in terms of intensity, they may be more likely to withstand a large degree of

123

overlap. Since the assumption is that a consonant and a vowel in a given CV syllable should have

maximum overlap with minimal loss of information, a C and a V that differ more in intensity should

have a larger lag (less overlap) than syllables with smaller intensity differences between C and V.

The remaining question is about the directional sensitivity of the intensity difference. It is possible

that the human perceptual system is sensitive to the sound intensity direction — the lower intensity

sound is more likely to be masked by a following higher intensity sound than a preceding higher

intensity sound. The two predictions of the explanation — 1) two adjacent sounds in a syllable with

higher intensity difference need to have larger lags (less overlapping degree) to be perceived; 2) a

lower intensity sound followed by a higher intensity sound is more challenging to be perceived than

the exact sound preceded by a higher intensity sound — can be tested by independently designed

speech perception experiments.

The perceptual experiments discussed here can be used to test the relationship between

perceptual recoverability and intensity difference. Synthesized stimuli made up of two segments

could be used to test the claim. The two segments in a stimulus should have different intensity

difference values. For instance, there could be stimuli such as AB where Bintensity - Aintensity = 20

dB or CD where Dintensity - Cintensity = 40 dB. For each pair of segments of the same intensity

difference, different degrees of overlap of the two segments were created. For example, for AB

where Bintensity - Aintensity = 20 dB, A and B should have various overlapping degrees as follows: B

aligned to the 0%, 20%, 40%, 60%, 80%, and 100% of A. The recordings will be played to the

participants to test whether they can identify the stimuli. It is expected that with the increase in

intensity difference, participants would need more lag (less overlap) to correctly identify the

segment pairs. To test directionality, the perception of stimuli AB will be compared with that of

stimuli BA, and it is predicted that a lower-intensity segment followed by a higher-intensity

segment (e.g., AB) should require more lag than a higher-intensity segment followed by a

lower-intensity segment (e.g., BA).

This hypothesis of perceptual recoverability and how it connects to gestural timing aligns with

the argument in Wright (1996), who argued that the perceptual demands of the listener contribute

124

to the production strategy of the speaker. In fact, scholars have explored the perception-production

link and observed altered speech production patterns when the auditory feedback changes (Katseff

et al., 2012). After collecting perceptual evidence, we will be more equipped to evaluate the

explanation of the observation in the future.

Note that the preference for larger intensity difference over smaller difference was also suggested

by Henke et al. (2012). They claimed that greater amplitude change results in more robust

information. As mentioned in Chapter 1, Henke et al. (2012) argued that different natural classes

of sounds have internal and transitional perception cues of various robust levels. Therefore, it is

expected that there are different overlapping degrees of adjacent sounds for different natural classes

to ensure the transmission of information in the perception.

5.2 Providing a basis for some phonological universals

The findings of the dissertation can provide a basis for several phonological universals if we

add a premise that humans prefer larger gestural lags in articulation. Notably, the implications do

not depend on sonority as a primitive. In other words, even if sonority were ultimately derived

from some other set of factors, and the gestural timing of CV sequences was really correlated with

those other factors, the implications of the gestural timing differences for the various typological

observations that I discuss below will still hold true.

5.2.1 Providing a basis for the Sonority Dispersion Principle and the SSP

The main finding of the current study has the potential to help explain phonological universals

related to the syllable structure — such as the Sonority Dispersion Principle and the Sonority

Sequencing Principle (SSP). Firstly, the Sonority Dispersion Principle states that in a syllable CV,

the onset and nucleus differ from each other in sonority as much as possible (Clements, 1990;

Parker, 2011). In other words, this principle requires that a CV syllable should have larger sonority

difference. The Sonority Dispersion Principle is potentially derivable from the finding of the

current dissertation along with another premise that a larger gestural lag is preferred, potentially

for reasons of perceptual recoverability (Chitoran et al., 2002). For instance, it may be that a larger

lag is correlated with more perceptual salience so it is preferred. It is also possible that a larger

125

lag is preferred because it is easier to articulate. Ohala (1990) argued that some sequences may be

disfavored due to their being difficult to articulate — it is possible that having a shorter gestural

lag for gestures serves as the physical manifestation of articulatory difficulty to implement a certain

ordered sequence. More work needs to be done to assess articulatory ease, which has been elusive

to define or study (Shariatmadari, 2006).

Secondly, the findings of the dissertation could be used to explain the Sonority Sequencing

Principle (SSP) if we generalize the link to CC onset clusters. The SSP requires that a sonority rise

(such as [pl]) is preferred in onsets over a sonority plateau (such as [pt]) which in turn is preferred

over a sonority fall (such as [lp]) cross-linguistically (Sievers, 1881, 1901; Greenberg, 1965; Pike,

1972; Hooper and Bybee, 1976; Steriade, 1982; Selkirk, 1984; Clements, 1990; Kenstowicz, 1994;

Blevins, 1995; Parker, 2002, 2011). Based on the observations of the current study, the SSP can

be accounted for as follows. If the sonority index of the clusters C1C2 is coded according to Table

1.6, and if the sonority difference of clusters is calculated by subtracting the sonority index of the

first consonant from that of the second consonant (i.e., C2 - C1), then the sonority difference of

the clusters [pl], [pt], and [lp] are 8, 0, -8 respectively (Table 5.7). In this current dissertation, I

found that gestural lag positively correlates to sonority difference. If one generalizes the finding on

CV sequences to CC sequences, one would predict that sonority rise has a larger lag than sonority

plateau, which has a larger lag than sonority fall. This prediction was already been supported by

Georgian (Crouch, 2022; Crouch et al., 2023). If the premise that humans prefer larger gestural lag

within a syllable is true, I would predict the phonological constraint that sonority rise is preferred

over plateau over fall.

Onset cluster
Sonority rise [pl]
Sonority plateau [pt]
Sonority fall [lp]

Sonority difference
9-1=8
1-1=0
1-9=-8

Table 5.7 Providing a basis for the SSP. In the Sonority difference column, the first two numbers
refer to the sonority indexes of the two consonants in the cluster, and the last number represents the
sonority difference.

126

5.2.2 Relevance to the Syllable Contact Law

A related question would be how the current finding can inform the explanation of the Syllable

Contact Law, which specifies that the structure A.B would be more preferable if a-b is larger (Hooper

and Bybee, 1976; Murray and Vennemann, 1983). I would argue that a general preference toward

larger gestural lag within the syllable could derive the Syllable Contact Law. Consider the syllable

CV1A.BV2, where there are two syllables CV1A and BV2, with A.B at the syllable boundary. If

both syllables need to satisfy the requirement that larger lags are preferred in a syllable, then it

comes as a consequence that a-b is larger (a and b refer to the sonority of A and B respectively).

Since in each syllable, every segment sequence should satisfy the large lag requirement, V2-b and

a-V1 should have larger sonority differences. This means that V2 and a should be larger, and b and

V1 should be less sonorous. If a is large and b is small, and a-b would be larger. See the stepwise

derivation in (8).

(8) Preference in question: larger lags (larger sonority difference) are preferred within a syllable,

where sonority difference is calculated by Sonoritylater - Sonorityformer.

a. Sample syllables: CV1A.BV2

b. First syllable: CV1A

c. First syllable satisfying preference: a-V1 larger → a large; V1 small

d. Second syllable: BV2

e. Second syllable satisfying preference: V2-b larger → V2 large, b small

f. a large, b small → a-b large

The derivation shows that the Syllable Contact Law may be the consequence of adjacent

syllables satisfying the larger lag requirement within each syllable. Admittedly, the derivation is

not an explanation of the Syllable Contact Law, but rather a prediction about tendencies.

5.2.3 Providing a basis for the syllable frequency asymmetry: CV versus VC

The premises used above can also be used to provide a basis to explain why cross-linguistically,

CV syllables are much more common than VC syllables (Ohala, 1990; Tabain et al., 2004; Nam

127

et al., 2009). Specifically, VC will have a shorter gestural lag than CV, which is disfavored if the

premise that human prefers larger lag is true.

Of course, we also need to account for the fact that VC sequences do exist in many languages.

It is likely that the observed relationship between sonority and gestural timing should be seen akin

to a force that has a certain effect, keeping all other things constant (Kröger et al., 1995); however,

if there is a sufficiently strong antagonistic force that requires VC as a sequence (perhaps as a

language-specific segmental sequence), then VC sequences could still surface. Such an analysis

makes the prediction that VC is dispreferred, but is still possible under the right circumstances —

this correctly predicts the asymmetry in CV and VC sequences within syllables across languages.

5.2.4 Providing a basis for the link among sonority, stress, and vowel height

There are several observations or correlations involving sonority, stress, and vowel height, and

the findings of the dissertation may provide explanations for those too. First, the finding of the

dissertation allows us to partially understand why lower vowels are acoustically longer and have

higher intensity than higher vowels (Lehiste, 1970; Gordon et al., 2012). Following the main claim

of the paper, since lower vowels are more sonorous, they will have a larger lag with the preceding

consonant. Therefore, there is less overlap with the preceding consonant, and thus there is less

acoustic “hiding” of the vowel. Consequently, a low vowel is likely to be acoustically longer and

louder than a high vowel.

Second, our proposal leads to the prediction that lower vowels are perceptually and acoustically

longer, making them better tone and stress holders than higher vowels. Such cases can be found

in many languages. For example, Zuraw (2003) found that in Palauan, more sonorous vowels

are dispreferred in unstressed syllables. Similarly, Gordon et al. (2012) found that in Armenian,

Javanese, and Kwak’wala, the reduced phonological sonority of schwa relative to peripheral vowels

is manifested in the rejection of stress on schwa. This indicates a positive correlation between

sonority and stress — reduced sonority correlates to the absence of stress. Moreover, avoidance

of stressed high vowels has been observed in Takia (Ross, 2002, 2003; De Lacy, 2007) — while

the final syllable is stressed by default, if the final vowel is a high vowel, stress falls on a non-high

128

vowel elsewhere in the word. Finally, it is also argued that languages where stress seeks out vowels

of lower sonority and disregards higher sonority ones are unattested (De Lacy, 2007).

The relationship between sonority difference and lag could be one of the reasons why stress

favors high sonority vowels. As observed by Gu (2023), Katsika (2016) and Katsika (2012),

stressed vowels are correlated with larger gestural lags than unstressed syllables. Since given the

same consonant, vowels high in sonority are correlated with larger gestural lags than vowels low in

sonority, this in turn leads to an acoustically and perceptually longer lower vowel, which therefore

provides a better holder for stress.1

But why are larger lags preferred? Chitoran (2016) extended the argument in Pouplier and

Beňuš (2011) and claimed that a larger lag provides a favorable environment for energy peak to

emerge. So maybe humans prefer this energy peak environment between two consonants.

5.3 A sonority-driven speech production model

This current study has the potential to support a new speech production model that assumes

sonority determines gestural coordination patterns. The assumption of this sonority-based speech

production model is that all gestures coordinate according to the sonority differences with the

gestures of the adjacent segment within a syllable. Consequently, the findings of the current study

should be generalized to CC, CV, and VC sequences in a syllable in all languages, where a positive

correlation between sonority difference and gestural lag is predicted for cases beyond CV syllables.

It is also possible that this claim even holds true across syllable boundaries or even larger prosodic

boundaries. This claim of a sonority-driven speech production model is consistent with the results

in Aziz (2024), who interpreted the model as a Sonority-Driven Gestural Timing constraint ranked

high in languages such as Malagasy. To argue for this model, admittedly, one needs to also test

gestural coordination in coda positions, the CV coordination in syllables with consonant clusters,

and various other conditions, which I plan to undertake in future work.

The intent of extending to all languages was not intended to be immodest but to make the claim

1There are languages where stress systems are insensitive to sonority (De Lacy, 2007), and the detailed discussion
of those phenomena requires substantial amount of careful literature review and experiment. Therefore, they are left
for future studies.

129

testable. If I were to say that such a generalization holds only for the dialect of English studied

here, it is not clear how someone else tests our claim beyond simply replicating our results for the

relevant dialect, and it would not be clear how it would inform phonetic theory more generally.

There is a section 5.4 about caveats that discusses relevant information.

Indeed, there are a lot of exceptions to the SSP, as mentioned in Chapter 1. I want to reiterate

that the generalization is about syllables and needs to be judged on segment sequences that are in

the same syllable structure context. For example, in English sC sequences form putative exceptions

to SSP; however, it has been argued that the [s] is not part of the onset in such cases, and instead

forms a foot-level appendix (Vaux and Wolfe, 2009). Similarly, Moroccan Arabic and Jazani

Arabic have word-initial consonant sequences; however, they do not violate the SSP as it has been

argued that such sequences do not form complex onsets and all but the last consonant are not in

the same syllable as the last consonant in such sequences (Goldstein et al., 2007; Shaw et al., 2009,

2011; Hermes et al., 2013, 2017). Claims about the SSP and the SDP are strictly speaking about

syllable-internal sequencing, so it is likely that the apparent cross-linguistic exceptions are just an

analysis along the lines of the superficial exceptions discussed above.

The sonority-driven speech production model assumes that sonority causes gestural

coordination variation, and I have discussed how this is likely. It is logically possible that a third

factor is causing the sonority differences and gestural lag variation. There is a question about

whether abstract sonority or the phonetic correlate of sonority is related to CV lag. I would argue

that primarily it is the abstract sonority that leads to a certain gestural coordination pattern. Also,

it would appear that the phonetic correlate of sonority systematically relates to gestural lag

variation.

5.4 Caveats and directions for future studies

Even though the main claim of the current study has been largely supported by a series of

experiments, there are some caveats and future directions that are discussed in this section.

One caveat is that it is unclear if the underlying concept is sonority or a strength hierarchy

(Honeybone, 2008) since they are mirror images of each other. At this point, I am unsure of how

130

to distinguish between the two concepts, given they are inverses of each other.

In the following, I am going to discuss the necessity of more comprehensive cross-linguistic

analysis, as well as the nuances of lag measurement and gestural parsing techniques.

5.4.1 Cross-linguistic analysis

Future studies on other languages are necessary to test the cross-linguistic impact of the current

finding. For instance, it may be worth replicating the study in German since German onset clusters

exhibited gestural coordination variation, but not in the same direction of the dissertation (Bombien

et al., 2013). Specifically, Bombien et al. (2013) found that /kn/ has a larger target onset lag than /kl/

and that /ks, ps/ has a larger lag than /kl, pl/. One potential reason for the discrepant results between

the current study and that of Bombien et al. (2013) is that their study also explored prosodic effects

but the segmental and syllabic material in the context carrier phrases was not controlled across the

stimuli, so it is possible that the effect they found was driven by these other characteristics of the

stimuli. Another reason for the result difference could be that in the case of /kn/ and /kl/ (and for

that matter /ks/), the measurements for both consonants were both based on tongue movement, so

it is possible that this led to misparsing of the gestures (as did in our own /kA/ and /gA/ stimuli).

Therefore, it would be optimal to re-run the study on German with these considerations in mind. If

the original pattern observed in Bombien et al. (2013) sustains despite the change to the stimuli that

I am suggesting, then the sonority hypothesis will be difficult to maintain as is — it will either have

to be relativized to just CV sequences or some other independent factor would have to be added to

the theory.

5.4.2 The lag measurement

A disclaimer I would like to note is that the current results, without considering C duration in

the model, are about observed CV lags — that is, CV lag duration was treated as informing us about

the lag relationship between CV. However, as pointed out before, there are at least two different

sources of observed CV lags: an absolute/constant lag and a proportional lag (Solé, 1992; Mücke

et al., 2020; Durvasula and Wang, 2023). The observed lag due to the former is not expected to

vary with different durations of the first gesture, but it is expected to vary with different durations

131

of the first gesture for the latter. To address the potential confound due to measurement choice,

consonant duration is considered in the mixed effect models of all three experiments. Therefore,

arguably the significant positive correlation between CV lag and sonority difference can be found,

regardless we consider absolute lag or proportional lag. However, it is worth mentioning that I did

not calculate proportional lag according to the method in Durvasula and Wang (2023), so future

work may need to verify whether similar results can be found for proportional lag.

5.4.3 The gestural parsing method

In speech production studies, kinematic data need to be analyzed and gestures need to be

parsed in order to answer the relevant question in articulation. There are logically infinite ways of

measuring and parsing an articulatory gesture, as discussed in section 1.8 of Chapter 1. However,

it is not clear what is the right way to parse gestures. The decision is arbitrary, and the decision

may directly affect the answers to the research questions. One potential way to evaluate different

methods is that the right measurement will bring out consistent observations. But since we are not

sure about the expectations of observations in the first place, the process of figuring out the right

method involves some circular argumentation, and there is no obvious way around it.

One method future studies could use is to use more than one measurement technique to answer

each research question.

If several methods lead to the same conclusion, then the conclusion is

likely to be true. However, if the conclusion is dependent on methods, then one needs to try to

make sense of the differences if possible. Some remaining relevant questions include: whether the

observed outcome difference is significant and how to make sense of the difference — is it in the

grammar, in the phonetics, or is it trivial?

In the current dissertation, the threshold algorithm lp_findgest was used, but there are some

issues or ambiguities related to it. First, Shaw et al. (2023) did mention that lp_findgest can

sometimes yield unrealistic parsing of gestures. For instance, sometimes peak velocity is not large

enough to parse out gestures.

In some cases, tangential velocities result in inaccurate sums of

distinct gestures, where component velocities should be used instead. The question remains about

what is a realistic gesture. Should it be a one-to-one mapping to acoustic signals? Should the

132

gesture be marked based on acoustic output? There is no straightforward answer to these questions,

but there should be more attempts to address the questions to push forward our understanding of

speech production studies.

Second,

the gesture boundaries determined by the threshold technique are sensitive to

articulatory stiffness. Liu et al. (2022) argued that the reason why some research concluded that a

consonant and a vowel in a CV syllable are coordinated sequentially is that consonants are

articulated with higher stiffness than vowels.

Third, there are actually variations in the assumptions of the threshold technique and how the

method is applied. Comparing the algorithm of lp_findgest and the threshold technique in Hoole

et al. (1994) suggests that the gesture identification procedures are different, regardless of where

the methods are used. While lp_findgest first identifies the nucleus and expands the gesture from

the center, the original technique first identifies the maximum velocity points at edges and then the

nucleus stage serves as the “bi-product”. Therefore, it is incorrect to assume that any threshold

technique was operated under the same assumption underlyingly. Furthermore, observing the uses

of the threshold technique suggests that there is a variety of scenarios — single articulatory gesture

or syllable — to use the method, though the scope of the technique was not explicit.

Moreover, there are variations on the exact threshold used.

In previous studies, the 20%

threshold was frequently used, probably because it is the default threshold in the lp_findgest

algorithm. However, the justification for the 20% threshold is missing in many studies.

It is

possible this threshold was chosen in the lp_findgest algorithm based on the previous study in Hoole

et al. (1994). It is worth noting though, as mentioned earlier, that the two threshold techniques

are not entirely the same, and the 20% was chosen in Hoole et al. (1994) because it generates

the expected prediction in terms of German tense-lax vowel durational difference. Note that even

though changing the threshold from 10% to 30% does not affect the conclusion in Durvasula and

Wang (2023), this does not mean that all thresholds do not affect results in all studies. For instance,

Kuberski and Gafos (2023) does show that increasing threshold values result in better performance

in linear regression models based on the analysis of thresholds such as 0%, 5%, 10%, 15%, 20%,

133

and 25%.2 This suggests that ideally thresholds should be justified.

Fourth, regarding the specific threshold algorithm lp_findgest, there are many decisions that

are arbitrary in this algorithm, but the arbitrariness of the decision was sometimes missing. One

specific decision to make is that when the algorithm identifies “the closest velocity minimum”, it

needs to specify the range of this search. However, it is unclear what the appropriate range is,

and it is not clear how to balance the two constraints — closest and minimum. The challenge of

determining an appropriate range also occurs when identifying the “minimal velocity point before

PVEL” and “the velocity minimum following PVEL2”.

Another decision to make is that — when talking about 20% peak (PVEL) velocity between

minimal velocity point and peak velocity point, it is also not clear whether it is the 20% when

the velocity is increasing or the 20% when the velocity is decreasing. Also, when talking about

the 20% point, it seems that we need to find the velocity point that is exactly 20%. A remaining

question is how to make the decision when there is not a point being recorded that is exactly 20%

of the peak velocity. One option is to use the existing real data to make a prediction of the velocity

contour and locate the point that is exactly the 20% point. Alternatively, one could locate a velocity

point only according to the existing velocity data collected.

Additionally, in general, when talking about velocity, there is a question about whether to look

at the velocity of one axis, multiple axes, or some calculation based on existing information. Here

are some possible velocity calculations: a) taking the maximum velocity among the three axes’

velocity; b) using the sum of three absolute velocities. Besides velocity, in some EMA systems such

as the NDI Vox, there is a 6D data frame output for data collection. Namely, for each timestamp,

the system has information for not only x, y, and z positions, but also four data points in terms of

quaternion rotation. Algorithms could potentially include quaternion rotation in its calculation as

well.

These challenges are for the field. While I have chosen to use the threshold technique in order

to have results comparable to most others in the field, and because it has fewer issues than the

2Note that this is based on observed lag and does not distinguish between proportional lag and observed lag. One

may argue that using proportional lag can lead to consistent results.

134

comparative technique as mentioned in section 1.8, I leave a more detailed comparison of possible

techniques for future work.

135

CHAPTER 6

CONCLUSION

The dissertation found that there is a positive correlation between gestural lag and sonority difference

for Mandarin and English CV syllables. There are a few questions that the current study aims to

address. The first question is about gestural coordination in speech production. The current study

found that there is a systematic variation in gestural coordination relevant to sonority. The second

question is the articulatory correlation of sonority — the current study suggests that sonority may

be a fundamental factor in speech production. The third question is the source of phonological

constraints — the dissertation suggests that a few phonological universals could come from human

beings’ general preference toward larger gestural lag.

The current study probed the link between gestural timing and sonority. Based on corpus data

and newly collected EMA data, I found a positive correlation between gestural lag and sonority

difference in English CV syllables. This finding provides a basis for typological universals — SSP,

SDP, and CV-VC syllable frequency asymmetry — if one adds another premise that larger gestural

lags are preferred. It also can account for the correlation among sonority, stress, and vowel height.

A potential explanation of the finding is available if we consider intensity as the phonetic correlate

of sonority — larger directional sonority differences of adjacent sounds require larger gestural lag

to ensure perceptual recoverability. The current study provides empirical evidence that suggests the

necessity of revamping the current speech production model. In general, the dissertation provides

evidence and claims that can help us form a more nuanced understanding of speech production.

136

BIBLIOGRAPHY

Albert, A. (2023). A model of sonority based on pitch intelligibility. BoD–Books on Demand.

Arnqvist, G. (2020). Mixed models offer no freedom from degrees of freedom. Trends in ecology

& evolution, 35(4):329–335.

Aziz, J. (2024). The Phonetics and Phonology of So-Called Vowel Devoicing in Malagasy. PhD

thesis, UCLA.

Bates, D., Mächler, M., Bolker, B., and Walker, S. (2014). Fitting linear mixed-effects models

using lme4. arXiv preprint arXiv:1406.5823.

Benguerel, A.-P. and Cowan, H. A. (1974). Coarticulation of upper lip protrusion in french.

Phonetica, 30(1):41–55.

Berent, I., Steriade, D., Lennertz, T., and Vaknin, V. (2007). What we know about what we have

never heard: Evidence from perceptual illusions. Cognition, 104(3):591–630.

Blackwood Ximenes, A., Shaw, J. A., and Carignan, C. (2017). A comparison of acoustic and
articulatory methods for analyzing vowel differences across dialects: Data from american and
australian english. The Journal of the Acoustical Society of America, 142(1):363–377.

Bladon, R. A. W. and Al-Bamerni, A. (1976). Coarticulation resistance in english/l. Journal of

Phonetics, 4(2):137–150.

Blevins, J. (1995). The syllable in phonological theory. In Handbook of phonological theory, pages

206–244. Blackwell.

Bombien, L., Mooshammer, C., and Hoole, P. (2013). Articulatory coordination in word-initial

clusters of german. Journal of Phonetics, 41(6):546–561.

Browman, C. P. and Goldstein, L. (1989). Articulatory gestures as phonological units. Phonology,

6(2):201–251.

Browman, C. P. and Goldstein, L. (1992). Articulatory phonology: An overview. Phonetica,

49(3-4):155–180.

Browman, C. P. and Goldstein, L. (2000). Competing constraints on intergestural coordination and
self-organization of phonological structures. Les Cahiers de l’ICP. Bulletin de la communication
parlée, (5):25–34.

Browman, C. P., Goldstein, L., et al. (1990). Tiers in articulatory phonology, with some implications
for casual speech. Papers in laboratory phonology I: Between the grammar and physics of speech,
1:341–397.

137

Byrd, D. (1996). Influences on articulatory timing in consonant sequences. Journal of phonetics,

24(2):209–244.

Byrd, D. and Krivokapić, J. (2021). Cracking prosody in articulatory phonology. Annual Review

of Linguistics, 7:31–53.

Byrd, D. and Saltzman, E. (2003). The elastic phrase: Modeling the dynamics of boundary-adjacent

lengthening. Journal of Phonetics, 31(2):149–180.

Byrd, D. M. (1994). Articulatory timing in English consonant sequences. University of California,

Los Angeles.

Chao, Y.-R. (1930). A system of tone letters. Le maître phonétique, 30:24–30.

Cheng, R., Jongman, A., and Sereno, J. A. (2023). Production and perception evidence of a

merger:[l] and [n] in fuzhou min. Language and Speech, 66(3):533–563.

Chitoran, I. (2016). Relating the sonority hierarchy to articulatory timing patterns: A cross-

linguistic perspective. Challenging sonority: Cross-linguistic evidence, pages 45–62.

Chitoran, I., Goldstein, L., and Byrd, D. (2002). Gestural overlap and recoverability: Articulatory

evidence from georgian. Laboratory phonology, 7(4-1):419–447.

Cho, T. (2006). Manifestation of prosodic structure in articulatory variation: Evidence from lip

kinematics in english. Laboratory phonology, 8:519–548.

Cho, Y.-m. Y. and King, T. H. (2003). Semisyllables and universal syllabification. The syllable in

optimality theory, pages 183–212.

Clements, G. N. (1990). The role of the sonority cycle in core syllabification. Papers in Laboratory

Phonology: Volume 1, Between the Grammar and Physics of Speech, 1:283.

Clements, G. N. (2005). Does sonority have a phonetic basis? comments on the chapter by
In In Raimy, E. & Cairns, C.(Eds.), Contemporary Views on Architecture and

vaux. 14 pp.
Representations in Phonological Theory. Citeseer.

Clements, G. N. (2009). Does sonority have a phonetic basis. Contemporary views on architecture

and representations in phonology, 48:165.

Collier, R., Bell-Berti, F., and Raphael, L. J. (1982). Some acoustic and physiological observations

on diphthongs. Language and Speech, 25(4):305–323.

Crouch, C. (2022). Postcards from the syllable edge: sonority and articulatory timing in complex

onsets in Georgian. PhD thesis, UC Santa Barbara.

138

Crouch, C., Katsika, A., and Chitoran, I. (2023). Sonority sequencing and its relationship to
articulatory timing in georgian. Journal of the International Phonetic Association, pages 1–24.

Davis, S. and Shin, S.-H. (1999). The syllable contact constraint in korean: An optimality-theoretic

analysis. Journal of East Asian Linguistics, 8(4):285–312.

De Lacy, P. (2007). The interaction of tone, sonority, and prosodic structure. The Cambridge

handbook of phonology, 281:281–308.

Dell, F. and Elmedlaoui, M. (1985). Syllabic consonants and syllabification in imdlawn tashlhiyt

berber.

Denes, P. (1955). Effect of duration on the perception of voicing. The Journal of the Acoustical

Society of America, 27(4):761–764.

Du, S. and Gafos, A. I. (2023). Articulatory overlap as a function of stiffness in german, english

and spanish word-initial stop-lateral clusters. Laboratory Phonology, 14(1).

Durvasula, K. (2024). Lecture 8 handout of laboratory phonology. Unpublished lecture notes.

Durvasula, K., Ruthan, M. Q., Heidenreich, S., and Lin, Y.-H. (2021). Probing syllable structure
through acoustic measurements: case studies on american english and jazani arabic. Phonology,
38(2):173–202.

Durvasula, K. and Wang, Y. (2023). Revisiting cv timing with a new technique to identify
inter-gestural proportional timing. Proceedings of the 20th International Congress of Phonetic
Sciences (ICPhS), pages 2284–2288.

Fowler, C. A. and Saltzman, E. (1993). Coordination and coarticulation in speech production.

Language and speech, 36(2-3):171–195.

Gafos, A. I. (2002). A grammar of gestural coordination. Natural language & linguistic theory,

20(2):269–337.

Gafos, A. I., Hoole, P., Roon, K., Zeroual, C., Fougeron, C., Kühnert, B., D’Imperio, M., and
Vallée, N. (2010). Variation in overlap and phonological grammar in moroccan arabic clusters.
Laboratory phonology, 10:657–698.

Gao, M. (2008). Mandarin tones: An articulatory phonology account. PhD thesis, Yale University.

Gelfer, C. E., Bell-Berti, F., and Harris, K. S. (1989). Determining the extent of coarticulation:
Effects of experimental design. The Journal of the Acoustical Society of America, 86(6):2443–
2445.

Gelman, A. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge

139

university press.

Gibson, M., Sotiropoulou, S., Tobin, S., and Gafos, A. (2017). On some temporal properties of
spanish consonant-liquid and consonant-rhotic clusters. Proceedings of the 13th Tagung Phonetik
und Phonologie im deutschsprachigen Raum (PP13), pages 73–76.

Gibson, M., Sotiropoulou, S., Tobin, S., and Gafos, A. I. (2019). Temporal aspects of word initial

single consonants and consonants in clusters in spanish. Phonetica, 76(6):448–478.

Goldstein, L. (2011). Back to the past tense in english. Representing language: Essays in honor of

Judith Aissen, pages 69–88.

Goldstein, L., Byrd, D., and Saltzman, E. (2006). The role of vocal tract gestural action units in
understanding the evolution of phonology. Action to language via the mirror neuron system,
pages 215–249.

Goldstein, L., Chitoran, I., and Selkirk, E. (2007). Syllable structure as coupled oscillator modes:
evidence from georgian vs. tashlhiyt berber. In Proceedings of the XVIth international congress
of phonetic sciences, pages 241–244. Saarbrücken Univ. des Saarlandes Saarbrücken, Germany.

Gordon, M., Ghushchyan, E., McDonnell, B., Rosenblum, D., and Shaw, P. A. (2012). Sonority
and central vowels: A cross-linguistic phonetic study. The sonority controversy, pages 219–256.

Gracco, V. L. (1994). Some organizational characteristics of speech movement control. Journal of

Speech, Language, and Hearing Research, 37(1):4–27.

Gracco, V. L. and Lofqvist, A. (1994). Speech motor coordination and control: evidence from lip,

jaw, and laryngeal movements. Journal of Neuroscience, 14(11):6585–6597.

Greenberg, J. H. (1965). Some generalizations concerning initial and final consonant sequences.

Linguistics, 3(18):5–34.

Gu, Y. (2023). Exploring the effect of stress on gestural coordination. Proceedings of the Linguistic

Society of America, 8(1):5539.

Hall, N. (2010). Articulatory phonology. Language and Linguistics Compass, 4(9):818–830.

Hall, T. A. (2002). Against extrasyllabic consonants in german and english. Phonology, 19(1):33–

75.

Hankamer, J. and Aissen, J. (1974). The sonority hierarchy. Papers from the parasession on natural

phonology. Chicago: Chicago Linguistic Society, 11.

Hardcastle, W. J. (1985). Some phonetic and syntactic constraints on lingual coarticulation

during/kl/sequences. Speech Communication, 4(1-3):247–263.

140

Harrison, X. A. (2015). A comparison of observation-level random effect and beta-binomial models

for modelling overdispersion in binomial data in ecology & evolution. PeerJ, 3:e1114.

Harrison, X. A., Donaldson, L., Correa-Cano, M. E., Evans, J., Fisher, D. N., Goodwin, C. E.,
Robinson, B. S., Hodgson, D. J., and Inger, R. (2018). A brief introduction to mixed effects
modelling and multi-model inference in ecology. PeerJ, 6:e4794.

Henke, E., Kaisse, E. M., and Wright, R. (2012).

Is the sonority sequencing principle an

epiphenomenon. The sonority controversy, 18:65–100.

Hermes, A., Mücke, D., and Auris, B. (2017). The variability of syllable patterns in tashlhiyt berber

and polish. Journal of Phonetics, 64:127–144.

Hermes, A., Mücke, D., and Grice, M. (2013). Gestural coordination of italian word-initial clusters:

the case of ‘impure s’. Phonology, 30(1):1–25.

Honeybone, P. (2008). Lenition, weakening and consonantal strength: tracing concepts through

the history of phonology. Lenition and fortition, pages 9–93.

Hoole, P., Bombien, L., Kühnert, B., and Mooshammer, C. (2009). Intrinsic and prosodic effects

on articulatory coordination in initial consonant clusters.

Hoole, P., Mooshammer, C., and Tillmann, H. G. (1994). Kinematic analysis of vowel production

in german. In Third international conference on spoken language processing.

Hooper, J. B. and Bybee, J. L. (1976). An introduction to natural generative phonology. Academic

Press.

Hsieh, F.-Y. (2017). A gestural approach to the phonological representation of English diphthongs.

PhD thesis, University of Southern California.

Iskarous, K. and Pouplier, M. (2022). As time goes by: A critical appraisal of space and time in

articulatory phonology in the 21st century. USC online articles.

Iverson, G. K. and Salmons, J. C. (1995). Aspiration and laryngeal representation in germanic.

Phonology, 12(3):369–396.

Jespersen, O. (1904). Lehrbuch der phonetik (leipzig and berlin). G. Teubner.

Johnson, K. and Song, Y. (2016). Gradient phonemic contrast in nanjing mandarin. UC Berkeley

Phonlab Annual Report, 12(1).

Kang, Y., van Oostendorp, M., Ewen, C. J., Hume, E., and Rice, K. (2011). The Blackwell

companion to phonology.

141

Katseff, S., Houde, J., and Johnson, K. (2012). Partial compensation for altered auditory feedback:

A tradeoff with somatosensory feedback? Language and speech, 55(2):295–308.

Katsika, A. (2012). Coordination of prosodic gestures at boundaries in Greek. Yale University.

Katsika, A. (2016). The role of prominence in determining the scope of boundary-related

lengthening in greek. Journal of phonetics, 55:149–181.

Kenstowicz, M. J. (1994). Phonology in generative grammar, volume 7. Blackwell Cambridge,

MA.

Kent, R. D. and Minifie, F. D. (1977). Coarticulation in recent speech production models. Journal

of phonetics, 5(2):115–133.

Kéry, M. and Royle, J. A. (2020). Applied hierarchical modeling in ecology: Analysis of distribution,
abundance and species richness in R and BUGS: Volume 2: Dynamic and advanced models.
Academic Press.

Kreitman, R. (2010). Mixed voicing word-initial onset clusters. Laboratory phonology, 10(4):4.

Krivokapić, J. (2020). Prosody in articulatory phonology. Prosodic theory and practice.

Kröger, B. J., Schröder, G., and Opgen-Rhein, C. (1995). A gesture-based dynamic model describing
articulatory movement data. The Journal of the Acoustical Society of America, 98(4):1878–1889.

Kroos, C., Hoole, P., Kühnert, B., and Tillmann, H. G. (1996). Phonetic evidence for the
phonological status of the tense-lax distinction in german. Journal of the Acoustical Society
of America, 100:2691.

Kuberski, S. R. and Gafos, A. I. (2023). How thresholding in segmentation affects the regression

performance of the linear model. JASA Express Letters, 3(9).

Kuznetsova, A., Brockhoff, P. B., and Christensen, R. H. (2017). lmertest package: tests in linear

mixed effects models. Journal of statistical software, 82:1–26.

Ladefoged, P. and Johnson, K. (2014). A course in phonetics. Cengage learning.

Lehiste, I. (1970). Suprasegmentals, cambridge, massachusetts & london, uk.

Liu, Z., Xu, Y., and Hsieh, F. (2020). Coarticulation as synchronised sequential

target
approximation: An ema study. In Proceedings of the Annual Conference of the International
INTERSPEECH, volume 2020, pages 1381–1385.
Speech Communication Association,
International Speech Communication Association (ISCA).

Liu, Z., Xu, Y., and Hsieh, F.-f. (2022). Coarticulation as synchronised cv co-onset–parallel

142

evidence from articulation and acoustics. Journal of Phonetics, 90:101116.

Luo, S. (2017). Gestural overlap across word boundaries: Evidence from english and mandarin
speakers. Canadian Journal of Linguistics/Revue canadienne de linguistique, 62(1):56–83.

MacNeilage, P. F. and Davis, B. L. (2000). On the origin of internal structure of word forms.

Science, 288(5465):527–531.

Marin, S. and Goldstein, L. (2012). A gestural model of the temporal organization of vowel clusters
in romanian. Consonant Clusters and Structural Complexity. Berlin/Boston, De Gruyter, pages
177–203.

Mattingly, I. G. (1981). Phonetic representation and speech synthesis by rule. In Advances in

Psychology, volume 7, pages 415–420. Elsevier.

Mielke, J. (2008). The emergence of distinctive features. Oxford University Press.

Mooshammer, C. and Fuchs, S. (2002). Stress distinction in german: simulating kinematic

parameters of tongue-tip gestures. Journal of Phonetics, 30(3):337–355.

Mooshammer, C., Geumann, A., Hoole, P., Alfonso, P., van Lieshout, P. H., and Fuchs, S.
(2003). Coordination of lingual and mandibular gestures for different manners of articulation.
In Proceedings of the 15th International Congress of Phonetic Sciences, August 3-9, 2003,
Barcelona, Spain, number 1, pages 81–84. International Phonetic Association.

Mücke, D., Hermes, A., and Tilsen, S. (2020). Incongruencies between phonological theory and

phonetic measurement. Phonology, 37(1):133–170.

Mücke, D., Nam, H., Hermes, A., and Goldstein, L. (2012). Coupling of tone and constriction

gestures in pitch accents. Consonant clusters and structural complexity, 26:205.

Murray, R. W. and Vennemann, T. (1983). Sound change and syllable structure in germanic

phonology. Language, pages 514–528.

Nam, H. (2007). Syllable-level intergestural timing model: Split-gesture dynamics focusing on

positional asymmetry and moraic structure. Laboratory phonology, 9:483–506.

Nam, H., Goldstein, L., and Saltzman, E. (2009). Self-organization of syllable structure: A coupled

oscillator model. Approaches to phonological complexity, 16:299–328.

Nam, H. and Saltzman, E. (2003). A competitive, coupled oscillator model of syllable structure.

In Proceedings of the 15th international congress of phonetic sciences, volume 1.

Ohala, J. J. (1990). Alternatives to the sonority heirarchy. In Papers from the 26th Regional Meeting

of the Chicago Linguistics Society, volume 2.

143

Ohala, J. J. and Kawasaki, H. (1997). Alternatives to the sonority hierarchy for explaining segmental
sequential constraints. Language and its ecology: Essays in memory of Einar Haugen, 100:343.

Öhman, S. E. (1966). Coarticulation in vcv utterances: Spectrographic measurements. The Journal

of the Acoustical Society of America, 39(1):151–168.

Parker, S. (2002). Quantifying the sonority hierarchy. PhD thesis, University of Massachusetts at

Amherst.

Parker, S. (2008). Sound level protrusions as physical correlates of sonority. Journal of phonetics,

36(1):55–90.

Parker, S. (2011). Sonority. The Blackwell companion to phonology, pages 1–25.

Parker, S. (2012). The sonority controversy, volume 18. Walter de Gruyter.

Pike, K. L. (1972). Phonetics: A critical analysis of phonetic theory and a technic for the practical

description of sounds.

Pons-Moll, C. (2008). The sonority scale: categorical or gradient. In Poster presented at the CUNY

Conference on the Syllable.

Pouplier, M. (2020). Articulatory phonology. In Oxford research encyclopedia of linguistics.

Pouplier, M. and Beňuš, Š. (2011). On the phonetic status of syllabic consonants: Evidence from

slovak. Laboratory phonology, 2(2):243–273.

Pouplier, M., Pastätter, M., Hoole, P., Marin, S., Chitoran, I., Lentz, T. O., and Kochetov, A.
(2022). Language and cluster-specific effects in the timing of onset consonant sequences in
seven languages. Journal of Phonetics, 93:101153.

R Core Team (2017). R: A Language and Environment for Statistical Computing. R Foundation

for Statistical Computing, Vienna, Austria.

Redford, M. A. (1999). The mandibular cycle and reversed-sonority onset clusters in russian. In
Proceedings from the 14th International Congress of Phonetic Sciences, pages 1893–1896.

Ross, M. (2002). Takia. the oceanic languages ed. by john lynch, malcolm ross & terry crowley,

216-248.

Ross, M. (2003). Seminar on takia, a papuanised austronesian language of papua new guinea.

Saltzman, E. L. and Munhall, K. G. (1989). A dynamical approach to gestural patterning in speech

production. Ecological psychology, 1(4):333–382.

144

Selkirk, E. (1984). On the major class features and syllable theory. Language sound structure.

Seo, M. (2011). Syllable contact. The Blackwell companion to phonology, pages 1–18.

Shariatmadari, D. (2006). Sounds difficult? why phonological theory needs ‘ease of articulation’.

School of Oriental and African Studies Working Papers in Linguistics, 14:207–226.

Shaw, J. and Chen, W.-r. (2019). Spatially conditioned speech timing: evidence and implications.

Frontiers in Psychology, 10:2726.

Shaw, J., Gafos, A. I., Hoole, P., and Zeroual, C. (2009). Syllabification in moroccan arabic:

evidence from patterns of temporal stability in articulation. Phonology, 26(1):187–215.

Shaw, J., Kawahara, S., and Shaw, J. A. (2023). Limits on gestural reorganization following vowel

deletion: The case of tokyo japanese. Laboratory Phonology, 14(1).

Shaw, J., Oh, S., Durvasula, K., and Kochetov, A. (2021). Articulatory coordination distinguishes

complex segments from segment sequences. Phonology, 38(3):437–477.

Shaw, J. A., Gafos, A. I., Hoole, P., and Zeroual, C. (2011). Dynamic invariance in the phonetic
expression of syllable structure: a case study of moroccan arabic consonant clusters. Phonology,
28(3):455–490.

Shi, X. (2015). 成都话响音的鼻化度——兼论其/n, l/不分的实质及类型 [the nasality degree
of sonorants in chengdu dialect]. 中国语音学报. Chinese Journal of Phonetics, 10:92–100.

Sievers, E. (1881). Grundzüge der phonetik: Breitkopf und hartel.

Sievers, E. (1901). Grundzüge der Phonetik: zur Einführung in das Studium der Lautlehre der

indogermanischen Sprachen, volume 1. Breitkopf & Härtel.

Smolensky, P. (1995). On the structure of the constraint component con of ug. Handout of the talk

presented at UCLA.

Solé, M.-J. (1992). Phonetic and phonological processes: The case of nasalization. Language and

speech, 35(1-2):29–43.

Steriade, D. (1982). Greek prosodies and the nature of syllabification. PhD thesis, Massachusetts

Institute of Technology.

Svensson Lundmark, M., Frid, J., Ambrazaitis, G., and Schötz, S. (2021). Word-initial consonant–

vowel coordination in a lexical pitch-accent language. Phonetica, 78(5-6):515–569.

Tabain, M., Breen, G., and Butcher, A. (2004). Vc vs. cv syllables: a comparison of aboriginal
languages with english. Journal of the International Phonetic Association, 34(2):175–200.

145

Team, R. C. et al. (2013). R: A language and environment for statistical computing.

Tiede, M. (2005). Mview:

software for visualization and analysis of concurrently recorded

movement data. New Haven, CT: Haskins Laboratories.

Tilsen, S. (2020). Detecting anticipatory information in speech with signal chopping. Journal of

Phonetics, 82:100996.

Vaux, B. and Wolfe, A. (2009). 5 the appendix. Contemporary views on architecture and

representations in phonology, 48:101.

Westbury, J., Milenkovic, P., Weismer, G., and Kent, R. (1990). X-ray microbeam speech production

database. The Journal of the Acoustical Society of America, 88(S1):S56–S56.

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G.,
Hayes, A., Henry, L., Hester, J., et al. (2019). Welcome to the tidyverse. Journal of open source
software, 4(43):1686.

Wright, R. (2004). A review of perceptual cues and cue robustness. Phonetically based phonology,

34:57.

Wright, R. A. (1996). Consonant clusters and cue preservation in Tsou. University of California,

Los Angeles.

Xhaferaj, A. et al. (2022). The sonority dispersion principle in albanian. European Journal of

Social Science Education and Research, 9(1):40–47.

Xu, Y., Liu, F., et al. (2006). Tonal alignment, syllable structure and coarticulation: Toward an

integrated model. Italian Journal of Linguistics, 18(1):125.

Yanagawa, M. (2006). Articulatory timing in first and second language: a cross-linguistic study.

Yale University.

Yin, R., van de Weijer, J., and Round, E. R. (2023). Frequent violation of the sonority sequencing
principle in hundreds of languages: how often and by which sequences? Linguistic Typology,
27(2):381–403.

Zhang, M., Geissler, C., and Shaw, J. (2019). Gestural representations of tone in mandarin: evidence
from timing alternations. In Proceedings of the 19th International Congress of Phonetic Sciences,
Melbourne, Australia 2019, pages 1803–1807. Australasian Speech Science and Technology
Association Inc Canberra, ACT.

Zhang, W. and Levis, J. M. (2021). The southwestern mandarin/n/-/l/merger: effects on production

in standard mandarin and english. Frontiers in Communication, 6:639390.

146

Zuraw, K. (2003). Vowel reduction in palauan reduplicants. In Proceedings of the 8th Annual

Meeting of the Austronesian Formal Linguistics Association [AFLA 8], pages 385–398.

147

APPENDIX A

ENGLISH RECRUITMENT EMAIL

Subject: Interested in being part of a speech production study?

Hello,

We

are

conducting

a

speech

production

experiment

on

native

speakers

of

English/Mandarin/Japanese/Spanish. The experiment will involve the participant producing

sentences in their native language while they are connected to an Electromagnetic Articulograph

machine that will track their speech articulations through sensors. From the project, we hope to

learn how speakers coordinate articulatory events (gestures) during speech production.

An Electromagnetic Articulograph allows researchers to measure the positions of parts of the

mouth as they are moved during speech articulation. The machine is connected to sensors that

will be placed both on the face/lips and on the tongue of the participant for the duration of the

experiment. The sensors will be affixed in place with a standard dental adhesive. Note, long-

term exposure to the electromagnetic fields of the Electromagnetic Articulograph machine has

not been shown to be harmful to human health, but it is recommended to avoid subjects who

are pregnant or who utilize pacemakers. Guidelines place the limit for safe continuous exposure

between 100𝜇T and 200𝜇T. The interested participant can read more about the technology at this

link: https://en.wikipedia.org/wiki/Electromagnetic_articulography.

A prospective participant will be a healthy individual with no history of hearing or speech

deficiencies who is at least 18 years old, is not pregnant and does not use a pacemaker.

The experiment will last at most 2 hours and you will be paid $30 for your participation.

If you are interested, fill out this pre-screening survey, and we will contact you if you meet our

criteria.

Sincerely,

Yunting

148

APPENDIX B

ENGLISH PRE-SCREENING SURVEY

Google form title: Linguistics experiment (speech production)

1. A prospective participant will be a healthy individual with no history of hearing or speech

deficiencies who is at least 18 years old, is not pregnant and does not use a pacemaker.

• Yes, I have read the above text.

• No, I haven’t read the above text.

2. Do you have a history of hearing or speech deficiencies?

• Yes.

• No.

• Prefer not to answer.

3. Are you at least 18 years old?

• Yes.

• No.

• Prefer not to answer.

4. What’s your preferred email address? (We will contact you if you pass the pre-screening.)

5. What is your first/primary language? (Multiple-choice question)

• English

• Mandarin

• Spanish

• Japanese

• Other ___

149

6. What state did you grow up in if you grew up in the US? (Answer NA if you did not grow up

in the US.)

150

APPENDIX C

MANDARIN RECRUITMENT MESSAGE

有兴趣参加语言学实验吗？

您好，我们正在进行一项关于英语/普通话/日语/西班牙语母语者的语言学实验。

在 该 实 验 中 ， 参 与 者 将 用 他 们 的 母 语 说 出 一 些 句 子 ， 同 时 他 们 将 与 一 台 电 磁 发 音

仪(Electromagnetic Articulography；EMA) 相连，该仪器可通过传感器跟踪他们的发音。

我们希望通过这个项目了解说话者在发音过程中如何协调发声。研究人员通过电磁

发音仪测量口部各部分在发音过程中的位置。该机器连接有传感器。整场实验中，

这 些 传 感 器 都 将 通 过 标 准 牙 科 粘 合 剂 固 定 在 参 与 者 的 脸 部/嘴 唇 和 舌 头 上 。 需 要 注

意的是，尽管持续暴露于电磁发音仪的电磁场对人类健康未显示出有害影响，我们

仍建议孕妇或使用心脏起搏器者避免参与实验。实验规范显示：连续暴露于磁场强

度100μT至200μT之间是安全的。受试者如有兴趣，可以通过此链接了解更多该技术的

相关信息：https://en.wikipedia.org/wiki/Electromagnetic_articulography。

符合要求的实验参与者应是健康的，没有听力或言语缺陷的记录，年龄至少为18岁，

不在孕期，并且不使用心脏起搏器。实验最长将进行2小时，您将获得30美元的参与费

用。

如果您感兴趣，请填写此表: https://forms.gle/NPSEwv9V9xj9N3V46。如果符合我们的

条件，我们将与您联系。

谢谢！

密歇根州立大学语音实验室

151

APPENDIX D

MANDARIN PRE-SCREENING SURVEY

语言学实验调查

1. 参与试验者必须是健康的，没有听力或言语缺陷的，18岁或18岁以上的，没有怀孕

的，不使用心脏起搏器的成年人。

• 是的，我已阅读以上信息。

• 不，我没阅读以上信息。

2. 您是18岁或18岁以上吗？

• 是

• 否

• 无法提供信息

3. 您有听力或言语缺陷记录吗？

• 有

• 没有

• 无法提供信息

4. 您的电子邮箱地址是？我们会在初筛结束之后联系您。

5. 您的母语是？（如有多种母语，请全部填写。）

6. 您会说哪些语言/方言？

7. 您是否在中国出生和长大？

• 是

• 否

152

8. 您在哪个省份/城市出生？哪个省份/城市长大？

9. 您在中国生活过几年？

153

APPENDIX E

ANNOTATION LABELS AND THEIR MEANINGS IN EXPERIMENT 2

Label
"Questionable"

"MultipleMeasure"

"NoneDefault"

Meaning
I am unsure about the annotation.
More than one gesture is measured for
a sound. Example: TB and TD were measured.
The non-default gesture is used.
Example: TT was measured for vowel.

"SensorUnavailable" The target sensor is unavailable.
"Mispronounced"

1

2

3

4
5

Table E.1 Annotation labels and their meanings in experiment 2: the English EMA experiment.

154

APPENDIX F

ANNOTATION LABELS AND THEIR MEANINGS IN EXPERIMENT 3

Meaning
I am unsure about the annotation.

Label
"Questionable"
"MultipleMeasure" More than one gesture is measured.
The non-default gesture is used.
"NoneDefault"
"SensorUnavailable" The target sensor is unavailable.
"UnclearPro"

"NaLamispron"

Unclear pronunciation.
There is a [n, l] merger.
Example: 腊 (la) is pronounced as 那 (na).

1
2
3
4
5

6

Table F.1 Annotation labels and their meanings in experiment 3: the Mandarin EMA experiment.

155

APPENDIX G

ENGLISH EXPERIMENTS RESULTS WITH VOWEL DISPLACEMENT AS FIXED
EFFECT

(Intercept)
Sonority difference
Vowel displacement

Estimate Std. Error
159.69
10.96
-0.45

23.19
1.70
0.70

df
34.58
23.88
1710.92

t value
6.89
6.46
-0.65

Pr(>|t|)
<0.00001
<0.00001
0.52

Table G.1 Mixed effect model for all stimuli with sonority difference and vowel displacement as
fixed effects. Random intercepts: stimuli, participants, and consonant duration.

(Intercept)
Sonority difference
Vowel displacement

Estimate Std. Error
111.58
14.07
4.05

26.61
1.92
0.99

df
21.28
10.57
1233.65

t value
4.19
7.34
4.11

Pr(>|t|)
0.0004
0.00002
0.00004

Table G.2 Mixed effect model for bilabial C stimuli with sonority difference and vowel displacement
as fixed effects. Random intercepts: stimuli, participants, and consonant duration.

(Intercept)
Sonority difference
Vowel displacement

Estimate Std. Error
415.92
-8.29
-8.17

132.53
8.73
2.87

df
187.41
180.30
158.34

t value
3.14
-0.95
-2.85

Pr(>|t|)
0.00
0.34
0.005

Table G.3 Mixed effect model for peak, pack with sonority difference and vowel displacement as
fixed effects. Random intercepts: participants and consonant duration.

(Intercept)
Sonority difference
Vowel displacement

Estimate Std. Error
173.50
7.34
5.29

108.30
8.70
2.90

df
171.41
161.65
171.37

t value
1.60
0.84
1.83

Pr(>|t|)
0.11
0.40
0.07

Table G.4 Mixed effect model for beak, back with sonority difference and vowel displacement as
fixed effects. Random intercepts: participants and consonant duration.

Estimate Std. Error

(Intercept)
Sonority difference
Vowel displacement

72.02
13.98
4.75

80.61
8.620
2.86

df
260.30
255.34
229.61

t value
0.89
1.62
1.66

Pr(>|t|)
0.37
0.11
0.10

Table G.5 Mixed effect model for meek, Mac with sonority difference and vowel displacement as
fixed effects. Random intercepts: participants and consonant duration.

156

(Intercept)
Sonority difference
Vowel displacement

Estimate Std. Error
-23.66
34.53
10.70

38.13
6.98
1.76

df
68.16
224.40
217.54

t value
-0.62
4.95
6.09

Pr(>|t|)
0.54
0.000001
<0.00001

Table G.6 Mixed effect model for week, whack with sonority difference and vowel displacement as
fixed effects. Random intercepts: participants and consonant duration.

(Intercept)
Sonority difference
Vowel displacement

Estimate Std. Error
183.60
6.74
-6.70

41.69
3.19
1.12

df
12.13
10.15
1177.57

t value
4.40
2.11
-6.01

Pr(>|t|)
0.001
0.06
<0.00001

Table G.7 Mixed effect model for coronal C stimuli with sonority difference and vowel displacement
as fixed effects. Random intercepts: stimuli, participants, and consonant duration.

(Intercept)
Sonority difference
Vowel displacement

Estimate Std. Error
-527.73
56.36
-1.93

240.84
16.79
3.30

df
264.38
264.45
205.60

t value
-2.19
3.36
-0.59

Pr(>|t|)
0.03
0.001
0.56

Table G.8 Mixed effect model for two, toe with sonority difference and vowel displacement as fixed
effects. Random intercepts: participants and consonant duration.

(Intercept)
Sonority difference
Vowel displacement

Estimate Std. Error
-185.42
33.80
-17.18

205.99
16.78
4.02

df
258.89
256.49
256.98

t value
-0.90
2.02
-4.28

Pr(>|t|)
0.37
0.05
0.00003

Table G.9 Mixed effect model for sue, so with sonority difference and vowel displacement as fixed
effects. Random intercepts: participants and consonant duration.

(Intercept)
Sonority difference
Vowel displacement

Estimate Std. Error
321.94
-4.24
-9.05

198.28
17.47
3.20

df
205.77
202.99
197.04

t value
1.62
-0.24
-2.83

Pr(>|t|)
0.11
0.81
0.01

Table G.10 Mixed effect model for do, doe with sonority difference and vowel displacement as
fixed effects. Random intercepts: participants and consonant duration.

157

(Intercept)
Sonority difference
Vowel displacement

Estimate Std. Error
-66.85
34.19
-8.22

132.19
15.63
2.30

df
261.62
257.73
251.65

t value
-0.51
2.19
-3.57

Pr(>|t|)
0.61
0.03
0.0004

Table G.11 Mixed effect model for new, know with sonority difference and vowel displacement as
fixed effects. Random intercepts: participants and consonant duration.

158

APPENDIX H

ENGLISH EXPERIMENT RESULTS WITH CONSONANT DISPLACEMENT AS FIXED
EFFECT

(Intercept)
Sonority difference
C displacement

Estimate Std. Error
137.07
12.22
-0.76

28.00
2.23
0.89

df
16.97
10.47
1434.78

t value
4.90
5.49
-0.86

Pr(>|t|)
0.0001
0.0002
0.39

Table H.1 Mixed effect model for bilabial C stimuli with sonority difference and consonant
displacement as fixed effects. Random intercepts: stimuli, participants, and consonant duration.

(Intercept)
Sonority difference
C displacement

Estimate Std. Error
178.57
8.50
2.88

65.18
4.87
2.11

df
202.80
204.82
172.16

t value
2.74
1.75
1.37

Pr(>|t|)
0.01
0.08
0.17

Table H.2 Mixed effect model for peak, beak with sonority difference and consonant displacement
as fixed effects. Random intercepts: participants and consonant duration.

(Intercept)
Sonority difference
C displacement

Estimate Std. Error
208.80
11.68
6.24

71.22
5.00
2.57

df
198.13
186.08
170.60

t value
2.93
2.33
2.43

Pr(>|t|)
0.003
0.02
0.02

Table H.3 Mixed effect model for pain, bane with sonority difference and consonant displacement
as fixed effects. Random intercepts: participants and consonant duration.

(Intercept)
Sonority difference
C displacement

Estimate Std. Error
108.33
12.87
4.62

83.25
5.50
2.81

df
162.65
149.96
159.42

t value
1.30
2.34
1.65

Pr(>|t|)
0.20
0.02
0.10

Table H.4 Mixed effect model for pack, back with sonority difference and consonant displacement
as fixed effects. Random intercepts: participants and consonant duration.

(Intercept)
Sonority difference
C displacement

Estimate Std. Error
181.35
8.33
-1.63

43.99
3.38
2.33

df
11.76
10.01
1431.03

t value
4.12
2.46
-0.70

Pr(>|t|)
0.001
0.03
0.48

Table H.5 Mixed effect model for coronal C stimuli with sonority difference and consonant
displacement as fixed effects. Random intercepts: stimuli, participants, and consonant duration.

159

(Intercept)
Sonority difference
C displacement

Estimate Std. Error
318.42
-3.97
-15.75

72.36
5.65
8.02

df
248.90
244.01
251.99

t value
4.40
-0.70
-1.97

Pr(>|t|)
0.00002
0.48
0.05

Table H.6 Mixed effect model for two, do with sonority difference and consonant displacement as
fixed effects. Random intercepts: participants.

(Intercept)
Sonority difference
C displacement

Estimate Std. Error
214.86
7.20
-13.63

56.22
3.99
6.30

df
265.08
271.23
278.76

t value
3.82
1.80
-2.16

Pr(>|t|)
0.0002
0.07
0.03

Table H.7 Mixed effect model for toe, doe with sonority difference and consonant displacement as
fixed effects. Random intercepts: participants.

(Intercept)
Sonority difference
C displacement

Estimate Std. Error
295.35
-0.17
-1.12

57.40
3.73
4.95

df
234.82
274.04
277.74

t value
5.15
-0.05
-0.23

Pr(>|t|)
<0.00001
0.96
0.82

Table H.8 Mixed effect model for talk, dock with sonority difference and consonant displacement
as fixed effects. Random intercepts: participants.

Estimate Std. Error

(Intercept)
Sonority difference
C displacement

11.21
23.95
-13.15

52.02
5.29
6.58

df
251.54
249.89
259.00

t value
0.22
4.53
-2.00

Pr(>|t|)
0.83
<0.00001
0.05

Table H.9 Mixed effect model for do, new with sonority difference and consonant displacement as
fixed effects. Random intercepts: participants.

(Intercept)
Sonority difference
C displacement

Estimate Std. Error
158.88
12.27
0.54

50.78
4.54
6.40

df
234.77
263.26
271.89

t value
3.13
2.70
0.09

Pr(>|t|)
0.002
0.01
0.93

Table H.10 Mixed effect model for doe, know with sonority difference and consonant displacement
as fixed effects. Random intercepts: participants.

160

Estimate Std. Error

(Intercept)
Sonority difference
C displacement

67.29
17.52
2.79

44.39
3.61
4.91

df
223.92
273.96
279.36

t value
1.52
4.86
0.57

Pr(>|t|)
0.13
<0.00001
0.57

Table H.11 Mixed effect model for dock, knock with sonority difference and consonant displacement
as fixed effects. Random intercepts: participants.

161

MANDARIN RESULTS FOR PAIRWISE COMPARISON DIFFER IN C VOICING

APPENDIX I

Figure I.1 CV lag based on target onset for Mandarin participants: bi4, pi4.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

54.04
10.57

40.86
3.15

df
268.60
270.81

t value Pr(>|t|)

1.32
3.35

0.19
0.001

Table I.1 Mixed effects model results for Mandarin participants: bi4, pi4.

Figure I.2 CV lag based on target onset for Mandarin participants: ba4, pa4.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

73.45
9.69

58.80
3.99

df
269.06
259.96

t value Pr(>|t|)

1.25
2.43

0.21
0.02

Table I.2 Mixed effects model results for Mandarin participants: ba4, pa4.

162

Figure I.3 CV lag based on target onset for Mandarin participants: ban4, pan4.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

82.40
8.11

55.82
3.75

df
269.80
262.96

t value Pr(>|t|)

1.48
2.16

0.14
0.03

Table I.3 Mixed effects model results for Mandarin participants: ban4, pan4.

Figure I.4 CV lag based on target onset for Mandarin participants: bei4, pei4.

Gestural Onset
(Intercept)
Sonority difference

Estimate Std. Error

165.91
6.95

36.18
2.90

df
189.85
282.52

t value Pr(>|t|)
0.00001
0.02

4.59
2.40

Table I.4 Mixed effects model results for Mandarin participants: bei4, pei4.

163