A SYSTEMATIC EVALUATION OF COMPUTATIONAL MODELS OF

PHONOTACTICS

By

Isaac Sarver

A THESIS

Submitted to

Michigan State University

in partial fulﬁllment of the requirements

for the degree of

Linguistics – Master of Arts

2020

ABSTRACT

A SYSTEMATIC EVALUATION OF COMPUTATIONAL MODELS OF

PHONOTACTICS

By

Isaac Sarver

In this thesis, recent computational models of phonotactics are discussed and evaluated

and two new models are implemented. Prior phonotactic modeling, motivated by gradient

acceptability judgments in nonce word judgment tasks (Albright 2009), claim that phonotac-

tic grammaticality is gradient, and these models are evaluated by their ability to judge nonce

words with scores that correlate with human acceptability judgments. Gorman (2013) argues

that these gradient models do not account for the facts suﬃciently and claims phonotactic

grammaticality is categorical. In this thesis, the account of Gorman (2013) is implemented

as well as a prominent gradient model from Hayes and Wilson (2008) and compared with

the performance of two machine learning models (a support vector machine and a recurrent

neural network), with all models trained on a corpus of English onsets. Results in this thesis

show that the computational models are unable to correlate with human judgment data

from Scholes (1966) as well as a categorical prediction of acceptability based on whether a

sequence is attested in the lexicon or not, and that these models rely on assumptions which

when challenged show that the models do not convincingly capture the gradience of the

human judgment data used for evaluation.

This thesis is dedicated to my siblings.

iii

ACKNOWLEDGEMENTS

First and foremost I would like to thank the linguistics faculty for their support and

guidance throughout my time at MSU. This includes Hannah Forsythe, who taught my

Introduction to Language course in Spring 2015 and ﬁrst introduced the ﬁeld of linguistics to

me, as well as Marcin Morzycki, who taught my second linguistics class I took and encouraged

me to add a linguistics minor in my undergrad, and who also advised me throughout my

semantics research my ﬁrst year of grad school. I am hugely indebted as well to Suzanne

Wagner, Yen-Hwei Lin, Alan Munn, and Cristina Schmitt for all that I’ve learned in the

program during classes, colloquiums, and personal discussion. Thank you also to Kristen

Johnson in the CSE department for NLP advice and instruction. And special thanks to my

advisor, Karthik Durvasula, who persuaded me to apply to the MA program, taught four of

my classes, and gave me support and advice throughout the whole process.

Thank you also to my fellow students. I’m so grateful for the friendships and discussions

of this time in classes, the grad oﬃce, and colloquium dinners. Thank you to those who

listened to my ideas and presented their own in Awkward Time, the Phono group, and

informal study sessions. And thank you as well to my advisors and colleagues in the Center

for Language Teaching Advancement who have given me so many professional opportunities

this year.

Lastly, there is no way I would have made it through the last few years without my

friends and family. Thank you to Megan Wixom and Daniela Diaz for listening to all of my

practice presentations, thank you to Suzanna Feldkamp for keeping me accountable through

our late-night writing sessions, and thank you to Abdullah Karaaslanlı and Thanaphong

Phongpreecha for being daily listening ears for everything I had on my mind and for helping

me through learning Python and machine learning. Thank you to my siblings for giving me

the ability to laugh on any day no matter the circumstances, and to my parents who have

supported me and fostered my curiosity about the world from the beginning. Thank you all.

iv

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1

2

3

4

5

6

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

RELATED WORKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Origins of phonotactics
2.2 Nature of constraints and output . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Dealing with Gradient Acceptability: Recent Models . . . . . . . . . . . . .
2.4 Frequency structure in the data . . . . . . . . . . . . . . . . . . . . . . . . .

METHODS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1 Phonotactic models for this study . . . . . . . . . . . . . . . . . . . . . . .
3.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 SVM model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 RNN model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Maximum Entropy grammar
. . . . . . . . . . . . . . . . . . . . . . . . . .

RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1 Gross Phonotactic Violation . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 SVM results
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 RNN results
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Changing frequency structures in the data for RNN training . . . . . . . . .
4.5 Changing frequency structures in the data for MaxEnt training . . . . . . .

DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1 Model Comparisons
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Training data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Experiment data used for evaluation . . . . . . . . . . . . . . . . . . . . . .

CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

5
5
7
9
11

14
14
15
17
20
22

26
26
27
28
31
32

35
35
36
37

38

41

v

LIST OF TABLES

Table 2.1: Examples of the nonce words used in Scholes’ experiment, along with

number of positive responses. . . . . . . . . . . . . . . . . . . . . . . . . .

Table 3.1: Onsets used for training . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table 3.2: Toy example of diﬀerent frequency structures in training data.

. . . . . .

Table 3.3: Maxent Grammar (note: ‘.’ represents multiplication here)

. . . . . . . .

Table 4.1: Type frequency model results on withheld test set

. . . . . . . . . . . . .

Table 4.2: Equalized Frequency Model results on withheld test set

. . . . . . . . . .

Table 4.3: Type frequency model results predicting Scholes data . . . . . . . . . . .

Table 4.4: Equalized frequency model results predicting Scholes data . . . . . . . . .

Table 4.5: Correlations of the Phonotactic Learner scores with the Scholes data . . .

6

16

17

22

32

32

32

32

34

vi

LIST OF FIGURES

Figure 2.1: Mean ratings of subjects asked how representative the stimuli were of
the categories “even” and “odd” respectively, where “1” indicates most
representative (Gorman 2013; Armstrong, L. Gleitman, and H. Gleit-
man 1983).

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

Figure 3.1:

Illustration of an SVM in 2 dimensions . . . . . . . . . . . . . . . . . . .

18

Figure 3.2: Sketch of the input nodes, hidden layer, and output of a Recurrent

Neural Network.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 3.3: Graph representation of the sigmoid function, with x = input and y =

output.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 4.1: Distribution of normalized ratings in the Scholes experiment for attested

and unattested onsets.

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 4.2: The accuracy of the SVM model’s classiﬁcations of the test data based

on the type of embedding.

. . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 4.3: SVM results on withheld test set

. . . . . . . . . . . . . . . . . . . . . .

Figure 4.4: SVM predictions for Scholes data, with ground truth set to gross status .

Figure 4.5: Loss values during RNN training, showing how wrong the model is for

each iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 4.6: Confusion matrix showing the model’s accuracy for guessing each class,

and its error.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 4.7: Scatter plot showing the model’s % conﬁdence that an onset is gram-
matical (value=1), compared to the percentage of “yes” responses in the
Scholes experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 4.8: Equalized Frequency, Tuned . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 4.9: Equalized Frequency, Untuned . . . . . . . . . . . . . . . . . . . . . . . .

Figure 4.10: Type Frequency, Tuned . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 4.11: Type Frequency, Untuned . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

20

22

27

28

29

29

30

30

31

34

34

34

34

CHAPTER 1

INTRODUCTION

Phonotactic theory is concerned with understanding the knowledge that a speaker has

about possible and impossible phonological sequences in their native language. This knowl-

edge allows for speaker judgments of novel words, where, upon hearing a novel sequence

of sounds, a speaker can determine if the new word is an acceptable sequence that could

reasonably be a member of their lexicon or not. For example, when speakers are presented

with the non-English words <blick> [blIk] and <bnick> [bnIk], speakers can easily assess
<blick> to be well-formed and <bnick> to be ill-formed.1 Note, however, that the reason

for their exclusion from the lexicon is categorically diﬀerent. The former is acceptable, yet

happens to not occur in the lexicon of most speakers; an accidental gap in the lexicon rather

than a systemic one. The latter, on the other hand, violates some structural requirement,

and is judged to be an impossible construction in the language (Halle 1962; Chomsky and

Halle 1965).

In recent work, phonotactic judgments have been argued to be based on gradient knowl-

edge (Albright and Hayes 2003; Albright 2009). For example, [wIs] and [ploUmf] can both

be judged as acceptable, but [wIs] is consistently judged as ‘more acceptable’ than [ploUmf].

This can be considered the more prominent view, often adopted in modern research on

phonotactic knowledge. Generally, this view equates gradient acceptability with gradient
grammaticality2.

There has also been some recent work concerned with the potential of modeling phonotac-

tic knowledge with machine learning and deep learning techniques (Mayer and Nelson 2019;

Mirea and Bicknell 2019). This follows the current enthusiasm for the ways deep learning
1<...> denotes orthographic representations, [...] denotes surface representations, and

/.../ denotes underlying representations.

2Although, as many have argued, gradient acceptability does not necessitate gradient

grammaticality(Gorman 2013; Chomsky 1965; Schütze 2011)

1

and more powerful computational methods might provide windows into linguistic theory and

knowledge (Pater 2019). Phonotactic knowledge is particularly well-suited to this approach,

because any phonotactic model can be based on a small number of features, whether seg-

mental or featural, and only need produce a simple output: in a categorical system, valid or

invalid; in a gradient system, some probability of a sequence’s acceptability.

In this thesis, I show that recent computational models in the phonotactics literature, as

well as my own models, can learn phonotactic generalizations from corpus data equally well

without any information in the data regarding the frequency of a sequence in the lexicon.

This suggests that gradient phonotactic acceptability judgments are not the result of varying

sequence frequencies in the lexicon; that lexical frequency ratios of phonological sequences

are perhaps not relevant to modeling speakers’ phonotactic judgments.

In a dissertation that looked speciﬁcally at diﬀerent types of models that can account for

acceptability judgments, Gorman (2013) argued that a categorical baseline can outperform

gradient models such as an n-gram model or a maximum entropy model (Jurafsky and

Martin 2009; Hayes and Wilson 2008). However, beyond Gorman’s analysis, which did

not investigate the implications of diﬀerent machine learning techniques on the issue at

hand, none of the recent work has considered the nature of phonotactic knowledge and how

modeling choices aﬀect the conclusions which are derived from them.

When designing a model of phonotactic knowledge, multiple modeling decisions must be

made, some of which are discussed in recent literature. First, does phonotactic knowledge

apply to features, or does it apply only at the segmental level? It is quite surprising given

the success of features in accounting for phonological patterns that recent work has shown

through both machine learning and traditional computational models that training on seg-

ments provides better results and easier learning than training on features (Albright 2009;

Mirea and Bicknell 2019).

Second, what is the appropriate representational unit, or window size, for onsets to be

grouped in as models are trained? Per Gorman’s 2013 model for monosyllabic nonce words,

2

a model of gross status treating each onset as a single unit suﬃces. [S T R] does not need to

be recognized as a sequence of segments, but simply as a unit contained in a set of attested
onsets.3 However, machine learning models can easily accommodate the diﬀerence in window
size as a model parameter, and this representation needs to be decided.4

Third, is phonotactic grammaticality a categorical or gradient measure? Both of these

decisions will greatly impact model decisions and performance. Many available learning

models can take any number of features as their inputs and could be adapted to a featural

or segmental view, but the ﬂexibility of some models for output as a binary, categorical

classiﬁcation, or gradient, is not as clear. Additionally, how well model output is analogous

to a true probability is also murky (Hayes and Wilson 2008). I will discuss this issue with

each model presented.

For my thesis, I do the following:

a. I implement four models:

i. Gross phonotactic violation, following Gorman (2013);

ii. A support vector machine, or SVM; a classical machine learning model frequently

used for binary classiﬁcation tasks.

iii. A maximum entropy model as employed by Hayes and Wilson (2008) and referred

to by phonotactic literature as a gradient model;

iv. A recurrent neural network, or RNN; a deep learning model quite ﬂexible to

diﬀerent conﬁgurations, and can be used to classify using a threshold or argmax

over normalized outputs.

3Gorman (2013) acknowledges that this does not capture all phonotactic patterns, but
operationalizing this model to include other aspects of phonotactics immediately becomes
quite diﬃcult and has not been explored.

4As an example, the onset[S T R] would be represented in the following ways for window

size n: n = 1: ((S),(T),(R)), n = 2: ((ST), (TR)), and n = 3: ((STR)).

3

b. I compare each model’s performance on a withheld test set to ensure that the model

has made some generalizations about the data.

c. I then test each model’s ability to categorize data from human judgment experiments

collected by Scholes (1966), to see which approach is better to model human judgments.

d. I train well-performing models on datasets which have had the frequency of types

equalized, and those which preserve the type frequency from the original data.

I train each model on a corpus of word onsets, controlling for three factors: First, previous

literature has not been concerned with the appropriate context window size for phonotactic

restrictions, so I train each model embedding the onsets by unigrams, bigrams, and trigrams.

Second, I also train each model with both the full dataset, including all tokens taken from

the dictionary, as well as the set of all onsets, removing any duplicates. This is to test if the

models can learn which onsets are grammatical with only the set of attested and unattested

onsets. Third, each model is trained only on segments, and not features. I do this following

the work of Albright (2009) and Mirea and Bicknell (2019), whichs shows that phonotactic

models trained on segments learn and perform better than models trained on features. I

show that the maximum entropy model is able to correlate the most with human judgments

and crucially that the RNN model as well as the maximum entropy model performance is

not signiﬁcantly aﬀected when the phonotactic frequency diﬀerences are neutralized in the

training data.

4

CHAPTER 2

RELATED WORKS

2.1 Origins of phonotactics

The study of phonotactics began with Halle (1962) and Chomsky and Halle (1965), where

the aforementioned distinction of [blIk] and [bnIk] is introduced. This is a simple but powerful

demonstration of phonotactic knowledge, which the authors deﬁne as restrictions on the

underlying representation of words. These restrictions can be segment structure constraints,

but also sequence structure constraints (Halle 1959). These are named morpheme structure

constraints (MSCs) by the authors.

Segment structure constraints contain the constraints dictated by the available underlying

segments in a given language. Though [D, d] are both possible surface segments in Spanish,

[D] only appears as an allophone of /d/, thus a nonce word such as /Dano/ cannot appear as

an underlying representation of a word in Spanish and violates the phonotactic system of the

language. My thesis will not be concerned with this aspect of phonotactic knowledge, as it

can be modeled quite simply in terms of set membership: Given a set of available segments
S, for each x in sequence s, if x ∈ S, then no segment structure constraint is violated.

Sequence structure constraints are not so simply deﬁned. Chomsky and Halle (1965)

prefer a featural, rule-based constraint structure on underlying representations. An example

of an English sequence structure constraint (from Gorman (2013)) is seen in (1).

(cid:20)

(cid:21)

(cid:20)

(1)

−Cont

→

+Liquid

/ #

−Cont

(cid:21)

(cid:20)

(cid:21)

This constraint prohibits a stop consonant after a word-initial stop, and prohibits the under-

lying */bnIk/, but allows /blIk/. A set of these MSCs then would represent language-speciﬁc

phonotactic knowledge that is present in any given speaker.

5

If this knowledge is represented in such a way, it can be tested experimentally, which is

the path Scholes (1966) took. Scholes (1966) used acceptability judgement tasks to probe

these constraints and evaluate them. In his study (particularly, experiment 5 in his book),

33 seventh grade students were asked to judge whether each word in a list of nonce words

was an acceptable English word or not. They were to give binary “yes” or “no” answers, and

each word received the number of “yes” votes as its score. Examples of this data can be
seen in Table 2.1. Nonce words are transcribed using the ARPAbet system1, and with the

corresponding IPA.

Word (ARPAbet) word (IPA) Number of “yes” responses

K R AH1 N
F L ER1 K
M R AH1 NG
V R AH1 N
N R AH1 N
V P EY1 L

kô2n
flÄk
mô2N
vô2n
nô2n
vpeIl

33
31
27
19
8
0

Table 2.1: Examples of the nonce words used in Scholes’ experiment,

along with number of positive responses.

Scholes’s data shows one problem of studying the categorical vs. gradient distinction.

There is no way given this experiment to ﬁnd gradience in the individual speaker judgments.

Each speaker gave a categorical “yes” or “no” answer. Nevertheless, the proportion of “yes”

to “no” answers can be split evenly in some cases, where the subjects in the study disagree

with each other. The word V R AH1 N [vô2n], was only considered acceptable by roughly

half of the subjects, for example.

The data from the Scholes (1966) experiments raises questions for a way to model phono-

tactics. The stimuli receive a range of responses that suggest even if a word is judged as

unacceptable, some words might be more unacceptable than others. Incidentally, Chomsky

and Halle (1968) acknowledge this; along with [blIk] and [bnIk] as examples to represent

acceptable and unacceptable, they introduce [bznk], which they claim should be less accept-

1See Section 3.2 for more discussion on this system.

6

able than [bnIk]. A theory that is simply a set of MSCs, designed to exclude */bnIk/ but

allow /blIk/ does not express these levels of unacceptability.

These facts about the acceptability of various nonce words has led to the theory that

the (phonotactic) grammaticality is gradient, and further evidence for this comes from ex-

periments showing a spread of diﬀerent levels of acceptability for nonce word rating tasks

(Albright 2009; Hayes and Wilson 2008; Shademan 2006). These experiments ﬁnd that

participants always give gradient responses when given a scale as a method of response.

2.2 Nature of constraints and output

The central question raised by Gorman (2013) is the that of gradient vs. categorical

knowledge. Under Gorman’s view, three questions need to be addressed by any theory of

gradient grammaticality phonotactics:

a. Do experiment participants have the ability to accurately perceive the grammaticality

of a test item and map their acceptability judgment to that grammaticality?

b. Do intermediate acceptability judgments constitute evidence for gradient grammati-

cality?

c. Is the gradient theory compared to categorical alternatives in a way that clearly shows

the advantages of the gradient theory?

Speakers have diﬃculty with perception of nonce words that have phonotactic violations

(Dupoux et al. 2004; Kabak and Idsardi 2007), and sometimes repair them with illusory

vowels dependent on native language phonology (Durvasula et al. 2018). If this is the case,

speakers are giving an acceptability judgment on a perceived item that is diﬀerent than the

stimulus. A full theory of gradient phonotactics would necessitate accounting for phonotactic

violation repairs systematically, so it is clear what the speaker is making their acceptability

judgments based oﬀ of. Though Gorman (2013) points this out as a complication for gradient

phonotactics, it should also be considered as a potential complication for a categorical theory

7

as well; a categorical judgment still relies on a perceived item that could exhibit diﬀerences

from the intended stimulus.

Secondly, intermediate acceptability judgments do not constitute direct evidence for gra-

dient grammaticality. Participants asked to make a judgment task on numbers presented to

them with a scale of how “representative” those numbers were of “even” and “odd” use the

intermediate measures even though there is a clear, categorical distinction between the two

categories (Figure 2.1).

Figure 2.1: Mean ratings of subjects asked how representative the
stimuli were of the categories “even” and “odd” respectively, where

“1” indicates most representative (Gorman 2013; Armstrong,

L. Gleitman, and H. Gleitman 1983).

Though there are potential explanations for why participants believe 23 is more odd

than 57, Armstrong, L. Gleitman, and H. Gleitman (1983) point out that subjects’ ability

to compute with these numbers and explain the deﬁnition of odd and even numbers are the

fundamental facts of the task, and a theory dedicated to explaining the gradient judgments

will ﬁnd it diﬃcult to represent basic human knowledge about odd numbers. Gorman (2013)

then suggests a parallel phonology example: orthodox beliefs about phonotactic acceptability

dictate that acceptability is closely related to frequency; the more often a sequence appears

8

in a language, the more acceptable it is. For example, Pierrehumbert (1993) argues that

there is a relationship between perceived well-formedness of a phoneme combination and its

frequency in the language. However, if sequences [bl] and [kl] have diﬀerent acceptability

measures due to their frequency ([bl] appears roughly twice more than [kl]), how are they to

be treated equally under independent phonological processes like syllabiﬁcation, etc.?

Lastly, gradient models should be able to demonstrate an advantage over a baseline

categorical model and address the challenges raised above. As stated, current methods

assume the correlation of gradient acceptability and frequency of patterns in the lexicon is

a causal relationship, but fail to test their models to see if learning is aﬀected by removing

frequency information from the model.

As mentioned above, two principled ways exist to represent phonological segments: as

the discrete segments themselves or as vectors of phonological features. Hayes and Wilson

(2008) use a featural system to train their model and do not compare model perforance with

a segment-based model. This was addressed by the work of Albright (2009), and now Mirea

and Bicknell (2019), who both show support for a segmental phonotactic system rather than

a featural system. Because of this, I will choose to focus on the problems of gradience in the

model structure and input, and train my own models using segmental representations.

2.3 Dealing with Gradient Acceptability: Recent Models

Now I will turn to a discussion of prior work conducted on gradient phonotactics models

(Albright 2009; Bailey and Hahn 2001; Hayes and Wilson 2008; Mayer and Nelson 2019).

Albright (2007) and Albright and Hayes (2003) conducted experiments where participants

responded to nonce words on a 7-point Likert scale, with some variation of “completely

impossible as an English word” on one end of the scale, and “would make a ﬁne English

word” on the other end. These studies highlight the existence of gradient acceptability:

[bwIk] is not as acceptable as [blIk], but is more acceptable than [bnIk]. Albright (2009)

introduced a model of probabilistic phonotactics to account for these experimental results.

9

The model Albright (2009) used to account for the human acceptability judgments is an n-

gram model, common in natural language processing for language modeling; it is a language

model that predicts the most likely next segment given a preceding sequence of segments that

can range in length (Jurafsky and Martin 2009). For example, a trigram model (tri- meaning

the model predicts an upcoming segment based on the previous two segments) for the onset

[spl], where # represents initial word boundary and N represents the syllable nucleus, would

look like the following:

(2)

ˆp(spl) = p(s|##) · p(p|#s) · p(l|sp)· (N|pl)

The probability of the whole onset is modeled as the product of the sequential probabilities

of each segment given the two preceding segments.

Another aspect that can be modeled as aﬀecting the probability of a sequence is neigh-

borhood density: what number of licit sequences exist that are one step away from the

sequence in question, where a step can be either a single insertion, single deletion, or single

substitution of a segment (Coltheart 1977)? Neighborhood density is a measure that is not

directly phonotactic; but neighborhood density does have a high correlation with bigram

probability (Bailey and Hahn 2001). Neighborhood density also has a clear and strong eﬀect

in longer nonce words with recognizable morphology, and even if these have strong violations

of phonotactics (e.g. mrupation) they will receive English-like ratings (Hay, Pierrehumbert,

and Beckman 2004).

Hayes and Wilson (2008) created a phonotactic learner based on maximum entropy gram-

mar which they claim can correlate well with the gradience in human judgments (Goldwater

and Johnson 2003; Jaynes 1983). They train their phonotactic learner on onsets, and use

a type frequency structure in the training data. They claim to replicate gradience of hu-

man judgments by reporting correlation co-eﬃcients with the results from Scholes (1966).

As discussed, this makes an assumption of individual speaker judgment gradience based on

the aggregate responses of the Scholes participants, when each participant gave a yes/no

answer. Thus, it can be argued that Hayes and Wilson are really showing they can correlate

10

model predictions with the aggregate score of onsets in the pool of participants in Scholes’

experiment. While I think this is an important point of discussion, I do not explore this in

my thesis, assuming that the Scholes data is reasonably representative of individual speaker

judgments.

Mayer and Nelson (2019) introduce a recurrent neural network language model (RNNLM)

which performs better than the Hayes and Wilson phonotactic learner at correlating with

judgment data. Their model is diﬀerent from the RNN used in my work in that it is not

trained and tested over onsets, but is rather trained on the whole words to learn probability

distributions over the transitions between segments, and these probability distributions can

be used to estimate the probability of a any string as a licit phonotactic sequence.

RNNLMs are more recently utilized models that have come out of the natural language

processing literature in the last two decades due to the rise in increased computing power

needed to train the models (Mikolov et al. 2010). Mayer and Nelson (2019) train RNNLMs on

various corpora of several languages to see if these models can learn phonotactic phenomena

in those languages. They test model performance in predicting vowel harmony rules in

Finnish, Cochabamba Quechua laryngeal co-occurrence restrictions, and sonority sequencing

rules in English. They ﬁnd that the neural nets perform better than the maximum entropy

model of Hayes and Wilson in correlating with human judgments. For English, Mayer and

Nelson (2019) evaluate their model and the phonotactic learner on experimental results

from Daland et al. (2011). This experiment was meant to probe judgments regarding the

phonotactics of sonority sequencing, but was not compared to any other commonly used

phonotactic judgments dataset.

2.4 Frequency structure in the data

A crucial part of probabilistic phonotactics is what frequency structure is important in
learning. Models can be trained on either type or token frequency.2 Hayes and Wilson (2008)
2The terms type and token need to be explained further in terms of onset frequencies.
One can imagine that token frequency is the frequency of onsets in the corpus of speech or

11

argue that type frequency is what is more accurate for training data (as opposed to token

frequency), and that is what they use as training for their phonotactic learner. As I have

also done, the authors have removed what they term exotic onsets from the CMU dictionary

and train the learner on a nonexotic corpus of onsets along with their type frequencies.

However, what if the frequency structure of types is not important at all? Again, many

recent gradient models fail to adequately consider the assumptions made in abandoning

a categorical model.

I train the models I will be using on a corpus of onsets with type

frequencies dictating their distribution in the training data, and then also train all of the

models with the frequency of the onsets equalized across the training data. I do this to expand

on the comments in Hayes and Wilson (2008) regarding the role of frequency structure in the

training data to include approaches that still question the relationship between frequency

structure and phonotactic acceptability.

I will refer to data with type frequency (the frequency that is present in the dictionary)

as type frequency data, because there is a cline of onsets, from those that are marginally

represented to those that are present in the thousands. Data where this frequency structure

has been removed I will refer to as equalized frequency data. It is important to distinguish

that this categorical vs. gradient distinction in judgments is not the same as the comparison

being made in the frequency structure of the data. This frequency structure comparison can

instead be explained as a deterministic learning method vs. a probabilistic learning method.

If the frequency structure is integral to proper acquisition of phonotactics, this should be

reﬂected in the ability of the model to predict human judgment.

Though the distinction between judgments and learning should be made, phonotactic

knowledge built on the probability distribution of possible onsets suggests that the knowl-

text, and this would lead to an high frequency for D, for example, since this is a common
onset in highly occuring function words (e.g. the, that, these, those, this). Of course, this
onset actually has very low occurences in the lexicon at large, and a count of occurences in
the CMU dictionary would yield a count of types and not tokens, since what is being counted
is the number of words in the lexicon that exhibit the onset and not the number of words in
some naturalistic corpus.

12

edge must be gradient in nature, whereas phonotactic knowledge built on a set of attested

onsets suggests that the knowledge is categorical in nature. This is what links the frequency

structure of the data with the nature of phonotactic judgments.

13

CHAPTER 3

METHODS

3.1 Phonotactic models for this study

For this study I will compare my own implementations of machine learning models and

compare them to other computational models in the literature. They will be ﬁtted with word

onset data from the Carnegie Mellon University (CMU) Pronouncing Dictionary (Weide

1998). I use this data because the stimuli are designed to test the phonotactic acceptability

of the nonce word onset only, outﬁtted with a set of rhymes that have only simplex codas and

are all attested. Other available data (Albright and Hayes 2003; Albright 2009) commonly

used for evaluations (Gorman 2013; Mirea and Bicknell 2019) do not follow this with their

stimuli, using complex codas and much more variability in their experiments. Due to this,

these experiments are not suitable for evaluation of models trained only on onsets, because

participant responses are aﬀected by the presence or lack of phonotactic violations in the

rhyme.

As is standard practice in machine learning, models will be evaluated based on a withheld

test set of data removed from the dataset before training, and then used to predict the data

of Scholes (1966) This practice has been absent from much of the phonotactics literature,

with models trained and tested on the same data, or tuned to the test set of data in some way

(Hayes and Wilson 2008; Goldwater and Johnson 2003). The reason this should be avoided

is that it does not allow for suﬃcient generalization from the training. The goal is to produce

a model that can adequately account for unseen data, but if the model is evaluated on the

data it learned from, it will produce artiﬁcially accurate results.

14

3.2 Data Preparation

The CMU Pronouncing Dictionary is an open-source pronunciation dictionary of North

American English. It contains roughly 134,000 words and their pronunciations. Pronuncia-

tions are transcribed using the Advanced Research Projects Agency phonetic transcription

codes, commonly known as the ARPAbet system. The CMU dictionary uses 39 ARPAbet

phonemes as well as primary and secondary stress markers to transcribe entries.

For data preprocessing, the dictionary entries and their pronunciations were imported

to Python (Python Software Foundation n.d.), using the Pandas library DataFrame object

(McKinney 2010). The 160 unique onsets are then isolated for analysis, and padded to the

length of the longest onset. This padding is used because the models require inputs of a

ﬁxed length, and the standard natural language processing procedure is to add some null

character to pad any piece of data shorter than the longest item needed.

In the case of

onsets, the longest onsets in English have a length of three, so those can be represented as [S

P L], whereas a simplex onset is represented with two added null characters as in [# # B].

Some of the onsets in the CMU dictionary occur only once or a few times. I examined

these onsets and, based on my judgment as a native speaker, decided that they are not

attested in my own lexicon, and thus they are not representative of the phonotactic knowledge

I want to model, and should be left out for analysis. I removed any unique onset which occurs

less than 35 times in the dictionary. To decide this cutoﬀ point, I examined the onsets

manually to see if the low-frequency onsets were indeed unacceptable for my judgment, and

found that removing onsets with a frequency of less than 35 removed all the onsets I found

unacceptable. Examples of such onsets are [# Z B] and [# HH M]. Hayes and Wilson (2008)

employ this same method, whereas Gorman (2013) opts to reduce the onsets represented in

the dictionary by eliminating each word that has a frequency less than one per one million

words in the SubtlexUS corpus, a dictionary containing word frequencies in American English

(New and Pallier 2009).

Negative data is generated by permuting all possible consonant combinations of lengths 1

15

Onset
no onset

b
d
f
t
S
dZ
kr
tS
sk
fr
ﬂ
kw
gl
sm
Sr
bj
spr
Sw
skw
D
dw

20,285
9,777
7,779
5,633
4,941
2,496
2,182
1,514
1,278
984
914
651
471
393
249
164
149
121
101
85
72
44

s
p
h
g
n
st
br
gr
kl
sp
T
dr
sl
hw
kj
skr
tw
fj
gw
Sl
Sn

12,571
7,866
6,734
4,986
3,429
2,362
1,607
1,285
999
915
651
483
408
367
191
153
123
107
86
81
57

Frequency Onset Frequency Onset Frequency

k
m
r
l
w
v
pr
j
tr
z
bl
pl
str
sw
sn
hj
mj
Tr
Z
pj
Sm
spl

13,042
9,547
7,483
5,511
3,864
2,445
1,796
1,383
1,197
959
705
528
460
384
244
163
142
114
99
84
66
40

Table 3.1: Onsets used for training

to 3, and subtracting from these the set of unique onsets found in the CMU dictionary. This

resulted in a set of 12,500 negative onsets. Each positive example is left at its frequency in

the CMU dictionary, totalling 128,000 tokens of positive onsets. In order to generate the

equalized data, the positive data is reduced to a set of all onsets represented in Table 3.1,

and the set is multiplied to reﬂect the size of the data with frequency information.

For example, consider a made up language where there are 100 unique words and 5 unique

onsets = sn, spl, r, m, j, and this data is used to train a model that requires of at least 100

data points. The onsets could be structured in the training data proportionally to their

appearance in the lexicon of this made up language, or they could be equally represented at

a number large enough to train the model, as in Table 3.2.

16

Onset

Gradient data count Equalized data count

sn
spl
r
m
j

Total count

40
25
15
12
8
100

20
20
20
20
20
100

Table 3.2: Toy example of diﬀerent frequency structures in training

data.

3.3 SVM model

SVMs are supervised learning models and are powerful, easy-to-use, discriminative binary

classiﬁers. These facts make them well-suited for baseline classiﬁcation and for categorical

classiﬁcation tasks. Standard SVM models do not provide a probability as an output, because

they only categorize the data as being on one side of a dividing hyperplane, and distance

from the hyperplane cannot correlate with probability of class membership.

The SVM model takes each data point as an n-dimensional vector, and seeks to ﬁnd the
hyperplane in n − 1 space such that the distance between the hyperplane and the nearest
data points, referred to as the support vectors, of both classes is maximized. An example of

such a hyperplane is provided in Figure 3.1, where two classes are separated by a hyperplane

(which in 2 dimensions is just a line). Supporting packages in the Scikit-Learn library in

Python (Pedregosa et al. 2011) allow for the onsets from the CMU dictionary to be vectorized

and fed into the SVM. Each vector has length n where n is the number of segments in the

data, and the vector for a given onset has values for 1 for each position correlated with the

segments in that onset.

How is this hyperplane determined from the training data? If the training data is a set of

pairs x, y such that x(i) is a vector for the ith data point in the training data and y(i) is the
value for that data point (y ∈ 1,−1) where we can use 1 to represent attested data and −1
for the generated unnattested data. With this form, the classiﬁer looks like the following:1

1This is in a sense more deterministic than logistic regression or neural nets, as will be

17

hyperplane

M

argin

Figure 3.1: Illustration of an SVM in 2 dimensions

(3) hw,b(x) = g(wT x + b)

where
g(z) = 1 if z ≥ 0, g(z) = 0 otherwise

In (3), the parameters (w, b) represent some hyperplane. And given the training pair

(x(i), y(i)), the functional margin of (w, b) is as in (4).

(4)

ˆγ(i) = y(i)(wT x + b)

This functional margin value represents the distance between the hyperplane and the data

point and should be large to reﬂect a conﬁdent prediction far away from the separation line.

So in order to attain a large functional margin for (x(i), y(i)), if y(i) is negative, (wT x + b)

should be a large negative number, and if y(i) is positive, y(i) should be a large positive

number. The functional margin value should always be positive (if it is negative, it is on

the wrong side of the hyperplane) and this is reﬂected in that if hw,b(x(i)) = y(i), then
y(i)(wT x + b) > 0.

discussed in the next section. This is because the SVM classiﬁer directly predicts the class
of the input, whereas in logistic regression, there is an intermediate step of estimating the
probability of class membership before classiﬁcation.

18

Though this is for only one data point, this process can be expanded to an entire set of

data where the functional margin with respect to the set is the smallest of the functional

margins for the individual data points. These data points are the support vectors.

(5) Given training set S:

S = (x(i), y(i)), i = 1, ..., m

then

ˆγ = min

i=1,...,m

ˆγ(i)

Lastly, the margin needs to be maximized. A number of hyperplanes can be drawn that

fail to maximize the margin between the hyperplane and the support vectors. However, the

function (4) can be made arbitrarily large because the classiﬁer (3) cares only about the sign
of (wT x + b), not the magnitude.2.

In order to do this, the functional margin can be divided by the euclidean norm of the

vector w, which normalizes the margin and assures the margin is not maximized artiﬁcially
through the classiﬁer parameters. Putting this all together3, the maximization function is

as in (6):

(6) maxˆγ,w,b

ˆγ
(cid:107)w(cid:107)

where
y(i)(wT x + b) ≥ ˆγ, i = 1, ..., m

For my model, I am using the sklearn.svm.SVC class, with the kernel parameter set to

"linear". I ran the SVM three times, testing the best unit to vectorize the data over. In

order to train the SVM, the data must be embedded numerically, and this can be done by

counting unique unigrams (the segments themselves), unique bigrams, (pairs of segments),

or trigrams (triads of segments). and unigrams perform best by a slight amount.

2For example, g(wT x + b) = g(2wT x + 2b)
3This overview of SVM classiﬁers is greatly simpliﬁed and is only meant to provide an
intuition of how the functions are conceptualized. In reality, the function in (6) is non-convex
and quite diﬃcult to solve and more steps are necessary to implement this classiﬁer from
scratch.

19

Input
layer

Hidden
layer

Output
layer

Output

Input #1

Input #2

Input #3

Input #4

Figure 3.2: Sketch of the input nodes, hidden layer, and output of a Recurrent Neural
Network.

3.4 RNN model

The probabilistic model I am using is a Recurrent Neural Network (RNN), which is a
simple neural net well-suited to sequential data (Elman 1990)4. The RNN is a network of

nodes and edges, with layers of nodes, and edges between each layer. Each node and each

edge can have a corresponding number associated with it, which is added or multiplied to

the input and sent to the next layer as its input. The input layer of size n can take an

n-dimensional vector, traverse through the layers to the output layer, where the output layer

is fed back in to the input of the next iteration. This allows for each segment in the onset

to be embedded as a number and fed into the network one at a time. Each iteration, the

network makes a preliminary guess, and this guess is fed into the network again while it

simultaneously analyzes the next segment. Thus the ﬁnal output of the network takes into

account the sequential information of the onset and does not treat it as a bag of segments.

A simpliﬁed visualization of the network architecture is available in Figure 3.2.

I am using an input layer n where n is the number of unique values in the data. For

unigrams, this is the number of consonants, for bigrams, the number of unique bigrams, etc.
4Model implementations for the SVM and RNN are documented at https://osf.io/f76zn/

(Sarver 2020).

20

I am using a hidden layer size of 128. When the network is being trained, a loss function

calculates how far oﬀ the model’s guess is after each iteration, and an optimization function

uses that information to traverse backwards through the network and update all the weights

and biases accordingly. I am using the Cross-Entropy Loss function and Stochastic Gradient

Descent for optimization. Training is done over 50,000 iterations, where for each iteration,

a random training pair is selected from the training data and ﬁtted to the model.

The model is implemented in the PyTorch framework (Paszke et al. 2017). The model has

the same accuracy level regardless of whether the data is embedded in unigrams, bigrams, or

trigrams, so the RNN model is easily able to classify these onsets regardless of the embedding

strategy. Results discussed in my thesis are from the model iteration that uses the bigram

embeddings. This model runs the fastest due to the least amount of parameters and relatively

small input layer.

The output layer consists of two nodes, one representing a valid segment and one repre-

senting an invalid segment. The output values are plugged into a sigmoid function (seen in

(7)) which places the values on a scale between 0 and 1. This function is displayed visually in

Figure 3.3. An extremely high value will be placed at 1 on the scale and represents that the

model has 100% conﬁdence in that output, an input value of 0 is placed at 0.5 and represents

50% conﬁdence, an extremely low value represents 0% conﬁdence. The node with the higher

value is selected as the model’s predicted onset.

(7) σ(x) = 1

1+e−x

The neural net model has structural diﬀerences to the SVM that require extra care in

determining how the data is fed to the model. While the SVM can receive the entire onset

as an n-dimensional vector, the recurrent neural net model is sequential; each segment of

the onset is fed into the model sequentially. This can of course also vary by window size:

for the onset [str], this sequence can occur over single segments, segment pairs, or segment

triplets. This gives the RNN more power to represent the dependencies between segments

and the eﬀects they have on acceptability.

21

1

0.5

σ(x)
σ(cid:48)(x)

−6 −4 −2

0

2

4

6

Figure 3.3: Graph representation of the sigmoid function, with x = input and y = output.

3.5 Maximum Entropy grammar

A maximum entropy model, notably used by Hayes and Wilson (2008) to build a phono-

tactic grammar, is also suited to providing a probabilistic output. This model has also been

used in phonology as a method of learning OT constraints (Goldwater and Johnson 2003). A

maximum entropy model can express the probability of member x to the set of possible forms

Ω. Given a set of observed data, the learning algorithm generates a model of constraints and

weights each constraint such that the probability of observed data is maximized.

The resulting model will look like this toy example provided by Hayes and Wilson (2008)

in Table 3.3. Assume that the grammar has two constraints, *#V and *C#, that have the

weights 3 and 2 respectively. This grammar would assign diﬀerent maxent values to the

lexical items CV, CVC, and V. CV does not trigger any violation, so its rating is highest,

followed by CVC, and C.

x
CV
CVC

V

*#V (w = 3)

3 · 0
3 · 0
3 · 1

*C# (w = 2)

2 · 0
2 · 1
2 · 0

Score (h(x))

(3 · 0) + (2 · 0) = 0
(3 · 0) + (2 · 1) = 2
(3 · 1) + (2 · 0) = 3

Maxent value (P∗(x))

exp(−0) = 1
exp(−2) ∼= 0.14
exp(−3) ∼= 0.05

Table 3.3: Maxent Grammar (note: ‘.’ represents multiplication

here)

22

Assume brieﬂy that the constraints and weights are optimized for the observed data.

The model sums the product of constraint violations and weights to achieve a score, which

is expressed as in (8):

N(cid:88)

(8) h(x) =

wiCi(x)

i=1

of that constraint, and(cid:80)N

Here, wi represents the weight of the ith constraint, Ci represents the number of violations
i=1 represents the sum maxent value, and is calculated with (9).

This is not a probability, but rather demonstrates the relative probability of the input.

(9) P ∗(x) = e−h(x)

And probability is calculated with the maxent value in (10):
(10) P (x) = P ∗(x)/Z

where

Z =

(cid:88)

y∈Ω

P ∗(y)

The output used in phonotactic learning is not the probability of the input itself, which

due to the large number of possible forms contained in Ω is impractical to report. Rather,

the maxent value meant to show the relative probability between the forms, is given.

How are the constraints and weights for the model determined? The model name refers
to its function of maximizing the entropy, a measure of randomness in the system5, which

S. Della Pietra, V. Della Pietra, and Laﬀerty (1997) show is equivalent to maximizing the

probability (see (11)) of the observed forms.

(cid:89)

x∈D

P (x)

(11) P (D) =

where

P = set of observed data

This probability is maximized by an iterative search algorithm similar to the stochastic

gradient descent described in the discussion of neural nets. All constraint weights N and

5−(cid:80)

x∈Ω P (x)log(P (x)), Cover and Thomas (1991)

23

total probability create a surface in (N + 1)−dimensional space, and though the surface is
never calculated as a whole, at each stage the local gradient is determined and the search

iterates upwards (in the direction of higher total probability of observed forms) until a

maximum is reached. Unlike neural nets, this surface is always convex, without only one

global maximum for the search to ﬁnd (S. Della Pietra, V. Della Pietra, and Laﬀerty 1997).

The speciﬁc algorithm used by Hayes and Wilson (2008) in their phonotactic learner is the

conjugate gradient method (Vetterling et al. 1992).

Hayes and Wilson (2008) set up their learner to maximize log(P (D)) for mathemat-

ical convenience, since the log function is monotonic and adjusting the weights to max-

imize log(P (D)) will necessarily maximize P (D). The partial derivative of each weight

log(P (D)) expresses the rate log(P (D)) will change in relation to that weight wi, and
∂
∂wi
the gradient is a vector of these partial derivatives. According to S. Della Pietra, V. Della

Pietra, and Laﬀerty (1997), ∂
∂wi

log(P (D)) is additionally interpretable as the diﬀerence be-

tween observed violations of constraint Ci and expected violations of the constraint, formally
O[Ci] − E[Ci].

Calculating E[Ci] necessitates a limit on the length of forms in Ω (all possible forms
In

for our model; in my case, all logically possible onsets), otherwise the set is inﬁnite.

accordance with other models used in this thesis, I limit all forms in Ω to a length of three

segments or less. Now that Ω is a ﬁnite set. E[Ci] is expressed in (12):

(cid:88)

x∈Ω

(12) E[Ci] =

where

P (x)Ci(x)

P (x) = probability of x

Ci(x) = number of Ci violations by x

Only one more piece is needed before presenting the full learning algorithm, which is a

measure of accuracy for the constraints. This accuracy measure calculates how the ratio of

observed constraint violations (O(Ci)) with expected constraint violations (E(Ci)), or O/E.

24

Hayes and Wilson (2008) also implement a statistical upper conﬁdence limit on O/E to

reﬂect a diﬀerence in accuracy between an O/E that equals 0/10 and one that equals 0/1000.

Eﬀectively, this means that instead of the accuracies being both 0 and zero, 0/10 has 0.22

accuracy score and 0/1000 has 0.002. This is because if there are only 10 logically possible

violations, a low number of observed violations does not imply as strong of a constraint as

if there were 1000 logically possible violations.

With these pieces, the learning algorithm (Hayes and Wilson 2008) is constructed as

follows:

(13) Phonotactic Learning Algorithm Input

A set Σ of segments classiﬁed by a set F of features, a set D of surface forms drawn
from Σ∗, an ascending set A of accuracy levels, and a maximum constraint size N

Initialize empty grammar G
for each accuracy level a in A do

Algorithm 1 Phonotactic Learning Algorithm
1: procedure PhonotacticLearner(A , D, F , N , Σ)
2:
3:
4:
5:
6:
7:
8:
9:
10: end procedure

select the most general constraint and add it to G
train the weights of the constraints in G

(cid:46) Gradient ascent

(cid:46) In the form of Table 3.3

while Exists constraint with accuracy < a do

(cid:46) Constraints by D, F , N , Σ

end while

end for
return G

25

CHAPTER 4

RESULTS

Two main metrics are used to report the results of the models. The RNN and SVM are

both classiﬁers, but with slight distinctions: the SVM output assigns a direct binary label

predicting that the input is attested or unattested to its input, whereas the RNN assigns

a score between 0 and 1, then assigns a binary label based on whether that score is above

or below the threshold value of 0.5. For these classiﬁers, one metric is the accuracy of

classiﬁcation. Accuracy is the ratio of onsets whose classiﬁcation correctly matches whether

that onset is truly attested or not (gross status).

Accuracy is collapsed across classes, so for unbalanced data sets, if one class has much

more data than the other and that class is predicted correctly, the accuracy can be very high

even if the smaller class is poorly predicted. For this reason, I will break down the results

into confusion matrices which display the results by class and will be explained further.

Ultimately however, the human acceptability judgments are not predictable by classi-

ﬁcation since they are gradient. To compare model output to these judgments, I will use

Pearson’s r correlation coeﬃcient, which is a measure of linear correlation between the model

output and the judgment data. A correlation of 1 means that the model perfectly predicts

each score (in this case plotting the results would result in all the points falling along a

straight line). A correlation value of 0 means that there is no relationship between the

model predictions and the human judgment data.

4.1 Gross Phonotactic Violation

The simplest way to account for phonotactic acceptability is that onsets that are attested

in the lexicon are judged highly and those that are unattested receive a lower acceptability

judgment. With respect to the modeling of onsets and Scholes (1966) data for this study,

attested is deﬁned as present in the set of onsets described in (3.1). The distribution of

26

ratings for attested and unattested onsets is shown in Figure 4.1. Both the attested and

unattested classes show a concentrated distribution around the high and low Scholes ratings,

respectively, with thin tails representing outliers. Onsets that were rated as acceptable by a

large number of Scholes participants that were unattested were mr and Sl, and onsets that

were rated as acceptable by a low number of partipants that were attested were Sm and sf.

The correlation value (Pearson’s r) for gross status and the Scholes data is 0.803.

Figure 4.1: Distribution of normalized ratings in the Scholes

experiment for attested and unattested onsets.

4.2 SVM results

Each model performs above 90% accuracy regardless of how the data is embedded, but

a window size of one (e.g. an onset spl is represented as ((S), (P), (L))) does lead to the

best performance by a slight amount. See Figure 4.2 for comparisons.

Though the SVM does present a good model for separating categorical data, it is not

well-suited to this task for a few reasons. Consider ﬁgures 4.3 and 4.4. They show a matrix

of probabilities where the upper left quadrant is the probability that the model is given a

negative onset and it guesses correctly, and the lower right quadrant is the probability that

the model is given a positive onset and it guesses correctly. The lower left quadrant shows

the probability of false negatives, where a positive data point is falsely classiﬁed as negative

27

Figure 4.2: The accuracy of the SVM model’s classiﬁcations of the

test data based on the type of embedding.

by the model; and the upper right quadrant shows the probability of false positives, where

a negative data point is falsely classiﬁed as positive by the model.

The imbalance of negative and positive examples in the training data skew the model

towards modeling of the negative examples. Looking at high rate of false positives, the model

signiﬁcantly under-performs in predicting positive examples. The fact that the training data

is skewed towards the negative data likely plays a role in this. The high rate of false positives

also holds in the predictions for the Scholes data when evaluated against the gross status of

the stimuli.

When correlated with the ratings of the participants, the Pearson’s r is 0.328. Though the

SVM does not get a good result when compared against acceptability data, it is important

to note that this does not discount a categorical model, given the performance of the gross

phonotactic violation model (r = 0.803). Perhaps this rather has to do with the unbalanced

data and the nature of SVM training, which will be discussed.

4.3 RNN results

The RNN performs quite well on the classiﬁcation task, with accuracy above 92% for

all iterations. The RNN outperforms the SVM and is also more robust to diﬀerent ngram

28

Figure 4.3: SVM results on withheld test set

Figure 4.4: SVM predictions for Scholes data, with ground truth set

to gross status

embeddings, but the diﬀerences are minimal. The RNN learns very quickly, within the ﬁrst

10,000 iterations. In Figure 4.5, the loss value is plotted over the iterations of the model. For

each iteration, a loss value is calculated which measures the distance of the model conﬁdence

in the output from the desired output. The higher the value, the farther away the model is

from the desired predictions.

Figure 4.6 shows a matrix of probabilities like discussed above, where the upper left

quadrant is the probability of true positives, and the lower right quadrant is the probability

29

Figure 4.5: Loss values during RNN training, showing how wrong the

model is for each iteration.

of true negatives.

Figure 4.6: Confusion matrix showing the model’s accuracy for

guessing each class, and its error.

When the Scholes Experiment 5 data is compared against the model predictions, the

neural net can categorize it with 81.4% accuracy (48 out of the 59 tokens). The RNN’s

conﬁdence that an onset is grammatical is part of its output, and can be compared to the

percentage of subjects that judge an onset as grammatical. This comparison is shown in
Figure 4.7.1

1Note that some data points are overlapping in the ﬁgure.

30

The ﬁgure shows the onsets plotted based on the model’s % conﬁdence that the onset

is grammatical (coded as having a value of 1), or ungrammatical (coded as having a value

of 0), and the corresponding normalized judgments from the Scholes (1966) experiment. If

the normalized judgment is 1, that means all 33 participants selected “yes” when asked if

an onset was grammatical, and if the normalized judgment is 0, none of the 33 participants

selected “yes.” The onsets are also color-coded for whether they are actually attested onsets

in English or not. The plot shows where the model is misclassifying onsets, and also shows

that the model output is strictly falling only on the ends of the scale.

Figure 4.7: Scatter plot showing the model’s % conﬁdence that an
onset is grammatical (value=1), compared to the percentage of “yes”

responses in the Scholes experiment.

4.4 Changing frequency structures in the data for RNN training

When the RNN is trained on both type frequency and equalized frequency datasets, the
accuracy remained similar in both cases.2 For both withheld test sets, the precision is higher

than the recall. Both models perform better at identifying positive cases than negative cases

on the test data. Ultimately, the model trained on the equalized frequency dataset lags in
2Recall that the training data can represent the onsets using the frequencies with which
they appeared in the CMU dictionary (gradient data), or represent each onset an equal
amount of times (equalized data).

31

accuracy at 91%, while the type frequency model accurately classiﬁes 94%. However, for

the Scholes results, the models diﬀer more. The model with gradient training has a higher

precision than recall, but the model with equalized training has a perfect recall and lower

precision, which leads to a higher overall accuracy in predicting the Scholes results. In terms

of correlation, on the other hand, the correlation value comparing model predictions to the

participant responses is signiﬁcantly higher for the type frequency model. Ultimately, this

is the more important measure as it reﬂects the model prediction of human responses.

Confusion Matrix
0.99
0.11
Total Accuracy

0.01
0.89

94%

Table 4.1: Type
frequency model
results on withheld

test set

Confusion Matrix
0.83
0.28
Total Accuracy

0.17
0.72

78%

Pearson’s r

0.635

Confusion Matrix
0.98
0.16
Total Accuracy

0.02
0.84

91%

Table 4.2:
Equalized

Frequency Model
results on withheld

test set

Confusion Matrix
0.63
0.00
Total Accuracy

0.37
1.00

82%

Pearson’s r

0.458

Table 4.3: Type
frequency model
results predicting

Scholes data

Table 4.4:
Equalized

frequency model
results predicting

Scholes data

4.5 Changing frequency structures in the data for MaxEnt training

As discussed before, the RNN’s output exhibits little gradience. The model depends on

thousands of iterations of training that reward conﬁdent predictions. It is worth comparing to

32

the maximum entropy phonotactic learner model (Hayes and Wilson 2008) which can provide

a more gradient output. It is important to note that Hayes and Wilson use the following

method to make their model output proportional to the normalized Scholes judgments:

ﬁrst, computing a maxent value from the model’s score of a test item, as in (14), and then

incorporating a free parameter T which is tuned to the test data to maximize the correlation

values (15).

(14) P ∗(x) = e−x
(15) predicted-rating(x) = P ∗(x)1/T

The tuning parameter is a value that can be freely changed to morph the data into values

with the highest correlation to the evaluation data. This is because the parameter can warp

and expand the existing diﬀerences between the data to a larger scale or desired shape.

With this parameter, the model can achieve a high correlation regardless of the structure of

the training data, because it can warp the data so that judgments are more evenly spread

through the intermediate range (See Figs. 4.8, 4.9, 4.10, and 4.11). When it is removed, the

correlation falls slightly but both models are still relatively well-performing.

However, it is important to note that using such a parameter goes against standard

machine learning practices. Because this tuning parameter is optimized to whatever data is

being used to evaluate the model, it will not generalize well to predicting new data. If the goal

is deﬁning a general model of phonotactics that can correlate well to new human judgments,

the model without the tuning parameter must be used. In Table 4.5, the correlations between

the Maxent model results and the Scholes data show how the correlation drops when the

tuning parameter is removed, and how the model predictions are extremely similar regardless

of the training data used.

33

Score with tuning

Score without tuning

Equalized frequency data Type frequency data

0.861 (T =10.05)

0.880 (T =4.90)

0.761

0.769

Table 4.5: Correlations of the Phonotactic Learner scores with the

Scholes data

Figure 4.8: Equalized Frequency, Tuned

Figure 4.9: Equalized Frequency, Untuned

Figure 4.10: Type Frequency, Tuned

Figure 4.11: Type Frequency, Untuned

34

CHAPTER 5

DISCUSSION

5.1 Model Comparisons

Both the existing model of Hayes and Wilson (2008) and the baseline case of gross

phonotactic violation perform better at correlating with the experimental data of Scholes

(1966) than either the SVM or the RNN. However, it is insightful to note that with the

removal of the tuning parameter, the plotted predictions of both the phonotactic learner and

the RNN are strikingly similar. (See ﬁgures 4.9, 4.11, and 4.7). Speciﬁcally, these plots show

a distribution of predicted data that falls in two groups at the top or the bottom of the plot,

without any intermediate data in between. Though in theory the models can output any

value between 0 and 1, the model ouput is strikingly binary. In the RNN’s case, the model is

trained to make binary judgments, and penalized for intermediate judgments. However for

the phonotactic learner, there is no built-in operation that trains the model in this direction.

For the models presented in this paper, none exceed the correlation value achieved by

correlating gross phonotactic violation with the Scholes data without tuning. This is an

important comparison to make; though certain amounts of manipulation might produce

a gradient model that also achieves a high correlation value, to what degree that model

is evidence for underlying gradient grammaticality depends on the success of alternative

explanations. If the gross status of phonotactic sequences can explain judgment data with

a similar level of success as a proposed gradient model, no one model can be claimed as

evidence for the nature of the human behavior that the models are explaining. All that has

been shown is that models with certain assumptions about gradience can also explain the

data in some way. These models are still far away from providing evidence for the nature of

phonotactic grammar.

It is also important to consider that for evaluating these models, the data from Scholes

35

(1966) is not an extensive test set; it simply represents 60 onsets where participants could

only choose a yes/no answer. A future step is to continue to evaluate these models against

human judgments in diﬀerent experimental settings to assure that these models are truly

correlating with human judgments.

Taking a wider point of view, how do these models compare not on the basis of perfor-

mance, but in terms of information that can be derived about human phonotactics? While

performance is an important indicator to the principles that might govern human phono-

tactic judgments, it is important to note the shortcomings of some of these models. First,

the use of negative evidence to train the SVM and RNN are clearly not analagous to the

human learner. Secondly, while the MaxEnt model chooses explicit constraints to optimize,

the neural net architecture is exceedingly diﬃcult to interpret. These are both concerns that

should be taken into account along with model performance.

5.2 Training data structure

Though the RNN model does correlate more highly with the Scholes data when trained on

the type frequency training data, the maximum entropy model does not change much in its

performance. This also creates a problem for the maximum entropy grammar, which, though

the model trained with type frequency training data does have a higher correlation with or

without the tuning parameter, the discrepancy is incredibly small and will undoubtedly vary

with other datasets. I ﬁnd this challenges the ability of the maximum entropy grammar to

accurately be a model of gradient grammaticality. Though the correlation value might be

high, looking at the input (in the frequency-equalized case) and the model output (without

the tuning parameter), it is not clear that gradient grammaticality is learned or expressed

by the model at all.

In fact, this model could be used as a binary classiﬁer by drawing

a decision boundary through the model outputs, such that if the Maxent value is greater

than 0.5 the onset is predicted as grammatical, and if it is lower than 0.5 it is predicted as

ungrammatical. This would be analogous to the classiﬁcation done by the RNN model.

36

This also complicates the claims of Pierrehumbert (1993) about the relationship between

frequency structure in the lexicon and phonotactic acceptability. If this claim is true, models

should perform much better when trained with a type frequency structure. The RNN results

do not directly contradict the predictions of this claim. Though it competently learns to

classify gross status of onsets without frequency structure, which suggests that a model can

learn whether an onset is attested or unattested without frequency structure, it still has a

higher correlation with the Scholes participant responses when trained with the type fre-

quency data. On the other hand, the Maxent model maintains a nearly identical correlation

with the Scholes data regardless of which dataset it is trained on, speciﬁcally when untuned

scores are reported.

5.3 Experiment data used for evaluation

One concern of current work is the lack of judgment data to use as evaluation for com-

putational models. The Scholes dataset is quite small and could be improved upon greatly.

In future work, it will be necessary to run a judgment experiment designed as evaluation

data for these models. Evaluating these models on further experimental data will either

strengthen or weaken their merit. A future experiment could use a Likert scale rating task

to capture gradience in each participant’s individual judgments, in which case a correlation

metric could assess the likeness of each participant’s responses to the group to ensure that

intermediate judgments are produced across speakers consistently. This would more accu-

rately represent the phenomenon of gradient acceptability than the Scholes data since the

gradience would be represented as an intra-speaker measure, and not an average over yes/no

responses.

37

CHAPTER 6

CONCLUSION

In conclusion, the three main ﬁndings are the following: ﬁrst, for the models used in this

thesis, it was found that the maximum entropy model correlates best with the acceptability

judgments from the Scholes data, followed by the RNN model and SVM model. Second, one

of the models with the capability to provide gradient scores for the onset predictions provide

scores on a continuum, but predict onsets as either highly acceptable or highly unacceptable.

Lastly, using training data with equalized frequency structure (where the number of onsets

in the training is equal for each unique onset) does not signiﬁcantly impact the maximum

entropy model performance, slightly worsens RNN performance on a withheld test set, and

signiﬁcantly hinders RNN performance in correlating with Scholes’ judgment data.

I claim that this shows that a model providing a high correlation value with human judg-

ment data is not enough to make a convincing case for gradient phonotactic grammaticality.

The motivation for a probabilistic account is the nature of gradient acceptability judgments

found by Albright and Hayes (2003) and Albright (2007). Regardless of the correlation value

or metric used for evaluation, if the model does not output a range of gradient judgments

which not only correlate well with the data, but can accurately predict the intermediate

judgments, it has failed to capture the gradience that prompted the modeling to begin with.

More merit should be given to categorical approaches when modeling phonotactics, and

nuance around the interaction between the probabilistic and categorical pieces should be

carefully explained and tested. Phonotactics is only one of many areas of linguistics where

researchers are increasingly interested in testing the potential of deep learning methods as a

way to learn more about linguistic knowledge. However, a number of assumptions make it

diﬃcult to not be mislead by the results of the models, as they can only oﬀer analogies and

can be diﬃcult to interpret.

Both the SVM and the RNN underperform compared to previous work.

I think this

38

could be due to the imbalance of the two-class training. These are both discriminatory

models requiring positive and negative data to learn, but generating negative data creates

an imbalanced dataset resulting in both models overﬁtting to the negative data, with a high

rate of false negatives.

The phonotactic learner provides an advantage over neural models in providing speciﬁ-

cally generated constraints, and correlates well with the Scholes data. However, the model

fails to capture any notion of intermediate judgments or gradience without being tuned to

the test data that it is being validated upon. Moreover, it does not seem to rely on any

frequency structure in the data to perform well.

Correlation values can mislead interpretation if not presented alongside a visual plot

of the data. Anscombe (1973) showed four data distributions now known as Anscombe’s

Quartet, which all have very diﬀerent distributions but extremely similar correlation values.

For this reason, results of phonotactics models should not be boiled down to a descriptive

statistic like Pearson’s r, as this is not a full description of the result. Though the phonotactic

learner itself is probabilistic in nature, I believe it cannot be claimed as a model of gradient

grammaticality or acceptability for this reason. Moreover, it does not seem to rely on any

frequency structure in the data to perform well.

If the goal, above all else, is to ﬁnd a model to correlate with the data in a way that

reproduces the intermediate judgments, the best possibility might lie in a RNN language

model (Mikolov et al. 2010; Mayer and Nelson 2019). Mayer and Nelson (2019) do train a

model that correlates better with judgment data than the maxent phonotactic learner.

However, I believe the results of this thesis show that recent modeling approaches have

not investigated fully the diﬀerent assumptions these models are relying on. It is not clear to

me that the maxent phonotactic learner is evidence for gradient grammaticality; and though

a neural network might be performing marginally better with more gradient output, neural

nets are notoriously diﬃcult to interpret. There are no constraints that are generated, and

everything the model has learned is contained in a “black box” of weights and biases inside

39

the network.

The ability of a neural net to correlate with human judgments is exciting and should

be pursued further, but at this point I do not think anything can be said about how the

neural network performance informs us about human grammaticality knowledge. Though I

agree with Pater (2019), Mayer and Nelson (2019), and Mirea and Bicknell (2019) that the

integration of neural network modeling and linguistics is a promising and thrilling future for

the ﬁeld, I believe extra care must be taken to continually be sure that when these models

are learning, that we the researchers are learning something as well.

40

BIBLIOGRAPHY

41

BIBLIOGRAPHY

Albright, Adam (2009). “Feature-based generalization as a source of gradient acceptability.”

In: Phonology.

— (2007). “Natural classes are not enough: Biased generalization in novel onset clusters.”

In: 15th Manchester Phonology Meeting, Manchester, UK.

Albright, Adam and Bruce Hayes (2003). “Rules vs. analogy in English past tenses: A com-

putational/experimental study.” In: Cognition.

Anscombe, Francis J (1973). “Graphs in statistical analysis”. In: The american statistician

27.1, pp. 17–21.

Armstrong, Sharon Lee, Lila Gleitman, and Henry Gleitman (June 1983). “What Some
Concepts Might Not Be”. In: Cognition 13, pp. 263–308. doi: 10.1016/0010-0277(83)
90012-4.

Bailey, Todd M and Ulrike Hahn (2001). “Determinants of wordlikeness: Phonotactics or

lexical neighborhoods?” In: Journal of Memory and Language 44.4, pp. 568–591.

Chomsky, Noam (1965). Aspects of the Theory of Syntax. 50th ed. The MIT Press. isbn:

9780262527408. url: http://www.jstor.org/stable/j.ctt17kk81z.

Chomsky, Noam and Morris Halle (1965). “Some controversial questions in phonological

theory.” In: Journal of Linguistics.

— (1968). The Sound Patterns of English. New York, Harper and Row.
Coltheart, M. (1977). “Access to the internal lexicon”. In: The psychology of reading. url:

https://ci.nii.ac.jp/naid/10018074200/en/.

Cover, Thomas M. and Joy A. Thomas (1991). Elements of information theory. New York:

Wiley.

Daland, Robert et al. (Aug. 2011). “Explaining sonority projection eﬀects”. In: Phonology

28. doi: 10.1017/S0952675711000145.

Della Pietra, Stephen, Vincent Della Pietra, and John Laﬀerty (1997). “Inducing Features of
Random Fields”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence.

Dupoux, Emmanuel et al. (Feb. 2004). “Epenthetic Vowels in Japanese: a Perceptual Illu-
sion?” In: Journal of Experimental Psychology Human Perception & Performance 25.
doi: 10.1037//0096-1523.25.6.1568.

Durvasula, Karthik et al. (Feb. 2018). “Phonology modulates the illusory vowels in perceptual
illusions: Evidence from Mandarin and English”. In: Laboratory Phonology 9. doi: 10.
5334/labphon.57.

42

Elman, Jeﬀrey L. (1990). “Finding structure in time.” In: Cognative Science.

Goldwater, Sharon and Mark Johnson (2003). “Learning OT constraint rankings using a
maximum entropy model”. In: Proceedings of the Stockholm workshop on variation within
Optimality Theory. Vol. 111120.

Gorman, Kyle (2013). “Generative Phonotactics”. PhD thesis. University of Pennsylvania.

Halle, Morris (1962). “Phonology in generative grammar.” In: Word.

— (1959). The Sound Pattern of Russian. The Hague: Mouton.

Hay, Jennifer, Janet Pierrehumbert, and Mary Beckman (2004). “Speech Perception, Well-
formedness, and the Statistics of the Lexicon”. In: Papers in Laboratory Phonology VI,
pp. 58–74.

Hayes, Bruce and Colin Wilson (2008). “A Maximum Entropy Model of Phonotactics and

Phonotactic Learning”. In: Linguistic Inquiry.

Jaynes, Edwin T. (1983). Papers on probability, statistics, and statistical physics. USA:

Kluwer Boston. isbn: 9027714487.

Jurafsky, Daniel and James H. Martin (2009). Speech and Language Processing (2nd Edition).

USA: Prentice-Hall, Inc. isbn: 0131873210.

Kabak, Baris and William Idsardi (Feb. 2007). “Perceptual Distortions in the Adaptation
of English Consonant Clusters: Syllable Structure or Consonantal Contact Constraints?”
In: Language and speech 50, pp. 23–52. doi: 10.1177/00238309070500010201.

Mayer, Connor and Max Nelson (Oct. 2019). “Phonotactic learning with neural language

models”. In:

McKinney, Wes (2010). “Data Structures for Statistical Computing in Python”. In: Proceed-
ings of the 9th Python in Science Conference. Ed. by Stéfan van der Walt and Jarrod
Millman, pp. 51–56.

Mikolov, Tomas et al. (Jan. 2010). “Recurrent neural network based language model”. In:

vol. 2, pp. 1045–1048.

Mirea, Nicole and Klinton Bicknell (2019). “Using LSTMs to Assess the Obligatoriness of

Phonological Distinctive Features for Phonotactic Learning”. In: ACL.

New, Boris and Christophe Pallier (2009). SubtlexUS - Lexique. url: lexique.org/?page_

id=241 (visited on 04/01/2020).

Paszke, Adam et al. (2017). “Automatic diﬀerentiation in PyTorch”. In: NIPS-W.

Pater, Joe (2019). “Generative linguistics and neural networks at 60: foundation, friction,

and fusion.” In: Language.

43

Pedregosa, F. et al. (2011). “Scikit-learn: Machine Learning in Python”. In: Journal of Ma-

chine Learning Research 12, pp. 2825–2830.

Pierrehumbert, Janet (1993). “Prosody, Intonation, and Speech Technology”. In: ed. by M.

Bates and R. Weischedel. Cambridge, UK: Cambridge University Press, pp. 257–282.

Python Software Foundation (n.d.). Python Language Reference. Version 3.6.6. url: https:

//python.org.

Sarver, Isaac (May 2020). Phonotactics Models. doi: 10.17605/OSF.IO/F76ZN. url: osf.io/

f76zn.

Scholes, Robert J. (1966). Phonotactic Grammaticality. Mouton.

Schütze, Carson (Mar. 2011). “Linguistic Evidence and Grammatical Theory”. In: Wiley

Interdisciplinary Reviews: Cognitive Science 2, pp. 206–221. doi: 10.1002/wcs.102.

Shademan, Shabnam (2006). “Is Phonotactic Knowledge Grammatical Knowledge ?” In:

Vetterling, William T. et al. (Nov. 1992). Numerical Recipes Example Book C (The Art of

Scientiﬁc Computing). 2nd. Cambridge University Press. isbn: 0521437202.

Weide, Robert L. (1998). The CMU Pronouncing Dictionary. http://www.speech.cs.cmu.

edu/cgi-bin/cmudict. Accessed: 2019-09-14.

44