A SYSTEMATIC EVALUATION OF COMPUTATIONAL MODELS OF
PHONOTACTICS
By
Isaac Sarver
A THESIS
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
Linguistics – Master of Arts
2020
ABSTRACT
A SYSTEMATIC EVALUATION OF COMPUTATIONAL MODELS OF
PHONOTACTICS
By
Isaac Sarver
In this thesis, recent computational models of phonotactics are discussed and evaluated
and two new models are implemented. Prior phonotactic modeling, motivated by gradient
acceptability judgments in nonce word judgment tasks (Albright 2009), claim that phonotac-
tic grammaticality is gradient, and these models are evaluated by their ability to judge nonce
words with scores that correlate with human acceptability judgments. Gorman (2013) argues
that these gradient models do not account for the facts suﬃciently and claims phonotactic
grammaticality is categorical. In this thesis, the account of Gorman (2013) is implemented
as well as a prominent gradient model from Hayes and Wilson (2008) and compared with
the performance of two machine learning models (a support vector machine and a recurrent
neural network), with all models trained on a corpus of English onsets. Results in this thesis
show that the computational models are unable to correlate with human judgment data
from Scholes (1966) as well as a categorical prediction of acceptability based on whether a
sequence is attested in the lexicon or not, and that these models rely on assumptions which
when challenged show that the models do not convincingly capture the gradience of the
human judgment data used for evaluation.
This thesis is dedicated to my siblings.
iii
ACKNOWLEDGEMENTS
First and foremost I would like to thank the linguistics faculty for their support and
guidance throughout my time at MSU. This includes Hannah Forsythe, who taught my
Introduction to Language course in Spring 2015 and ﬁrst introduced the ﬁeld of linguistics to
me, as well as Marcin Morzycki, who taught my second linguistics class I took and encouraged
me to add a linguistics minor in my undergrad, and who also advised me throughout my
semantics research my ﬁrst year of grad school. I am hugely indebted as well to Suzanne
Wagner, Yen-Hwei Lin, Alan Munn, and Cristina Schmitt for all that I’ve learned in the
program during classes, colloquiums, and personal discussion. Thank you also to Kristen
Johnson in the CSE department for NLP advice and instruction. And special thanks to my
advisor, Karthik Durvasula, who persuaded me to apply to the MA program, taught four of
my classes, and gave me support and advice throughout the whole process.
Thank you also to my fellow students. I’m so grateful for the friendships and discussions
of this time in classes, the grad oﬃce, and colloquium dinners. Thank you to those who
listened to my ideas and presented their own in Awkward Time, the Phono group, and
informal study sessions. And thank you as well to my advisors and colleagues in the Center
for Language Teaching Advancement who have given me so many professional opportunities
this year.
Lastly, there is no way I would have made it through the last few years without my
friends and family. Thank you to Megan Wixom and Daniela Diaz for listening to all of my
practice presentations, thank you to Suzanna Feldkamp for keeping me accountable through
our late-night writing sessions, and thank you to Abdullah Karaaslanlı and Thanaphong
Phongpreecha for being daily listening ears for everything I had on my mind and for helping
me through learning Python and machine learning. Thank you to my siblings for giving me
the ability to laugh on any day no matter the circumstances, and to my parents who have
supported me and fostered my curiosity about the world from the beginning. Thank you all.
iv
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1
2
3
4
5
6
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
RELATED WORKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Origins of phonotactics
2.2 Nature of constraints and output . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Dealing with Gradient Acceptability: Recent Models . . . . . . . . . . . . .
2.4 Frequency structure in the data . . . . . . . . . . . . . . . . . . . . . . . . .
METHODS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1 Phonotactic models for this study . . . . . . . . . . . . . . . . . . . . . . .
3.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 SVM model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 RNN model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Maximum Entropy grammar
. . . . . . . . . . . . . . . . . . . . . . . . . .
RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1 Gross Phonotactic Violation . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 SVM results
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 RNN results
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Changing frequency structures in the data for RNN training . . . . . . . . .
4.5 Changing frequency structures in the data for MaxEnt training . . . . . . .
DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1 Model Comparisons
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Training data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Experiment data used for evaluation . . . . . . . . . . . . . . . . . . . . . .
CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
5
5
7
9
11
14
14
15
17
20
22
26
26
27
28
31
32
35
35
36
37
38
41
v
LIST OF TABLES
Table 2.1: Examples of the nonce words used in Scholes’ experiment, along with
number of positive responses. . . . . . . . . . . . . . . . . . . . . . . . . .
Table 3.1: Onsets used for training . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Table 3.2: Toy example of diﬀerent frequency structures in training data.
. . . . . .
Table 3.3: Maxent Grammar (note: ‘.’ represents multiplication here)
. . . . . . . .
Table 4.1: Type frequency model results on withheld test set
. . . . . . . . . . . . .
Table 4.2: Equalized Frequency Model results on withheld test set
. . . . . . . . . .
Table 4.3: Type frequency model results predicting Scholes data . . . . . . . . . . .
Table 4.4: Equalized frequency model results predicting Scholes data . . . . . . . . .
Table 4.5: Correlations of the Phonotactic Learner scores with the Scholes data . . .
6
16
17
22
32
32
32
32
34
vi
LIST OF FIGURES
Figure 2.1: Mean ratings of subjects asked how representative the stimuli were of
the categories “even” and “odd” respectively, where “1” indicates most
representative (Gorman 2013; Armstrong, L. Gleitman, and H. Gleit-
man 1983).
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
Figure 3.1:
Illustration of an SVM in 2 dimensions . . . . . . . . . . . . . . . . . . .
18
Figure 3.2: Sketch of the input nodes, hidden layer, and output of a Recurrent
Neural Network.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 3.3: Graph representation of the sigmoid function, with x = input and y =
output.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 4.1: Distribution of normalized ratings in the Scholes experiment for attested
and unattested onsets.
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 4.2: The accuracy of the SVM model’s classiﬁcations of the test data based
on the type of embedding.
. . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 4.3: SVM results on withheld test set
. . . . . . . . . . . . . . . . . . . . . .
Figure 4.4: SVM predictions for Scholes data, with ground truth set to gross status .
Figure 4.5: Loss values during RNN training, showing how wrong the model is for
each iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 4.6: Confusion matrix showing the model’s accuracy for guessing each class,
and its error.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 4.7: Scatter plot showing the model’s % conﬁdence that an onset is gram-
matical (value=1), compared to the percentage of “yes” responses in the
Scholes experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 4.8: Equalized Frequency, Tuned . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 4.9: Equalized Frequency, Untuned . . . . . . . . . . . . . . . . . . . . . . . .
Figure 4.10: Type Frequency, Tuned . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 4.11: Type Frequency, Untuned . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
20
22
27
28
29
29
30
30
31
34
34
34
34
CHAPTER 1
INTRODUCTION
Phonotactic theory is concerned with understanding the knowledge that a speaker has
about possible and impossible phonological sequences in their native language. This knowl-
edge allows for speaker judgments of novel words, where, upon hearing a novel sequence
of sounds, a speaker can determine if the new word is an acceptable sequence that could
reasonably be a member of their lexicon or not. For example, when speakers are presented
with the non-English words [blIk] and [bnIk], speakers can easily assess
to be well-formed and to be ill-formed.1 Note, however, that the reason
for their exclusion from the lexicon is categorically diﬀerent. The former is acceptable, yet
happens to not occur in the lexicon of most speakers; an accidental gap in the lexicon rather
than a systemic one. The latter, on the other hand, violates some structural requirement,
and is judged to be an impossible construction in the language (Halle 1962; Chomsky and
Halle 1965).
In recent work, phonotactic judgments have been argued to be based on gradient knowl-
edge (Albright and Hayes 2003; Albright 2009). For example, [wIs] and [ploUmf] can both
be judged as acceptable, but [wIs] is consistently judged as ‘more acceptable’ than [ploUmf].
This can be considered the more prominent view, often adopted in modern research on
phonotactic knowledge. Generally, this view equates gradient acceptability with gradient
grammaticality2.
There has also been some recent work concerned with the potential of modeling phonotac-
tic knowledge with machine learning and deep learning techniques (Mayer and Nelson 2019;
Mirea and Bicknell 2019). This follows the current enthusiasm for the ways deep learning
1<...> denotes orthographic representations, [...] denotes surface representations, and
/.../ denotes underlying representations.
2Although, as many have argued, gradient acceptability does not necessitate gradient
grammaticality(Gorman 2013; Chomsky 1965; Schütze 2011)
1
and more powerful computational methods might provide windows into linguistic theory and
knowledge (Pater 2019). Phonotactic knowledge is particularly well-suited to this approach,
because any phonotactic model can be based on a small number of features, whether seg-
mental or featural, and only need produce a simple output: in a categorical system, valid or
invalid; in a gradient system, some probability of a sequence’s acceptability.
In this thesis, I show that recent computational models in the phonotactics literature, as
well as my own models, can learn phonotactic generalizations from corpus data equally well
without any information in the data regarding the frequency of a sequence in the lexicon.
This suggests that gradient phonotactic acceptability judgments are not the result of varying
sequence frequencies in the lexicon; that lexical frequency ratios of phonological sequences
are perhaps not relevant to modeling speakers’ phonotactic judgments.
In a dissertation that looked speciﬁcally at diﬀerent types of models that can account for
acceptability judgments, Gorman (2013) argued that a categorical baseline can outperform
gradient models such as an n-gram model or a maximum entropy model (Jurafsky and
Martin 2009; Hayes and Wilson 2008). However, beyond Gorman’s analysis, which did
not investigate the implications of diﬀerent machine learning techniques on the issue at
hand, none of the recent work has considered the nature of phonotactic knowledge and how
modeling choices aﬀect the conclusions which are derived from them.
When designing a model of phonotactic knowledge, multiple modeling decisions must be
made, some of which are discussed in recent literature. First, does phonotactic knowledge
apply to features, or does it apply only at the segmental level? It is quite surprising given
the success of features in accounting for phonological patterns that recent work has shown
through both machine learning and traditional computational models that training on seg-
ments provides better results and easier learning than training on features (Albright 2009;
Mirea and Bicknell 2019).
Second, what is the appropriate representational unit, or window size, for onsets to be
grouped in as models are trained? Per Gorman’s 2013 model for monosyllabic nonce words,
2
a model of gross status treating each onset as a single unit suﬃces. [S T R] does not need to
be recognized as a sequence of segments, but simply as a unit contained in a set of attested
onsets.3 However, machine learning models can easily accommodate the diﬀerence in window
size as a model parameter, and this representation needs to be decided.4
Third, is phonotactic grammaticality a categorical or gradient measure? Both of these
decisions will greatly impact model decisions and performance. Many available learning
models can take any number of features as their inputs and could be adapted to a featural
or segmental view, but the ﬂexibility of some models for output as a binary, categorical
classiﬁcation, or gradient, is not as clear. Additionally, how well model output is analogous
to a true probability is also murky (Hayes and Wilson 2008). I will discuss this issue with
each model presented.
For my thesis, I do the following:
a. I implement four models:
i. Gross phonotactic violation, following Gorman (2013);
ii. A support vector machine, or SVM; a classical machine learning model frequently
used for binary classiﬁcation tasks.
iii. A maximum entropy model as employed by Hayes and Wilson (2008) and referred
to by phonotactic literature as a gradient model;
iv. A recurrent neural network, or RNN; a deep learning model quite ﬂexible to
diﬀerent conﬁgurations, and can be used to classify using a threshold or argmax
over normalized outputs.
3Gorman (2013) acknowledges that this does not capture all phonotactic patterns, but
operationalizing this model to include other aspects of phonotactics immediately becomes
quite diﬃcult and has not been explored.
4As an example, the onset[S T R] would be represented in the following ways for window
size n: n = 1: ((S),(T),(R)), n = 2: ((ST), (TR)), and n = 3: ((STR)).
3
b. I compare each model’s performance on a withheld test set to ensure that the model
has made some generalizations about the data.
c. I then test each model’s ability to categorize data from human judgment experiments
collected by Scholes (1966), to see which approach is better to model human judgments.
d. I train well-performing models on datasets which have had the frequency of types
equalized, and those which preserve the type frequency from the original data.
I train each model on a corpus of word onsets, controlling for three factors: First, previous
literature has not been concerned with the appropriate context window size for phonotactic
restrictions, so I train each model embedding the onsets by unigrams, bigrams, and trigrams.
Second, I also train each model with both the full dataset, including all tokens taken from
the dictionary, as well as the set of all onsets, removing any duplicates. This is to test if the
models can learn which onsets are grammatical with only the set of attested and unattested
onsets. Third, each model is trained only on segments, and not features. I do this following
the work of Albright (2009) and Mirea and Bicknell (2019), whichs shows that phonotactic
models trained on segments learn and perform better than models trained on features. I
show that the maximum entropy model is able to correlate the most with human judgments
and crucially that the RNN model as well as the maximum entropy model performance is
not signiﬁcantly aﬀected when the phonotactic frequency diﬀerences are neutralized in the
training data.
4
CHAPTER 2
RELATED WORKS
2.1 Origins of phonotactics
The study of phonotactics began with Halle (1962) and Chomsky and Halle (1965), where
the aforementioned distinction of [blIk] and [bnIk] is introduced. This is a simple but powerful
demonstration of phonotactic knowledge, which the authors deﬁne as restrictions on the
underlying representation of words. These restrictions can be segment structure constraints,
but also sequence structure constraints (Halle 1959). These are named morpheme structure
constraints (MSCs) by the authors.
Segment structure constraints contain the constraints dictated by the available underlying
segments in a given language. Though [D, d] are both possible surface segments in Spanish,
[D] only appears as an allophone of /d/, thus a nonce word such as /Dano/ cannot appear as
an underlying representation of a word in Spanish and violates the phonotactic system of the
language. My thesis will not be concerned with this aspect of phonotactic knowledge, as it
can be modeled quite simply in terms of set membership: Given a set of available segments
S, for each x in sequence s, if x ∈ S, then no segment structure constraint is violated.
Sequence structure constraints are not so simply deﬁned. Chomsky and Halle (1965)
prefer a featural, rule-based constraint structure on underlying representations. An example
of an English sequence structure constraint (from Gorman (2013)) is seen in (1).
(cid:20)
(cid:21)
(cid:20)
(1)
−Cont
→
+Liquid
/ #
−Cont
(cid:21)
(cid:20)
(cid:21)
This constraint prohibits a stop consonant after a word-initial stop, and prohibits the under-
lying */bnIk/, but allows /blIk/. A set of these MSCs then would represent language-speciﬁc
phonotactic knowledge that is present in any given speaker.
5
If this knowledge is represented in such a way, it can be tested experimentally, which is
the path Scholes (1966) took. Scholes (1966) used acceptability judgement tasks to probe
these constraints and evaluate them. In his study (particularly, experiment 5 in his book),
33 seventh grade students were asked to judge whether each word in a list of nonce words
was an acceptable English word or not. They were to give binary “yes” or “no” answers, and
each word received the number of “yes” votes as its score. Examples of this data can be
seen in Table 2.1. Nonce words are transcribed using the ARPAbet system1, and with the
corresponding IPA.
Word (ARPAbet) word (IPA) Number of “yes” responses
K R AH1 N
F L ER1 K
M R AH1 NG
V R AH1 N
N R AH1 N
V P EY1 L
kô2n
flÄk
mô2N
vô2n
nô2n
vpeIl
33
31
27
19
8
0
Table 2.1: Examples of the nonce words used in Scholes’ experiment,
along with number of positive responses.
Scholes’s data shows one problem of studying the categorical vs. gradient distinction.
There is no way given this experiment to ﬁnd gradience in the individual speaker judgments.
Each speaker gave a categorical “yes” or “no” answer. Nevertheless, the proportion of “yes”
to “no” answers can be split evenly in some cases, where the subjects in the study disagree
with each other. The word V R AH1 N [vô2n], was only considered acceptable by roughly
half of the subjects, for example.
The data from the Scholes (1966) experiments raises questions for a way to model phono-
tactics. The stimuli receive a range of responses that suggest even if a word is judged as
unacceptable, some words might be more unacceptable than others. Incidentally, Chomsky
and Halle (1968) acknowledge this; along with [blIk] and [bnIk] as examples to represent
acceptable and unacceptable, they introduce [bznk], which they claim should be less accept-
1See Section 3.2 for more discussion on this system.
6
able than [bnIk]. A theory that is simply a set of MSCs, designed to exclude */bnIk/ but
allow /blIk/ does not express these levels of unacceptability.
These facts about the acceptability of various nonce words has led to the theory that
the (phonotactic) grammaticality is gradient, and further evidence for this comes from ex-
periments showing a spread of diﬀerent levels of acceptability for nonce word rating tasks
(Albright 2009; Hayes and Wilson 2008; Shademan 2006). These experiments ﬁnd that
participants always give gradient responses when given a scale as a method of response.
2.2 Nature of constraints and output
The central question raised by Gorman (2013) is the that of gradient vs. categorical
knowledge. Under Gorman’s view, three questions need to be addressed by any theory of
gradient grammaticality phonotactics:
a. Do experiment participants have the ability to accurately perceive the grammaticality
of a test item and map their acceptability judgment to that grammaticality?
b. Do intermediate acceptability judgments constitute evidence for gradient grammati-
cality?
c. Is the gradient theory compared to categorical alternatives in a way that clearly shows
the advantages of the gradient theory?
Speakers have diﬃculty with perception of nonce words that have phonotactic violations
(Dupoux et al. 2004; Kabak and Idsardi 2007), and sometimes repair them with illusory
vowels dependent on native language phonology (Durvasula et al. 2018). If this is the case,
speakers are giving an acceptability judgment on a perceived item that is diﬀerent than the
stimulus. A full theory of gradient phonotactics would necessitate accounting for phonotactic
violation repairs systematically, so it is clear what the speaker is making their acceptability
judgments based oﬀ of. Though Gorman (2013) points this out as a complication for gradient
phonotactics, it should also be considered as a potential complication for a categorical theory
7
as well; a categorical judgment still relies on a perceived item that could exhibit diﬀerences
from the intended stimulus.
Secondly, intermediate acceptability judgments do not constitute direct evidence for gra-
dient grammaticality. Participants asked to make a judgment task on numbers presented to
them with a scale of how “representative” those numbers were of “even” and “odd” use the
intermediate measures even though there is a clear, categorical distinction between the two
categories (Figure 2.1).
Figure 2.1: Mean ratings of subjects asked how representative the
stimuli were of the categories “even” and “odd” respectively, where
“1” indicates most representative (Gorman 2013; Armstrong,
L. Gleitman, and H. Gleitman 1983).
Though there are potential explanations for why participants believe 23 is more odd
than 57, Armstrong, L. Gleitman, and H. Gleitman (1983) point out that subjects’ ability
to compute with these numbers and explain the deﬁnition of odd and even numbers are the
fundamental facts of the task, and a theory dedicated to explaining the gradient judgments
will ﬁnd it diﬃcult to represent basic human knowledge about odd numbers. Gorman (2013)
then suggests a parallel phonology example: orthodox beliefs about phonotactic acceptability
dictate that acceptability is closely related to frequency; the more often a sequence appears
8
in a language, the more acceptable it is. For example, Pierrehumbert (1993) argues that
there is a relationship between perceived well-formedness of a phoneme combination and its
frequency in the language. However, if sequences [bl] and [kl] have diﬀerent acceptability
measures due to their frequency ([bl] appears roughly twice more than [kl]), how are they to
be treated equally under independent phonological processes like syllabiﬁcation, etc.?
Lastly, gradient models should be able to demonstrate an advantage over a baseline
categorical model and address the challenges raised above. As stated, current methods
assume the correlation of gradient acceptability and frequency of patterns in the lexicon is
a causal relationship, but fail to test their models to see if learning is aﬀected by removing
frequency information from the model.
As mentioned above, two principled ways exist to represent phonological segments: as
the discrete segments themselves or as vectors of phonological features. Hayes and Wilson
(2008) use a featural system to train their model and do not compare model perforance with
a segment-based model. This was addressed by the work of Albright (2009), and now Mirea
and Bicknell (2019), who both show support for a segmental phonotactic system rather than
a featural system. Because of this, I will choose to focus on the problems of gradience in the
model structure and input, and train my own models using segmental representations.
2.3 Dealing with Gradient Acceptability: Recent Models
Now I will turn to a discussion of prior work conducted on gradient phonotactics models
(Albright 2009; Bailey and Hahn 2001; Hayes and Wilson 2008; Mayer and Nelson 2019).
Albright (2007) and Albright and Hayes (2003) conducted experiments where participants
responded to nonce words on a 7-point Likert scale, with some variation of “completely
impossible as an English word” on one end of the scale, and “would make a ﬁne English
word” on the other end. These studies highlight the existence of gradient acceptability:
[bwIk] is not as acceptable as [blIk], but is more acceptable than [bnIk]. Albright (2009)
introduced a model of probabilistic phonotactics to account for these experimental results.
9
The model Albright (2009) used to account for the human acceptability judgments is an n-
gram model, common in natural language processing for language modeling; it is a language
model that predicts the most likely next segment given a preceding sequence of segments that
can range in length (Jurafsky and Martin 2009). For example, a trigram model (tri- meaning
the model predicts an upcoming segment based on the previous two segments) for the onset
[spl], where # represents initial word boundary and N represents the syllable nucleus, would
look like the following:
(2)
ˆp(spl) = p(s|##) · p(p|#s) · p(l|sp)· (N|pl)
The probability of the whole onset is modeled as the product of the sequential probabilities
of each segment given the two preceding segments.
Another aspect that can be modeled as aﬀecting the probability of a sequence is neigh-
borhood density: what number of licit sequences exist that are one step away from the
sequence in question, where a step can be either a single insertion, single deletion, or single
substitution of a segment (Coltheart 1977)? Neighborhood density is a measure that is not
directly phonotactic; but neighborhood density does have a high correlation with bigram
probability (Bailey and Hahn 2001). Neighborhood density also has a clear and strong eﬀect
in longer nonce words with recognizable morphology, and even if these have strong violations
of phonotactics (e.g. mrupation) they will receive English-like ratings (Hay, Pierrehumbert,
and Beckman 2004).
Hayes and Wilson (2008) created a phonotactic learner based on maximum entropy gram-
mar which they claim can correlate well with the gradience in human judgments (Goldwater
and Johnson 2003; Jaynes 1983). They train their phonotactic learner on onsets, and use
a type frequency structure in the training data. They claim to replicate gradience of hu-
man judgments by reporting correlation co-eﬃcients with the results from Scholes (1966).
As discussed, this makes an assumption of individual speaker judgment gradience based on
the aggregate responses of the Scholes participants, when each participant gave a yes/no
answer. Thus, it can be argued that Hayes and Wilson are really showing they can correlate
10
model predictions with the aggregate score of onsets in the pool of participants in Scholes’
experiment. While I think this is an important point of discussion, I do not explore this in
my thesis, assuming that the Scholes data is reasonably representative of individual speaker
judgments.
Mayer and Nelson (2019) introduce a recurrent neural network language model (RNNLM)
which performs better than the Hayes and Wilson phonotactic learner at correlating with
judgment data. Their model is diﬀerent from the RNN used in my work in that it is not
trained and tested over onsets, but is rather trained on the whole words to learn probability
distributions over the transitions between segments, and these probability distributions can
be used to estimate the probability of a any string as a licit phonotactic sequence.
RNNLMs are more recently utilized models that have come out of the natural language
processing literature in the last two decades due to the rise in increased computing power
needed to train the models (Mikolov et al. 2010). Mayer and Nelson (2019) train RNNLMs on
various corpora of several languages to see if these models can learn phonotactic phenomena
in those languages. They test model performance in predicting vowel harmony rules in
Finnish, Cochabamba Quechua laryngeal co-occurrence restrictions, and sonority sequencing
rules in English. They ﬁnd that the neural nets perform better than the maximum entropy
model of Hayes and Wilson in correlating with human judgments. For English, Mayer and
Nelson (2019) evaluate their model and the phonotactic learner on experimental results
from Daland et al. (2011). This experiment was meant to probe judgments regarding the
phonotactics of sonority sequencing, but was not compared to any other commonly used
phonotactic judgments dataset.
2.4 Frequency structure in the data
A crucial part of probabilistic phonotactics is what frequency structure is important in
learning. Models can be trained on either type or token frequency.2 Hayes and Wilson (2008)
2The terms type and token need to be explained further in terms of onset frequencies.
One can imagine that token frequency is the frequency of onsets in the corpus of speech or
11
argue that type frequency is what is more accurate for training data (as opposed to token
frequency), and that is what they use as training for their phonotactic learner. As I have
also done, the authors have removed what they term exotic onsets from the CMU dictionary
and train the learner on a nonexotic corpus of onsets along with their type frequencies.
However, what if the frequency structure of types is not important at all? Again, many
recent gradient models fail to adequately consider the assumptions made in abandoning
a categorical model.
I train the models I will be using on a corpus of onsets with type
frequencies dictating their distribution in the training data, and then also train all of the
models with the frequency of the onsets equalized across the training data. I do this to expand
on the comments in Hayes and Wilson (2008) regarding the role of frequency structure in the
training data to include approaches that still question the relationship between frequency
structure and phonotactic acceptability.
I will refer to data with type frequency (the frequency that is present in the dictionary)
as type frequency data, because there is a cline of onsets, from those that are marginally
represented to those that are present in the thousands. Data where this frequency structure
has been removed I will refer to as equalized frequency data. It is important to distinguish
that this categorical vs. gradient distinction in judgments is not the same as the comparison
being made in the frequency structure of the data. This frequency structure comparison can
instead be explained as a deterministic learning method vs. a probabilistic learning method.
If the frequency structure is integral to proper acquisition of phonotactics, this should be
reﬂected in the ability of the model to predict human judgment.
Though the distinction between judgments and learning should be made, phonotactic
knowledge built on the probability distribution of possible onsets suggests that the knowl-
text, and this would lead to an high frequency for D, for example, since this is a common
onset in highly occuring function words (e.g. the, that, these, those, this). Of course, this
onset actually has very low occurences in the lexicon at large, and a count of occurences in
the CMU dictionary would yield a count of types and not tokens, since what is being counted
is the number of words in the lexicon that exhibit the onset and not the number of words in
some naturalistic corpus.
12
edge must be gradient in nature, whereas phonotactic knowledge built on a set of attested
onsets suggests that the knowledge is categorical in nature. This is what links the frequency
structure of the data with the nature of phonotactic judgments.
13
CHAPTER 3
METHODS
3.1 Phonotactic models for this study
For this study I will compare my own implementations of machine learning models and
compare them to other computational models in the literature. They will be ﬁtted with word
onset data from the Carnegie Mellon University (CMU) Pronouncing Dictionary (Weide
1998). I use this data because the stimuli are designed to test the phonotactic acceptability
of the nonce word onset only, outﬁtted with a set of rhymes that have only simplex codas and
are all attested. Other available data (Albright and Hayes 2003; Albright 2009) commonly
used for evaluations (Gorman 2013; Mirea and Bicknell 2019) do not follow this with their
stimuli, using complex codas and much more variability in their experiments. Due to this,
these experiments are not suitable for evaluation of models trained only on onsets, because
participant responses are aﬀected by the presence or lack of phonotactic violations in the
rhyme.
As is standard practice in machine learning, models will be evaluated based on a withheld
test set of data removed from the dataset before training, and then used to predict the data
of Scholes (1966) This practice has been absent from much of the phonotactics literature,
with models trained and tested on the same data, or tuned to the test set of data in some way
(Hayes and Wilson 2008; Goldwater and Johnson 2003). The reason this should be avoided
is that it does not allow for suﬃcient generalization from the training. The goal is to produce
a model that can adequately account for unseen data, but if the model is evaluated on the
data it learned from, it will produce artiﬁcially accurate results.
14
3.2 Data Preparation
The CMU Pronouncing Dictionary is an open-source pronunciation dictionary of North
American English. It contains roughly 134,000 words and their pronunciations. Pronuncia-
tions are transcribed using the Advanced Research Projects Agency phonetic transcription
codes, commonly known as the ARPAbet system. The CMU dictionary uses 39 ARPAbet
phonemes as well as primary and secondary stress markers to transcribe entries.
For data preprocessing, the dictionary entries and their pronunciations were imported
to Python (Python Software Foundation n.d.), using the Pandas library DataFrame object
(McKinney 2010). The 160 unique onsets are then isolated for analysis, and padded to the
length of the longest onset. This padding is used because the models require inputs of a
ﬁxed length, and the standard natural language processing procedure is to add some null
character to pad any piece of data shorter than the longest item needed.
In the case of
onsets, the longest onsets in English have a length of three, so those can be represented as [S
P L], whereas a simplex onset is represented with two added null characters as in [# # B].
Some of the onsets in the CMU dictionary occur only once or a few times. I examined
these onsets and, based on my judgment as a native speaker, decided that they are not
attested in my own lexicon, and thus they are not representative of the phonotactic knowledge
I want to model, and should be left out for analysis. I removed any unique onset which occurs
less than 35 times in the dictionary. To decide this cutoﬀ point, I examined the onsets
manually to see if the low-frequency onsets were indeed unacceptable for my judgment, and
found that removing onsets with a frequency of less than 35 removed all the onsets I found
unacceptable. Examples of such onsets are [# Z B] and [# HH M]. Hayes and Wilson (2008)
employ this same method, whereas Gorman (2013) opts to reduce the onsets represented in
the dictionary by eliminating each word that has a frequency less than one per one million
words in the SubtlexUS corpus, a dictionary containing word frequencies in American English
(New and Pallier 2009).
Negative data is generated by permuting all possible consonant combinations of lengths 1
15
Onset
no onset
b
d
f
t
S
dZ
kr
tS
sk
fr
ﬂ
kw
gl
sm
Sr
bj
spr
Sw
skw
D
dw
20,285
9,777
7,779
5,633
4,941
2,496
2,182
1,514
1,278
984
914
651
471
393
249
164
149
121
101
85
72
44
s
p
h
g
n
st
br
gr
kl
sp
T
dr
sl
hw
kj
skr
tw
fj
gw
Sl
Sn
12,571
7,866
6,734
4,986
3,429
2,362
1,607
1,285
999
915
651
483
408
367
191
153
123
107
86
81
57
Frequency Onset Frequency Onset Frequency
k
m
r
l
w
v
pr
j
tr
z
bl
pl
str
sw
sn
hj
mj
Tr
Z
pj
Sm
spl
13,042
9,547
7,483
5,511
3,864
2,445
1,796
1,383
1,197
959
705
528
460
384
244
163
142
114
99
84
66
40
Table 3.1: Onsets used for training
to 3, and subtracting from these the set of unique onsets found in the CMU dictionary. This
resulted in a set of 12,500 negative onsets. Each positive example is left at its frequency in
the CMU dictionary, totalling 128,000 tokens of positive onsets. In order to generate the
equalized data, the positive data is reduced to a set of all onsets represented in Table 3.1,
and the set is multiplied to reﬂect the size of the data with frequency information.
For example, consider a made up language where there are 100 unique words and 5 unique
onsets = sn, spl, r, m, j, and this data is used to train a model that requires of at least 100
data points. The onsets could be structured in the training data proportionally to their
appearance in the lexicon of this made up language, or they could be equally represented at
a number large enough to train the model, as in Table 3.2.
16
Onset
Gradient data count Equalized data count
sn
spl
r
m
j
Total count
40
25
15
12
8
100
20
20
20
20
20
100
Table 3.2: Toy example of diﬀerent frequency structures in training
data.
3.3 SVM model
SVMs are supervised learning models and are powerful, easy-to-use, discriminative binary
classiﬁers. These facts make them well-suited for baseline classiﬁcation and for categorical
classiﬁcation tasks. Standard SVM models do not provide a probability as an output, because
they only categorize the data as being on one side of a dividing hyperplane, and distance
from the hyperplane cannot correlate with probability of class membership.
The SVM model takes each data point as an n-dimensional vector, and seeks to ﬁnd the
hyperplane in n − 1 space such that the distance between the hyperplane and the nearest
data points, referred to as the support vectors, of both classes is maximized. An example of
such a hyperplane is provided in Figure 3.1, where two classes are separated by a hyperplane
(which in 2 dimensions is just a line). Supporting packages in the Scikit-Learn library in
Python (Pedregosa et al. 2011) allow for the onsets from the CMU dictionary to be vectorized
and fed into the SVM. Each vector has length n where n is the number of segments in the
data, and the vector for a given onset has values for 1 for each position correlated with the
segments in that onset.
How is this hyperplane determined from the training data? If the training data is a set of
pairs x, y such that x(i) is a vector for the ith data point in the training data and y(i) is the
value for that data point (y ∈ 1,−1) where we can use 1 to represent attested data and −1
for the generated unnattested data. With this form, the classiﬁer looks like the following:1
1This is in a sense more deterministic than logistic regression or neural nets, as will be
17
hyperplane
M
argin
Figure 3.1: Illustration of an SVM in 2 dimensions
(3) hw,b(x) = g(wT x + b)
where
g(z) = 1 if z ≥ 0, g(z) = 0 otherwise
In (3), the parameters (w, b) represent some hyperplane. And given the training pair
(x(i), y(i)), the functional margin of (w, b) is as in (4).
(4)
ˆγ(i) = y(i)(wT x + b)
This functional margin value represents the distance between the hyperplane and the data
point and should be large to reﬂect a conﬁdent prediction far away from the separation line.
So in order to attain a large functional margin for (x(i), y(i)), if y(i) is negative, (wT x + b)
should be a large negative number, and if y(i) is positive, y(i) should be a large positive
number. The functional margin value should always be positive (if it is negative, it is on
the wrong side of the hyperplane) and this is reﬂected in that if hw,b(x(i)) = y(i), then
y(i)(wT x + b) > 0.
discussed in the next section. This is because the SVM classiﬁer directly predicts the class
of the input, whereas in logistic regression, there is an intermediate step of estimating the
probability of class membership before classiﬁcation.
18
Though this is for only one data point, this process can be expanded to an entire set of
data where the functional margin with respect to the set is the smallest of the functional
margins for the individual data points. These data points are the support vectors.
(5) Given training set S:
S = (x(i), y(i)), i = 1, ..., m
then
ˆγ = min
i=1,...,m
ˆγ(i)
Lastly, the margin needs to be maximized. A number of hyperplanes can be drawn that
fail to maximize the margin between the hyperplane and the support vectors. However, the
function (4) can be made arbitrarily large because the classiﬁer (3) cares only about the sign
of (wT x + b), not the magnitude.2.
In order to do this, the functional margin can be divided by the euclidean norm of the
vector w, which normalizes the margin and assures the margin is not maximized artiﬁcially
through the classiﬁer parameters. Putting this all together3, the maximization function is
as in (6):
(6) maxˆγ,w,b
ˆγ
(cid:107)w(cid:107)
where
y(i)(wT x + b) ≥ ˆγ, i = 1, ..., m
For my model, I am using the sklearn.svm.SVC class, with the kernel parameter set to
"linear". I ran the SVM three times, testing the best unit to vectorize the data over. In
order to train the SVM, the data must be embedded numerically, and this can be done by
counting unique unigrams (the segments themselves), unique bigrams, (pairs of segments),
or trigrams (triads of segments). and unigrams perform best by a slight amount.
2For example, g(wT x + b) = g(2wT x + 2b)
3This overview of SVM classiﬁers is greatly simpliﬁed and is only meant to provide an
intuition of how the functions are conceptualized. In reality, the function in (6) is non-convex
and quite diﬃcult to solve and more steps are necessary to implement this classiﬁer from
scratch.
19
Input
layer
Hidden
layer
Output
layer
Output
Input #1
Input #2
Input #3
Input #4
Figure 3.2: Sketch of the input nodes, hidden layer, and output of a Recurrent Neural
Network.
3.4 RNN model
The probabilistic model I am using is a Recurrent Neural Network (RNN), which is a
simple neural net well-suited to sequential data (Elman 1990)4. The RNN is a network of
nodes and edges, with layers of nodes, and edges between each layer. Each node and each
edge can have a corresponding number associated with it, which is added or multiplied to
the input and sent to the next layer as its input. The input layer of size n can take an
n-dimensional vector, traverse through the layers to the output layer, where the output layer
is fed back in to the input of the next iteration. This allows for each segment in the onset
to be embedded as a number and fed into the network one at a time. Each iteration, the
network makes a preliminary guess, and this guess is fed into the network again while it
simultaneously analyzes the next segment. Thus the ﬁnal output of the network takes into
account the sequential information of the onset and does not treat it as a bag of segments.
A simpliﬁed visualization of the network architecture is available in Figure 3.2.
I am using an input layer n where n is the number of unique values in the data. For
unigrams, this is the number of consonants, for bigrams, the number of unique bigrams, etc.
4Model implementations for the SVM and RNN are documented at https://osf.io/f76zn/
(Sarver 2020).
20
I am using a hidden layer size of 128. When the network is being trained, a loss function
calculates how far oﬀ the model’s guess is after each iteration, and an optimization function
uses that information to traverse backwards through the network and update all the weights
and biases accordingly. I am using the Cross-Entropy Loss function and Stochastic Gradient
Descent for optimization. Training is done over 50,000 iterations, where for each iteration,
a random training pair is selected from the training data and ﬁtted to the model.
The model is implemented in the PyTorch framework (Paszke et al. 2017). The model has
the same accuracy level regardless of whether the data is embedded in unigrams, bigrams, or
trigrams, so the RNN model is easily able to classify these onsets regardless of the embedding
strategy. Results discussed in my thesis are from the model iteration that uses the bigram
embeddings. This model runs the fastest due to the least amount of parameters and relatively
small input layer.
The output layer consists of two nodes, one representing a valid segment and one repre-
senting an invalid segment. The output values are plugged into a sigmoid function (seen in
(7)) which places the values on a scale between 0 and 1. This function is displayed visually in
Figure 3.3. An extremely high value will be placed at 1 on the scale and represents that the
model has 100% conﬁdence in that output, an input value of 0 is placed at 0.5 and represents
50% conﬁdence, an extremely low value represents 0% conﬁdence. The node with the higher
value is selected as the model’s predicted onset.
(7) σ(x) = 1
1+e−x
The neural net model has structural diﬀerences to the SVM that require extra care in
determining how the data is fed to the model. While the SVM can receive the entire onset
as an n-dimensional vector, the recurrent neural net model is sequential; each segment of
the onset is fed into the model sequentially. This can of course also vary by window size:
for the onset [str], this sequence can occur over single segments, segment pairs, or segment
triplets. This gives the RNN more power to represent the dependencies between segments
and the eﬀects they have on acceptability.
21
1
0.5
σ(x)
σ(cid:48)(x)
−6 −4 −2
0
2
4
6
Figure 3.3: Graph representation of the sigmoid function, with x = input and y = output.
3.5 Maximum Entropy grammar
A maximum entropy model, notably used by Hayes and Wilson (2008) to build a phono-
tactic grammar, is also suited to providing a probabilistic output. This model has also been
used in phonology as a method of learning OT constraints (Goldwater and Johnson 2003). A
maximum entropy model can express the probability of member x to the set of possible forms
Ω. Given a set of observed data, the learning algorithm generates a model of constraints and
weights each constraint such that the probability of observed data is maximized.
The resulting model will look like this toy example provided by Hayes and Wilson (2008)
in Table 3.3. Assume that the grammar has two constraints, *#V and *C#, that have the
weights 3 and 2 respectively. This grammar would assign diﬀerent maxent values to the
lexical items CV, CVC, and V. CV does not trigger any violation, so its rating is highest,
followed by CVC, and C.
x
CV
CVC
V
*#V (w = 3)
3 · 0
3 · 0
3 · 1
*C# (w = 2)
2 · 0
2 · 1
2 · 0
Score (h(x))
(3 · 0) + (2 · 0) = 0
(3 · 0) + (2 · 1) = 2
(3 · 1) + (2 · 0) = 3
Maxent value (P∗(x))
exp(−0) = 1
exp(−2) ∼= 0.14
exp(−3) ∼= 0.05
Table 3.3: Maxent Grammar (note: ‘.’ represents multiplication
here)
22
Assume brieﬂy that the constraints and weights are optimized for the observed data.
The model sums the product of constraint violations and weights to achieve a score, which
is expressed as in (8):
N(cid:88)
(8) h(x) =
wiCi(x)
i=1
of that constraint, and(cid:80)N
Here, wi represents the weight of the ith constraint, Ci represents the number of violations
i=1 represents the sum maxent value, and is calculated with (9).
This is not a probability, but rather demonstrates the relative probability of the input.
(9) P ∗(x) = e−h(x)
And probability is calculated with the maxent value in (10):
(10) P (x) = P ∗(x)/Z
where
Z =
(cid:88)
y∈Ω
P ∗(y)
The output used in phonotactic learning is not the probability of the input itself, which
due to the large number of possible forms contained in Ω is impractical to report. Rather,
the maxent value meant to show the relative probability between the forms, is given.
How are the constraints and weights for the model determined? The model name refers
to its function of maximizing the entropy, a measure of randomness in the system5, which
S. Della Pietra, V. Della Pietra, and Laﬀerty (1997) show is equivalent to maximizing the
probability (see (11)) of the observed forms.
(cid:89)
x∈D
P (x)
(11) P (D) =
where
P = set of observed data
This probability is maximized by an iterative search algorithm similar to the stochastic
gradient descent described in the discussion of neural nets. All constraint weights N and
5−(cid:80)
x∈Ω P (x)log(P (x)), Cover and Thomas (1991)
23
total probability create a surface in (N + 1)−dimensional space, and though the surface is
never calculated as a whole, at each stage the local gradient is determined and the search
iterates upwards (in the direction of higher total probability of observed forms) until a
maximum is reached. Unlike neural nets, this surface is always convex, without only one
global maximum for the search to ﬁnd (S. Della Pietra, V. Della Pietra, and Laﬀerty 1997).
The speciﬁc algorithm used by Hayes and Wilson (2008) in their phonotactic learner is the
conjugate gradient method (Vetterling et al. 1992).
Hayes and Wilson (2008) set up their learner to maximize log(P (D)) for mathemat-
ical convenience, since the log function is monotonic and adjusting the weights to max-
imize log(P (D)) will necessarily maximize P (D). The partial derivative of each weight
log(P (D)) expresses the rate log(P (D)) will change in relation to that weight wi, and
∂
∂wi
the gradient is a vector of these partial derivatives. According to S. Della Pietra, V. Della
Pietra, and Laﬀerty (1997), ∂
∂wi
log(P (D)) is additionally interpretable as the diﬀerence be-
tween observed violations of constraint Ci and expected violations of the constraint, formally
O[Ci] − E[Ci].
Calculating E[Ci] necessitates a limit on the length of forms in Ω (all possible forms
In
for our model; in my case, all logically possible onsets), otherwise the set is inﬁnite.
accordance with other models used in this thesis, I limit all forms in Ω to a length of three
segments or less. Now that Ω is a ﬁnite set. E[Ci] is expressed in (12):
(cid:88)
x∈Ω
(12) E[Ci] =
where
P (x)Ci(x)
P (x) = probability of x
Ci(x) = number of Ci violations by x
Only one more piece is needed before presenting the full learning algorithm, which is a
measure of accuracy for the constraints. This accuracy measure calculates how the ratio of
observed constraint violations (O(Ci)) with expected constraint violations (E(Ci)), or O/E.
24
Hayes and Wilson (2008) also implement a statistical upper conﬁdence limit on O/E to
reﬂect a diﬀerence in accuracy between an O/E that equals 0/10 and one that equals 0/1000.
Eﬀectively, this means that instead of the accuracies being both 0 and zero, 0/10 has 0.22
accuracy score and 0/1000 has 0.002. This is because if there are only 10 logically possible
violations, a low number of observed violations does not imply as strong of a constraint as
if there were 1000 logically possible violations.
With these pieces, the learning algorithm (Hayes and Wilson 2008) is constructed as
follows:
(13) Phonotactic Learning Algorithm Input
A set Σ of segments classiﬁed by a set F of features, a set D of surface forms drawn
from Σ∗, an ascending set A of accuracy levels, and a maximum constraint size N
Initialize empty grammar G
for each accuracy level a in A do
Algorithm 1 Phonotactic Learning Algorithm
1: procedure PhonotacticLearner(A , D, F , N , Σ)
2:
3:
4:
5:
6:
7:
8:
9:
10: end procedure
select the most general constraint and add it to G
train the weights of the constraints in G
(cid:46) Gradient ascent
(cid:46) In the form of Table 3.3
while Exists constraint with accuracy < a do
(cid:46) Constraints by D, F , N , Σ
end while
end for
return G
25
CHAPTER 4
RESULTS
Two main metrics are used to report the results of the models. The RNN and SVM are
both classiﬁers, but with slight distinctions: the SVM output assigns a direct binary label
predicting that the input is attested or unattested to its input, whereas the RNN assigns
a score between 0 and 1, then assigns a binary label based on whether that score is above
or below the threshold value of 0.5. For these classiﬁers, one metric is the accuracy of
classiﬁcation. Accuracy is the ratio of onsets whose classiﬁcation correctly matches whether
that onset is truly attested or not (gross status).
Accuracy is collapsed across classes, so for unbalanced data sets, if one class has much
more data than the other and that class is predicted correctly, the accuracy can be very high
even if the smaller class is poorly predicted. For this reason, I will break down the results
into confusion matrices which display the results by class and will be explained further.
Ultimately however, the human acceptability judgments are not predictable by classi-
ﬁcation since they are gradient. To compare model output to these judgments, I will use
Pearson’s r correlation coeﬃcient, which is a measure of linear correlation between the model
output and the judgment data. A correlation of 1 means that the model perfectly predicts
each score (in this case plotting the results would result in all the points falling along a
straight line). A correlation value of 0 means that there is no relationship between the
model predictions and the human judgment data.
4.1 Gross Phonotactic Violation
The simplest way to account for phonotactic acceptability is that onsets that are attested
in the lexicon are judged highly and those that are unattested receive a lower acceptability
judgment. With respect to the modeling of onsets and Scholes (1966) data for this study,
attested is deﬁned as present in the set of onsets described in (3.1). The distribution of
26
ratings for attested and unattested onsets is shown in Figure 4.1. Both the attested and
unattested classes show a concentrated distribution around the high and low Scholes ratings,
respectively, with thin tails representing outliers. Onsets that were rated as acceptable by a
large number of Scholes participants that were unattested were mr and Sl, and onsets that
were rated as acceptable by a low number of partipants that were attested were Sm and sf.
The correlation value (Pearson’s r) for gross status and the Scholes data is 0.803.
Figure 4.1: Distribution of normalized ratings in the Scholes
experiment for attested and unattested onsets.
4.2 SVM results
Each model performs above 90% accuracy regardless of how the data is embedded, but
a window size of one (e.g. an onset spl is represented as ((S), (P), (L))) does lead to the
best performance by a slight amount. See Figure 4.2 for comparisons.
Though the SVM does present a good model for separating categorical data, it is not
well-suited to this task for a few reasons. Consider ﬁgures 4.3 and 4.4. They show a matrix
of probabilities where the upper left quadrant is the probability that the model is given a
negative onset and it guesses correctly, and the lower right quadrant is the probability that
the model is given a positive onset and it guesses correctly. The lower left quadrant shows
the probability of false negatives, where a positive data point is falsely classiﬁed as negative
27
Figure 4.2: The accuracy of the SVM model’s classiﬁcations of the
test data based on the type of embedding.
by the model; and the upper right quadrant shows the probability of false positives, where
a negative data point is falsely classiﬁed as positive by the model.
The imbalance of negative and positive examples in the training data skew the model
towards modeling of the negative examples. Looking at high rate of false positives, the model
signiﬁcantly under-performs in predicting positive examples. The fact that the training data
is skewed towards the negative data likely plays a role in this. The high rate of false positives
also holds in the predictions for the Scholes data when evaluated against the gross status of
the stimuli.
When correlated with the ratings of the participants, the Pearson’s r is 0.328. Though the
SVM does not get a good result when compared against acceptability data, it is important
to note that this does not discount a categorical model, given the performance of the gross
phonotactic violation model (r = 0.803). Perhaps this rather has to do with the unbalanced
data and the nature of SVM training, which will be discussed.
4.3 RNN results
The RNN performs quite well on the classiﬁcation task, with accuracy above 92% for
all iterations. The RNN outperforms the SVM and is also more robust to diﬀerent ngram
28
Figure 4.3: SVM results on withheld test set
Figure 4.4: SVM predictions for Scholes data, with ground truth set
to gross status
embeddings, but the diﬀerences are minimal. The RNN learns very quickly, within the ﬁrst
10,000 iterations. In Figure 4.5, the loss value is plotted over the iterations of the model. For
each iteration, a loss value is calculated which measures the distance of the model conﬁdence
in the output from the desired output. The higher the value, the farther away the model is
from the desired predictions.
Figure 4.6 shows a matrix of probabilities like discussed above, where the upper left
quadrant is the probability of true positives, and the lower right quadrant is the probability
29
Figure 4.5: Loss values during RNN training, showing how wrong the
model is for each iteration.
of true negatives.
Figure 4.6: Confusion matrix showing the model’s accuracy for
guessing each class, and its error.
When the Scholes Experiment 5 data is compared against the model predictions, the
neural net can categorize it with 81.4% accuracy (48 out of the 59 tokens). The RNN’s
conﬁdence that an onset is grammatical is part of its output, and can be compared to the
percentage of subjects that judge an onset as grammatical. This comparison is shown in
Figure 4.7.1
1Note that some data points are overlapping in the ﬁgure.
30
The ﬁgure shows the onsets plotted based on the model’s % conﬁdence that the onset
is grammatical (coded as having a value of 1), or ungrammatical (coded as having a value
of 0), and the corresponding normalized judgments from the Scholes (1966) experiment. If
the normalized judgment is 1, that means all 33 participants selected “yes” when asked if
an onset was grammatical, and if the normalized judgment is 0, none of the 33 participants
selected “yes.” The onsets are also color-coded for whether they are actually attested onsets
in English or not. The plot shows where the model is misclassifying onsets, and also shows
that the model output is strictly falling only on the ends of the scale.
Figure 4.7: Scatter plot showing the model’s % conﬁdence that an
onset is grammatical (value=1), compared to the percentage of “yes”
responses in the Scholes experiment.
4.4 Changing frequency structures in the data for RNN training
When the RNN is trained on both type frequency and equalized frequency datasets, the
accuracy remained similar in both cases.2 For both withheld test sets, the precision is higher
than the recall. Both models perform better at identifying positive cases than negative cases
on the test data. Ultimately, the model trained on the equalized frequency dataset lags in
2Recall that the training data can represent the onsets using the frequencies with which
they appeared in the CMU dictionary (gradient data), or represent each onset an equal
amount of times (equalized data).
31
accuracy at 91%, while the type frequency model accurately classiﬁes 94%. However, for
the Scholes results, the models diﬀer more. The model with gradient training has a higher
precision than recall, but the model with equalized training has a perfect recall and lower
precision, which leads to a higher overall accuracy in predicting the Scholes results. In terms
of correlation, on the other hand, the correlation value comparing model predictions to the
participant responses is signiﬁcantly higher for the type frequency model. Ultimately, this
is the more important measure as it reﬂects the model prediction of human responses.
Confusion Matrix
0.99
0.11
Total Accuracy
0.01
0.89
94%
Table 4.1: Type
frequency model
results on withheld
test set
Confusion Matrix
0.83
0.28
Total Accuracy
0.17
0.72
78%
Pearson’s r
0.635
Confusion Matrix
0.98
0.16
Total Accuracy
0.02
0.84
91%
Table 4.2:
Equalized
Frequency Model
results on withheld
test set
Confusion Matrix
0.63
0.00
Total Accuracy
0.37
1.00
82%
Pearson’s r
0.458
Table 4.3: Type
frequency model
results predicting
Scholes data
Table 4.4:
Equalized
frequency model
results predicting
Scholes data
4.5 Changing frequency structures in the data for MaxEnt training
As discussed before, the RNN’s output exhibits little gradience. The model depends on
thousands of iterations of training that reward conﬁdent predictions. It is worth comparing to
32
the maximum entropy phonotactic learner model (Hayes and Wilson 2008) which can provide
a more gradient output. It is important to note that Hayes and Wilson use the following
method to make their model output proportional to the normalized Scholes judgments:
ﬁrst, computing a maxent value from the model’s score of a test item, as in (14), and then
incorporating a free parameter T which is tuned to the test data to maximize the correlation
values (15).
(14) P ∗(x) = e−x
(15) predicted-rating(x) = P ∗(x)1/T
The tuning parameter is a value that can be freely changed to morph the data into values
with the highest correlation to the evaluation data. This is because the parameter can warp
and expand the existing diﬀerences between the data to a larger scale or desired shape.
With this parameter, the model can achieve a high correlation regardless of the structure of
the training data, because it can warp the data so that judgments are more evenly spread
through the intermediate range (See Figs. 4.8, 4.9, 4.10, and 4.11). When it is removed, the
correlation falls slightly but both models are still relatively well-performing.
However, it is important to note that using such a parameter goes against standard
machine learning practices. Because this tuning parameter is optimized to whatever data is
being used to evaluate the model, it will not generalize well to predicting new data. If the goal
is deﬁning a general model of phonotactics that can correlate well to new human judgments,
the model without the tuning parameter must be used. In Table 4.5, the correlations between
the Maxent model results and the Scholes data show how the correlation drops when the
tuning parameter is removed, and how the model predictions are extremely similar regardless
of the training data used.
33
Score with tuning
Score without tuning
Equalized frequency data Type frequency data
0.861 (T =10.05)
0.880 (T =4.90)
0.761
0.769
Table 4.5: Correlations of the Phonotactic Learner scores with the
Scholes data
Figure 4.8: Equalized Frequency, Tuned
Figure 4.9: Equalized Frequency, Untuned
Figure 4.10: Type Frequency, Tuned
Figure 4.11: Type Frequency, Untuned
34
CHAPTER 5
DISCUSSION
5.1 Model Comparisons
Both the existing model of Hayes and Wilson (2008) and the baseline case of gross
phonotactic violation perform better at correlating with the experimental data of Scholes
(1966) than either the SVM or the RNN. However, it is insightful to note that with the
removal of the tuning parameter, the plotted predictions of both the phonotactic learner and
the RNN are strikingly similar. (See ﬁgures 4.9, 4.11, and 4.7). Speciﬁcally, these plots show
a distribution of predicted data that falls in two groups at the top or the bottom of the plot,
without any intermediate data in between. Though in theory the models can output any
value between 0 and 1, the model ouput is strikingly binary. In the RNN’s case, the model is
trained to make binary judgments, and penalized for intermediate judgments. However for
the phonotactic learner, there is no built-in operation that trains the model in this direction.
For the models presented in this paper, none exceed the correlation value achieved by
correlating gross phonotactic violation with the Scholes data without tuning. This is an
important comparison to make; though certain amounts of manipulation might produce
a gradient model that also achieves a high correlation value, to what degree that model
is evidence for underlying gradient grammaticality depends on the success of alternative
explanations. If the gross status of phonotactic sequences can explain judgment data with
a similar level of success as a proposed gradient model, no one model can be claimed as
evidence for the nature of the human behavior that the models are explaining. All that has
been shown is that models with certain assumptions about gradience can also explain the
data in some way. These models are still far away from providing evidence for the nature of
phonotactic grammar.
It is also important to consider that for evaluating these models, the data from Scholes
35
(1966) is not an extensive test set; it simply represents 60 onsets where participants could
only choose a yes/no answer. A future step is to continue to evaluate these models against
human judgments in diﬀerent experimental settings to assure that these models are truly
correlating with human judgments.
Taking a wider point of view, how do these models compare not on the basis of perfor-
mance, but in terms of information that can be derived about human phonotactics? While
performance is an important indicator to the principles that might govern human phono-
tactic judgments, it is important to note the shortcomings of some of these models. First,
the use of negative evidence to train the SVM and RNN are clearly not analagous to the
human learner. Secondly, while the MaxEnt model chooses explicit constraints to optimize,
the neural net architecture is exceedingly diﬃcult to interpret. These are both concerns that
should be taken into account along with model performance.
5.2 Training data structure
Though the RNN model does correlate more highly with the Scholes data when trained on
the type frequency training data, the maximum entropy model does not change much in its
performance. This also creates a problem for the maximum entropy grammar, which, though
the model trained with type frequency training data does have a higher correlation with or
without the tuning parameter, the discrepancy is incredibly small and will undoubtedly vary
with other datasets. I ﬁnd this challenges the ability of the maximum entropy grammar to
accurately be a model of gradient grammaticality. Though the correlation value might be
high, looking at the input (in the frequency-equalized case) and the model output (without
the tuning parameter), it is not clear that gradient grammaticality is learned or expressed
by the model at all.
In fact, this model could be used as a binary classiﬁer by drawing
a decision boundary through the model outputs, such that if the Maxent value is greater
than 0.5 the onset is predicted as grammatical, and if it is lower than 0.5 it is predicted as
ungrammatical. This would be analogous to the classiﬁcation done by the RNN model.
36
This also complicates the claims of Pierrehumbert (1993) about the relationship between
frequency structure in the lexicon and phonotactic acceptability. If this claim is true, models
should perform much better when trained with a type frequency structure. The RNN results
do not directly contradict the predictions of this claim. Though it competently learns to
classify gross status of onsets without frequency structure, which suggests that a model can
learn whether an onset is attested or unattested without frequency structure, it still has a
higher correlation with the Scholes participant responses when trained with the type fre-
quency data. On the other hand, the Maxent model maintains a nearly identical correlation
with the Scholes data regardless of which dataset it is trained on, speciﬁcally when untuned
scores are reported.
5.3 Experiment data used for evaluation
One concern of current work is the lack of judgment data to use as evaluation for com-
putational models. The Scholes dataset is quite small and could be improved upon greatly.
In future work, it will be necessary to run a judgment experiment designed as evaluation
data for these models. Evaluating these models on further experimental data will either
strengthen or weaken their merit. A future experiment could use a Likert scale rating task
to capture gradience in each participant’s individual judgments, in which case a correlation
metric could assess the likeness of each participant’s responses to the group to ensure that
intermediate judgments are produced across speakers consistently. This would more accu-
rately represent the phenomenon of gradient acceptability than the Scholes data since the
gradience would be represented as an intra-speaker measure, and not an average over yes/no
responses.
37
CHAPTER 6
CONCLUSION
In conclusion, the three main ﬁndings are the following: ﬁrst, for the models used in this
thesis, it was found that the maximum entropy model correlates best with the acceptability
judgments from the Scholes data, followed by the RNN model and SVM model. Second, one
of the models with the capability to provide gradient scores for the onset predictions provide
scores on a continuum, but predict onsets as either highly acceptable or highly unacceptable.
Lastly, using training data with equalized frequency structure (where the number of onsets
in the training is equal for each unique onset) does not signiﬁcantly impact the maximum
entropy model performance, slightly worsens RNN performance on a withheld test set, and
signiﬁcantly hinders RNN performance in correlating with Scholes’ judgment data.
I claim that this shows that a model providing a high correlation value with human judg-
ment data is not enough to make a convincing case for gradient phonotactic grammaticality.
The motivation for a probabilistic account is the nature of gradient acceptability judgments
found by Albright and Hayes (2003) and Albright (2007). Regardless of the correlation value
or metric used for evaluation, if the model does not output a range of gradient judgments
which not only correlate well with the data, but can accurately predict the intermediate
judgments, it has failed to capture the gradience that prompted the modeling to begin with.
More merit should be given to categorical approaches when modeling phonotactics, and
nuance around the interaction between the probabilistic and categorical pieces should be
carefully explained and tested. Phonotactics is only one of many areas of linguistics where
researchers are increasingly interested in testing the potential of deep learning methods as a
way to learn more about linguistic knowledge. However, a number of assumptions make it
diﬃcult to not be mislead by the results of the models, as they can only oﬀer analogies and
can be diﬃcult to interpret.
Both the SVM and the RNN underperform compared to previous work.
I think this
38
could be due to the imbalance of the two-class training. These are both discriminatory
models requiring positive and negative data to learn, but generating negative data creates
an imbalanced dataset resulting in both models overﬁtting to the negative data, with a high
rate of false negatives.
The phonotactic learner provides an advantage over neural models in providing speciﬁ-
cally generated constraints, and correlates well with the Scholes data. However, the model
fails to capture any notion of intermediate judgments or gradience without being tuned to
the test data that it is being validated upon. Moreover, it does not seem to rely on any
frequency structure in the data to perform well.
Correlation values can mislead interpretation if not presented alongside a visual plot
of the data. Anscombe (1973) showed four data distributions now known as Anscombe’s
Quartet, which all have very diﬀerent distributions but extremely similar correlation values.
For this reason, results of phonotactics models should not be boiled down to a descriptive
statistic like Pearson’s r, as this is not a full description of the result. Though the phonotactic
learner itself is probabilistic in nature, I believe it cannot be claimed as a model of gradient
grammaticality or acceptability for this reason. Moreover, it does not seem to rely on any
frequency structure in the data to perform well.
If the goal, above all else, is to ﬁnd a model to correlate with the data in a way that
reproduces the intermediate judgments, the best possibility might lie in a RNN language
model (Mikolov et al. 2010; Mayer and Nelson 2019). Mayer and Nelson (2019) do train a
model that correlates better with judgment data than the maxent phonotactic learner.
However, I believe the results of this thesis show that recent modeling approaches have
not investigated fully the diﬀerent assumptions these models are relying on. It is not clear to
me that the maxent phonotactic learner is evidence for gradient grammaticality; and though
a neural network might be performing marginally better with more gradient output, neural
nets are notoriously diﬃcult to interpret. There are no constraints that are generated, and
everything the model has learned is contained in a “black box” of weights and biases inside
39
the network.
The ability of a neural net to correlate with human judgments is exciting and should
be pursued further, but at this point I do not think anything can be said about how the
neural network performance informs us about human grammaticality knowledge. Though I
agree with Pater (2019), Mayer and Nelson (2019), and Mirea and Bicknell (2019) that the
integration of neural network modeling and linguistics is a promising and thrilling future for
the ﬁeld, I believe extra care must be taken to continually be sure that when these models
are learning, that we the researchers are learning something as well.
40
BIBLIOGRAPHY
41
BIBLIOGRAPHY
Albright, Adam (2009). “Feature-based generalization as a source of gradient acceptability.”
In: Phonology.
— (2007). “Natural classes are not enough: Biased generalization in novel onset clusters.”
In: 15th Manchester Phonology Meeting, Manchester, UK.
Albright, Adam and Bruce Hayes (2003). “Rules vs. analogy in English past tenses: A com-
putational/experimental study.” In: Cognition.
Anscombe, Francis J (1973). “Graphs in statistical analysis”. In: The american statistician
27.1, pp. 17–21.
Armstrong, Sharon Lee, Lila Gleitman, and Henry Gleitman (June 1983). “What Some
Concepts Might Not Be”. In: Cognition 13, pp. 263–308. doi: 10.1016/0010-0277(83)
90012-4.
Bailey, Todd M and Ulrike Hahn (2001). “Determinants of wordlikeness: Phonotactics or
lexical neighborhoods?” In: Journal of Memory and Language 44.4, pp. 568–591.
Chomsky, Noam (1965). Aspects of the Theory of Syntax. 50th ed. The MIT Press. isbn:
9780262527408. url: http://www.jstor.org/stable/j.ctt17kk81z.
Chomsky, Noam and Morris Halle (1965). “Some controversial questions in phonological
theory.” In: Journal of Linguistics.
— (1968). The Sound Patterns of English. New York, Harper and Row.
Coltheart, M. (1977). “Access to the internal lexicon”. In: The psychology of reading. url:
https://ci.nii.ac.jp/naid/10018074200/en/.
Cover, Thomas M. and Joy A. Thomas (1991). Elements of information theory. New York:
Wiley.
Daland, Robert et al. (Aug. 2011). “Explaining sonority projection eﬀects”. In: Phonology
28. doi: 10.1017/S0952675711000145.
Della Pietra, Stephen, Vincent Della Pietra, and John Laﬀerty (1997). “Inducing Features of
Random Fields”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence.
Dupoux, Emmanuel et al. (Feb. 2004). “Epenthetic Vowels in Japanese: a Perceptual Illu-
sion?” In: Journal of Experimental Psychology Human Perception & Performance 25.
doi: 10.1037//0096-1523.25.6.1568.
Durvasula, Karthik et al. (Feb. 2018). “Phonology modulates the illusory vowels in perceptual
illusions: Evidence from Mandarin and English”. In: Laboratory Phonology 9. doi: 10.
5334/labphon.57.
42
Elman, Jeﬀrey L. (1990). “Finding structure in time.” In: Cognative Science.
Goldwater, Sharon and Mark Johnson (2003). “Learning OT constraint rankings using a
maximum entropy model”. In: Proceedings of the Stockholm workshop on variation within
Optimality Theory. Vol. 111120.
Gorman, Kyle (2013). “Generative Phonotactics”. PhD thesis. University of Pennsylvania.
Halle, Morris (1962). “Phonology in generative grammar.” In: Word.
— (1959). The Sound Pattern of Russian. The Hague: Mouton.
Hay, Jennifer, Janet Pierrehumbert, and Mary Beckman (2004). “Speech Perception, Well-
formedness, and the Statistics of the Lexicon”. In: Papers in Laboratory Phonology VI,
pp. 58–74.
Hayes, Bruce and Colin Wilson (2008). “A Maximum Entropy Model of Phonotactics and
Phonotactic Learning”. In: Linguistic Inquiry.
Jaynes, Edwin T. (1983). Papers on probability, statistics, and statistical physics. USA:
Kluwer Boston. isbn: 9027714487.
Jurafsky, Daniel and James H. Martin (2009). Speech and Language Processing (2nd Edition).
USA: Prentice-Hall, Inc. isbn: 0131873210.
Kabak, Baris and William Idsardi (Feb. 2007). “Perceptual Distortions in the Adaptation
of English Consonant Clusters: Syllable Structure or Consonantal Contact Constraints?”
In: Language and speech 50, pp. 23–52. doi: 10.1177/00238309070500010201.
Mayer, Connor and Max Nelson (Oct. 2019). “Phonotactic learning with neural language
models”. In:
McKinney, Wes (2010). “Data Structures for Statistical Computing in Python”. In: Proceed-
ings of the 9th Python in Science Conference. Ed. by Stéfan van der Walt and Jarrod
Millman, pp. 51–56.
Mikolov, Tomas et al. (Jan. 2010). “Recurrent neural network based language model”. In:
vol. 2, pp. 1045–1048.
Mirea, Nicole and Klinton Bicknell (2019). “Using LSTMs to Assess the Obligatoriness of
Phonological Distinctive Features for Phonotactic Learning”. In: ACL.
New, Boris and Christophe Pallier (2009). SubtlexUS - Lexique. url: lexique.org/?page_
id=241 (visited on 04/01/2020).
Paszke, Adam et al. (2017). “Automatic diﬀerentiation in PyTorch”. In: NIPS-W.
Pater, Joe (2019). “Generative linguistics and neural networks at 60: foundation, friction,
and fusion.” In: Language.
43
Pedregosa, F. et al. (2011). “Scikit-learn: Machine Learning in Python”. In: Journal of Ma-
chine Learning Research 12, pp. 2825–2830.
Pierrehumbert, Janet (1993). “Prosody, Intonation, and Speech Technology”. In: ed. by M.
Bates and R. Weischedel. Cambridge, UK: Cambridge University Press, pp. 257–282.
Python Software Foundation (n.d.). Python Language Reference. Version 3.6.6. url: https:
//python.org.
Sarver, Isaac (May 2020). Phonotactics Models. doi: 10.17605/OSF.IO/F76ZN. url: osf.io/
f76zn.
Scholes, Robert J. (1966). Phonotactic Grammaticality. Mouton.
Schütze, Carson (Mar. 2011). “Linguistic Evidence and Grammatical Theory”. In: Wiley
Interdisciplinary Reviews: Cognitive Science 2, pp. 206–221. doi: 10.1002/wcs.102.
Shademan, Shabnam (2006). “Is Phonotactic Knowledge Grammatical Knowledge ?” In:
Vetterling, William T. et al. (Nov. 1992). Numerical Recipes Example Book C (The Art of
Scientiﬁc Computing). 2nd. Cambridge University Press. isbn: 0521437202.
Weide, Robert L. (1998). The CMU Pronouncing Dictionary. http://www.speech.cs.cmu.
edu/cgi-bin/cmudict. Accessed: 2019-09-14.
44