MODELING AND 
PREDICTION OF GENETIC REDUNDANCY IN 
ARABIDOPSIS 
THALIANA AND SACCHAROMYCES CEREVISIAE
 
 
By
 
Siobhan Anne Cusack
 
 
A DISSERTATION
 
Submitted to
 
Michigan State University
 
in partial fulfillment of the 
requirements
 
for the degree of
 
 
Cell a
nd Molecular Biology

Doctor of Philosophy
 
20
20
 
A
BSTRACT
 
 
MODELING AND PREDICTION OF GENETIC REDUNDANCY IN ARABIDOPSIS 
THALIANA AND SACCHAROMYCES CEREVISIAE
 
 
By
 
Siobhan Anne Cusack
 
Genetic redunda
ncy is a phenomenon where more than one gene encodes product
s that 
perform the same function. This frequently manifests experimentally as a single gene knockout 
mutant which does not demonstrate a phenotypic change compared to the wild type due to the 
pres
ence of a paralogous gene performing the same function; a ph
enotype is only observed when 
one or more paralogs are knocked out in combination. 
This presents a challenge in a fundamental 
goal of genetics, linking genotypes to phenotypes, especially because 
it is difficult to determine 
a 
priori 
which gene pairs are r
edundant. Furthermore, while some factors that are associated with 
redundant genes have been identified, little is known about factors contributing to long
-
term 
maintenance of genetic redundancy. 
Here, 
we applied a machine learning approach to predict 
redu
ndancy among 
benchmark redundant and nonredundant 
gene pairs in the model plant 
Arabidopsis thaliana
.
 
Predictions were validated using well
-
characterized 
redundant and 
nonredundant 
gene pairs. Additionally, 
we leveraged the availability of fitness and mult
i
-
omics 
data in 
the 
budding yeast 
Saccharomyces cerevisiae
 
to build machine learning models for 
pred
icting genetic redundancy and related phenotypic outcomes (single and double mutant 
fitness) among paralogs, and to identify features important in generatin
g these predictions. 
Collectively, our models of genetic redundancy 
provide quantitative assessments
 
of how well 
existing data allow predictions of fitness and genetic redundancy, 
shed light on characteristics 
that may contribute to long
-
term maintenance o
f paralogs that are seemingly functionally 
 
redundant, and will ultimately allow for more targeted ge
neration of 
phenotypically
 
informative 
mutants, advancing functional genomic studies.
 
iv
 
 
ACKNOWLEDGEMENTS
 
 
First and foremost, I would like to 
give my hear
tf
elt 
thank
s to
 
Shin
-
Han Shiu for 
taking a 
chance on me and accepting me into his lab. 
Without his guidance and mentorship I would not 
have succeeded
 
in graduate school
. I 
would 
also 
like to 
t
hank 
my
 
guidance 
committee members
: 
Rob Last and Kathy Osteryoun
g
,
 
for their advice and support, and Yair Shachar
-
Hill for 
invaluable belief in and advocacy for me throughout my graduate career. 
The Cell and Molecular 
Biology program has been a wonderful academic home and source of support, and for that I 
thank 
Sue Con
r
a
d, 
Peggy Petroff, 
and Alaina Burghardt
.
 
T
he 

pleasure of working with

Melissa Lehti
-
Shiu, Peipei Wang, Fanrui Meng, Paityn Donaldson, 
Sarah Horan
, Thilanka Ranaweera and Serena Lotreck

 
made each day a delight to come to 
work
 
a
nd contributed to the friendly, supportive atmosphere of the Shiu Lab which I will miss 
very much
.
 
Thank
s are especially due
 
to
 
Christina Azodi and Bethany Moore, 
for 
their
 
willingness to 
lend a hand
 
as I got the hang of Python and machine learning, and 
f
o
r 
their
 
friendship
, which 
made
 
the more challenging aspects of grad s
tudent life
 
much easier
.
 
I will 
always cherish 
our 
Starbucks trips.
 
Finally, 

Ronan, for teaching me how to 
function on less sleep than I ever thought possible
, and 
Geo
f
f
,
 
for 
the 
truly 
relentless 
encouragement 
and 
for 
sticking with me
 
through all the ups and downs
. 

next 80 
years
 
together
.
 
 
v
 
 
TABLE OF CONTENTS
 
 
L
I
S
T
 
O
F
 
FIGURES
 
................................
................................
................................
......................
 
vii
 
 
KEY TO ABBREVIATIONS
 
................................
................................
................................
........
 
ix
 
 
CHAPTER 1: INTRODUCTION: A
 
HISTORY AND FUTURE OF GENETIC 
REDUNDANCY
................................
................................
................................
.............................
 
1
 
1.1 How has the definition and u
nderstand
ing of genetic redundancy changed over time?
 
.......
 
2
 
1.2 How is gene
tic redundancy relevant in genetics studies and how is it
 
addressed?
 
...............
 
4
 
1.3 How can genetic redundancy be maintained over evolut
ionary ti
me?
 
................................
.
 
6
 
1.4 What questions about genetic redundancy remain and h
ow can they be addressed?
 
...........
 
9
 
 
CHAPTER 2
: GENOME
-
WIDE PREDICTIONS OF GENETIC REDUNDANCY IN 
ARABIDOPSIS THALIANA
 
USING MACHINE LEA
RNING
 
................................
....................
 
11
 
2.
1 Abstract
 
................................
................................
................................
...............................
 
12
 
2.2 Introduction
 
................................
................................
................................
.........................
 
13
 
2.3 Results and Discussion
 
................................
................................
................................
.......
 
16
 
2.3.1 Definitions of genetic redundancy
 
................................
................................
...............
 
16
 
2.
3.2 Optimal parameters for prediction of genetic redundancy with machine learning
 
......
 
19
 
2.3.3 Comparison of models built with different redundancy definitio
ns
 
............................
 
22
 
2.3.4 Important evolutionary f
eatures i
n predicting redundant and nonredundant gene pairs
................................
................................
................................
................................
...............
 
24
 
2.3.5 
Im
portant gene expression, functional, and network characteristics
 
...........................
 
27
 
2.3.6 
Redundancy predictions for Arabidopsis gene pairs not
 
in the b
enchmark dataset
 
....
 
31
 
2.3.7 Validation of predictions
................................
................................
..............................
 
37
 
2.4 Conclusions
 
................................
................................
................................
.........................
 
40
 
2.5 Materials and Methods
 
................................
................................
................................
........
 
43
 
2.5.1 Definitions of redunda
nt and no
nredundant gene pairs
 
................................
...............
 
43
 
2.5.2 Feature value generation
 
................................
................................
..............................
 
43
 
2.5.3 Functional annotation and evolutionary property features
 
................................
..........
 
45
 
2.5.4 Gene
 
expression and epigeneti
c modific
ation features
 
................................
................
 
47
 
2.5.5 Protein sequence and network property 
features
 
................................
.........................
 
48
 
2.5.6 Identification of features distinguishing redundant and nonredundant pairs
 
...............
 
49
 
2.
5.7 Redun
dancy prediction model building and optimization with machine learning
 
......
 
50
 
APPENDIX
 
................................
................................
................................
...............................
 
52
 
 
CHAPTER 3: MODELING MUTANT FITNESS AND GENETIC REDUNDANCY IN 
SACCHAROMYCES CEREVISIAE
 
................................
................................
...........................
 
65
 
3.1 Abstract
 
................................
................................
................................
...............................
 
66
 
3.2 Introduction
 
................................
................................
................................
.........................
 
66
 
3.3 Results and Discussion
 
................................
................................
................................
.......
 
69
 
3.3.1 Distribution of fitness scores
 
................................
................................
.......................
 
69
 
3.3.2 Predictions 
o
f single
 
mutant fitness
 
................................
................................
.............
 
71
 
vi
 
 
3.3.3 Feature importance in predicting SM f
itness
 
................................
...............................
 
72
 
3.3.4 Predictions of DM fitness and comparison of important features
 
...............................
 
75
 
3.3.5 Prediction
 
of doubl
e mu
tant fitness using single mutant fitness
................................
..
 
78
 
3.3.6 Predicti
ng degrees of genetic redundancy and comparison of important features
 
......
 
81
 
3.4 Conclusions
 
................................
................................
................................
.........................
 
87
 
3.5 Ma
terials a
nd Methods
 
................................
................................
................................
........
 
89
 
3.5.1 Data source
................................
................................
................................
...................
 
89
 
3.5.2 Features for single mutant fitness predictions
 
................................
.............................
 
90
 
3.5.3 Features for double mutant and genetic redundancy predictions
................................
.
 
91
 
3.5.4 Definition of genetic redundancy
................................
................................
.................
 
92
 
3.5.5 Machine learning models
 
................................
................................
.............................
 
92
 
APPENDIX
 
................................
................................
................................
...............................
 
94
 
 
CHAPTER 4: CONCLUSIONS
 
................................
................................
................................
.
 
106
 
 
RE
FERENCES
 
................................
................................
................................
...........................
 
109
 
 
vii
 
 
L
I
S
T OF FIGURES
 
 
Figure 1.1: Polyploidy as an ancient strategy for survival
 
after protocell division.
 
.......................
 
2
 
 
Figure 1.2: Single/double mutan
t phenoty
pes indicating genetic redundancy.
 
..............................
 
4
 
 
Figure 2.1: Arabidopsis 
phenotype categories and redundancy definitions.
 
................................
 
18
 
 
Figure 2.2:
 
Machine learning pipeline workflow and performance of models b
uilt with
 
different 
redundancy definitions
.
 
................................
................................
................................
.................
 
23
 
 
Figure 2.3: Features si
gnificantly associated with redundancy.
 
................................
...................
 
28
 
 
Figure 2.4: 
Feat
ures associated with mispredictions.
 
................................
................................
...
 
32
 
 
Figure
 
2.5: Predicted redundancy scores for gene pairs throughout the Arabidopsis genome.
 
....
 
35
 
 
Figure 2.6: Model performance on holdout test sets.
................................
................................
....
 
38
 
 
Figure 2.S1: Benchmark gene pair phenotypes.
 
................................
................................
...........
 
53
 
 
Figure 2.S2: Comparison of models built with different algorithms a
nd numbers of features.
 
...
 
54
 
 
Figure 2.S3: Statistical association of features among different feature categories with 
redundancy
.
 
................................
................................
................................
................................
...
 
56
 
 
Figure 2.S4: Important features in predicting redundancy.
 
................................
..........................
 
58
 
 
Figure 2.S5: 
Enrichment of GO terms among redundant gene pairs vs. nonredundant gene pairs 
for each redundancy definition.
 
................................
................................
................................
....
 
60
 
 
Figure 2.S6: Performance of models trained on RD4 and RD9 in cross
-
validat
ion.
 
....................
 
62
 
 
Figure 2.S7
: Distribution of values among features that may contribute to mispredictions.
........
 
63
 
 
Figu
re 3.1: Distribution of mutant fitness values.
................................
................................
.........
 
71
 
 
Figure 3.2:
 
Performance and important features of single mutant fitnes
s model.
 
........................
 
73
 
 
Figure 3.3: Performance and important features of doubl
e mutant 
fitness model.
 
.......................
 
77
 
 
Figure 3.4: Performance and important featu
res of double mutant fitness model including single 
mutant fitness
-
de
rived features.
 
................................
................................
................................
....
 
79
 
 
viii
 
 
Figure 3.5: Performance of model built
 
only wit
h and correlation of fitness values with single 
mutant fitness
-
derived features.
 
................................
................................
................................
....
 
82
 
 
Figure 3.6: Distribution of genetic redundancy scores.
 
................................
................................
 
83
 
 
Figure 3.7: Performance and important feature
s of redu
ndancy model.
 
................................
.......
 
84
 
 
Figure 3.S1: Distribution of gene family siz
es.
 
................................
................................
............
 
95
 
 
Figure 3.
S2: Feature importance scores for SM fitness model.
 
................................
....................
 
96
 
 
Figure 3.S3: Distribution
 
of SM fi
tness values among gene pairs with selected annotations.
 
.....
 
97
 
 
F
igure 3.S4: Feature importance scores vs. number of gene pairs with annotation.
 
....................
 
99
 
 
Figure 3.S5: Distribution of DM fitness val
ues amon
g
 
gene pairs with selected annotations.
 
..
 
100
 
 
Figure 3.S6: Perfo
rmance of DM fitness models built with single features.
 
..............................
 
102
 
 
Figu
re 3.S7:
 
Distribution of redundancy scores among gene pairs wi
th selec
t
e
d annotations.
 
..
 
103
 
 
Figure 3.S8: Feature importance scores vs
. number of gene pairs with annotation and difference 
in redundancy score among genes in a pair.
 
................................
................................
...............
 
105
 
 
ix
 
 
KEY TO ABBREVIATIONS
 
 
AUC
-
ROC
 
 
A
rea under the curve 

 
receiver operating characteristic
 
AU
-
PRC
 
 
Area under the precision r
ecall curve
 
DM
 
 
Double mutant
 
EN
 
 
Elastic net
 
GB
 
 
Gradient boosting
 
GO
 
 
Gene onto
logy
 
MAPK
 
 
Mitogen
-
activated protein kinase
 
ML
 
 
Machine learning
 
MYA
 
 
Millio
n
 
years ago
 
NR
 
 
Nonredundant
 
 
PCC
 
 
PTM
 
 
Post
-
translati
onal modification
 
RD 
 
 
Redundancy 
definition
 
RF
 
 
Random Forest
 
SD
 
 
Standard deviat
ion
 
SGD
 
 
Saccharomyces Genome Database
 
SM
 
 
Single mutant
 
SMF
 
 
Single
 
mutant 
f
i
tness
 
SVM
 
 
Support vector machines
 
SVR
 
 
Support vector regression
 
TAIR
 
 
The Arabidopsis
 
Information Resource
 
x
 
 
WGD
 
 
Whole genome duplication
 
WT
 
 
Wild type
 
 
1
 
 
C
HAPTER
 
1
: 
INT
RODUCTION
: 
A
 
HISTORY AND FUTURE OF GENETIC 
REDUNDANCY
 
 
2
 
 
1.1 
How has the
 
definit
i
o
n and understanding of genetic redundancy changed over time?
 
Conceptually, genetic redund
ancy was originally placed in the context of the origins of 
cellular life 
(Gabriel 1960)
, wherein the shift from free
-
floating organic m
olecules
 
t
o protocells, 
capable of metabolizing, growing, and dividing i
nto functional daughter prot
ocells, would have 
been haphazard. Protocells would not have had carefully controlled
 
systems to ensure even 
distribution of genetic material during divisio
n. It wa
s
 
therefore posited that polyploidy was 
likely a necessary pre
-
c
ondition for cellular life, 
as it would increase the chances that a full set of 
genes would be passed on to each 
new protocell
 
(
Figure 1.1
)
. As a side effect, this would result 
in what
 
was ter
m
e

h multiple copies 
of each ge
ne. It was further hypothesized that the development of increased fidelity in vertica
l 
gene transmission would allow duplicate gene copies to diverge and fo
r some t
o
 
begin to develop 
 
 
Figure 1.1
:
 
Polyploidy as an ancient strate
gy for survival after protoc
ell division
.
 
Illustration of polyploidy as an ancient strategy for survival after pr
otocell division. The top row 
represents a haploid protocell, which doe
s not pro
du
ce viable daughter protocells after unequal 
distribution of genetic material in division.
 
The bottom ro
w represents a diploid protocell, which 
produces viable daughter protoc
ells even after unequal division.  
 
3
 
 
new functions.
 
 
From these hypothet
ical conc
e
ptions of genetic redundancy, there has been a 
considerable shift in how it is defined. Ea
rly research on genetic redundancy was primarily in the 
model organism 
Saccharomyces 
cerevisiae 

 
system 
used for ge
netics an
d
 
biochemistry due to its genetic tractability and ease of growth 
(Botstein
 
and Fink 1988)
. Some studies characterizing biochemical activity found that there were mu
ltiple 
structurally similar enzymes performing the same functions 
(e.g., Hunter and Markert 1957, 
Lacroute et al. 1965)
. In this context of biochemica
l activity, genetic redundancy referred to 
unlinked 
genes encoding enzy
mes capab
l
e of catalyzing the same reaction 
(Mortimer 1969)
. A 
simi
lar definition placed genetic redundancy in terms of functional robustness in biochem
ical 
pathways due to interconnected metabolic networ
ks
 
(Weintraub 1993
)
. In th
e
se definitions, there 
was no requirement for the genes themselves to be related, and thus 
could include non
-
paralogous genes derived from convergent evolution
 
(Pickett and Meeks
-
Wagner 1
995)
. 
 
 
While ma
ny early 
s
tudies on genetic redundancy focused on biochemical activity, some 
distinguished between g
enetic redundancy (caused by paralogous genes with identical protein 
products) and fun
ctional redundancy (caused by genes with 
different protein products th
at 
cataly
z
e the same reaction
)
 
(
Khan and Haynes 1972)
. 
This more genetics
-
based conception of 
genetic redundancy was later expanded to also i
nclude partially redundant paralogs that 
had 
evolved new functions sin
ce the du
p
lication event from which they arose 
(Pickett a
nd Meeks
-
Wagner 1995)

enetic redundancy 
presents experimentally
 
(discussed below), recent st
udies pri
m
arily use this last definition, 
defining genetic redundancy as
 
paralogous genes that maint
ain some of the same functionality 
(e.g.,
 
Kempin et al. 1995)
.
 
4
 
 
1.2 How is ge
netic redundancy relevant in 
genetics 
s
tudies and how is it addressed?
 
 
Experimentally, 
genetic redundancy 
is
 
commonly observed a
s a single gene knockout 
(
referred to here as a 
single mutant) 
that shows no 
abnormal 
phenotype compared with a 
wild
-
type organism, w
hile
 
a double
 
or higher
-
order mu
t
a
nt 
of that same gene does have a
n abnormal
 
phenotype
 
(
Figure 1.2
)
.
 
Studies of organisms 
w
ith genomes 
that have been mutated to the point 
of essential saturation in 
single muta
nt
 
knockouts suggest that 
genetic redundancy is extremely 
common; a ge
nome
-
wide
 
s
ingle mutant
 
knockout study in yeast found that only 15% of mutants 
had decreased growth 
under normal growth conditions 
(Giaever et al. 2002)
, while in 
Arabidopsis 
thaliana
 
singl
e mutants with a loss of function phenotype have been described for 
ab
out 10% o
f
 
the genes in the genome 
(Lloyd and Meinke 2012)
. Becaus
e the paradigm of 
 
reverse genetics (and to some extent, forward 
genetics) relies on observable mutant phenotypes 
 
 
F
igure 1.2
:
 
Single/double mutant phenot
ypes indi
c
a
ting genetic redundancy.
 
 
Simple example of how genetic redundancy may manifest in the co
ntext of a single
-
double 
mutant knockout s
tudy. 
This example depicts mutants of 
Arabidopsis thaliana
. Neither single 
mutant has an abnormal phenotype compar
ed to the
 
w
ildtype, while the double mutant is 
noticeably smaller and chlorotic. The lack of abnorma
l phenotype in either single mutant 
indica
tes th
at the genes can compensate for one another, meaning they may have redundant 
functions.
 
5
 
 
to elucidate gene fu
nctions,
 
t
h
is 
poses a fundamental problem 
in 
the pursuit 
of linking genotypes 
 
to phenotypes.
 
Effort
s have been undertaken in both systems t
o overcome this problem by 
generating large se
ts of double and higher
-
order mutants. The synthetic genetic array met
hod, a 
wa
y
 
of systematically generating all possible combinations of double mutants 
(Tong
 
2001)
, 
h
a
s 
made this a relatively straightforward proposition in yeast, and allowed for exhaustive
 
observation of single and double mutant
 
phenotypes 
(Costanzo et al.
 
2016)
. In
 
p
lants, it is simply 
not feasible with current technologie
s to gene
r
a
te all possible combinations of double mutants to 
exhaustively identify phenotypes among 
gene pairs that may be redundant. Noneth
eless, large 
sets of double mutants have been 
generated in Arabidopsis to enable the study of genetic 
redundancy 
(
e.g., 
Bolle 
et al. 20
1
3
, 
Su and Krysan 2016)
, which has generated valuable 
phenotypic information about 
single
 
mutant
s and corresponding 
double mutant
s in paralogous 
gene pairs in plants. 
 
 
As ment
ioned above, there is not universal agreement about the 
severity of m
utant 
phe
n
ot
yp
es 
necessary for two
 
genes to be considered redundant. In the most striking examples o
f 
genetic redundancy, 
single mutant
s
 
show 
no phenotype
 
while the corresponding 
double m
utant
 
is 
lethal
. Some well
-
studied examples of this in Arabidopsis in
clude the
 
mi
togen activated protei
n 
kinase (MAPK) double mutant 
mpk3
/
mpk6
 
(Wang et al. 20
07)
 
and the plasma membrane H
+
 
pump double mutant 
aha1
/
aha2
 
(Haruta et al. 2010)
 
which are lethal, while the respective si
ngle 
muta
n
ts
 
show no phenotype. It is also possible for the 
double mutant
 
to have a deleterious but 
nonlethal 
phenotype
; for example, the MAPK 
double mutant
 
mpk9
/
mpk12
,
 
which has a
ltered 
response to abscisic acid 
(Jammes et al. 
2009)
, and the MAPK kinase kinase mutant 
anp1
/
anp3
, 
which has reduced growth 
(Krysan et a
l. 2002)
.
 
T
hese two situations, where the 
single mutant
s 
have no phenotype, are considered by some 
to be the only true type of genetic redundancy (e.g
., 
6
 
 
Bro
okfield 1997)
. However, other def
initions also include 
single mutant
s with phenotypes, as 
long as the 
double mu
t
an
t
 
phenotype is more severe. Examples of this include MAPK 
double 
mutant
s such as 
mpk4
/
mp
k11
 
(Kosetsu et al. 2010)
, 
mpk4
/
mpk5
, and 
mpk4
/
mpk13
 
(Su and 
Krys
an 2016)
, where each 
single mutant
 
has a dwarf 
phenotype and the 
dou
ble mutan
t
s 
have 
more severe dwarf phenotypes. Interestingly, the severity of the dwarf phenotype va
ries between 
the double mutants. 
Th
is may be a result of
 
unequal genetic redundancy, wh
ere 
a 
gene product 
A 
can compensate for 
gene pro
duct B more effective
ly than B
 
ca
n compensate for A
. For example, 
two tryptophan synthase 

genes 
(TSB1 and TSB2) 
are expressed at 
very 
different levels
, 
with TSB1 being expressed at levels approximat
ely 15 times higher than TSB2 
(Pruitt and Last 
1993)
. 
As a result, 
a knockout for 
TSB1 
shows a normal phenotype 
due to compensatory 
function of TSB2 
und
er low light conditions
,
 
because
 
the 
plants are 
growing 
relatively slowly
. 
However, 
there is a severe chlorotic pheno
type of t
h
is
 
same mutant under
 
high light conditions, 
because 
when 
pl
ants are growing quickly 
the 
lower 
expression level of TSB2 is not suf
ficient to 
compensate for 
TSB1
 
(Last et al. 1991)
.
 
G
ene
s
 
such as MPK4
,
 
which appears to show different 
degrees of redundancy with 
various 
members of the gene family
, and TSB2, whi
ch has 
conditional compensation for its homolog,
 
sugges
ts that there 
are
 
complicated evolutionary 
mechanisms acting on duplicate genes
, which will be discussed more below
. 
 
 
1.3 How can genetic redundancy be maintained over evolutionary time?
 
The question of how genetic redundancy can be stably maintained is a
 
puzzling one. A 
commonly held idea is that there is an evolutionary advantage to maintaining multiple copies of 
a gene; if one copy was compromised, the presence of the other would provid
e a phenotypic 
buffering effect 
(Zhang 2012)
. However, there are several issues with this hypothesis. One is that 
7
 
 
are more likely 
to be singletons 
(DeSmet et al. 
2013)
, which would l
imit the utility of genetic redundancy as an evolutionary failsafe. A second 
issue is that this hypothesis rests on the capacity of an organism to select for t
raits that are 
beneficial to
 
potential 
future 
circumstances, which is contrary to our current und
erstanding of 
selective pressure. Additionally, there is an energetic cost to maintain every gene in the genome 
(Lynch 
and Marinov 2015)
, and therefore
 
an evolutionary force towards elimination of 
unnecessary material to minimize energetic losses. In fact, the outcome most com
monly seen 
after gene duplication events 
in plants is pseudogenization 
(Panchy et al. 2016)
. 
Thus, while the
 
idea of an organism retaining multiple copies of a gene makes some intuitive sense, genetic 
redundancy in fact represents an evolutionary paradox 
(Nowak et al. 1997)
.
 
 
In spite of this, the literature is replete with examples of genetic redundancy
 
throughout 
the tree of life
, from 
the relatively small genomes of many 
bacteria
 
(Guckes et al. 2019)
 
and the 
model animal 
C
aenorhabditis
 
elegans
 
(Tischler et al. 2006)
 
to the larger genomes of the model 
animal
 
Mus musculus 
(Peters et al. 1999)
, 
monocots
 
(Yao et al. 2
008
,
 
Li et al. 2019)
 
and
 
dicots
 
such as Arabidopsis (discussed above). 
It has been hypoth
esized that genetic 
redundancy is a 
temporary and unstable state which only occurs during the few million years after a duplication 
event, before there has been enough t
ime for one copy to become pseudogenized. However, 
while this may be true in some cases, redundant
 
genes in
 
S. cerevisiae 
and 
C. elegans 
have been 
shown to 
originate from duplication events that happened over 600 million years ago
 
(Vavouri et 
al. 2008)
, and Arabidopsis has genetically redundant paralogs from duplica
tion events that 
happened over 50 million years ago 
(Blanc and Wolfe 2004; Maere et al. 2005)
, far longer than it 
would 
be expected
 
to take for 
one of the 
genes to become pseudogenized or develop a new 
function
. 
 
8
 
 
Several other mechanisms have been proposed to explain
 
how 
long
-
term stable 
maintenance
 
is pos
sibl
e.
 
In the case of paralogs with both overlapping and more recently evolved 
separate functions, both 
genes 
may be retained due to independent positive selection 
(Pickett and 
Meeks
-
Wagner 1995)
. Multiple copies
 
of a gene 
may be retained if their presence provides a 
selective advantage through increased levels of 
the same gene product
. F
or example
, 
some 
bacteria 
with
 
a relatively high
 
copy number of
 
rRNA
 
genes
 
can 
take advantage of nutrient
-
rich 
environments
 
by g
rowing more quickly than bacteria with fewer rRNA genes 
(Klappenbach et 
al. 2000)
; a 
higher copy number allows for co
mparatively rapid generation of 
ribosomes through 
parallel transcription and subsequent
ly for quicker protein synthesis.
 
A selective advantage may 
also 
be co
nferred by maintenance of redundant paralogs 
through the action of several similar 
gene products that act as functionally redundant checkpoints, 
for example in cell cycle 
progression, 
to 
ensur
e
 
that important cellular pr
ocesses are carried out properl
y 
(Thomas 1993)
. 
M
athematical
ly 
model
ling genetic r
edundancy 
showed that s
tabl
e
 
maint
e
n
ance 
over time can 
occur
 
if the efficacy of each duplicate at performing a given function (measured by the fitness of 
single m
utants) is inversely correlated
 
with
 
the mutation rate in each gene 
(Nowak et al. 1997)
. 
Large
-
scale duplication events can r
esult in retention of duplicate
 
gene
s that encode subunits of 
multimeric protein complexes, as proper functioning of the complex is dependent on the 
stoichiometry of the subunits, which would be disrupted by the loss of one duplicate 
(Birchler 
and Veitia 2010)
. Finally, there is recent evidence of 
coordinated 
differential expression among
 
s
ome p
aralogs, such that knocking out one gene triggers compensatory upregulation of another 
(El
-
Brolosy et al. 2019; Ma et al. 2019)
, which would be expected to promote retention of both
. 
Thus, there are many explanations as to how redundancy c
ould
 
be stably ma
intained even over 
9
 
 
millions of 
years
, but fewer answers as to which mechanisms generally allow for prolonged 
retention of genetic redundancy.
 
 
1.4 
What 
questions
 
about genetic redundancy
 
remain 
and how can they be 
addressed
?
 
Genetic redundancy is 
likely overestimated in the literature
. 
Ident
ification of redundant 
genes typically requires the discovery of a 
single mutant
 
with no phenotype; however, 
there are 
many reasons a phenotype in a 
single mutant
 
may not be readily apparent
. For example, 
phenot
ypes may be present but c
onditional
,
 
s
ub
tle
, 
or t
issue
-
specific 
(
discussed by 
Bouché and 
Bouchez 2001; Bolle et al. 2013)
. A more accurate assessment of the prevalence of genetic 
redundancy will require studies 
specifically designed to take these aspects int
o acc
ount. 
Furthermore, while genetic
 
redundancy
 
tends to be discussed as a binary trait (gene pairs are 
classified as redundant or nonredundant), or sometimes as a categorical trait (gene pairs are fully 
redu
ndant, partially redundant, or nonredundant; 
e.g., 
Khan and Haynes 1972)
, phenotypic data 
from gene families suggest that the degree of redundancy exists over a range (as mentioned 
above with MPK4) which c
ould therefore be calculated as a continuous tr
ait b
ased on fine
-
scale 
phenotypic data. As this is likely to be more biologically relevant than a classification scheme, a 
model of genetic redundancy should be built that allows for degrees of genetic redund
ancy on a 
continuum rather than a binary or cat
egori
cal state.
 
Some of the factors associated with long
-
term maintenance of genetic redundancy in 
model organisms have been identified. For example, transcription factors are enriched among 
redundant gene pai
rs in Arabidopsis, suggesting that function of 
the g
ene product influences 
retention 
(Bl
anc and Wolfe 2004)
. We also know that the m
echan
ism(s) of duplication play a 
role; for example, stress related genes derived from tandem duplication are retained at a higher 
10
 
 
rate in Arabidopsis than those derived from other duplication mechanisms 
(Hanada et al. 2008)
. 
In yeas
t, redundant paralogs have a lower nonsynonymou
s sub
stitution rate than nonredundant 
paralogs 
(Li et al. 2010)
. However, there are no obvious patterns that can be reliably used to 
determine 
a priori
 
which gene pairs are likely to be redundant and therefore should be targeted in 
double mutant
 
studies to generate phenotypi
cally informative mutants. 
 
To begin addressing the challenges presented above, the research in this dissertation 
focuses on two maj
or points:
 
1)
 
Building a binary 
(classification) 
model of genetic redundancy in Arabidopsis to predict 
redundancy among paralog
ous gene pairs using several different definitions of genetic 
redundancy, to ultimately aid in future targeted mutant generation. 
 
2)
 
B
uilding a continuous model of genetic redundancy in yeast to i
dentify features associated 
with 
redundancy
, to begin answerin
g evolutionary questions about 
its
 
maintenance.
 
Together, the
se approaches
 
contribute to understanding of the biological basis of ge
netic 
redundancy and provide a foundation for future modeling as more fine
-
scale data become 
available. 
 
 
11
 
 
C
HAPTER
 
2
: 
G
ENOME
-
WIDE PREDICTIONS OF GENETIC REDUNDANCY IN 
ARABIDOPSIS THALIANA
 
USING MACHINE LEARNING
 
 
12
 
 
2.
1 Abstract
 
 
Genetic redundancy refers to a situation where an individual with a loss
-
of
-
function 
mutation in one gene (single mutant) does not show an apparent phen
o
type until one or more 
paralogs are also knocked out (double/higher
-
order mutant). Previous studies hav
e identified 
some characteristics common among redundant gene pairs, but a predictive model of genetic 
redundancy incorporating a wide variety of featur
e
s has not yet been established. In addition, the 
relative importance of these characteristics for genet
ic redundancy remains unclear. Here, we 
establish machine learning models for predicting whether a gene pair is likely redundant or not in 
the model pla
n
t 
Arabidopsis thaliana
. Benchmark gene pairs were classified based on six feature 
categories: functiona
l annotations, evolutionary conservation including duplication patterns and 
mechanisms, epigenetic marks, protein properties including post
-
translationa
l
 
modifications, 
gene expression, and gene network properties. The definition of redundancy, data 
transf
ormations, feature subsets, and machine learning algorithms used affected model 
performance significantly. Among the most important features in predicti
n
g gene pairs as 
redundant were having a paralog(s) from recent duplication events, annotation as a tran
scription 
factor, downregulation during stress conditions, and having similar expression patterns under 
stress conditions.
 
Predictions were then tested 
u
sing phenotype data withheld from model 
building and validated using well
-
characterized, redundant and 
nonredundant gene pairs. This 
genetic redundancy model sheds light on characteristics that may contribute to long
-
term 
maintenance of paralogs that are 
s
eemingly functionally redundant, and will ultimately allow for 
more targeted generation of functionally
 
informative double mutants, advancing functional 
genomic studies.
 
 
13
 
 
2.2 Introduction
 
Genetic redundancy, which refers to multiple genes that perform the
 
same function, has 
been defined in many ways since the 
mid
-
1900s 
(Gabriel 1960)
. A
n early study of genetic 
redundancy in 
Saccharomyces cerevisiae
 
discussed it in the context of unlinked genes encoding 
enzymes catalyzing the same reacti
o
n 
(Mortimer 1969)
. A later study took a broader view of 


n
es with primarily 
regulatory functions 
(Pickett and Meeks
-
Wagner 1995)
. In studies from a number of model 
organisms, multiple examples of what is considered genetic redundancy have been given, 
including: genes d
e
rived from convergent evolution encoding enzymes t
hat perform the same 
function 
(Pickett and Meeks
-
Wagner 1995)
; biochemical pathways that are redundant due to 
interconnected metabolic networks 
(Weintraub 1993)
; and genes from the same family 
(paralogs) that m
aintain some of the same functionality 
(Kempin et al. 1995)
. Discussions of 
genetic redundancy in recent literature mostly encompass this last definition, where a duplication 
event results in mul
t
iple copies of a gene that retain overlapping func
tions (e.g., 
Chen et al. 2010
, 
Bolle et al. 2013
, 
Rutter et al. 2017
). Practically, genetic redundancy is commonly observed as
 
a 
single gene knockout mutant that shows no phenot
ype or a mild phenotype compared with a 
wild
-
type organism, with a double or higher
-
order mutant showing a more severe phenotype. 
 
After a gene is duplicated, selection may be relaxed on each copy, allowin
g
 
accumulation 
of mutations in one copy, which can 
lead to pseudogenization 
(Brookfield 1992)
; thus, the 
presence of 
genetically redundant paralogs long after the duplication event would seem to be an 
evolut
i
onary paradox 
(Nowak et al. 1997)
. In spite of this, the literature is replete with examples 
of genetic redundancy, and many
 
redundant genes in species such as 
S. cerevisiae 
a
nd 
14
 
 
Caenorhabditis elegans 
originated from duplication events that happened over 600 million years 
ago 
(Vavouri et al. 2008)
. 
At least two mechanisms may explai
n
 
how this is possible. Redundant 
copies can be ret
ained for a long time due to the slow pace of genetic drift in large populations. 
Based on a few key assumptions, it is estimated that a mutation deleterious to the function of a 
duplicate copy could take 
0
.75 to 5 million years to be fixed in 
Arabidopsis 
thaliana 
(Panc
hy et 
al. 2016)
. However, this cannot account for the apparent redundancy among paralogs from the 
most recent whole genome duplication (WGD) in the Arabidopsis lineage ~50 million years ago 
(Bowers et al. 2003)
. Another possibility is that genetic redundancy is selected for due to its 
ability to
 
buffer the effect of a deleterious mutation in one
 
paralog 
(Zhang 2012)
. The issue is 
that such a mechanism requires selection based on future needs, which is counter to our 
understanding of evolution. 
A
 
mathematical model has been used to demonstrate th
at 
redundancy can be stably maintained over time 
(Nowak et al. 1997)
. Howe
v
er, the model 
requirement for perfect equivalency 
in gene functions and in mutations between paralogs seems 
unrealistic. Due to the challenges in assessing functions of paralogs, the extent of genetic 
redundancy and the factors contributing to it remain l
a
rgely unclear.  
 
Plants are an excellent resource 
for studying the fate of duplicated genes due to the 
relatively high rate of WGD events. While pseudogenization (loss of gene function) is the most 
common fate of duplicated genes in plants 
(Panchy et al. 2016)
, some
 
duplicates are retained. By 
identifying and compar
ing characteristics of paralogous gene pairs and singleton genes, studies 
have revealed, for example, a lower synonymous substitution rate among retained (i.e., not 
pseudogenized) paralogs 
(Jiang et al. 2013)
, suggesting that these ge
ne pairs are relatively recent 
duplicates or that there is selective pressure to retain the ancestral (or a more recently evolved) 
function. Retention bias is also seen for some gene functions. For example
,
 
paralogous 
15
 
 
transcription factor and signaling gen
es are retained at a higher rate than DNA repair genes 
(Blanc and Wolfe 2004)
. Retent
ion rates of paralogs also vary by duplication mechanism

retained tandem duplicates are more frequently involved in stress responses 
(Hanada et al. 2008)
, 
and genes involved in signaling processes are preferentially retained when
 
derived from WGD 
rather than smaller duplication events 
(Maere et al
.
 
2005)
. While these studies reveal some 
characteristics of genes that are retained after duplication, they do not directly address whether 
these retained paralogs maintain redund
ant functions. A landmark study in 
Arabidopsis
 
addressed this question using
 
machine learning to integrate 43 gene features related to sequence 
similarity and gene expression, and predicted that ~50% genes in the 
Arabidopsis
 
genome have 
at least one redund
ant paralog 
(Chen et al. 2010)
. In 
t
his study, a gene whose single mutant 
showed no abnormal phenotype (or a mild phenotype) and its closest match in the genome based 
on sequence similarity were defined as a redunda
nt pair. The most important features for 
predicting redundancy included diff
e
rences in isoelectric point, molecular weight, and predicted 
protein domains between genes in a pair. While this pioneer study provided insights into the 
prevalence of genetic red
undancy, redundancy was defined in only one way without considering 
the phen
o
types of the corresponding double mutants. Also, in the decade since that study 
substantially more functional genomic data have become available; inclusion of these data in 
additi
on to sequence similarity and gene expression may improve the accuracy of re
d
undancy 
predictions.
 
While the definition of redundancy presented above is prevalent, observation of unequal 
genetic redundancy, where the single mutant for one paralog shows a mu
ch more severe 
phenotype than the other and the double mutant has a still mo
r
e severe phenotype 
(Briggs et al. 
2006)
, promotes the idea 
t
hat redundancy is more accurately conceptualized as a continuum. 
16
 
 
However, the time
-
consuming nature of precise phenotyping required to quantify redundancy in 
this manner means tha
t such data are available for relatively few paralogs, and discussions of 
ge
n
etic redundancy frequently exclude single mutants with severe phenotypes. Here we build 
upon previous work by modeling genetic redundancy using multiple definitions of redundancy 
by including single mutants in multiple phenotypic categories, and incorpora
t
ing over 4,000 gene 
features from six categories, including functional annotations, evolutionary properties, protein 
sequence properties, gene expression patterns, epigenetic modi
fications, and network properties. 
We compared several machine learning algo
r
ithms and feature selection methods to identify 
which of the features have the most predictive power with respect to redundancy
. Independent of 
the model, we additionally performe
d statistical analysis to identify features common among 
redundant gene pair
s
 
using nonredundant gene pairs as a contrast. To estimate the prevalence of 
genetic redundancy throughout the genome, we used two of the best
-
performing genetic 
redundancy definit
ions to predict whether ~18,000 gene pairs in the Arabidopsis genome are 
gen
e
tically redundant. Finally, to assess the accuracy of our model, we validated predictions 

-
characterized gene
 
pairs.  
 
 
2.3 Results
 
and Discussion
 
2
.3.1 
Definitions of genetic redundanc
y
 
The designation of a gene pair as genetically redundant requires phenotype data for 
double mutants and the corresponding single mutants. To define a set of benchmark redundant 
and nonredundant gene pairs, we used phenotype data for 2,400 single and 347 h
i
g
her
-
order 
Arabidopsis mutants (including 271 double mutants) from a previous study 
(Lloyd and Meinke 
2012)
 
in which mutants were classified as having no phenotype, a less severe phenotype (i.e., 
17
 
 
conditional, cellular/bio
c
h
emical, or morphological), or a severe phenotype (i.e., lethal, 
indicating the gene is essential) based on comparison with wild
-
type individuals. We assigned 
these categories phenotype class numbers: 0 (no phenotype), 1 (conditional), 2 
(cellular/biochem
i
c
al), 3 (morphological), and 4 (lethal) (
Figure 
2.
1A
) and applied this same 
phenotype classification to 29 additional gene pairs 
(Bolle et al. 2013)
, resulting in a final 
benchmark set of 300 single and double mutant trios (two single mutants and one corresponding 
double mutant). Note that our 
data are from e
x
p
eriments generally not designed to 
assess genetic
 
redundancy
 
and typically conducted 
in 
one or 
a 
limited number
 
of conditions
 
and environments
. 
Thus, it more straightforward to identify an abnormal phenotype (indicative of nonredundancy) 
than to prove th
e
 
absolute absence of an abnormal phenotype (indicative of redundancy). 
 
Using the benchmark phenotype data and the core idea for defining genetic redundancy 
based on 
phenotype severity of single mutants only in compariso
n to the corresponding double 
mutan
t
, we established nine redundancy definitions (RD1 to 9) (
Figure 
2.
1B
). The most 
inclusive definition of genetic redundancy was RD9. Under this definition, a gene pair
 
was 
considered redundant if the phenotypes of both si
ngle mutants were less severe (i.e.
,
 
assigned a 
lower phenotype class number) compared with that of the double mutant; 190 benchmark gene
 
trios met this definition. The other RDs were subsets of RD9. Us
ing these more stringent 
definitions, only mutants of 
a particular phenotype class were i
n
cluded in the benchmark dataset; 
for example, when using RD4 only single mutants of class 0 (no phenotype) and double mutants 
of class 4 (lethal) were considered. The
 
use of multiple definitions offered insulation agains
t 
errors due to the inherent challe
n
ges of classifying phenotypes into specific categories (e.g., 
some morphological phenotypes are much more severe than others; under specific conditions, 
conditional l
ethal is effectively the same as lethal). For example,
 
while RD4 excluded
 
18
 
 
Fig
ure
 
2.
1
: 
A
rabidopsis phenotype categories and redundancy definitions.
 
(
A
) Phenotype classes from Lloyd and Meinke 2012. (
B
) Definitions of redundancy and 
nonredundancy (NR) bas
ed on phenotype classes of both single mutants (SM1 and SM2) and the 
double mutant (DM) f
o
r each gene pair. The number of gene pairs assigned to each definition is 

are in reference to phenotype category numbers; for 
ex

-
4 combined and RD9 is 
RD1
-
8 combined.
 
 
19
 
 
conditional lethal double mutants, as these belonged to phenotype class 1, both types of mutants 
were included in RD5 and RD9. While we acknowledge that this classification of phenot
ype 
severity has caveats, in the absence of quantitati
v
e phenotype data on a large scale, quantitative 
categories together with our multiple definitions of redundancy allow us to better utilize the 
dataset and begin addressing redundancy more as a continuu
m than as a binary problem.
 
To define nonredundant gen
e
 
pairs, a single definition was used: two genes were 
considered nonredundant if the double mutant was in the same phenotype class as either single 
mutant or in a class with a lower number; that is, at 
least one single mutant had an equal or more 
severe ph
e
notype than the double mutant (
Figure 
2.
1B
). The nonredundant set contained 110 
gene trios. The nearly 2:1 ratio of redundant to nonredundant gene pairs may reflect a bias in the 
reporting of double mutant phenotypes; a positive result supporting the pres
e
nce of a more 
severe phenotype in the doub
le mutant would tend to be reported, with negative results less 
likely to appear in the literature. 
Because comparably fewer gene pairs for which the double 
mutant has no abnormal phenotype have been reported, non
r
edundant gene pairs very likely are 
less c
ommon in our dataset than they are in nature. 
Double mutants with much more dramatic 
phenotypes compared with the single mutants were also overrepresented in our dataset (
Figure 
2.
S1
), likely for similar reasons. 
A
s a result, some definitions that included
 
only double mutants 
with mild or no phenotypes had too few gene pairs (RDs 1, 2, and 6, which had 16, 10, and 13 
gene pairs, respectively) to generate robust models and were therefore excluded from further 
analys
e
s. 
 
 
2.3.2 
Optimal parameters for predictio
n of genetic redundancy with machine learning
 
 
Machine learning allows integration of multiple data types to build a statistical model that 
 
20
 
 
can
 
predict a specific outcome. In our case, we were interested in estab
li
shing a machine 
learning 
model that could p
redict whether a gene pair was redundant or not using six broad categories of 
data: functional annotations, evolutionary properties, protein properties, gene expression 
patterns, epigenetic modifications, and ne
tw
ork properties (
Table S1
). The general appr
oach we 
took is illustrated in 
Figure 
2.
2A
. Here the input for the model consisted of benchmark gene 
pairs (instances), classified as redundant or nonredundant (labels) according to our nine 
definitions, and inf
or
mation about the genes and gene pairs from 
the six categories of data 
(referred to as features). To alleviate the possibility of overfitting our model due to the large 
number of features examined (~4,000) compared with the number of instances (161
-
300 
de
pe
nding on the definition), we used 90% of th
e benchmark gene pairs (training set) to train a 
model for predicting if a gene pair is redundant or not in a 10
-
fold cross
-
validation scheme. 
Performance was measured using the Area Under the Curve
-
Receiver Ope
ra
ting Characteristic 
(AUC
-
ROC); higher score
s indicate a higher true positive rate (proportion of all redundant gene 
pairs correctly predicted) over the range of false positive rates (proportion of gene pairs 
incorrectly predicted as redundant). Performan
ce
 
was additionally measured using the Area 
U
nder the Precision Recall Curve (AU
-
PRC); higher scores here indicate greater precision 
(proportion of gene pair predictions that are correct) over the range of true positive rates 
("recall"). Because we used a 
bi
nary classification scheme (redundant or no
t) for machine 
learning, a model classifying gene pairs at random would have a score of 0.5 for both the AUC
-
ROC and AU
-
PRC measures, while a perfect model would have a score of 1. Comparing three 
commonly used 
ma
chine learning algorithms, Support Vector M
achine (SVM), Random Forest, 
and Gradient Boosting, we found SVM was on average the best
-
performing algorithm when 
21
 
 
using RD9 (
Figure 
2.
S2A
-
B
; ANOVA, 
p
-
value
 
< 2 x 10
-
16

Differen
ce
 
test, 
p
-
values < 0.008). 
 
We next explored
 
how the number of features examined and feature value transformation 
affected model performance. While models using multiple features generally perform better than 
those based on single features, the presence o
f 
uninformative features can decrease model 
p
erformance. We tested two methods, Random Forest and Elastic Net, for selecting the most 
informative features. We looked at the effect of transformation because transforming feature 
values (e.g., taking the squa
re
 
of values) can amplify small differences, 
allowing subtle patterns 
to be more readily identified. We transformed the features four different ways (log, square, 
reciprocal, and binned) and compared model performance using the original, non
-
transformed 
fe
at
ures; the best transformation for each feat
ure (as determined by feature importance scores 
from the trained models); or multiple transformations of the same original feature. We tested 24 
feature combinations (see 
Table S2
)
 
by asking how well the model b
as
ed on each feature 
combination performed in
 
predicting the RD9/nonredundant benchmark genes in cross
-
validation. We found that using 200 features selected with Random Forest, using the best 
transformations of each, led to the best performing model (AUC
-
R
OC
 
= 0.74, 
Figure 
2.
S2C 
and 
AU
-
PRC = 0.72, 
Fi
gure 
2.
S2D
), with a 15% and 18% improvement in performance over a 
model using all of the untransformed features (AUC
-
ROC = 0.64, 
Figure 
2.
S2E
 
and AU
-
PRC = 
0.61, 
Figure
 
2.
S2F
)
. The selected features included many
 
t
hat were different 
representations of 
the same, raw feature. For example, several features related to total synonymous substitution rate 
(
Ks
), namely maximum 
Ks
, minimum 
Ks
, average 
Ks
, difference in 
Ks
, and total (sum) 
Ks
 
for 
genes in a pair (see 
Method
s
)
 
were all among the
 
features selected for RD9, demonstrating that 
22
 
 
representing a characteristic such as 
Ks
 
in a variety of ways provides distinct and useful 
information for building the model.
 
 
2.3.3 
Comparison of models built with different 
redundancy d
ef
i
nitions
 
 
We anticipa
ted that the training sets established using some RDs would result in more 
accurate predictions than others. Therefore, we next identified the RD that resulted in the best 
predictions of redundancy using the optimal algorithm (SVM) a
nd
 
i
nput feature set th
at we 
identified (200 features selected with Random Forest, using only the best transformation of each 
feature). When comparing how well each model performed on the cross
-
validation sets, the 
model built for RD4 (referred to as the R
D4
 
m
odel) had the best 
performance (AUC
-
ROC = 
0.84, 
Figure 
2.
2B
;
 
AU
-
PRC = 0.82, 
Figure
 
2.
2C
; light blue lines
). This RD had the highest 
contrast between single and double mutant phenotype classes (0

no apparent phenotype

and 
4

lethal, respectively). A like
ly
 
r
eason for the bette
r performance of the RD4 model is that it 
was easier to build a model to distinguish between redundant an
d nonredundant gene pairs when 
the phenotype differences were the most extreme. The second
-
best models were the ones with 
the 
la
rg
e
st training sample sizes, RD5 and RD9 (yellow and green lines,
 
respectively, 
Figure 
2.
2B
-
C
). Thus, it appears that phenotype 
class contrast and sample size were the most important 
factors
 
influencing model performance. We therefore focused on models bui
lt
 
with the highest 
phenotype class contrast (RD4) and the largest sample sizes (RD5 and RD9) for further model 
building. 
 
 
Whil
e
 
the RD4 model performed the best in cross
-
validation, the majo
rity of redundant 
gene pairs in the Arabidopsis genome do not ha
ve such a high phenotype class contrast. We
 
 
23
 
 
Figure 2.2
:
 
Machine learning pipeline workflow and performance of models built with 
different redundancy definitions
.
 
 
(
A
) 
Machine l
e
arning pipeline workflow. Input data consisted of instances
 
(gene pairs) wit
h 
labels (redundant or nonredundant) and values of features (characteristics of gene pairs). 
Example features, as shown in the table, include DNA sequence similarity, t
he number o
f
 
genes 
in a pair annotated as having transcription factor (
TF) activity, max
imum gene expression level, 
and the average level of CpG methylation among genes in the pair. The full input data are 
provided in 
S
upplemental Data
. Instances were first split into training
 
and testing sets
. The 
training set was further split into a training subset (90%) and validation subset (10%) in a 10
-
fold 
cross validation scheme. The optimal model after tuning the m
odel parame
t
ers was used to 
provide performance metrics based on cross
-
validation, predi
ct labels in the training set for 
model evaluation purposes, and to obtain feature importance scores. (
B
-
C
) Cross
-
validation 
performance of models built using six of ni
ne RDs base
d
 
on
 
(
B
) Area Under the Curve 
-
 
Receiver Operating Character
istic (AUC
-
ROC) a
nd 
(
C
)
 
Area Under the Precision
-
Recall Curve 
(AU
-
PRC). RD1, 2, and 6 were not included due to small training data sizes. A model classifying 
gene pairs perfectly would 
have AUC
-
RO
C
 
and AU
-
PRC scores of 1.0; black dotted lines 
represent the
 
performance of a
 
model classifying at random, in which AUC
-
ROC and AU
-
PRC 
scores would be 0.5 given that we used balanced data (i.e., equal number of redundant and 
nonredundant instanc
es). (
D
) AU
C
-
ROC and (
E
) AU
-
PRC for a model trained using RD4 gene 
pair
s and half of the
 
nonredundant pairs (randomly selected) then applied to RD9 gene pairs 
(excluding RD4) and nonredundant pairs that did not overlap with those used in training the RD4 
m
odel.
 
 
24
 
 
therefore tested whether the RD4 model would prove useful in predicting redundanc
y b
e
tween 
gene pairs when there were less extreme phenotype differe
nces between the single and double
 
mutants. The RD4 model was applied to a test set composed of RD9 gene pairs (after removing 
RD4 pairs) and a random subset of half the nonredundant gene p
air
s
. While the AUC
-
ROC was 
only 0.62 (
Figure 
2.
2D
), the high AU
-
PR
C score (0.82, 
Figure 
2.
2E
) indicated that, as expected 
from applying a model built with a more conservative definition of redundancy, this model errs 
on the side of having a higher number 
of 
f
alse negatives rather than false positives. Similarly, the 
RD5 
model was applied to a test set composed of RD9 gene pairs (after removing RD5 pairs) 
and a random subset of half the nonredundant gene pairs. The performance of this model was 
significantl
y w
o
rse (AUC
-
ROC = 0.57, 
Figure 
2.
S2G
; AU
-
PRC = 0.59, 
Figure 
2.
S2H
)
. 
Thus
, 
the best
-
performing models for predicting redundancy among gene pairs with all types of 
phenotype contrasts were those trained on RD4 and RD9; therefore, these two models were used 
in 
t
he following analyses. 
 
 
2.3.4 
Important evolutionary features 
in predicting redundant and 
nonredundant
 
gene 
pairs
 
Because the identification of features that are distinct between redundant and 
nonredundant gene pairs can provide insights about the bio
log
i
cal underpinnings of redundancy, 
we next assessed whether the d
istribution of values for each feature among the six feature 
categories was significantly different between redundant and nonredundant gene pairs based on 
the RD4 and RD9 definitions (see 
M
ate
r
ials and 
Methods
). For RD4 and RD9, evolutionary 
properties had
 
the highest percentage of features statistically associated with redundancy (55% 
and 53% respectively, multiple testing
-
adjusted 
p
-
value (
q) <
0.05; 
Figure 
2.
3A
-
B
), and these 
25
 
 
features tende
d t
o
 
be the most significantly correlated with redundancy (median 
q
-
value of 
significant features = 0.0003 and 0.004 respectively; 
Figure 
2.
S3A
-
B
). Overall, a shared set of 
159 features were significantly associated with redundancy in both RD4 and RD9,
 
and
 
th
e
re was 
a correlation between 
-
log(
q
-

0.75, 
p
 
< 2.2 x 10
-
16
; 
Figure 
2.
3C
). This suggested that some features may be significantly 
associated with redundancy regardless of definition. However, a
mon
g
 
the top 200 features 
selected for 
building the RD4 and RD9 models, we found that only 33% and 25%, respectively, 
were significantly associated with redundancy when considered individually (
Figure 
2.
S3C
-
D
), 
highlighting the utility of considering featu
res
 
jointly using machine learning. 
 
We
 
next looked into individual features that distinguished redundant gene pairs defined 
using RD4 and RD9 from nonredundant gene pairs using feature importance scores from the 
trained models (
Table S3)
. In this case, an
 
im
p
ortance score represents the degree
 
to which an 
individual feature contributes to the separation of redundant from nonredundant gene pairs by the 
algorithm, with features with a higher importance score having a larger contribution. In total, 51 
feature
s w
e
re shared between the two models (
T
able S3
) with well correlated importance ranks 
(PCC = 0.63, 
Figure 
2.
S4
A
), suggesting that a core set of features are important for predicting 
redundancy using multiple definitions. However, a shared set of 51 feature
s l
e
aves ~75% of the 
200 features selec
ted for each model as unique, highlighting the significant effect of redundancy 
definition on the models and the types of important features recovered. 
 
The relative importance of the six feature categories ranked fro
m b
e
st to worst based on 
median importa
nce ranks for features in those categories in RD4/RD9
-
based models was as 
follows: functional annotations (32/17), evolutionary properties (63.5/81.5), network properties 
(123/81.5), gene expression patterns (110.5/10
1.5
)
, epigenetic modifications (108/140
), and 
26
 
 
protein properties (139/133.5). 
Note that the importance rank
s
 
do not mirror the findings in 
Figure 
2.
3A
-
B
, indicating that, for example, while the distributions of 
 
>50% of evolutionary 
property
-
based features
 
si
g
nificantly differed between redunda
nt and nonredundant pairs, these 
features were not as important in predicting redundancy as functional annotation features.
 
The 
most important feature in both the RD4 and the RD9 models, as determined by feature 
impor
tan
c
e scores, was whether the gene pair
s were duplicates from the 

-
WGD event (for the 
importance scores of the top 20 features, see 
Figure 
2.
S4B
-
C
), with 

-
WGD
-
derived gene pairs 
more likely to be redundant (
Figure 
2.
3D
).
 
The 

D e
v
ent in the 
Arabidopsis lineage, and
 
despite it having likely occurred ~50 million years ago, the importance 
of this feature suggests that gene pairs derived from this event have not diverged in sequence and 
function sufficiently to appear nonredundant.
 
 
T
w
o other evolutionary property featu
res that were important for both definitions were 
reciprocal best match (rank=7 and 15 for RD4 and RD9, respectively, 
Figure 
2.
S4B
-
C
) and a 
lethality score
-
derived feature (discussed below). Reciprocal best matches ar
e p
a
ralogous gene 
pairs that do not hav
e additional retained paralogs generated since their divergence; gene pairs 
that were reciprocal best matches were more likely to be redundant. As a pair of genes without 
more recent duplicates are themselves likely t
o b
e
 
the product of a relatively recent
 
duplication 
event (
Figure 
2.
S4D
), they are expected to have had less time to diverge in sequence and 
function, explaining their enrichment among redundant gene pairs. Consistent with this, 
Ka 
and 
Ks
-
related features 
ran
k
ed as high as 30 and 32 in the RD4 
and RD9 models, respectively. 
Nonetheless, contrary to our expectations, these evolutionary rate
-
related features were not the 
most informative. Instead, other characteristics confounded with rates of evolution, such 
as 
27
 
 
m
echanism/mode of duplication and, a
s discussed in the following sections, gene functions and 
expression profiles, played more important roles in the model. 
 
The reciprocal difference in lethality score was an important feature in both models 
(rank=2 an
d 9
 
for RD4 and 9, respectively, 
Figure
 
2.
S4B
-
C)
.
 
Lethality score is the likelihood 
that mutation in a gene will lead to a lethal phenotype in Arabidopsis 
(Lloyd et al. 2015)
. Thus, 
we would expect that each
 
ge
n
e in a redundant pair would have a 
low lethality score, and 
therefore a relatively small difference in lethality score for the gene pair. In contrast to our 
expectation, 
we found that redundant gene pairs generally had a smaller difference in reciproca
l 
l
e
thality scores (which equates to a 
larger difference in raw lethality score) compared with 
nonredundant gene pairs, although the difference was not significant (Wilcoxon 
test, 
q
-
value < 
0.11). This unexpected result was likely an artifact of a bias in 
our
 
data

lethality scores were 
predicte
d by 
Lloyd et al. 
(
2015)
 
for genes without kn
own single mutant phenotypes, but 92% of 
the genes included in our benchmark datas
et 
h
ave known (nonlethal) phenotypes. I
n the absence 
of a predicted lethality score, we used a score of 0 for known nonlethal mutants, which likely 
artificially lowered the a
verage lethality scores in our benchmark set. Nonetheless, the lethality 
scores st
ill
 
provided useful information, as ind
icated by the high importance ranking in both RD4 
and RD9. 
 
 
2.3.5 
Important gene expression, functional, and network characteristics
 
F
e
atures related to gene expression made up the largest portion o
f
 
features selecte
d for 
RD4 and RD9 model building, with a
 
total of 126 gene expression features selected for one or
 
both models. The predicted directionality of four features varied between th
e two definitions,
 
meaning that for a given feature, redundant 
g
ene pairs accordi
ng to one RD had higher values
 
28
 
 
Fig
ure
 
2.3
: Features significantly associated with redundancy.
 
(A
-
B) Pe
rcentage of features in each feature category that were significantly associated with
 
redundancy (Wilcoxon rank
-
sum test 
for continuous features; Fishe

features; all multiple
-
test corrected with Benjamini
-
Hochberg method) when usin
g (A) RD4 and 
(B) RD9. (C) Correlation between RD4 and RD9 
-
log(q
-
values) obtained us
i
ng the statistical 
tests as describ
ed in (A) and (B) for each fea
ture. (D) Distribution of values among redundant 
and nonredundant gene pairs for selected features using 
RD4 and RD9 (separated by a dotted 
line). For each model, a feature is shown here if 
t
he importance score ranked between 
1 and 20, 
was the highest in i
ts feature category, and was significantly associated with redundancy using 
the statistical tests describ
ed in (A) and (B). For transformed continuous features, untransformed 
feature values 
a
re shown, with transformed values s
hown as inserts. Abbreviations


-
depen

 
hormone treatments in which a gene 
in the pair is differentially 


n
regulated under biotic 
 
 
29
 
 
Figure 2.3 
cont

d
 

ess co
-
ex

-
expression, k
-

-
expression clusters generated from stress dataset
s with hierarchical (split into 25 clusters) and 
 
k
-
means (k=5) 
c
lustering, respectively; plots indicate whether the genes
 
in a 
pa
ir are in the same 
cluster or different clusters.
 
 
compared 
with nonredundant gene pairs, while the reverse was true for the other RD. For 
example, expression variation in the developmental expre
ssion dataset (after reciprocal average 
transformation) wa
s 
hig
h
er for redundant gene pairs according to RD4 than for non
redundant 
gene pairs, but lower for redundant gene pairs according to RD9. We also found that tissue
-
specific stress responses varied b
y redundancy definition; the mean rank of features related
 
t
o 
a
b
iotic stress response for RD4 was lower for root tissue (
97) than shoot tissue (120), while the 
opposite was true for RD9 (99 and 94, respectively). Features derived from biotic/abiotic stress
 
and hormone treatment data were more consistently informa
ti
ve 
a
cross definitions than those 
from the developmental datas
et; while there were four developmental gene expression features in 
the top 30 for RD9, no such features ranked higher than 54 for RD4.
 
The most important gene 
expression feature for RD9 was th
e 
max
i
mum number of biotic stress conditions under which 
one or
 
both genes in a pair was downregulated, with redundant gene pairs having a lower 
maximum than nonredundant gene pairs (
Figure 
2.
3D
). Thus, redundant gene pairs tend not to 
be downregulated und
er
 
st
r
ess conditions. Previous findings indicated that duplicat
e genes 
involved in stress responses are retained at a higher rate than genes involved in other processes 
(Maere et al. 2005)
. The most important gene expression feature for RD4 was the maximum 
numbe
r of hormone treatments under which one or both genes in a pai
r 
was
 
differentially 
expressed compared with the control, with 
nonredundant gene pairs having a higher maximum 
(
Figure 
2.
3D
).
 
30
 
 
Among 2,627 functional annotation features, 19 and 13 were among the
 
top 200 for the 
RD4 and RD9 models, respectively. While only 
on
e o
f
 
these features was selected for both 
models, given that 
functional enrichment among redundant gene pairs varies by RD (
Figure 
2.
S5
), it was expected that different functional annotation f
eatures would be important for 
predicting redundancy using dif
fe
ren
t
 
redundancy definitions. The most important gene function
 
feature for the RD4 model was the number of genes in a pair (0, 1 or 2) annotated as DNA
-
dependent transcription factors (referred
 
to as transcription factors). In the trained RD4 model, 
gene 
pa
irs
 
in which both genes had this annotation were more frequen
tly predicted as 
nonredundant, consistent with the feature value distributions (
Figure 
2.
3D
). This was somewhat 
unexpected as previ
ous studies have shown that transcription factors are more lik
el
y t
o
 
be 
retained after gene duplication than other types of g
enes 
(Blanc and Wolfe 2
00
4)
.
 
The most 
important functional annotation feature for RD9
 
was the number of genes in the pair having the 

Figure 
2.
3D
). This term, which encompasses
 
a broad 
range of processes including responses to stressors o
r 
hor
m
ones, ion transport, circadian rhythm, 
aging, and cell gr
owth, among many others, was an important predictor of nonredundant gene 
pairs.
 
Finally, while no network properties or protein pro
perties were among the 20 most 
important features in predictin
g 
red
u
ndancy for RD4, two network properties were in the top 20
 
important features for RD9: presence in the same gene co
-
expression clusters (generated using 
biotic and abiotic stress datasets, 
two different clustering algorithms and two different numbers 
of
 
cl
u
sters), with gene pairs in the same cluster more likely t
o be redundant (
Figure 
2.
3D
). 
Consistent with this, 
C
hen et al. 
(
2010)
 
found that gene co
-
expression during path
og
en 
i
nfection 
was one of the most important features for predi
cting redundancy in Arabidopsis. In the earlier 
31
 
 
study by Chen et al. (2010), isoelectric point, overlap in protein domain annotatio
ns, and 
sequence similarity were also among the features found
 
t
o b
e
 
important predictors of 
redundancy. While these features
 
were included in our model building based on RD4 and RD9, 
they ranked between 26 and 166 depending on the redundancy definition (
T
able S3
). Note that 
the redundancy definition used in the Chen
 
e
t a
l
. study was based on solely single mutant 
phenotypes; thi
s likely accounts for the minimal overlap in features found to be important in 
predicting redundancy. 
 
 
We also examined the potenti
al causes of mis
-
predictions by comparing feature values 
betw
ee
n c
or
rectly and incorrectly predicted pairs, generating a scor
e (see 
Materials and 
Methods
)
 
representing whether mis
-
predicted nonredundant pairs had feature values similar to 
RD9 pairs (
Figure 
2.
4A
). We identified several featu
res for which incorrectly 
pr
edi
cted 
nonredundant pairs had values more like RD9 gene 
pairs than correctly predicted nonredundant 
pairs, and that also had high feature importance scores, sugg
esting they may play a role in mis
-
predictions (
Figure 
2.
4B
).
 
Addi
tionally, in a principal c
om
pon
ents analysis of correctly and 
incorrectly predicted n
onredundant pairs (
Figure 
2.
4C
), the top 24 features contributing to the 
first principal component were r
elated to CpG methylation (
Table S4
), implicating it as a major 
co
ntributor to mis
-
predictio
n.
 
 
2.3.6 
Redundancy predictions for Arabidopsis gene pair
s not in the benchmark dataset
 
With the predic
tive model of redundancy in place, we wanted to get an esti
mate of 
genetic redundancy broadly throughout the entire genome. F
or this analysis, we selec
te
d a subset 
of paralogous gene pairs: all of the WGD and t
andem duplicate (TD) pairs in the Arabidopsis 
genome (7,764 total, collectively referred to as the WG/TD 
set; 
Supplementa
l Data
). We also
 
32
 
 
Figure 
2.
4: 
Features associated with mispredictions.
 
Distribution of feature separ
ation scores for features used to build the RD9 model. T
o
 
identify 
features that may contribute to mis
-
predictions, feature values were compared between (1) 
nonredundant ge
ne
 
pairs 
pr
edicted as nonredundant (NR/NR), (2) nonredundant pairs predicted 
as redu
ndant (NR/RD9), and (3) redundant pairs predicted as redu
ndant (RD9/RD9). Using the 
median value (Med) in each class/predicted class category, we calculated a normalized fe
at
ure 
sep
ar
ation score as follows: 
 

. 
For 
each feature, 
the feature separation score represents the difference in feature
 
values between 
c
orrectly and incorrectly predicte
d nonredu
ndant gene pairs, 
with a sc
o
re of 0 meaning that 
correctly and incorrectly predicted pairs had similar values and a score of 1 meani
ng that 
 
33
 
 
Figure 
2.4 

 
incorrectly predicted pairs had values 
more similar to redundant gene pairs. Close to 20
% of th
e 
features had a separation
 
s
c
ore of 1. (
B
) Distribution of values for selected features among the 
three categories of actual and pred
icted redundancy described in (
A
). Horizontal bars indicate the 
me

tissues
 
and developmental stages i
n 
w
hich 

the maximum number of biotic stress conditions in which one or bot
h genes in the pair are 

in. CpG
 

f 
t
he minimum level of CpG 

the reciprocal of 
the difference in CpG methylation level in sperm
 
cells for genes in the pair. These four features
 
had a 
feature separation score cl
os
e
 
to 1 and had feature importance scores in the top 10 for RD9, 
implicating them in mis
-
predictions. (
C
) 
Dimensions 1 and 2 of a principal components analysis 
performed to
 
identify features that were different between co
rrectly
 
and incorrectly predicted 
no
n
redundant pairs. The top 24 features contributing to Dimension 1 were related to CpG 
methylation levels 
(
Table S4
). 
 
 
sought to model redundancy within a gene family; because a gene fami
ly consists of a group of 
paralogs deri
ved fro
m a variety of dupli
ca
t
ion mechanisms and with differing ev
olutionary 
distances, it offers a wide spectrum of 
relatedness among gene pairs. For this analysis, we used 
the protein kinase (Kin) superfamily to gen
erate all possible combinations of gene
 
pairs,
 
then 
randomly selec
te
d
 
10,000 pairs for analysis (
Supplemen
tal Data
). We expected that applying 
our model to both datasets would provide information about gene
tic redundancy at the genome
-
wide scale
 
and at
 
the more fine
-
grain
ed
 
gene family level. While both the RD
4 and RD9 models 
showed a high degree of accuracy i
n predicting redundant gene pairs in cross
-
validation (87% 
and 92% of redundant gene pairs correctly
 
predicted, respectively; 
Figure 
2.
S6A
-
B
), the
 
RD4 
model predicted
 
n
o
nredundant gene pairs with much high
er accuracy than the RD9 model (75% 
and 36%, respec
tively; 
Figure  
2.
S6A
-
B
). Because of the high error rate in predicting 
nonredundant pairs with the R
D9 model, we focused on using the RD4 m
odel to
 
estimate the 
preval
en
c
e of genetic redundancy in the Arabi
dopsis genome.
 
34
 
 
Although we analyzed machine learnin
g results primarily as a binary variable (gene pairs
 
were classified as either redundant or nonredun
dant), these binary predictions were g
enerated
 
from likelihood sc
or
e
s
 
output by the machine learning pi
peline. The likelihood score, referred to 


-
1, with 0 being most likely 
nonredundant and 1 most lik
ely redundant. Using this redundancy s
core, a 
threshold score was
 
d
e
termined that would maximize the harm
onic mean of precision (in this case, the proportio
n of 
true redundant pairs to predicted redundant pairs) and recall (proportion of redundant pairs 
pr
edicted correctly), and this threshold
 
was use
d to generate the b
in
a
ry predictions for the 
WG/TD and Kin 
datasets. Using the RD4 model, the majority of the 
17,764 WG/TD and Kin 
gene pairs were predicted as redundant with redundancy scores well above the th
reshold (
Figure 
2.
5
). Among the WG/TD 
set as a
 
whole, 80% were pr
ed
i
cted as redundant (
Figure 
2.
5A
), with
 
gene pairs derived from the 

-
WGD event more likel
y to be predicted as redundant (83%; 
Figure 
2.
5B

-
WGD event (
71%; 
Figure 
2.
5C

and more a
ncient W
GD events (73%, 
Fig
ur
e
 
2.
5D
). As duplicate pairs evolve ove
r time, it is 
expected that the degree of genetic r
edundancy would continue to decline. While this is true 
when comparing the 

-
WGD to older events, si
milar proportions of duplicate pairs f
rom the 

and more ancient 
ev
e
nts were predict
ed as redundant 
based
 
on RD4. This may be because gene 
pairs derived fro

-

-
WGD in 
terms of 
Ks 
(Maere et al. 2005)
. However, it is surprising that so many redundant gene pairs 
(d
efined based on RD4) that duplicated 5
0 MYA (

-

-
WG
D; 
Edg
er et al. 
2015)
 
or longer would b
e retained. Similarly, 83% of tandem duplicates and 87% of kinas
es 
were predicted as redundant based o
n RD4 (
Figure 
2.
5E 
and 
Figure 
2.
5F
, respectively).   
 
 
35
 
 
Figure 2.5: Predicted redundancy scores for ge
ne pairs througho
ut the Arabidopsis 
genome. 
 
(A) 
Pred
icted redundancy scores from the RD4 model for gene pairs in the
 
genome derived from 
whole genome or
 
tandem duplication (WGD and TD, respectively). These results grouped 
specifically by duplication event
/type are shown i
n the following four sections of thi
s figure. (B) 

-
WGD event, (C) gene 

-
WGD event, (D) 

-
WGD event, (E) gene pairs derived from tandem duplication (TD), 
 
36
 
 
Figure 2.
5 
cont

d
 
and (F) 10,000 randomly
-
selected gen
e
 
p
airs from the kinas
e superfamily (Kin). A majority of 
gene pairs in all of these datasets were predicted as redundant using RD4.
 
 
T
his percentage of redundant pair predictions was hi
gher than previous estima
tes in the 
literature 
(
e.g., 
Chen et al. 2010
)
. It is important to note that in our WG/TD and Kin datasets, 
gene pairs are likely being predicte
d as redundant because th
ey more closely resemble redundant 
 
gene pair
s
 
w
ith respect to featur
es that have the highest weight in our predictive model (e.g., 
WGD event). However, the model is built on experimental data that have much more power 
when calling
 
a gene pair as nonredundant than calling them as redundant; 
demonstra
t
in
g that a 
single mutan
t has an abnormal phenotype (meaning it is nonredundant) is a simpler task than 
definitively stating that a mutant has no abnormal phenotype and therefore is redu
ndant with 
another gene. As previously proposed 
(Bouché and Bouchez 2001; Bolle et al. 2013)
, the lack
 
of 
 
an o
b
se
rved severe phenotype
 
in a single mutant 
may be because phenotypes are conditional, 
tissue
-
specific, and/or subtle rather than masked by genetic redundancy. Many large
-
scale 
phenotypi
ng studies are not able to take these factors into account, a
nd it wou
l
d 
therefore be 
expected
 
that
 
a model built with data from such studies overestimate genetic redundancy in the 
genome. 
 
While the binary classification of gene pairs as redundant or nonr
edundant was possible 
with the available data and straightfor
ward to i
n
te
rpret, it is an over
-
simplification of the 
complex nature of genetic redundancy. The threshold
-
based definition of genetic redundancy 
may be convenient, but the landscape of genetic r
edundancy is far more nuanced, with a 
continuum between gene 
pairs wit
h
 
v
arious degrees of gen
etic redundancy. Nonetheless, these 
data still allowed us to gain valuable insights into the mechanistic underpinnings of genetic 
37
 
 
redundancy by revealing importan
t features as discussed in the earlier sections. In addition,
 
we 
antic
i
pa
te the models can be 
iteratively improved with the future availability of more phenotype 
data, particularly quantitative data.   
 
 
2.3.7 
V
alidation of predictions
 
 
To validate predicti

an
d
 
R
D4, respectively, ran
domly selected and proportionally divided between redundant and 
nonredundant pairs, 
Figure 
2.
2A
) of the benchmark data. This test set was not i
ncluded in the 
model
 
building process and serves to illustrate how the model will perform
 
on
 
new data. 
Applying t
he RD4 and the RD9 models on the test set, we obtained AUC
-
ROC scores of 0.73 
and 0.68, respectively (
Figure 
2.
6A
) and AU
-
PRC scores of 0.62 an
d 0.82, respectively
 
(
Figure 
2
.
6B
)
. 
Although
 
there was a decrease in performance compared
 
wi
th cross
-
validation results 
(
Figure 
2.
2B
-
E
), 80% (4/5) and 68% (13/19) of redundant pairs were predicted correctly based 
on the RD4 a
nd the RD9 models, respectively
, and 36% (4/11) of 
nonredundant pairs were 
predicted correctly by each of these models (
F
ig
ure 
2.
6C
-
D
). Thus, the holdout testing set 
generally supported the utility of the RD4 and RD9 models, but the current threshold score
 
was 
more conservative toward c
alling gene pairs as
 
non
-
redundant.
 
 
F
urther validation was performed by identifying sing
le
 
and double mutants
 
in the 
literature that have specifically been studied as mutant trios and have very well
 
documented and 
characteriz
ed phenotypes. We selected ten of these gene pair
s: five that meet our criteria for 
redundancy under RD9 and five we wo
ul
d classify as nonre
dundant (
Table S5
). Half of the pairs
 
were present in our RD9 benchmark training dataset,
 
while the other half were 
present in the 
38
 
 
WG/TD and/or 
Kin
 
test datasets. We
 
compared the known redundant gene pairs from the
 
literature to our p
re
dictions from cro
ss
-
validation for pairs in the training set and predictions
 
 
Figure 2.6: Model performance on holdout test set
s.
 
(A) 
AUC
-
ROC and (B) AU
-
PRC curv
es for the holdout test 
sets for models built with each RD. 
Performance of the models on test
 
s
ets was lower compared with performanc
e in cross
-
validation, likely due to the small sample sizes of the test sets. (C
-
D) Confu
sion matrix for (C) 
RD4 and (D) RD
9 showing the number of 
correctly and incorrectly predicted redundant and 
nonredundant gene p
ai
rs in the respective test sets. 
result
s showing that the RD9 model tends to 
err on the side of predicting false positives while
 
the RD4 model is much more conser
vative and 
prone to gene
rating false negative predictions. 
 
 
from application of the train
ed
 
m
odel for pairs in the t
est set(s). We found that the RD9 model 
correctly predicted four of five redundant pairs (according to RD9) but mis
-
predicted all five of 
th
e nonredundant pairs 
as redundant. This comparison was repeated for the RD4 model with th
e 
sa
me gene pairs. However,
 
three of the gene pairs classified as redundant using RD9 were 
39
 
 
classified as nonredundant using RD4 because the double mutants were not let
hal. Thus, the 
valida
tion set for RD4 included two redundant and eight nonredundant gene 
pa
ir
s. The RD4 
model correc
tly predicted one out of the two redundant pairs (according to RD4) and four out of 
the eight nonredundant pairs. This was consistent with o
ur expectations and p
rior
 
results showing 
that the RD9 model tends to err on
 
the side of pr
ed
icting false positives 
while the RD4 model is 
much more conservative and prone to generating false negative predictions.
 
To determine why mis
-
predictions may have occurred in these spe
cific cases, we 
revisited features previously identified as likely con
tr
ibutors to mis
-
predicti
on in general in the 
benchmark dataset (e.g., 
Figure 
2.
4A
-
B
). For the RD
9 model, one such feature was reciprocal 
best match. Although this feature was more stron
gly associated with nonredundant gene pairs in 
the benchmark dataset (
Fi
gure 
2.
S7A
), the one RD
9 pair predicted as nonredundant comprised 
paralogs that were not recipr
ocal best matches, making this a likely reason for mis
-
prediction. 
Derivation of paralogs
 
from the 

-
WGD event was another such feature (
Figure 
2.
S7B
); three 
n
on
redundant pairs predict
ed as RD9 (nonredundant/RD9) were derived from the 

-
WGD event, 
indicati
ng that this feature was a likely contributor to their mis
-
prediction. Another important 
feature was related to the number of biotic stress conditions under wh
ic
h genes were 
downregula
ted (referred to as biotic downregulation breadth). For this feature, th
e distribution of 
feature values among the actual/predicted classes demonstrated that all
 
five nonredundant/RD9 
pairs had values more similar to the correctly 
pr
edicted RD9 pairs than 
to the correctly predicted 
nonredundant pairs (
Figure 
2.
S
7
C
). For the RD
4 model, the one RD4 pair that was predicted as 
nonredundant had values for features rela
ted to CpG methylation (
Figure 
2.
S7D
), gene family 
size (
Figure 
2.
S7E
)
 
a
nd CHH methylation (
Fig
ure 
2.
S7F
) that were more similar to those of 
nonredundant pairs. Additionally, all four of the nonredundant pairs predicted as RD4 had CHH 
40
 
 
methylation in embryo
 
tissue values that were more similar to those of RD4 gene pairs (
Figu
re
 
2.
S7F
). 
 
In total, we 
identifie
d several types of features that were likely contributors to 
mispredictions, including duplication event (

-
WGD or not), downregulation under biotic str
ess 
conditions, and gene methylation patterns. Importantly, we were thus able to identify one or 
more fe
atures that likely contributed to each instance of mis
-
prediction of both the RD4 and RD9 
gene pairs used for validation, a important step in improving future iterations of the model; for 
example, depending on the definition being used and the importance o
f the accuracy of 
predictions (precision) compared with the importance of identifying all redundant gene pairs in a 
dataset (recall), certain features could be excluded from the model.
 
 
2.4 
Conclusions
 
In this study, we optimized and utilized a machine lea
rning approach to predict genetic 
redundancy among paralogs in Arabidopsis using multiple definitions of redundancy. We 
identified two biologically relevant and well
-
performing definitions of redundancy and the 
optimal 200 features for each definition that
 
allowed us to best model redundancy. Our models 
performed well on a
 
hold
-
out testing dataset, demonstrating their utility. Several features related 
to evolutionary properties, including lethality score, whether genes in a pair were reciprocal best 
matches
, and the type of duplication event from which a gene pair was derived, were consistently 
ranked as important in generating predictions across redundancy definitions. Interestingly, 
evolutionary rates, such as 
Ka
 
and 
Ks
, were statistically different betwee
n redundant and 
nonredundant gene pairs but not highly ranked in the models, indicating that multiple factors 
contribute to redundancy, as revealed by machine learning models integrating multiple features. 
41
 
 
Analysis of these evolutionary
-
related features de
monstrated that redundant gene pairs tend to be 
more recent duplicates than nonredundant pairs. While it may be tempting to explain redundancy 
as gene pairs having not had enough time to diverge in function, many redundant pairs are 
derived from a WGD even
t estimated to have occurred ~50 million years ago, offering plenty of 
time for pseudogenization. This suggests that there is some selective pressure to maintain 
redundancy. In general, we found feature importance to be highly variable by redundancy 
defini
tion, underscoring the need for testing multiple definitions depending on the biological 
question being addressed. For example, if one is interested in predicting which genes are lethal 
or have severe phenotypes a stricter definition is required than when 
a broader view of 
redundancy is being used, whereby less extreme phenotype contrasts between single and double 
mutants would be appropriate. 
 
While the models provide useful information about gene features related to genetic 
redundancy, there is still room
 
for improvement in terms of prediction accuracy. Performance on 
test gene pairs withheld from model building was generally not as good as the performance 
based on cross
-
validation, which may be due to the small size of the test sets. In addition, our 
more
 
conservative trained model predicted 84% of 17,764 paralogs throughout the genome to be 
redundant, which is a much higher estimate than has been shown previously 
(Chen et al. 2010)
. 
This is likely a result of the u
nderlying data used for model building; our models are expected to 
be biased towards categorization of gene pairs as redundant for the following reasons. We 
classified redundancy using phenotype data from the literature, including experiments that were 
not
 
specifically designed to identify redundancy; there are expected to be substantial differences 
between experiments in how
 
phenotypes were scored. For example, conditional or particularly 
subtle phenotypes may not have been examined. This likely results in
 
misclassification of single 
42
 
 
mutants as not having an abnormal phenotype. Because genetic redundancy was defined as a 
double mutant having more a severe phenotype than the corresponding single mutants, this bias 
will therefore lead to overestimation of gen
etic redundancy. Furthermore, classification of gene 
pairs as redundant or nonredundant, as we were able to do using the broad phenotype categories 
currently available on a large scale, overly simplifies a complex phenomenon. Redundancy as it 
exists in nat
ure is not an all
-
or
-
nothing binary state, but rather a continuum with a wide range of 
biologically relevant states.
 
In our modeling exercise, redundancy score
s
 
derived from the model allow an 
approximation of this continuum, which can be further tested. O
ne approach for testing the 
degree of genetic redundancy is by obtaining lifetime fitness data for single and double mutant 
sets. Because lifetime fitness in a mutant reflects the totality of phenotypic effects due to the 
introduced mutation over the entir
e life cycle of the individual, subtle and conditional 
phenotypes are likely better captured. Importantly, our current model can predict redundancy as 
defined by differences in some phenotypes under some specific conditions. It remains unclear the 
extent t
o which such model is relevant to predicting redundancy when it is defined based on 
single and double mutant fitness, the phenotypic outcome that has the most bearing on the 
evolutionary fate of a gene pair. Thus, in future studies the generation of lifeti
me fitness data 
would allow for a machine learning regression model that more accurately predicts degrees of 
genetic redundancy between genes in a pair rather than simply classifying genes as redundant or 
not. Such a model could be applied to gene pairs wi
thin a large gene family to compare predicted 
redundancy scores and reveal patterns related to redundancy maintenance and loss through 
evolutionary time. Analysis of features important for building the model would be expected to 
yield additional useful ins
ights about mechanisms related to the evolutionary fate of gene 
43
 
 
duplicates and the long
-
term retention of genetic redundancy. Taken together, our results 
demonstrate the utility of machine learning in combining features to generate accurate 
predictions of 
genetic redundancy and identify several evolutionary features that are important in 
predicting genetic redundancy across several definitions. 
 
 
2.
5
 
Materials and 
M
ethods
 
2.
5
.1 
Definitions of redundant and nonredundant gene pairs
 
Arabidopsis mutant phenotyp
e
 
data were collected from 
Lloyd and Meinke 
(
2012
) and
 
Bolle et al. 
(
2013)
. Our benchmark dataset comprised gene trios for which a double mutant 
phenotype and both corresponding single m
utant phenotypes were repor
t
ed, with a total of 300 
gene trios. A numeric phenotype severity value was assigned to each single and double mutant 
(
Figure 
2.
1A)
, with 0 representing no phenotype; 1, a conditional phenotype of any kind; 2, a 
cell or biochemic
al phenotype; 3, a morpholo
g
ical phenotype; and 4, a lethal phenotype. 
Redundancy was classified using nine definitions (RDs) of varying stringency (
Figure 
2.
1B
). 
The least stringent definition was RD9, in which any gene pair for which the double mutant 
ph
enotype severity score was 
h
igher than that of both the single mutants was defined as 
redundant. With this definition, the dataset contained 190 redundant gene pairs. Gene pairs were 
classified as nonredundant if at least one single mutant had a phenotype 
severity score greater 
than
 
or equal to the double mutant score; the dataset contained 110 nonredundant gene pairs. 
 
 
2.5
.
2
 
Feature value generation
 
For predictive modeling, data from six general categories were collected for each gene: 
functional annotations such as GO terms; evolu
t
ionary properties such as synonymous 
44
 
 
substitution rate; protein sequence properties such as posttranslational modificatio
ns; gene 
expression patterns; epigenetic modifications such as histone methylation; and network 
properties such as gene interactions (
T
able S1
). These data were processed to generate feature 
values for each gene pair (
Supplemental Data
), and the method used for processing depended 
on the data type:
 
binary (e.g., whether or not a gene had 
a
 
given protein domain), categorical 
(e.g., all the names of protein domains present in a given gene product) and continuo
us (e.g., 
gene expression level). 
 
Features such as protein domain and functional annotations were treated as binary and/or 
categorica
l
 
input data for feature generation. For processing as binary input data, each gene was 
assigned a score of 0 (does not ha
ve the annotation/property) or 1 (has the annotation/property); 
gene pair feature values were then generated by taking the number of g
e
nes in the pair (0, 1, or 
2) having that annotation or property. For example, if Gene1 was annotated as having DNA 
bindin
g activity but Gene2 was not, the feature value for DNA binding activity for that gene pair 
would be 1. Additional features were gener
a
ted by taking the square, 
-
log
10
, and reciprocal value 
of features processed in this way. For processing as categorical i
nput data, all annotations of a 
specific type (e.g., GOslim terms) were listed for each gene. These were then used to represent 
simila
r


u
ld be 1, the total 
number of unique annotations between the gene pair would be 3, and the percent overlap would 
be 33. Fo
r continuous data, gene pair feature values were generated by calculating the difference, 
average, maximum, minimum, and total of the 
v
alues for the gene pair. For example, if Gene1 
had an isoelectric point of 10 and Gene2 had an isoelectric point of 9, th
e difference would be 1, 
45
 
 
the average 9.5, the maximum 10, the minimum 9, and the total value would be 19. Additional 
features were gen
e
rated by taking the square, log
10
, and reciprocal of features processed as 
categorical and continuous data, and by assign
ing each value to one of four quartile bins 
generated from the untransformed feature data. 
 
 
2.
5
.
3
 
Functional annotation and evolutio
n
a
ry property features
 
Functional annotations included GO biological process, molecular function and cellular 
component annotations 
(The Gene Ontology Consortium et al. 2000; The Gene Ontology 
Consort
i
u
m 2017)
, metabolic p
athway annotations from AraCyc v.15 
(Mueller et al. 2003)
, and 
predicted protein domain a
n
n
otations from Pfam 
(Finn et al. 2016)
. These annotations were 
processed as binary and categorical data as described above. There were 2,627 features related to 
functional annotations after transformati
o
n
s were applied (
Table
 
S1
 
and 
Supplemental Data
). 
 
Broadly, evolutionary properties included duplication mechanism and timing, and 
relationship to other genes in the genome. There were 171 features related
 
t
o evolutionary 
proper
ties after transformations were applied (
Table S1
 
and 
Supplemental Data
). 
 
To get the evolutionary rate for each gene in a pair, protein sequences (collected from 
NCBI; 
Pruitt et al. 2007)
 
of each 
A. th
a
l
iana
 
gene pair were s
earched against protein sequences 
from 
Theobroma cacao
, 
Populus trichocarpa
, 
Glycine max
 
and 
Solanum lycopersicum
, using the 
Basic Local Alignment Search Tool for protein sequences (BLASTP; 
Altschul et al. 1990)
. 
Protein sequences of the gene pair and the best hits in these four species w
e
r
e first aligned using
 
MUSCLE 
(Edgar 2004)
, and then were compared to their coding nucleotide sequences to 
generate the corresponding coding sequence (CDS) alignment. CDS alignments were used to 
build gene trees usin
g
 
RAxML/8.0.6 
(Stamatakis 2014)
 
with parameters: 
-
f a 
-
x 12345 
-
p 12345 
46
 
 
-
# 1000 
-
m PROTGAMMAJTT. 
Ka
, 
Ks
 
and the 
Ka
/
Ks
 
ratio on branches le
a
d
ing to each gene of 
a
 
gene pair were calculated using the free
-
ratio model of the codeml program in PAML v. 4.9d 
(Yang 2007)
. G
ene family size and 
lethality scores
 
were obtained from 
L
l
oyd et al. 
(
2015)
. 
Where lethality scores were not available, a score of 0 was assigned to known nonlethal genes 
and 1 was assigned to known lethal genes. Nucleotide and amino acid se
quence similarity were 
calcu
l
a
ted using EMBOSS Needle 
(McWilliam et a
l
.
 
2013)
. 
Ka
, 
Ks
, 
Ka
/
Ks
, gene family size, 
functional likelihood, lethality scores, and sequence similarity were processed as continuous data 
 
Gene pairs were determined to have been derived from one of four types of gene 
dupli
cation events using MCScanX
-
t
r
ansposed 
(Wang et al. 2013)
: 1) segmental duplicates

paralogs located in corresponding intra
-
species collinear blocks; 2) tandem duplicates

paralogs 
next to each other; 3) proximal duplicat
es

paralogs close to each ot
h
e

non
-
homologous genes; 4) transposed duplicates

one of the paralogs located in inter
-
species 
collinear blocks, the other not. 
Segmental duplicates were additionally noted as being derived or 
not derived
 

-
 

-
WGD events
.
 
 
Protein sequences of 
A. thaliana
 
were searched 
against protein sequences of 
A. thaliana
 
(intra
-
species), 
Arabidopsis lyrata
, 
Brassica rapa
, 
Carica papaya
, 
P. trichocarpa
, and 
Vitis vinifera
 
(inter
-
species) using BLASTP, with 
a cutoff E
-
value of 1x10
-
10
.
 
F
ive different sets of parameters were evaluated for MCScanX
-
transposed: 1) 
-
k 50 
-
s 5 
-
m 25; 2) 
-
k 50 
-
s 2 
-
m 25;  3) 
-
k 25 
-
s 2 
-
m 25; 4) 
-
k 25 
-
s 2 
-
m 50; 5) 
-
k 25 
-
s 5 
-
m 25; 
where 
-
k indicates the cutoff score of collinear
 
blocks, 
-
s specifies the nu
m
b
er of matched genes 
required for the calling of a collinear block, and 
-
m means the maximum number of genes 
allowed for the gap between two genes. The duplication mechanisms inferred using these five 
different sets of paramete
rs were consistent with one 
a
n
other for the majority of gene pairs; 78 
pairs had discrepant results, representing 0.4% of the total dataset. In these cases, the mechanism 
47
 
 
that occurred most frequently in the results for that gene pair was assigned; if ther
e was no 
majority, the mecha
n
i
sm was listed as N/A. Each gene pair was assigned a binary value 
indicating whether or not the genes were reciprocal best matches (i.e., 
they 

best hit based on nucleotide BLAST searches) and whether or not t
hey were derived from each 
t
y
p

-
WGD event would have a 

-
WGD feature, and a value of 0 for all other 
duplication mechanisms). 
 
Retention rate was based on the presence or absence
 
o
f a paralog in 15 species: 
A. lyrata
, 
Capsella rubella
, 
B. rapa
, 
T. cacao
, 
P. trichocarpa
, 
Medicago truncatula
, 
V. vinifera
, 
S. 
lycopersicum
, 
Aquilegia coerulea
, 
Oryza sativa
, 
Amborella 
trichopoda
, 
Picea abies
, 
Selaginella 
moellendorffii
, 
Physcomitrella 
p
a
tens
, and 
Marchantia polymorpha
. The retention rate for each 
gene was calculated as the number of genomes in which a paralog was present divided by the 
total number of genomes analyzed (
16: 
A. thaliana 
plus the 15 additional species). Genome data 
were co
l
l
ected from Phytozome 
(Goodstein et al. 
2012
) for 
P. patens
 
318 v3.3, 
M. polymorpha
 
320 v3.1, 
S. moellend
o
r
ffii
 
91 v1.0, 
A. trichopoda
 
291 v1.0, 
O. sativa
 
323 v7.0, 
B. rapa
 
277 
v1.3, 
C. rubella
 
183 v1.0, 
A. thaliana
 
167 TAIR10, 
A. lyrata
 
v2.1, 
M. truncatula
 
285 Mt4.0 v1, 
V. vinifera
 
145 Genos
cope 12x, 
A. coerulea
 
v3.1, 
P. trichocarpa
 
210 v3.0, and 
T. cacao
 
23
3
 
v1.1; from NCBI for 
S. lycopersicum
 
v2.5; and from PlantGenIE 
(Sundell et al. 2015)
 
for
 
P. 
abies
 
v1.0.
 
 
2.
5
.4 
Gene expression and epigenetic 
modification features
 
 
Processed microarray gene expression datasets were obtai
n
e
d from Moore et al. (2019) 
and contained gene expression levels under biotic 
(Wilson et al. 2012)
 
and abiotic stress 
(Kilian 
48
 
 
et al
.
 
2007; Wilson et al. 2012)
, under hormone treatment 
(Goda et al. 2008)
, at different 
developmental stages 
(Schmid 
et al. 2005)
, and at different times of day 
(Mockler et al. 2007)
. In 
addition 
t
o
 
these gene expression levels, we also considered expression breadth, which represents 
the number of tissues and conditions under which each gene is expressed. Gene expression
 
levels 
and ribosome occupancy from RNA
-
seq and Ribo
-
Seq experiments in root ti
s
s
ue were obtained 
from 
Hsu et al. 
(
2016)
 
and processed along wi
t
h
 
the microarray gene expression data as 
continuous data. 
There were 450 features related to gene expression after transformations were 
applied 
(
Table S1
 
and 
Supplemental Data
)
. 
 
Epigenetic modifications i
n
c
luded DNA methylation, chromatin accessibility, and histone 
modifications. Percent CHH, CHG, and CpG methylation, gene body methylation, and histone 
modification data were obt
ained from 
Lloyd et al. 
(
2015)
. Percent methylation values were 
treated as continuous data, and gene body methylation and histone modification data as binary 
data. Ch
romatin accessibility data were from 
Sullivan et al. 
(
2014)
 
and were also binary, with 
each gene receiving a score of 1 if i
t contained a DNase peak site and a score of 0 if it did not. 
There were 565 fe
a
t
ures related to epigenetic modifications after transformations were applied 
(
Table S1
 
and 
Supplemental Data
). 
 
 
2.
5
.5 
Protein sequence and network property features
 
Protein sequence properties included a
mi
n
o acid length, isoelectric point, and 
posttranslational modifications. Amino acid lengths w
ere obtained from 
Lloyd et al. 
(
2015)
. 
Isoelectric points 
and myristoyla
ti
o
n data were from The Arabidopsis Information Resource 
(Berardini et al. 2015)
. Amino acid length and i
soelectric point were processed as continuous 
data. Acetylation, deamination, formylation, hydroxylation, oxidation, and propionylation data 
49
 
 
were 
obtained from Th
e 
P
lant Proteome Database 
(Sun et al. 2009)
. Posttranslational 
modifications were processed as bi
na
r
y data: whether or not the protein product was predicted or 
known to have the modification
.
 
In total, 93 features were related to protein sequence properties 
after transformations were applied (
Table S1
 
and 
Supplemental Data
). 
 
Network properties were related to known or potential 
interactions of genes or protein 
products. Gene interaction data (AraNet v.1, 
Lee et al. 2010)
 
and 
protein
-
protein interactions 
(
AtPIN, 
Brandão et al. 2009)
 
were 
processed 
as
 
categorical data. 
Gene co
-
expression was 
calculated from the microarray data
sets referenced above using multiple clustering algorithms, 
namely k
-
means, c
-
means and hierarchical clustering at k=5, 10, 25, 50, 100,  200, 300, 400, 
500, 1000, and 2000 as d
es
c
ribed in 
Moore et al. 
(
2019)
. These data were processed as 
categorical data, with each combination of clustering algorithm, datase
t 
a
nd k
-
value included as 
a feature; a gene pair received a value of 1 if both 
genes were in the same cluster and a value of 0 
if they were not. 
There were 205 features related to network properties after transformations 
were applied 
(
Table S1
 
and 
Supplemental Data
)
. 
 
 
2.
5
.6 
Identific
ation of features distinguishing redundant and nonredundant pairs
 
 
To identify features that could distinguish between gene pairs from the redundant and 
nonredundant classes, we
 
a
p
plied statistical tests to determine if feature values were significantly 
different between the classes. Binary gene pair features (e.g., dupli
cation type, presence in a gene 
co
-
expression cluster) were analyzed using two
-

 
m
u
ltiple testing 
correction using the Benjamini
-
Hochberg method 
(Benjamini and Hochberg 1995)
. To determine 
whether feature value transformations 
improved the ability to distinguish between classes, the 
reciprocal, square, and log
10
 
of continuous features 
we
r
e included as separate features. 
50
 
 
Continuous values were also binned into four quartiles of equal size and bin values included as 
features. Tran
sformed and untransformed continuous feature values were analyzed using a 
Wilcoxon rank sum test 
(Wilcoxon 1
945)
 
with multiple testing correction performed using the 
Benjamini
-
Hochberg method. Features were consider
ed
 
to distinguish between redundant and 
nonredundant gene pairs if 
q
 
< 0.05 after multiple testing correction (
Table S1
). Continuous 
feature effec
t sizes are the standardized z statistic (calculated from the 
p
-
values given by the 
Wilcoxon rank sum test) di
vi
d
ed by the square root of the sample size. Binary feature effect sizes 
correspond to the odds ratio calculated from the enrichment table for eac
h feature.    
 
 
2.
5
.7 
Redundancy prediction model building and optimization with machine learning
 
 
Models for
 
pr
e
dicting genetic redundancy between gene pairs were built with Random 
Forest, Gradient Boosting and SVM algorithms implemented in the scikit
-
le
arn machine learning 
package 
(Pedregosa et al. 2011)
 
in Py
tho
n
. For Random Forest and Gradient Boosting, a grid 
search was performed with 10
-
fold cross
-
validation for parameter optimization; gene pairs we
re 
randomly divided into a training set (90%) and a testing set (10%) with proportional division of 
redundant 
and
 
nonredundant gene pairs. This division was repeated 10 times, with 10 replicate 
models built for each iteration, for a total of 100 models. Te
n
-
fold cross
-
validation was also used 
in model building with 100 iterations each for a total of 1000 models, w
ith
 
each being tested on a 

building the model, to assess
 
performance. The trained model was used to predict redundancy 
among all tandem and WGD pairs in Arabidopsis (
Supplemental Data
) and among a random 
sample of Arabidopsis kinase gene pairs. Using kinase fami
ly classifications from 
Lehti
-
Shiu 
51
 
 
and Shiu 
(
2012)
, a
ll 
p
ossible within
-
family combinations of gene pairs were generated. Ten 
thousand of these pairs were then randomly selected for 
predictions (
Supplemental Data
). 
 
 
52
 
 
APPENDIX
 
 
53
 
 
Supplemental Information
 
 
Figure 2.S1
: Benchmark gene pair phenotypes. 
 
Distribution of benchmark gene 
pairs among phenotype severity categories (as defined in Figure 
2.1) for both mutants (SM1 and SM2) and the double mutant for each p
air
.
 
The dataset is biased 
toward double mutants with more severe phenotypes.
 
54
 
 
Figure 2.S2
: 
Comparison
 
of 
models 
built with different algorithms and numbers of 
features.
 
 
55
 
 
F
i
g
u
r
e
 
2
.
S
2
 
c
o
n
t

d
 
 
(
A
)
 
AUC
-
ROC scores and (
B
) AU
-
PRC scores for binary classificatio
n machine learni
ng 
m
odels 
built using RD9 with Gradient Boosting (GB), Random Forest (RF) and Support Vector Machine 
(SVM) algorithms 
and using different numbers of features. Shading indicates 
the standard
 
deviation. Using AUC
-
ROC as a measure, models
 
bu
i
l
t with SVM performed the best (
ANOVA, 
p
-
value
 
< 2 x 10
-
16

, 
p
-
values < 0.008
). 
Using AU
-
PRC, models built with SVM performed significantly better 
compared with those 
built w
ith Gradient Boostin
g (
A
NO
VA, 
p 
< 2 x 10
-
16
; 

p
-
value < 0.0001), but not 

p
-
value = 0.36). (
C
) AUC
-
ROC scores and 
(
D
) AU
-
PRC scores for machine learning models built using different combinations of feature 
numbers (

F
ea


-
ROC as a measure, 
the
 
be
st
-
performing 
combination was 200 features selected with Random Forest and with only the best 
transformation o
f each feature allowed (
ANOVA, 
p 
< 2 x 10
-
16

p
-
values < 
0.0001
). Using AU
-
PRC as a measure, this combination was significantly
 
be
t
te
r than all other 
combinations of parameters (ANOVA, 
p 
< 2 x 10
-
16

 
p
-
values < 2.3 x 10
-
4
)
 
except 
for the following two combinations: 200 features selected with Random Forest, with multiple 
transformations of each feature allowed, and 100
 
fe
a
tu
res selected with Random Forest, with 
multiple transformations of each feature allowed (

p
-
values 
1.00 and 0.13, 
respectively). (
E
) AUC
-
ROC curve and (
F
) AU
-
PRC curve of a model built with all 
untransformed features, demonstrating the imp
rov
e
d 
performance of the optimized model in (
C
) 
and (
D
) with respect to both measures. 
(
G
) AUC
-
ROC and (
H
) AU
-
PRC fo
r a model trained 
using RD5 gene pairs and half of the nonredundant pairs (randomly selected) then applied to 
RD9 gene pairs (excluding RD5)
 
an
d
 
n
onredundant pairs that did not overlap with those used in 
training the RD5 model.
 
Using these performance meas
ures, this model did not perform as well 
as a model trained on RD4 then applied to a test set composed of RD9 gene pairs (after removing 
RD4
 
pa
i
rs
) and a random subset of half the nonredundant gene pairs (
Figure 
2.
2D
-
E
); therefore, 
RD5 was not selected for
 
further analysis.
 
 
56
 
 
Figure 2.S3
: 
Statistical association of features among different feature categories with 
redundancy.
 
(
A
-
B
)
 
Distribu
tio
n
 
of 
-
log(
q
-
values) from tests of feature association with redundancy as defined 
using (
A
) RD4 and (
B
) RD9. Statistical significance was determined with Wilcoxon rank sum 
test for continuous features and two
-

; a
l
l
 
values were 
 
 
57
 
 
corrected for multiple testing with the Benjamini
-
Hochberg method. The dotted lines sh
ow a 
q
-
value of 0.05. All 
p
-
values, 
q
-
values and effect sizes are reported in 
Table S1
.
 
The median effect 
 
sizes (calculated as desc
rib
e
d in 
Methods
) were 0.11 (RD4) and 0.07 (RD9) among continuous 
features, and 1.4 (RD4) and 1.1 (RD9) 
among binary features. 
(
C
-
 
D
)
 
Distribution by feature 
category of the 200 features selected for model building for (
C
)
 
RD4 and (
D
) RD9. Features that 
ha
ve 
a
 
statistically significant association with redundancy as described above are shown in 
orange. Only 
25% of the features selected were significantly associated with redundancy. 
 
 
58
 
 
Figure 2.S4
: 
Important features in predicting redundancy.
 
(
A
) 
Comparis
on 
o
f feature importance ranks of the 51 features included in both the RD4 and 
RD9 models.
 
The feature importance ranks are well correlated between the two models (PCC = 
0.63; 
p 
= 6.0 x 10
-
7
), indicating that a core set of features is important in predicti
ng 
r
edundancy 
 
59
 
 
Figure 2.S4 

 
a
cross definitions. (
B
-
C
) Feature importance scores obtained from machine le
arning models for 
(
B
)
 
RD4 and (
C
)
 
RD9. (
D
) Distribution of 
Ks 
values among gene pairs included in 
the RD4 (left) 
 
and 
RD9 (right) models that are r
ec
i
p
rocal best matches (orange) and not reciprocal best matches 
(green). Dotted lines show the median 
Ks 
value for each group. Gene pairs that are reciprocal 
best matches tend to be more recent duplicates as shown by the lower 
Ks 
values.
 
 
60
 
 
Figure 2.S5
: 
En
r
i
chment of GO terms among redundant gene pairs vs. nonredundant gene 
pairs for each redundancy definition.
 
Blue represents enrichment among nonredundant pairs while red represents enrichment among 
redundant gene pairs; light
er shades show statistically 
we
a
k
er associations (i.e., higher 
q
-
value) 
 
 
61
 
 
Fig

 
and darker shades show statistically stronger associations (lower 
q
-
value). Statistically 
significant enrichment was seen in transcription factor activity among nonredundant pairs 
 
compared with
 
RD
4
 
and RD8 gene pairs, and in DNA
-
dependent transcription factor activity 
among nonredundant gene pairs compared with RD4, RD5 and RD9 gene pairs. In general, 
functional enrichment highly varied by RD.
 
 
62
 
 
Figure 2.S6
: P
erformance of models trained 
on 
R
D4 and RD9 in cross
-
validation.
 
(
A
-
B
) Performance in cross
-
validation; the percentages of (
A
) redundant and (
B
) nonredundant 
gene pairs correctly and incorrectly predicted using different RDs are shown. Gene pairs to the 
left of the threshold (dotted l
ine
)
 
were classified as nonredundant and gene pairs to the right were 
classified as redundant. Models were built u
sing RD4 pairs, RD9 pairs, and RD9 pairs not 
included in RD4 as the redundant instances (RD9 minus RD4) with a randomly
-
selected half of 
nonre
dun
d
ant pairs as the nonredundant instances.
 
 
63
 
 
Figure 2.S7
: Distribution of values among 
features that may contribute to mispredictions.
 
(
A
) Distribution of reciprocal best match gene pairs (RBM) among annotated (Ann.) vs. 
predicted (Pred.) RD9
 
cl
a
sses: true negatives (NR/NR), false negatives (RD9/NR), false 
positives (NR/RD9), and true positi
ves (RD9/RD9), including the benchmark dataset and the 10 
validation pairs identified from the literature (here referred to as validation pairs). Two NR/RD
9 
v
a
lidation pairs were reciprocal best matches, which was observed more often for RD9/RD9 
pairs than
 
NR/NR pairs, while genes in the RD9/NR validation pair were not reciprocal best 
matches, likely explaining these three mis
-
predictions. (
B
)
 
Distribution o
f 

-
whole genome 
duplication (

-
WGD)
-
derived gene pairs among the annotated/predicted classes, includ
ing the 
benchmark and 10 validation pairs. Three validation NR/RD9 pairs were derived from the 

-
WGD event, which was observed more often for RD9/RD9 pairs
 
than NR/NR pairs, potentially 
 
64
 
 
contributing to their mis
-
prediction. (
C
) Distri
bution of feature values among benchmark and 
validation gene pairs for the maximum biotic downregulation breadth between genes in a pair; a 
reciprocal tran
sf
ormation was applied to generate reciprocal maximum biotic downregulation 
breadth. All five of the 
validation NR/RD9 pairs had high values for this feature and looked 
more similar to RD9/RD9 pairs than to NR/NR pairs. (
D
) Distribution of reciprocal minim
um
 
CpG methylation in endosperm cells values among benchmark and validation pairs. (
E
) 
Distribution o
f total gene family size values among benchmark and validation pairs. The 
RD4/NR validation pair had a high value, which was more consistent with the value
s 
of NR/NR 
pairs than RD4/RD4 pairs. (F) Distribution of reciprocal average CHH methylation in embryo
 
tissue values among benchmark and validation pairs. All four of the NR/RD4 validation pairs 
had high values that were more similar to those of RD4/RD4 gen
e 
pairs, while the one RD4/NR 
pair had a low value more similar to those of NR/NR pairs. 
 
 
65
 
 
C
HAPTER
 
3
: 
MODELING MUTANT FITNESS AND 
GENETIC REDUNDANCY IN 
SACCHAROMYCES CEREVISIAE
 
 
66
 
 
3.1 Abstract
 
Genetic redundancy is a phenomenon where more than one gene en
co
des products that 
perform the same function. This frequently 
experimentally 
manifests as a single gene knockout 
mutant 
that
 
does not demonstrate a phenotypic change or a decrease in fitness co
mpared to wild 
type individuals, due to the presence of a para
lo
gous gene that may have retained the same 
function. A phenotype is only observed when one or more paralogs are knocked out in 
combination. While some factors that are associated with genes wit
h retained, potentially 
redundant functions have been identifi
ed
 
in yeast, little is known about factors contributing to 
long
-
term maintenance of genetic redundancy. Here, we leveraged the availability of fitness and 
multi
-
omics data in budding yeast 
Sacch
aromyces cerevisiae
 
to build machine learning models 
for predi
ct
ing genetic redundancy and related phenotypic outcomes (single and double mutant 
fitness) among paralogs. We found that single mutant fitness was particularly informative to 
predict double mut

io
n coefficient, 
PCC, = 0.83). While the 
genetic redundancy 
model did not perform as well (PCC = 0.40), six of 
the top 10 features identified as important in predicting double mutant fitness wer
e shared with 
the redundancy model. Our models provide quantit
at
ive assessments of how well existing data 
allow predictions of fitness and genetic redundancy.  In addition, these models allow the 
identification of 
potentially 
biologically significant featu
res contribut
ing
 
to long
-
term 
maintenance of genetic redundanc
y 
in eukaryotes.
 
 
3.2 Introduction
 
While there has been a
 
rapid 
increase in the number of organisms with sequenced 
genomes as sequencing technology has i
mproved
 
and costs decreased, a key 
remain
ing 
challenge 
67
 
 
is linking genotypes to phenotypes, and especial
ly
 
genes to functions. In service of this goal, 
large
-
scale reverse genetic screens have been undertaken in many organisms (e.g., 
Lu et al. 2011
, 
Matynia et
 
al. 2008
, 
Giaever et al. 2002
, 
Lee et al. 2003
). These experiments have had varying 
levels of success in identifying phenotypes, as a single gene knockout mutant often does not 
demonstrate a phenot
y
pe that is informative with respect to its function (
Bouché and Bouchez 
2001
, 
Th
at
cher et al. 1998
). This can be due to a number of reasons, including subtle
, 
tissue
-
specific
, or environmentally conditional
 
phenotypes, or genetic interactions 
(
discussed in 
Brookfield 1992; Bouché and Bouchez 2001; Bolle et al. 2013)
.
 
Saccharomyces cerevisiae
 
has been a popular 
eukaryotic 
model system for reverse 
genetics 
for decades because of its small size, short generation time, and genetic tractabili
ty
 
(Botstein and Fink 1988)
. This means that knockout studies can be condu
ct
ed at a scale that is 
not feasible with physically larger or physiologically more complicated organisms. The factors 
that can lead to a 
lack of apparent phenotype in sing
le mutants can be overcome in this system: 
the environment can be manipulated with r
el
ative ease (e.g., increasing the incubation 
temperature, addition of antibiotics or limiting essential elements in the growth medi
um
); f
itness 
phenotypes in the form of c
olony size can be precisely quantified; and there are no complications 
of tissue
-
spe
ci
fic function in a unicellular organism. 
 
Importantly, 
an approach 
called synthetic genetic array (SGA) 
has been developed in 
yeast for 
s
tudying double mutants 
(DMs) 
in a 
systematic way 
(Tong 2001)
. 
Utilizing 
haploid 
single
 
m
utants
 
(SMs)
, SGA involves crossin
g 
each 
SM
 
(the query mutant) 
with all other possible 
SMs (array of mutants) 
to gener
at
e a comprehensive library of 
haploid DMs
. 
Normalized c
olony 
size 
(
i.e., fitness) 
of 
DMs 
can be compared to the corresponding 
SMs and to the wild
-
type (WT) 
to identify
 
int
eractions among
 
pairs of genes
. 
These 
can 
include positive 
genetic 
interactions,
 
68
 
 
whe
re
 
the
 
DM 
has greater fitness than expected compared to
 
the SMs
, 
or 
negative interactions,
 
where the DM has 
lower fitness
.
 
There are several notable 
and sometimes overlappi
ng 
subsets of 
negative genetic interactions, including synthetic lethality and g
enet
ic
 
redundancy
. Sy
nthetic 
lethality refers to a situation where a DM is letha
l while 
both of the corresponding
 
SMs are 
viable
 
(Dobzhansky 1946)
.
 
T
he term
 
ge
netic redundancy
 
typically refers to paralogs that have 
retained some or a
ll of the same function following the duplication event from which they are 
derived 
(e.g., 
Liu 
et al. 2008
, 
Mendonca et al. 2
011
, 
Rutter
 
et al. 2017
). 
Genetic redundancy 
manifests experimentally as any ot
he
r type of negative genetic interaction
, 
where 
the DM has a 
more severe 
fit
ness decrease compared with either of the corresponding SMs
 
(this can include 
synthe
tic lethality)
. The diffe
rence is that 
among redundant gene pairs, 
this effect is 
specifically 
s
ee
n due to 
the 
phenotypic buffering
 
effect of 
having a
n evolutionarily
 
relat
ed 

backup

 
gene to 
perform the function when one is knocked out
.
 
 
The idea of 
havin
g multiple genes 
performing the same function as a
n 
evolutionary 
failsafe makes some intuitive s
en
se. 
However, 
r
etention of functionally redundant genes over 
millions of ye
ars is seemingly an evolutionary paradox, because selective pressure would be 
expect
ed to relax on on
e copy, allowing mutations to accumulate and los
s of
 
the ancestral 
function ove
r 
time 
(Brookfield 1992)
. H
owever, there are genetically redundant gene pairs in the 
yeast genome that have bee
n retained for hu
ndreds of millions of years 
(Vavou
ri
 
et al. 2008)
. 
Features common among retained paralogs have been 
relative
ly 
well
-
studied 
(Seoighe and Wolfe 
1999; Scannell et al. 2007)
, but there is less informat
ion about factors
 
associated 
with and 
mechanisms contributing to 
maintained 
genetic redundancy
, 
i.
e., 
retained paralogs with 
overlapping functions
 
(Guan et al. 2007; Li et al. 2010)
. 
 
69
 
 
A previous study used SGA to identify of 
hundreds of th
ousands of positive and negative 
genetic interactions in yeast 
(Costanzo et al. 2016)
, a resource that can be used to better 
understand factors associated with different types of genetic interactions. 
Using 
these 
published 
phenotype data
, we built upon previous efforts 
to understand the biological underpinnings of 
genetic redundancy. This was accomplished using machine learning to 
build 
predict
ive models 
of
 
phenotyp
ic
 
severity in single and double mutants, and 
of 
genetic red
undancy among 
gene 
pairs as a function of singl
e and double mutant fitness. 
Because 
we used 
a large dataset where
 
phenotypes are known, 
we 
were
 
able to 
first 
assess the accuracy of the 
models
 
on the known data 
and 
then 
identify potential biologically impor
tant features identified by the machine learnin
g 
model as important in making those predictions. 
The single mutant fitness model confirmed the 
validity of 
using predictive model building to identify biologically importa
nt features; 
as 
expected, essential (

portant in predicting severe 
phenotypes. Models were then built for double mutant fitness and genetic redundancy, and the 
features that were important in generating those p
redictions analyzed to identify biolo
gically 
relevant factors which may be contribut
ing to double mutant phenotype severity and genetic 
redundancy.
 
 
3.3 
Results and Discussion
 
3.3.1 
Distribution of fitness scores
 
We
 
first 
evaluat
e
d
 
the structure of the single mutant (SM) and double mutant (DM
) 
fitness data. 
Our analyses included 
the 
5327 
SMs
 
having a reported fitness 
value in the 26°C 
datasets of 
Costanzo et al. 
(2016; see 
Methods
).
 
Fitness values for these mutants ranged from 
0.13 
to 1.
14
 
(
Figure 3.1A
)
, with a value of 1 
representing no chang
e from the WT fitness
. The 
70
 
 
data in our set were
 
extremely unbalance
d;
 
the values were clustered tightly around 1, with a 
median SM 
fitness 
value of 0.99 (mean = 0.93, standard deviation [SD] = 0
.16; 
Figure 
3.
1A
). 
These results indicated that under normal g
rowth conditions, the vast majority of genes in
 
the 
yeast genome can be knocked out with little or no fitness penalty
. This is
 
co
nsistent with 
previous findings 
(
e.g., 
Giaever et al. 2002)
. 
 
DMs 
included in our analys
is
 
comprised paralogous gene pairs with reporte
d fitness 
values for the DM and both corresponding SMs in the 26°C datasets of Constanzo et al. (2016; 
see 
Methods
). 
There were 8160 of these DMs, with fitness values ranging from 
-
0.09 to 1.28 
(
Figure 3.1B
). 
Li
ke the SM
 
values
, the 
DM 
fitness values were 
tightly clustered around 1, with a 
median DM fitness value of 0.96 (mean = 0.87, SD = 0.21; 
Figure 
3.
1B
). Together with the SM 
fitness values, this result suggests that the majority of genes in the genome can 
be
 
knocked out 
either alone or in combination w
ith the closest paralogous gene with little or no consequence to 
the fitness of the organism at 
26°C
. There are several possible explanations for this observation: 
many genes may have conditional phenotypes (e
.g
., 
only present under 
stress
 
conditions
); fit
ness 
could be affected in a way that is not captured by colony size
 
(
such
 
as 
a 
difference in cell 
number
)
; or there may be an extremely high degree of genetic redundancy involving three or 
more paralogs. 
This 
last point depends greatly on how paralogs are 
defined; according to our 
definition, 
46
% of the genes 
in our 
DM 
dataset are members of a gene family wit
h three or more 
members
 
(
Figure 
3.S1
). 
However, using only paralogous gene
s
 
which have been 
confirmed in
 
the literature
 
(according to 
Saccharomyces 
Gen
ome Database;
 
Cherry et al. 1998)
, 
our dataset 
contains no gene families with more than three paralogs and only five gene families containing 
three paralogs. Thus, 
it is unlikely that a high degree of redundancy 
among three or more 
paralogs would account for 
the lack of phenotypes observed 
in both SMs and DMs. 
 
71
 
 
Figure 3.1
: 
Distribution of mutant fitness values.
 
 
D
istribution of (A) single and (B) double mutant fitness values in our dataset. Single mutants 
were 
those for which fitness data were available in 
t
he 26°C dataset from 
Costanzo et al. (2016); 
double mutants were those for which fitness data were available in the Costanzo et al. dataset for 
a double mutant and both corresponding single mutants, where the
 
genes in the double mutant 
pair are paralogs.
 
 
3.3.2 
Predictions o
f single mutant fitness
 
 
To ultimately model genetic redundancy in yeast to identify biological factors 
contributing to it, we first tested the validity of this approach on a simpler prob
lem: prediction of 
SM fitness. We applied machi
n
e
 
learning to build a
 
regression model of SM fitness in yeast 
using 11,661 features spanning five categories (
Supplemental Data 
and 
Methods
), including 
functional annotations (e.g., GO terms), evolutionary properties 
(e.g., presence of homologous 
genes in
 
o
ther spe
cies), protein sequence properties (e.g., PTMs), gene and protein expression 
(e.g., stress conditions under which a gene is differ
entially expressed vs. control), and network 
properties (e.g., gene interactio
ns). In the context of machine learnin
g
,
 
each SM
 
is here 
72
 
 
considered an instance, i.e., one observation. Data used in machine learning comprised the 5327 
SM instances, the corresp
onding SM fitness values as the predicted variable, and the gene 
features corr
esponding to each SM. The instances we
r
e
 
divided
 
into a training and a testing set 
(90% and 10% of instances, respectively; see 
Methods
). The support vector regression (SVR) 
algo
rithm was applied to build a predictive model of SM fitness from the training 
set using ten
-
fold cross
-
validation (s
e
e
 
Methods
), and the resulting model was then applied to the testing set. 
Model performance on the training and testing sets was evaluated u

coefficient (PCC
;
 
0 in a random model and 
-
1 or 1 i
n models with perfect negative or posi
t
i
ve 
corre
lation
, respectively
). The model had a PCC = 0.40 
on the training set and
 
0.35 on 
the 
testing 
set
 
(
Figure 
3.
2A)
. To determine whet
her this would be the performance of the model if the 
features were not actual
ly predictive of fitness (i.e., if thi
s
 
result w
as simply due to chance), we 
randomized the fitness values and reran the model. The PCC was significantly lower than with 
the orig
inal data (0.029 and 
-
0.036 on training and testing sets, respectively; 
Figure
 
3.
2B
), 
indicating that the features d
o
 
have pre
dictive value for single mutant fitness
, and therefore some 
likely 
biologically relevant association with this trait
. 
 
 
3.3.3 
Featu
re importance in predicting SM fitness
 
 
Output of our SVR model included featu
re importance scores, which represent
 
t
he 
contrib
ution of each feature to predictions; high importance scores have larger weight. Among 
features used for SM fitness 
predictions, eight of the top ten predictors (determined by 
 
importance score rank) were fea
tures related to
 
specific genes havin
g
 
interactio
ns with other
 
gene
s
 
(
Figure 3.2C
). Gener
a
lly, a feature such as interaction with a specific gene is highly 
 
sparse

on
ly a small number of genes in the dataset would be interacting, and this held true in
 
73
 
 
Figure 3.2
:
 
Performance and important fe
atures
 
of s
ingle mutant fitness model
. 
 
Performance of 
(A) single mutant fitness model (Spearma

fitness model with randomized fitness va

 
-
3). Models were built with 
the Support Vector Regression (SVR) 
algorithm. Randomized fitness values simulated a 
model 
where features and fitness values 
were not related in any way beyond random chance, 
demonstrating that a relat
ionship did exis
t
 
between SM fitness and our features. (C) Features 
with
 
the ten highest
-
ranked importance scores in pred
icting SM fitness. Mitochondrial translation 
(ranked 2nd) w
as a Gene Ontology (GO) term; N
-
terminal acetyltransferas
e domain (ranked 8th) 
is a PANTHER 
p
rotein domain; all others represent interaction with a 
gene which has been given 
 
 
74
 
 
Figure 3.2 cont

d
 
a simplified descriptor
. In order from rank 1
-
10 the gene name
s are: MRPS12, SEN2,
 
MSS51, 
SNF11, CDC40, DST1, PRX1, and OPI1.
 
 
our d
ataset. 
Each of the 
important gene interaction fe
atures was explanatory for less than 0.
3
% of 
gene pairs in the
 
dataset (13 genes or fewer out of our set of 5327 ha
d
 
each of these inter
actions; 
Figure 3.S2
).
 
Thus, while
 
these features were informative for explaining the fitness effects of 
some single mutants 
which had these interactions, the 
p
redictive powe
r of the
 
features was not 
generalizable throughout the genome. 
 
To determine poten
tial biological reasons why these 
specific features were identified as 
the most important in generating predictions, we looke
d at the functional annotations of
 
the 
genes with
 
which i
nteractions were predictive of fitness. In general, the features were rela
ted to 
the levels and integrity of
 
mRNA and proteins. Specifically, they included a splicing factor, 
splicing suppressor, and
 
splicing endonuclease; a transcri
p
tion factor an
d transc
riptional 
regulator; a chromatin remodeling complex; mitochondrial transla
tion function and a 
mitochondrial 
ribosomal small subunit; and N
-
terminal acetyltransferase activity,
 
which typically 
occurs 
at the ribosome 
(Pol
e
vo
da et al. 20
08)
. To
 
determine the directionality of the predictive 
power (e.g., whether knocking out a gene with N
-
terminal acetyltransferase activity is associated 
with increased or decreased fitness), we next compare
d the 
distributions of fitness val
u
es 
between 
SMs
 
tha
t ha
d 
gene 
interaction
s
 
identified by the model as important 
and those that did 
not (
Figure 
3.S3
). 
 
While the difference was more dramatic for some features than 
for 
others, 
SMs
 
for genes 
with the intera
ctions tended to have decreased fi
t
ness compared 
to t
hose
 
that did not have them. 
As mentioned above, these features mainly represent key housekeeping functions without which 
75
 
 
a cell cannot metabolize and/or grow. For example, four of the features were dire
ctly related to 
mitochondrial func
t
ion (
Figure 3.
S
3
A,
 
B, 
D, I
), with an additional feature (
Figure 3.S
3
J
) 
annotated as having disrupted mitochondrial metabolism when knocked out 
(Luévano
-
M
artínez 
et al. 2013)
. It is therefore 
plausible
 
that altering their functions woul
d be related to decreased 
fitness,
 
demonstrating 
the 
biol
ogical validity of our approach.
 
One might hypothesize that the 
prevalence of mitochondrial
-
related genes and functions may be a result of decades of extensive 
study of mitochondria, leading to a muc
h higher number of genes having an
n
otations relat
ed to 
mi
tochondria than to other function
s, making these features less sparse and therefore more 
informative. 
While we knew that these features were 
sparse 
with respect 
to the number of genes 
in the dataset,
 
we hypothesized
 
that
 
genes having
 
annotations re
lated to
 
mitochondria 
may
 
still 
be 
dispro
p
ortiona
tely
 
high
.
 
To determine whether importance 
among 
features 
such as GO terms 
and gene interaction
 
was simply a function of sparsity, we assessed whether there w
as a linear 
relationship between 
t
h
e 
feature impo
rtance 
s
core 
and the number of genes havi
ng the annotation 
or interaction represented by that feature. Using this measure, importance 
scores 
do not seem 
to 
be dependent
 
on the
 
sparsity 
of the feature 
(
PCC
 
= 
-
0
.
02
, 
p 
= 
0.04
; 
Figure 
3.S
4
A
).
 
 
3
.
3.4 
Prediction
s of DM 
fitness and comparison of 
important features
 
 
As the SM fitness model demonstrated the validity of regression
-
based machine learning 
to predict SM fitness and evaluate identified features as biologically significant, we next sought 
t
o
 
determine whether thi
s
 
approach would be useful for a rel
ated and more complex trait. We 
therefore built a model of DM fitness, deriving 17,199 feature values (
Supple
mental Data
) from 
the corresponding SMs for each DM (see 
Methods
). The DM fitness model 
p
erformed much 
better t
h
an the SM model, with PCC = 0.73 and 0.80 
for the training and testing sets, 
76
 
 
respectively (
Figure 3.3A
). Randomization of the fitness values to approximate predictions of 
fitness with uninformative features resulted in a much worse 
m
odel (PCC = 0.01; 
Figu
r
e 3.3B
),
 
demons
trating that 
there is a relationship between the features and fitness that is stronger than 
random chance.
 
While 
feature importance analysis in the SM fitness model served as 
a demonstration of 
the validity 
of our met
h
ods, we here used feat
ure importance s
cores to identify 
potential 
biological factors contributing to DM fitness. In contrast with the SM feature importance scores, 
the ten highest ranked features (
Figure 
3.
3C
) did not include any specific gene i
nteraction
s
, but 
rather included 
GO terms (eight 
features), protein
-
protein interactions (one feature), and protein 
domains (one feature). The GO terms included transcriptional regulation under zinc starvation, 
function in fermentation, snoRNA localization
, localiza
t
ion to the Golgi, RNA 
polymerase 
assem
bly, and both the 
GO and GOslim terms for mitochondrial translation. Also included was 
the number of GO terms 
that 
were shared between genes in a pair
. Similar to the results from the 
SM fitness model, these
 
are key 
c
e
llular 
functions, whic
h may explain th
e association with
 
decreased fitness.
 
Also consistent with
 
the SM fitness model, there was 
no 
correlation between 
feature 
sparsity
 
and importance rank (
PCC
 
= 0.
05,
 
p
-
value 
=
 
4
.
3
 
x 10
-
6
; 
Figure 
3.S
4B
)
, 
meanin
g 
these fe
a
tures were not selecte
d as important s
imply as 
that type of 
art
i
fact of the data structure
.
 
We next looked at the distribution of values for important features in relation to DM 
fitness to determine the directionality of their predictive 
p
ower
 
(
i.e., whet
h
er higher feature
 
valu
es were related 
to higher or lower DM fitness; 
Figure 3.S
5
). For the important GO
 
terms 
identified, we found that DMs in which one gene had the
 
annotation generally had lower fitness 
values, while DMs in which neither
 
had t
he annotation generally h
a
d highe
r fitness values
. For 
the important features representing shared GO terms and protein
-
protein interactions
 
(referred to
 
 
77
 
 
Figure 3.3
: Performance and important features of double mutant fitness model. 
 
 
P
erformance of (
A

= 0.71 and (
B
) DM fitness m
odel with 
fitness scores permuted to represent no association between features and fitne
ss values 

(
C
) 
Features with the ten highest
-
ranked importance scores
 
in predicting 
D
M fitness. In contrast with the S
M fitness mo
d
el, important 
features were primarily GO terms 
rather than interactions with specific genes.
 
 
78
 
 
-

, we
 
expected 
that
 
genes
 
having a hi
gh percentage
 
of the same protein int
eractors and high number of 
shared 
GO terms may be performing similar functions or involved in the same pathway(s), and therefore 
would have a strong fitness effect when both genes were knocked out. Th
is was
 
not the cas
e, 
however; t
h
ere was no significant 
correlation between percent shared PPI or number of shared 

-
0.03 and 
-
0.04, respectively; 
Figure 
3.S
5
E, I
). Taken together,
 
we found that
, consistent 
with t
he SM fitnes
s model, key 
h
ousekeeping functional 
annotations are predictive of reduced fitness, but that due to the sparsity 
of genes having these annotations, the predictions are 
accurate for a small number of genes. 
 
 
3.3.5 
Prediction of double mu
tant fi
tness using 
single mutant
 
fitness 
 
With
 
one DM f
i
tness model in place, we 
wondered whether the high level of gene 
interactions in yeast 
(Co
stanzo et al. 2016)
 
would limit the utility of 
SM fitness 
data in predicting 
DM fitness. 
Where two
 
genes do no
t have epista
t
ic interactions, the f
i
tness of the corresponding 
DM can be modeled as 
the sum of 
the fitness decreases of both
 
SMs
, 
making predi
ction of
 
DM 
fitness based on SM fitness a very straightforward 
proposition
.
 
However, gene interaction
s such 
as ge
netic redunda
n
cy can lead to non
-
add
i
tive fitness effects in a DM compared to the 
corresponding SMs
,
 
which would be expected to complicate the 
problem 
of DM fitness 
prediction 
significantly
. To answer this question, we incorporated features der
ived from SM
 
fitness into
 
the DM fitness model: 
f
or each DM, we 
included
 
the difference, average, maximum, 
minimum and total SM fitness for the genes in ea
ch pair
 
as features
. Inclusion of these
 
SM 
fitness score
-
derived features (SMF features) did improve 
the model (P
CC = 0.80 and 0.84 for 
training and 
t
esting sets, respectively; 
Figure 
3.
4A
-
B
). The improvement in model performance
 
79
 
 
Figure 3.4
:
 
Performance and important features of double mutant fitness 
model including 
single mutant fitness
-
derived features.  
 
Perfor
mance of (
A
) DM fitness model with
 
single muta
nt fitness (SMF) features included 

 
=) and (
B
) DM fitness model with SMF features included and fitness scores 

=). SMF features for each gene pair comprised the maximum, 
minimum, average, difference, and total values for the corresponding SM fitness values.
 
(
C
) 
Featu
res
 
with the ten highest
-
ranked importance scores in predicting
 
DM fitness with SMF 
features include
d. SMF features 
were
 
extremely important predictors of DM fitness.
 
 
80
 
 
indicated that the SMF fea
tures were important in predicting DM f
itness, which was conf
irmed
 
by the SMF features occupying five
 
of the six top feature importance ranks for thi
s
 
model 
(
Figure 
3.
4C
). 
 
As mentioned above, d
ue to the prevalence of genetic interactions in yeast, this 
impor
t
ance
 
of SMF features in predicting 
DM fitness was not exp
ected, but suggests an 
important relationsh
ip between SM and DM fitness. To determine how effective this relationship 
could be in predicting DM fitness, we built another model using only the five SM
F
 
fea
tures 
(SMF feature model). This
 
model performed bette
r still than the previous model, with PCC =
 
0.84 and 0.83 for the training and testing sets, respectively (
Figure 
3.5
A
), and randomization of 
the DM fitness values demonstrated that this performance
 
was 
not achieved by chance (
Figure 
3.5
B
). The higher perf
ormance of the model with five features compared to the model with 
~17,000 features may be counterintuitive, as a strength of machine learning in general is the 
combination of features to achieve mo
r
e ac
curate predictions than could b
e at
tained with any on
e 
feature. However, the presence of many uninformative features can decrease the performance of 
the model. The sparsity of most of the functional annotation data may have made 
many of them 
uninforma
t
ive,
 
explaining the decreased perfo
rman
ce of the model in
corporating them. 
 
To determine the contribution of each individual SMF feature to the DM predictions, we 
calculated the correlation between DM fitness and each of the SMF features. Despite our 
expe
c
tati
on that genetic interactions wo
uld 
equate to a nonlin
ear relationship between the fitness 
of DMs with the corresponding SMs, DM fitness showed a positive linear correlation with 
minimum, average, total, and maximum SM fitness, and a negative
 
linear
 
co
r
rela
tion with 
difference in SM fitn
ess 

um rho = 0.86, 0.88, 0.88, 0.63, and 
-
0.65, 
respectively; 
p
-
values < 2.2 x 10
-
16
; 
Figure 3.5C
-
G
). In general, a very strong linear correlation 
81
 
 
of an individual feature with the predicted variable ap
p
roxi
mates the predictive power of t
he 
f
eature on its own,
 
as was the case here (
Figure 3.S
6
). 
 
The difference in correla
tion strength/direction among the five SMF features in
dicated 
that the way in which SM fitness was represented affected the predictive power. The two features 
most strongl
y correlated with DM f
itness were the average and the total SM fitness, features 
whic
h contain fitness information about both of the SMs wi
th a single number. Maximum and 
minimum SM fitness features by definition represent only one of the two SMs for each 
gene pair, 
and were no
t as well correlated with DM fitness
. While the decrease in cor
relation was very
 
small in 
minimum SM fitness
, m
aximum
 
SM fitness had a comparatively much weaker 
correlation with DM fitness. This suggests a result 
relevant to gene inte
ractions: that SMs wit
h 
lower fitness are less likely to have positive paralogous gen
e interactions than are SMs with high 
fitness. 
 
 
3.3.
6
 
Predicting degrees of genetic redundancy and comparison of important features
 
 
We next 
sought to determine whether w
e could accurately mod
el 
genetic redundancy as a
 
form of negative gene interaction,
 
i
ncorporat
ing
 
both SM and 
DM fitness 
data for a given g
e
ne 
pair
. Redundancy scores were calculated based on the change in fitness between WT and 
single/double mutants, wher
e the score would be 1
 
in the case of full redundancy and 0 would be
 
nonredundant (see 
Methods
). Using this definition of genetic redundanc
y
, the r
edundancy scores 
in our dataset were concentrated around 0, with a median score of 0.0023
 
 
(mean = 0.01; 
standard deviation = 0.12;
 
Figure 
3.
6
)
, suggesting
 
few instances of genetic redundancy
 
are 
present in
 
the yeast genome
. 
 
As with SM and DM fitn
e
s
s, we 
established a regression model to predict redundancy
,
 
82
 
 
Figure 3.5
:
 
Performance of model 
built only with
 
and correlation of fitnes
s values with 
single mutant fitness
-
derived features
.
 
 
83
 
 
F
igure 3.5
 
cont

 
Performance of (
A


0.88) 
and (
B
) 
DM f
itness model with SMF features only and fitness scores 
randomized
 

-
0.008
).
 
(
C
-
G
) 
Correlation of individual SMF features with DM fitness.
 
 
and 
us
ed 
the same gene pairs and features used for the DM fitness model. In addition, five 
principal component scores derived from GO terms and protein
 
domains 
(
as a means of 
dimensional redu
cti
o
n
 
for t
hose sparse feature
 
types)
 
were included as features. The performance 
 
of the redundancy model was comparable to the performance of the SM fitness model, with PCC 
of 0.32 and 0.40 on the training and te
sting sets, respectively (
Figure 
3.
7
A
). Ho
wever
,
 
the mo
del 
with randomized redundancy values here again demonstrated that there was predictive power in
 
Figure 3.6
:
 
Distribution of genetic redundancy scores. 
 
Distribution of genetic redundancy score
s in our dataset. A value of 0 repres
ents nonredun
danc
y
 
and a value of 1 represents full redundancy, with degrees of partial redundancy represented in 
between. 
 
 
84
 
 
our features beyond random correlation (
Figure 3.7B
).
 
To 
identify potential biological factors associate
d with redundancy, we next analyzed th
e 
fea
t
u
re importance scores for the redundancy model. We hypothesized that there would be a 
relatively high degree of overlap in the most important features for predic
ting DM fitness and 
redundancy, as redundancy was d
irectly calculated from fitness scores
. Thi
s
 
did turn out to be 
the case, with six of the ten most important features in the redundancy model shared with the 
DM fitness model (
Figure 3.7C
), and a high degr
ee of correlation in importance score ranks 
between
 
the DM fitness and redundancy models 
(
Figu
r
e
 
3.7D
).
 
 
M
aximum 
SM fitness 
was the most important feature in predicting redundancy
, despite 
being the 
least important 
of the 
SMF feature
s
 
for 
predicting 
DM fit
ness. Given the high feature 
importance scores and
 
correlation of SMF features with DM f
itness
 
i
n that model, it was 
 
 
Figure 3.7
: 
Performance and important features of redundancy model. 
 
 
(B) redundancy model with 

 
85
 
 
Figure 3.7 

d
 
(
C
) 
Features with the ten highest
-
ranked importance scores in predicting
 
redundancy. 
(
D
) 
Correla
ti
on of feature importance scores in the DM fitness and redund
ancy models showing that 
features important in predicting DM fitness also tend to be important in predicting redundancy
 

0.84
)
.
 
 
86
 
 
expected that important SMF features in the 
re
du
ndancy model would also be correlated with 
redundancy. Howev
er, none of the SMF features were very well correlated with redundancy
 
due 
to the presence of many outliers
.
 
PCC was 0.18, 
-
0.16, and 0.22 
for maximum, difference in and 
average SM fitness, resp
ec
tively (
Figure 
3
.S
7
A, F, J
). The high importance 
scores appear to be 
an effect of the skewedness of the data; the vast majority of SM fitness scores were clustered 
tightly around 1, so the maximum and average SM fitness scores were mostly high, and the 
d
if
ference in SM fitness scores was mostly low. As t
he redundancy scores were clustered around 
0, there would appear to be an association of low redundancy scores with high maximum SM 
fitness, high average SM fitness, and low difference in SM fitness. This 
is
 
simply because that
 
i
s 
how the data were distrib
uted, not because there was any correlation or predictive power. 
 
Several of the features with high importance scores did show some real differences 
between high and low redundancy gene pairs. The number o
f 
genes in the pair that had a GO 
annotation of res
ponse to osmotic stress was related to redundancy (
Figure 
3.S
7
G
). Gene pairs 
that had either 0 or 1 gene in the pair with this annotation had low redundancy scores, while 
instances where both genes had the
 
a
nnotation had high redundancy scores. This makes 
biological sense in that genes with the same annotation would be more likely to be redundant 
than gene pairs without the same functional annotations. A similar case was seen with 
other 
GO 
annotation
s, name
ly
 
an annotation
 
of 60S L4 ribosomal protein for bo
th genes in the pair 
co
-
occurring with 
a high redundancy score 
(
Figure 
3.S
7
H
) and
 
an 
annotation of epsin 5 for 
only 
one gene in a pair 
co
-
occurring with a low redundancy score 
(
Figure 
3.S
7
I
). This was once
 
a
gain likely due to the uneven distribution of dat
a

for each feature, there was one gene pair 
with a high redundancy score and the GO annotation, making these features not generally 
predictive of genetic redundancy, but highly effective at accurately pred
ic
ting a very small 
87
 
 
number of gene pairs
. 
In fact, 
for 
binary features such as GO
 
annotations 
in this model
, the 
feature importance score 
was essentially a function of the 
how many
 
gene pairs ha
d 
the 
annotation 
multiplied by
 
the difference in median 
redund
an
cy score
 
among gene pairs with
 
and 
without
 
the annotation (
Figure 
S
8
). Thus, feature importance 
in this case 
appears to be a 
combination of 
the structure of the data (feature sparsity, as discussed in 
Section 
3.3.3
) and 
true
 
differences in 
redundancy 
sco
re
 
among g
ene pairs with specific annotations, such as the 60S L4 
ribosomal protein and epsin 5.
 
Furthermore
, ribosomal genes are known to have many copies in 
yeast 
(Komili et al. 2007)
, and epsin
 
is a family of highly conserved membrane pro
te
ins 
(Sen et 
al. 2012)
, pointing to potential broader biological significance of these results
 
in the context of 
genetic redundancy
. 
 
 
3.4 
Conclusions
 
In this study we 
used machine learning models to predict mutan
t
 
fitness and genetic 
redundancy in yeast. We established several extremely accurate regression models for the 
prediction of DM fitness phenotypes. As DM fitness can easily be found experimentally, the 
importanc
e of this model is not so much for generating
 
p
redictions; that the predictions are so 
good simply shows that the model is working accurately. The biological significance of our 
models was also supported by the SM fitness model, as several features related
 
to mitochondrial 
function were found to be
 
i
m
p
ortant in predicting SM fitness. Thus, our experiments 
demonstrated the validity of using a machine learning approach to model fitness phenotypes and 
identify features that are important contributors to yeast 
SM and DM fitness, which are the basis 
for ca
l
c
ulating genetic redundancy. We also discovered that SM fitness is a very reliable 
predictor of DM fitness, which was not previously known, and identified several GO annotations 
88
 
 
that are highly accurate in pred
icting redundancy scores among a subset of ge
n
e
 
pairs
, 
implicating those processes as potentially important in retained redundancy
. 
 
While this study found some exciting results, there 
certainly 
are
 
further
 
avenues that 
could be explored with respect to pr
ediction of redundancy. The features we used 
f
o
r prediction 
of redundancy among gene pairs were mainly different ways of utilizing the same features used 
in the SM fitness model, but there are many other interesting features specific to pairs of genes 
that
 
may shed more light on the relationship betw
e
e
n genes in a pair. Future work in this area 
would use DM
-
specific features such as the rate of synonymous and nonsynonymous substitution 
and gene co
-
expression values. Additionally, a classification
-
based mode
l could be used to 
predict fitness and redund
a
n
cy as categorical rather than continuous variables; for example, SM 
and DM fitness could be categorized as increased fitness/same fitness/decreased fitness with 
respect to the wild type, and genetic interactio
ns could be represented as negative 
epistasis
/
p
ositive epistasis (redundancy)/no gene interaction. Approaching fitness and 
redundancy as a classification problem would potentially allow for more accurate predictions as 
well as important features that are g
eneralizable to specific, biologically releva
n
t
 
states of fitness 
and genetic interaction, rather than specific features important only to a few gene pairs along a 
continuum of values. It would also be beneficial to incorporate yeast fitness data from addi
tional 
conditions, such as increased temperat
u
r
e or nutrient limitation, to identify redundancy among 
genes that are only conditionally 
observed
.
 
 
On the computational side
, 
additional model 
hyper
-
parameter 
optimization 
should be 
conducted using grid 
search
. Additionally,
 
the algorithm used to build 
t
h
is model 
(SV
R
) i
s 
somewhat limited 
in terms of parsing individual feature contributions 
to the model as a whole 
and to specific gene pairs
.
 
A
 
decision tree
-
based model 
could be built
, which would allow fo
r a 
89
 
 
more nuanced investigation of features that ar
e
 
important in predicting redundancy generally and 
in each 
gene pair 
instance. 
Analysis of features important in 
correctly predicted and 
mis
-
predicted gene pairs would be expected to uncover 
features with b
iological 
significance in 
genetic redundancy and a
l
l
ow us to further optimize the model,
 
for example by removing 
features that consistently contribute to mis
-
predictions. 
 
We here learned that a regression
-
based machine learning approach can be used to 
ide
ntify features that are biologically relevant in p
r
e
dicting fitness phenotypes and genetic 
redundancy in yeast. Single mutant fitness was unexpectedly important in predicting double 
mutant fitness, with a correlation suggesting that single mutants with low
 
fitness are less likely to 
have gene interactions
 
i
n general. Finally, contrary to expectations, we determined that in 
standard yeast growth conditions 
(at 26°C on rich medium) 
it is difficult to predict genetic 
redundancy as a continuous variable; future
 
predictive analyses may benefit from the inclusio
n
 
of fitness data from additional conditions and approaching genetic redundancy as one of several 
categori
cal types
 
of gene interactions.
 
 
3.
5
 
Materials and Methods
 
3.5.1 
Data source
 
 
Saccharomyces cerevisiae
 
fitness data for single and double mutants 
(gro
w
n
 
at 26°C 
on 
Yeast Extract Peptone Dextrose medium, as described by 
Kuzmin
 
et al. 
(
2014
)
)
 
were obtained 
from Costanzo et al. 
(
2016
)
. For predic
t
i
ve modeling, data from five general categories were 
collected for each gene: functional annotations; evolutionary properties; protein sequence 
properties; gene and protein 
expression; and network properties (
Supplemental Data
). 
 
90
 
 
Functional annotation dat
a 
were ob
t
ained from the Saccharomyces Genome Database 
(SGD, 
yeastgenome.org
) and included gene ontology (GO) terms (
v20170913
), GO slims 
(v20170114), protein domains (v201610), and b
iochemical pathways (v. 15). From the 
protein 
domain
 
data, the PANTHER do
ma
in anno
t
ations were used. Evolutionary data from SGD 
comprised the number of orthologs for each gene among bacteria and among genomes 
throughout the domain eukarya, with a particula
r focus on fungi (
v201609). 
Protein properties 
data obtained from SGD inc
lu
ded sub
c
ellular localization 
(Huh et al. 2003)
; molecular weight, 
amino acid length, an
d 
isoelec
t
ric point (
v20160916
)
; and regulatory protein binding levels 
(Venters et al
. 2011)
. Post
-
translational modifications (PTMs) were from iPTMnet 
(Huang et al. 
2018)
 
release 5.0. Expression data from SGD included protein abundance in wild type (WT) 
yeast 
(H
uh et al. 2003)
, protein abundance in WT and mutant yeast under stress conditions 
(Chong et al. 2015)
, mRNA hal
f
-
lives 
(Geisberg et al. 2014)
, and gene over
-
 
or underexpression 
in stress conditions 
(Waern and Snyder 2013)
. Network property data
 
c
omprise
d
 
physical 
interactions (
v20170114
). 
 
 
3.5.2 
Fea
tures for single mutant fitness predictions
 
For single mutant (SM) fitness predictions, the features are listed in 
Supplemental Data
.
 
Most of the data were processed in each of two ways: as binary (p
re
sence/absence) a
nd
 
as 
continuous (total number) data. As an exam
ple, each gene had a binary feature value for each 
GO term, if the gene had that annotation or not, and a continuous feature value for all of the GO 
terms collectively, the number of GO anno
ta
tions for the ge
ne
. Data processed in this manner 
included the f
ollowing categories: GO terms, GO slim terms, PANTHER protein domains, 
biochemical pathways, species with a homologous gene, subcellular localization, PTMs, physical 
91
 
 
interactions, overexpres
si
on in stress con
di
tions, and underexpression in stress condition
s. An 
additional continuous feature, differential expression in stress conditions, combined the over
-
 
and underexpression terms. Each PTM was also represented by a continuous value for each 
ge
ne 
comprising th
e 
number of sites at which the protein product h
ad that PTM. Continuous features 
with no binary feature counterparts included molecular weight, amino acid length, isoelectric 
point, protein abundance, mRNA half
-
life, and level of regulato
ry
 
protein binding
.
 
 
3.5.3 
Features for double mutant and 
genetic redundancy predictions
 
The majority of features for double mutants (DMs) were derived from SM gene pair 
features, and the manner in which these were calculated depended on the type of SM fea
tu
re 
(
Supplemental Data
).
 
C
ate
gories of gene properties processed a
s binary for SMs include GO 
terms, GO slim terms, PANTHER protein domains, biochemical pathways, species with a 
homologous gene, subcellular localization, PTMs, physical interactions, and d
if
ferential 
expression in s
tre
ss conditions. These gene properties 
processed as binary for SMs were used to 
generate both continuous and categorical features for DMs. For each gene property category, 
continuous DM features were generated by determining the
 
p
ercent and number of over
lap
ping 
properties between each gene pai
r. For each gene property within each category, categorical DM 
features were generated by taking the total number of genes in the pair with that property (0, 1 or 
2). Additional DM features
 
w
ere generated from contin
uou
s SM features by taking the 
differenc
e, average, maximum, minimum, and sum of the two SM values in a DM pair. This 
same processing was applied to SM fitness values for use in the DM model as detailed below.  
 
92
 
 
3.5.4 
Definition 
of
 
genetic redundancy
 
 
From
 
th
e 26°C datasets of Constanzo et al. (2016), we identified a subset of paralogous 
gene pairs (defined as gene pairs with a BLASTP Expect
-
value < 1 x 10
-
5
 
; 
Altschul et al. 1990)
 
for which fitness data were available for the DM and both corresponding SMs. DMs in t
his
 
subset were us
ed
 
in modeling DM fitne
ss a
nd 
redundancy scores. We define genetic redundancy 
(
R)
 
as a state where the difference in DM fitness compared to WT is greater than the sum of the 
difference between both corresponding SMs compared to WT, i.e., there is a synergi
st
ic rather 
than additi
ve e
ffe
ct of the double mutations:
 
 
(Eq. 1)
 
 
R 
= (
F
WT
 
-
 
F
DM
) 
-
 
(
F
WT
 
-
 
F
SM1
)
-
 
(
F
WT
 
-
 
F
SM2)
 
 
where 
F
 
represents the fitness value of the indicated genotype. The mutant fitness values were 
normalized (

) relative to the fitness of WT
. 
Equation 1 was thus simpl
ifi
ed to:
 
 
(Eq. 2)
 
 
R 
= (1 
-
 

DM
) 
-
 
(1
-
 

SM1
)
-
 
(1 
-
 

SM2
)  
 
 
R 
was calculated for the DMs i
n our dataset to generate continuous redundancy scores. 
 
 
3.5.5 
Machine learning models
 
 
Predictive models of SM fitness, DM fitness,
 
a
nd genetic redundancy wer
e im
plemented 
in the ScikitLearn machine learning package in Python 
(Pedregosa et al. 2011)
. Support Vector 
Regression 
(
SVR, 
Drucker et al. 1997)
 
was used to bu
il
d regression models to pr
edic
t fitness and 
redundancy scores as continuous values. For building a model, the gene (for the
 
SM fitness 
93
 
 
model) or gene pair (for the DM fitness and redundancy models) instances were randomly 
divided into a training set (90%)
 
f
or model building and a t
esti
ng set (10%) that was withheld 
from model building for independent validation. During trainin
g, 10
-
fold cross
-
validation was 
used to build the models, meaning that a total of 10 models were built with the training data; 
with 
ea
ch model, 90% of the trai
ning
 
set instances were used for model building and the 
other10% to were used to cross
-
validate t
he performance of the model. After cross
-
validation, 
the trained model was then applied to the testing set.
 
Performance results are 
re
ported f
rom
 
cross
-
validat
ion 
and from the testing set
. 
 
 
94
 
 
APPENDIX
 
95
 
 
Supplemental Information
 
 
Figure 3.S1
: 
Distribution of gene family sizes. 
 
Distribution of gene
 
family sizes 
for the paralogs in our dataset
. 
The paralogous gene set 
comprise
s
 
22
87 unique 
genes in
 
571 
gene
 
families
. Fort
y
-
three percent of 
the gene families
 
have 
more than two members
, 
meaning that 
61% of the 
paralogous genes in our dataset 
have three or 
more paralogs. 
This 
conflicts with 
results 
in the Saccharomyces Genome Data
ba
se
, which 
shows 
only thre
e co
nfirmed gene families in our datas
et with more than 
two members, likely due to 
our 
use of a sequence similarity
-
based
 
definition of paralogs.
 
 
96
 
 
Figure 3.S
2
: 
Feature importance scores for SM fitness model.
 
 
Number of genes ha
v
i
ng
 
each of the ten most i
mpor
tant features
 
in predicting SM fitness
. 
Mitochondrial translation (ranked 2
nd
) was a Gene Ontology (GO) term; N
-
terminal 
acetyltransferase domain (ranked 8
th
) is a PANTHER protein domain; all others represent 
interaction with
 
a
 
g
ene which has been give
n a 
simplified de
scriptor. In order from rank 1
-
10 the 
gene names are: MRPS12, SEN2, MSS51, SNF11, CDC40, DST1, PRX1, and OPI1. 
 
Many of the gene interaction features in our dataset are sparse, meaning few 
of the genes have 
these
 
i
nt
eractions
 
and they were
 
not
 
expected to 
be particularly informativ
e in ML. 
 
 
97
 
 
Figure 3.S3
: Distribution of 
SM fitness values among 
gene pairs with selected annotations.
 
D
istribution of SM fitness values among genes with and w
ithou
t 
the annotatio
ns and 
inte
ractions 
show
n 
to be important in predicting SM fitness (see 
Figure 3.
2C
 
and 
Figure 3.S2
)
. In general, 
genes without the annotations and interactions had lower SM fitness values.
 
 
98
 
 
Figure 
3.S3 cont

d
 
D
istribution of SM fitness
 
values among genes with and w
ithou
t 
the annotations and 
inte
ractions 
show
n 
to be important in predicting SM fitness (see 
Figure 3.
2C
 
and 
Figure 3.S2
)
. In general, 
genes without the annotations and interactions had lower SM fitness values.
 
 
99
 
 
Figure 3
.S4
: 
Feature importance scores vs
. num
be
r of gene pairs with
 
ann
otation.
 
 
(
A
)
 
For
 
each binary feature in the SM fitness model (e.g., GO term or protein domain annotation; 
interaction with a specific gene), comparison of the feature importance score with th
e number of 
genes having that ann
otati
on
 
or interaction. (
B
)
 
For
 
features in the DM fitness model 
corresponding to the features shown 
in 
A
, comparison of the feature importance score with the 
number of gene pairs in which both genes have that annotation or
 
interaction. In both cases, ther
e 
is 
no
 
correlation, indica
t
ing
 
that feature importance is not simply a function of the prevalence of 
a particular feature in our dataset. 
 
 
100
 
 
Figure 
3.
S
5
: Distribution of 
DM fitness values among gene pairs
 
with selected annotations.
 
(
A
-
D
, 
F
-
H
,
 
J
)
 
Distri
butio
n 
of 
DM
 
fitness values
 
amo
ng genes with and without the annotations 
and interactions shown 
to be important in predicting DM fitness (see 
Figure 3.
3
C
)
. 
In 
general, 
gene
 
pairs in which one 
gene
 
without the
 
annotations and interactions had lower SM fitne
ss 
 
101
 
 
F
igure 3.S5 

 
v
al
ues.
 
(
E
) Distribution of DM fitness values vs. number of overlapping GO terms and (
I
) 
percent overlapping protein
-
protein interactions among genes in a pair. There was no significa
nt 
relationship between DM fitness and either
 
of tho
s
e 
features, contrar
y to
 
our expectations. 
 
 
102
 
 
Figure 3.S
6
: Performance of 
DM fitness models built with single features.
 
PCC scores representing the performance 
of prediction
 
of DM fitness 
using each of the 
SMF 
features separately. These results ar
e consi
s
te
nt with the corre
lati
on strength between DM fitness
 
and each SMF feature.
 
 
103
 
 
Figu
re 3.S7
:
 
Distribution of redundancy scores among gene pairs with selected ann
otations.
 
(
A
) Distribution of redundancy scores vs. maximum SM 
fi
tness for ge
nes in a pair; 
(
D
) 
vs. 
number
 
of overlapping GO terms
.
 
N
either
 
of the correlations were 
very strong, highlighting the 
importance of a mo
de
l combining 
multiple featur
es
 
w
ith weak p
redictive power 
to generate more 
accurate predictions. 
(
B
) 
and (
C
) 
Distribution of 
redundancy scores
 
among gene
 
pairs
 
with 
0, 1 
or 2 genes in the pair having 
an annotation 
shown 
to be important in predicting 
redundancy
 
(see
 
Figure 3.
7
C
)
. 
 
 
104
 
 
Figure 
3.S7 cont

d
 
(
E
) 
and (
G
-
I
) 
Distribution of 
redundancy scores
 
among gene
 
pairs
 
with 
0, 1 or 2 genes in the pair 
having 
an annotation 
shown 
to be important in predicting 
redundancy
 
(see
 
Figure 3.
7
C
)
. 
There 
were f
ew
 
g
eneralizable patterns, although some
 
features 
(
G
-
I
) 
did show a very small number of 
gene
 
pairs which had the annotation and 
a high redundancy score, likely leading to the high 
imp
ortance value assigned to those features.
 
(
F
) 
Distribution of redundancy scores
 
vs. 
difference 
in SM fitness betw
een genes in a pair
 
and
 
(
J
)
 
vs.
 
average SM fitness for genes in a pair. N
either
 
of the correlations were 
very strong, highlighting the importance of a mo
de
l combining 
multiple 
featur
es
 
w
ith weak predictive power 
to generate more accurate predictions.
 
 
105
 
 
Fi
gure 
3.S8
: 
Fea
tur
e
 
importance scores vs. number of gene pairs with annotation and 
difference in redundancy score among genes in a pair.
 
 
(
A
)
 
For features in the 
redundancy 
model 
derived from 
binary 
features 
(e.g., 
number of genes in 
a pair wit
h a 
GO ter
m o
r protein doma
in 
a
nnotation)
, comparison of the feature importance score 
with the number of gene pairs in which both genes have that annotation or interaction. 
There is 
no correl
ation, indicating that fe
ature importance is not simply a functio
n of the n
umb
er of genes 
in
 
th
e
 
dataset with a particular annotation. (
B
) 
For features shown in (
A
), m
edian redundancy 
score was calculated among gene pairs in which both genes ha
d the annota
tion 
and among gene 
pairs in which one or neither gene had the an
notation
; 
the
 
difference in
 
me
d
ian values was then 
compared with 
f
eature importance scores
. There was some correlation between feature 
importance and the median redundancy score 
difference, indicating that
, as would be expected, 
feature importance was 
affe
cted by 
th
e a
bility of an i
ndi
v
idual feature 
to differentiate between 
high and low redundancy scores. 
 
 
106
 
 
C
HAPTER
 
4: CONCLUSIONS
 
 
107
 
 
In studying genetic redundancy over the last several years, I have come to appreciate it 
for the nuanced and evolutionar
ily fascin
ati
ng phenomenon 
tha
t
 
it is. The Shiu Lab
,
 
together 
with 
Jeffrey Conner at Kellogg Biological Station and 
Patrick Krysan 
at 
Un
iversity of 
Wisconsin
-
Madison, 
is 
currently 
conducting 
experiments to dig deeper into 
these nuances, 
defining 
genetic re
dundancy a
s a
 
continuous 
ra
the
r
 
than binary 
trait
. 
They are accomplishing this
 
by 
measuring 
lifetime fitness of 
hundreds of 
single 
and double mutant trios
 
in Arabidopsis
, both 
in growth chambers and in the field. 
T
he Shiu Lab has 
also
 
collaborated with the
 
Kramer La
b
 
a
t 
MSU, using t
hei
r
 
Center for Advanced Algal and Plant Phenotyping to 
conduct comprehensive 
photosynthetic phenotyping
 
of 
some of 
our mutants, 
generat
ing
 
truly staggering amounts of data
. 
With these quantitative 
phenotypes
, 
degrees of 
genetic 
redundancy
 
am
ong gene pairs
 
ca
n
 
be 
calculated
,
 
producing a veritable treasure trove of data
 
for some lucky student
(s)
 
to analyze
. 
I 
look forward to seeing the next iteration of the Arabidopsis redundancy model, using 
genetic 
redundancy scores derived from 
this proje
ct 
to 
increase th
e a
c
curacy of predi
ctions 
and 
uncover
 
additional insights into the evolutionary underpinnings of redundancy
. 
 
When I joined the Shiu Lab, 
its
 
graduate student population was at something of an
 
inflection point. One had recently f
inished up
 
an
d moved on, an
d t
w
o more would do
 
the same 
within the next six months. I had the privilege to work alongside two other grad students for 
another couple of years before they graduated too
, and 
i
n 
that same year, 
my 
last one
, we gained 
two new s
tudents. 
E
ver
y scientist in
 
ac
a
demia
 
is famili
ar with this

 
frequent turnover
 
as 
students come and go
, with all the 
new ideas and 
feelings of excitement, loss, joy, 
and 
sadness
 
that brings. 
In just a few short years, 
largely 
driven by graduate students, 
th
e lab adde
d 
s
everal 
areas o
f r
e
search to 
its repertoire
 
and 
began mastering
 
new, 
more powerful machine learning 
methods
 
than the ones used and discussed here
. 

108
 
 
at which 
computational biology is moving
, and I 
thank new 
and
 
future gradua
te 
s
tudents 
ev
erywhere for 
show
ing
 
me 
that there will always be 
bright and 
enthusiastic 
new 
people
 
just 
waiting for a chance 
to 
push
 

possible
. 
 
 
109
 
 
REFERENCES
 
 
110
 
 
R
EFERENCES
 
 
A
ltschul SF, Gish 
W, Miller W, Myers EW, Lipman DJ. 1990. Basic Local Alignment Search 
Tool. J Mol Biol. 215(3):403

410.
 
 
Benjamini Y, Hochberg Y. 1995. Controlling the False Discovery Rate: A Practical and 
Powe
rful Approach t
o Multiple Testing
. J R Sta
t
 
So
c Ser B. 57(1):28
9

300.
 
 
Berardini TZ, Reiser L, Li D, Mezheritsky Y, Muller R, Strait E, Huala E. 2015. The 

Reference Plant Gen
ome. Genesis. 5
3(8):474

485.
 
 
Bir
chler JA,
 
Vei
tia RA. 2010. The
 
Gene Balance Hypothesis: implications for gene regulation, 
quantitative traits and evolution. New Phytol. 186(1):54

62.
 
 
Blanc G, Wolfe KH. 2004. Functional Divergence of Duplicated Genes Form
ed by Polyploid
y 
during Arabidops
is Evolut
i
on.
 
Plant Cell. 16(7
):1679

1691.
 
 
Bolle C, Huep G, Kleinbölting N, Haberer G, Mayer K, Leister D, Weisshaar B. 2013. GABI
-
DUPLO: A collection of double mutants to overcome genetic redundancy in Arabidopsis 
thalian
a. Plant J. 75(
1):157

171.
 
 
Botst
ein D, Fi
n
k G
R. 1988. Yeast: A
n experimental organism for modern biology. Science. 
240(4858):1439

1443.
 
 
Bouché N, Bouchez D. 2001. Arabidopsis gene knockout: phenotypes wanted. Curr Opin Plant 
Biol. 4(2):111

117.
 
 
Bowers 
JE, Chapman BA, 
Rong J, Paterson A
H. 2003. 
U
nra
velling angiosper
m genome 
evolution by phylogenetic analysis of chromosomal duplication events. Nature. 422(6930):433

438.
 
 
Brandão MM, Dantas LL, Silva
-
Filho MC. 2009. AtPIN: Arabidopsis thaliana 
Protein 
Interaction Network.
 
BMC Bioinformatic
s. 10(454
)
.
 
 
Briggs GC, Osmont
 
KS, Shindo C, Sibout R, Hardtke CS. 2006. Unequal genetic redundancies 
in Arabidopsis 
-
 
a neglected phenomenon? Trends Plant Sci. 11(10):492

498.
 
 
Brookfield J. 1992. Can genes be
 
truly redundant? Curr Biol.
 
2(10):553

554.
 
 
B
rookfield
 
JFY
. 1997. Genetic r
edundancy. In: Advances in Genetics vol. 36. p. 137

155.
 
 
111
 
 
Chen H
-
W, Bandyopadhyay S, Shasha DE, Birnbaum KD. 2010. Predicting genome
-
wide 
redundancy using machine learning. BMC Ev
ol Biol. 10(357).
 
 
Cherry JM
, Adler C, Ball C,
 
Chervitz
 
SA,
 
Dwight SS, Heste
r ET, Jia Y, Juvik G, Roe T, 
Schroeder M, et al. 1998. SGD: Saccharomyces Genome Database. Nucleic Acids Res. 
26(1):73

79.
 
 
Chong YT, Koh JLY, Friesen H, Duffy K, Cox MJ, Moses A, 
Moffat J, Boone C, Andrews B
J. 
2015. Yeast pro
teome dyn
a
mic
s from single cel
l imaging and automated analysis. Cell. 
161(6):1413

1424.
 
 
Costanzo M, VanderSluis B, Koch EN, Baryshnikova A, Pons C, Tan G, Wang W, Usaj M, 
Hanchard J, Lee SD, et al. 2016. A glo
bal genetic interaction netw
ork maps a wiring 
diagram o
f
 
ce
llular function. 
Science. 353(6306).
 
 
DeSmet R, Adams KL, Vandepoele K, Van Montagu MCE, Maere S, Van De Peer Y. 2013. 
Convergent gene loss following gene and genome duplications creates single
-
cop
y families in 
flowering plan
ts. Proc Natl Acad
 
Sci U S 
A
. 1
10(8):2898

2903.
 
 
Dobzhansky T. 1946. Genetics of natural populations. XIII. recombination and variability in 
populations of Drosophila pseudoobscura. Genetics. 31(May):269

290.
 
 
Drucker H, Surges 
CJC, Kaufman L, Smola A, Vap
nik V. 1997. Suppo
rt vector
 
reg
ression 
machines.
 
Adv Neural Inf Process Syst. 9:155

161.
 
 
Edgar RC. 2004. MUSCLE: Multiple sequence alignment with high accuracy and high 
throughput. Nucleic Acids Res. 32(5):1792

1797.
 
 
Edger PP,
 
Heidel
-
Fischer HM, Bekaert 
M, Rota J, Glöckne
r G, Plat
t
s A
E, Heckel DG, Der
 
JP, 
Wafula EK, Tang M, et al. 2015. The butterfly plant arms
-
race escalated by gene and genome 
duplications. Proc Natl Acad Sci U S A. 112(27):8362

8366.
 
 
El
-
Brolosy MA, 
Kontarakis Z, Rossi A, Kuenne C, Günt
her S, Fukuda N, K
ikhi K, B
o
ezi
o 
GLM, Takacs CM,
 
Lai SL, et al. 2019. Genetic compensation triggered by mutant mRNA 
degradation. Nature. 568(7751):193

197.
 
 
Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell 
AL, Potter SC, Punta M, 
Qureshi M, Sa
ngrador
-
Vegas A, e
t al. 201
6
. T
he Pfam protein f
amilies database: towards a more 
sustainable future. Nucleic Acids Res. 44(D1):D279

D285.
 
 
Gabriel ML. 1960. Primitive Genetic Mechanisms and the Origin of Chromosomes. Am
 
Nat. 
94(877):257

269.
 
 
Geisberg J V.
, Moqtaderi Z, Fan
 
X, Ozsol
a
k F
, Struhl K. 2014.
 
Global analysis of mRNA 
isoform half
-
lives reveals stabilizing and destabilizing elements in yeast. Cell. 156(4):812

824.
 
 
112
 
 
Giaever G, Chu AM, Ni L, Connelly C, Riles L, V
eronneau S, Dow S, Lucau
-
Danila A, 
An
derson K, Andre B,
 
et al. 2
0
02.
 
Functional profi
ling of the Saccharomyces cerevisiae genome. 
Nature. 418:387

391.
 
 
Goda H, Sasaki E, Akiyama K, Maruyama
-
Nakashita A, Nakabayashi K, Li W, Ogawa M, 
Yamauchi Y, Preston J, 
Aoki K, et al. 2008. The AtGenExpress
 
hormone and chemi
cal treatm
ent
 
data set: experi
mental design, data evaluation, model data analysis and data access. Plant J. 
55(3):526

542.
 
 
Goodstein DM, Shu S, Howson R, Neupane R, Hayes RD, Fazo J, Mitros T, Dirks W
, Hellsten 
U, Putnam N, et al. 2012. 
Phytozome: a compa
rative pla
tfo
rm for green plan
t genomics. Nucleic 
Acids Res. 40(D1):D1178

D1186.
 
 
Guan Y, Dunham MJ, Troyanskaya OG. 2007. Functional analysis of gene duplications in 
Saccharomyces cerevisiae. Genetics
. 175(2):933

943.
 
 
Guckes KR, Cecere 
AG, Wasilko NP, Wi
lliams AL,
 
Bu
ltman KM, Mandel 
MJ, Miyashiro T. 
2019. Incompatibility of vibrio fischeri strains during symbiosis establishment depends on two 
functionally redundant hcp genes. J Bacteriol. 201(19):1

14
.
 
 
Hanada K, Zou C, Lehti
-
Shiu MD, Sh
inozaki K, Shiu S
-
H. 2008. I
mpo
rtance of Lineage
-
Specific Expansion of Plant Tandem Duplicates in the Adaptive Response to Environmental 
Stimuli. Plant Physiol. 148(2):993

1003.
 
 
Haruta M, Burch HL, Nelson RB, 
Barrett
-
W
ilt G, Kline KG, Mohsin SB, Young JC, Otegui MS, 
Sussman MR. 2010
. M
olecular Characte
rization of Mutant Arabidopsis Plants with Reduced 
Plasma Membrane Proton Pump Activity. J Biol Chem. 285(23):17918

17929.
 
 
Hsu PY, Calviello L, Wu H
-
YL, Li F
-
W, 
Rothfels 
CJ, Ohler U, Benfey PN. 2016. Super
-
resolution ribosome profiling
 
re
veals unannotated translation events in Arabidopsis. Proc Natl 
Acad Sci U S A. 113(45):E7126

E7135.
 
 
Huang H, Arighi CN, Ross KE, Ren J, Li G, Chen SC, Wang Q, Cowart J, Vijay
-
Sha
nker K, W
u 
CH. 2018. IPTMnet: An integrated resource for protein post
-
transla
tional modification network 
discovery. Nucleic Acids Res. 46(D1):D542

D550.
 
 
Huh, K. W, Falvo, V. J, Gerke, C. L, Carroll, S. A, Howson, W. R, et al. 2003. Global analysis 
of prot
ein local
ization in budding yeast. Nature. 425(6959):686

691.
 
 
Hunter R. L., Markert C. L. 1957. Histochemical Demonstration of Enzymes Separated by Zone 
Electrophoresis in Starch Gels. Science. 125(3261):1294

1295.
 
 
Jammes F, Song C, Shin D, 
Munemasa S, Takeda K, 
Gu D, Cho D, Lee S, Giordo R, Sritubtim 
S, et al. 2009
. MAP kinases MPK9 and MPK12 are preferentially expressed in guard cells and 
positively regulate ROS
-
mediated ABA signaling. Proc Natl Acad Sci U S A. 106(48):20520

20525.
 
113
 
 
Jiang W,
 
Liu Y, Xia E, Gao L. 
2013. Prevalent Role of Gene Features in Determining 
E
volutionary Fates of Whole
-
Genome Duplication Duplicated Genes in Flowering Plants. Plant 
Physiol. 161(4):1844

1861.
 
 
Kempin SA., Savidge B, Yanofsky MF. 1995. Molecular Basis of t
he cauliflower Phenoty
pe in 
Arabidopsis. Science. 267(5197):522

525.
 
 
Khan N
A, Haynes RH. 1972. Genetic redundancy in yeast: Non
-
identical products in a 
polymeric gene system. MGG Mol Gen Genet. 118(3):279

285.
 
 
Kilian J, Whitehead D, Horak J, Wanke D, Wei

gelo C, Bornberg
-
Bauer E, 
Kudla J, Harter K. 2007. The
 
AtGenExpress global stress expression data set: protocols, 
evaluation and model data analysis of UV
-
B light, drought and cold stress responses. Plant J. 
50(2):347

363.
 
 
Klappenbac
h JA, Dunbar JM, Schmi
dt TM. 2000. rRNA operon copy number reflects ecologic
al 
strategies of bacteria. Appl Environ Microbiol. 66(4):1328

1333.
 
 
Komili S, Farny NG, Roth FP, Silver PA. 2007. Functional Specificity among Ribosomal 
Proteins Regulates Gene Ex
pression. Cell. 131(3)
:557

571.
 
 
Kosetsu K, Matsunaga S, Nakagami H, Colcomb
et J, Sasabe M, Soyano T, Takahashi Y, Hirt H, 
Machida Y. 2010. The MAP kinase MPK4 Is Required for Cytokinesis in Arabidopsis thaliana. 
Plant Cell. 22(11):3778

3790.
 
 
Krysan PJ, J
ester PJ, Gottwald JR,
 
Sussman MR. 2002. An Arabidopsis mitogen
-
activated 
pr
otein kinase kinase kinase gene family encodes essential positive regulators of cytokinesis. 
Plant Cell. 14(5):1109

1120.
 
 
Kuzmin E, Sharifpoor S, Baryshnikova A, Costanzo M, Myers
 
CL, Andrews BJ, Boone
 
C. 2014. 
Chapter 10: Synthetic Genetic Array Analysis
 
for Global Mapping of Genetic Networks in 
Yeast. In: Yeast Genetics: Methods and Protocols. Vol. 1205. p. 143

168.
 
 
Lacroute F, Piérard A, Grenson M, Wiame JM. 1965. The biosynthe
sis of carbamoyl phosp
hate 
in Saccharomyces cerevisiae. J Gen Microbiol. 40(
1):127

142.
 
 
Last RL, Bissinger PH, Mahoney DJ, Radwanski ER, Fink GR. 1991. Tryptophan mutants in 

ll. 3(4):345

358.
 
 
Lee
 
I, Ambaru B, Thakkar P, Marcotte EM, Rhee SY. 2010. Rational assoc
iation of genes with 
traits using a genome
-
scale gene network for Arabidopsis thaliana. Nat Biotechnol. 28(2):149

156.
 
 
Lee SS, Lee RYN, Fraser AG, Kamath RS, Ahringer
 
J, Ruvkun G. 2003. A 
systematic RNAi 
screen identifies a critical role for mitochondria 
in C. elegans longevity. Nat Genet. 33(1):40

48.
 
114
 
 
Lehti
-
Shiu MD, Shiu S
-
H. 2012. Diversity, classification and function of the plant protein kinase 
superfamily. Philos Trans R Soc Lond B Bio
l Sci. 367(1602):2619

2639.
 
 
Li C, Lin H, Chen A, Lau M, Jernstedt 
J, Dubcovsky J. 2019. Wheat VRN1, FUL2 and FUL3 
play critical and redundant roles in spikelet development and spike determinacy. Dev. 
146(14):1

11.
 
 
Li J, Yuan Z, Zhang Z. 2010. The cellula
r robustness by genetic redundancy in budding yeast. 
PLoS Genet. 6(
11).
 
 
Liu B, Kanazawa A, Matsumura H, Takahashi R, Harada K, Abe J. 2008. Genetic redundancy in 
soybean photoresponses associated with 
duplication of phytochrome A gene. Genetics. 
180(2):99
5

1007.
 
 
Lloyd J, Meinke D. 2012. A Comprehensive Dataset of Genes 
with a Loss
-
of
-
Function Mutant 
Phenotype in Arabidopsis. Plant Physiol. 158(3):1115

1129.
 
 
Lloyd JP, Seddon AE, Moghe GD, Simenc MC, Sh
iu S
-
H. 2015. Characteristics of Plant 
Essential Genes
 
Allow for within
-
 
and between
-
Species Prediction of Lethal Mutant 
Phenotypes. 
Plant Cell. 27(8):2133

2147.
 
 
Lu Y, Savage LJ, Larson MD, Wilkerson CG, Last RL. 2011. Chloroplast 2010: A database for 
lar
ge
-
scale phenotypic screening of arabidopsis mutants. 
Plant Physiol. 155(4):1589

1600.
 
 
Luévano
-
Martínez LA, Appolinario 
P, Miyamoto S, Uribe
-
Carvajal S, Kowaltowski AJ. 2013. 
Deletion of the transcriptional regulator opi1p decreases cardiolipin content an
d disrupts 
mitochondrial metabolism in Saccharomyces c
erevisiae. Fungal Genet Biol. 60:150

158.
 
 
Lynch M, Marinov GK. 201
5. The bioenergetic costs of a gene. Proc Natl Acad Sci U S A. 
112(51):15690

15695.
 
 
Ma Z, Zhu P, Shi H, Guo L, Zhang Q, Chen Y, Chen S
, Zhang Z, Peng J, Chen J. 2019. PTC
-
bearing mRNA elic
its a genetic compensation response via Upf3a and COMPASS component
s. 
Nature. 568(7751):259

263.
 
 
Maere S, De Bodt S, Raes J, Casneuf T, Van Montagu M, Kuiper M, Van de Peer Y. 2005. 
Modeling gene and g
enome duplications in eukaryotes. Proc Natl Acad Sci U
 
S A. 
102(15):5454

5459.
 
 
Matynia A, Anagnostaras SG, Wiltgen BJ, L
acuesta M, Fanselow MS, Silva AJ. 2008. A high 
through
-
put reverse genetic screen identifies two genes involved in remote memory in mic
e. 
PLoS One. 3(5).
 
 
McWilliam H, Li W, Uludag M, Squiz
zato S, Park YM, Buso N, Cowley AP, Lopez R. 2013. 
Analysis Tool We
b Services from the EMBL
-
EBI. Nucleic Acids Res. 41(Web Server 
Issue):W597

W600.
 
115
 
 
Mendonca AG, Alves RJ, Pereira
-
Leal JB. 2011. Loss of 
Genetic Redundancy in Reductive 
Genome Evolution. PLoS
 
Comput Biol. 7(2).
 
 
Mockler TC, Michael TP, Priest HD, Shen R, Sul
livan CM, Givan SA, Mcentee C, Kay SA, 
Chory J. 2007. The Diurnal Project: Diurnal and Circadian Expression Profiling, Model
-
based 
Patt
ern Matching, and Promoter Analysis. Cold Spring Harb 
Symp Quant Biol. 72:353

363.
 
 
Moore BM, Wang P, Fan P, Leong B, Sch
enck CA, Lloyd JP, Lehti
-
Shiu MD, Last RL, 
Pichersky E, Shiu S
-
H. 2019. Robust predictions of specialized metabolism genes 
through 
machine learning. Proc Natl Acad Sci U S A. 116(6):2344

23
53.
 
 
Mortimer RK. 1969. Genetic Redundancy in Yeast. Genetics. 61(1
):Supplement 329
-
334.
 
 
Mueller LA, Zhang P, Rhee SY. 2003. AraCyc: A Biochemical Pathway Database for 
Arabidopsis. Plant Ph
ysiol. 132(2):453

460.
 
 
Nowak MA, Boerlijst MC, Cooke J, Smith JM.
 
1997. Evolution of genetic redundancy. Nature. 
388(6638):167

171.
 
 
Panchy N, Lehti
-
Shiu M, Shiu S
-
H. 2016. Evolution of Gene Duplication in Plants. Plant 
Physiol. 171(4):2294

2316.
 
 
Pedrego
sa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blon
del M, Prettenhofer 
P, Weiss R, Dubourg V, et al. 2011. Scikit
-
lear
n: Machine Learning in Python. J Mach Learn 
Res. 12:2825

2830.
 
 
Peters H, Wilm B, Sakai N, Imai K, Maas R, Balling R. 1999.
 
Pax1 and Pax9 synergistically 
regulate vertebral column developme
nt. Development. 126(23):5399

5408.
 
 
Pickett FB, Meeks
-
Wagner DR. 1
995. Seeing Double: Appreciating Genetic Redundancy. Plant 
Cell. 7(9):1347

1356.
 
 
Polevoda B, Brown S, Cardillo TS, Rigby S

-
terminal 
acetyltransferases are associ
ated with ribosomes. J Cell Biochem. 103(2):492

508.
 
 
Pruitt KD, La

Arabidopsis thaliana. Plant Physiol. 102(3):10
19

1026.
 
 
Pruitt KD, Tatusova T, Maglott DR. 2007. NCBI reference 
sequences (RefSeq): a curated non
-
redundant sequence database of ge
nomes, transcripts and proteins. Nucleic Acids Res. 
35(D1):D61

D65.
 
 
Rutter MT, Wieckowski YM, Murren CJ, Strand AE. 2017. 
Fitness effects of mutation: testing 
genetic redundancy in Arabido
psis thaliana. J Evol Biol.:1

12.
 
 
116
 
 
Scannell DR, Frank AC, Conant G
C, Byrne KP, Woolfit M, Wolfe KH. 2007. Independent 
sorting
-
out of thousands of duplicated gene pairs in two yeast 
species descended from a whole
-
genome duplication. Proc Natl Acad Sci. 104
(20):1

6.
 
 
Schmid M, Davison TS, Henz SR, Pape UJ, Demar M, Vingron
 
M, Schölkopf B, Weigel D, 
Lohmann JU. 2005. A gene expression map of Arabidopsis thaliana development. Nat 
Genet. 
37(5):501

506.
 
 
Sen A, Madhivanan K, Mukherjee D, Claudio Aguilar R. 2012.
 
The epsin protein family: 
Coordinators of endocytosis and signalin
g. Biomol Concepts. 3(2):117

126.
 
 
Seoighe C, Wolfe KH. 1999. Yeast genome evolution in the post
-
genome era.
 
Curr Opin 
Microbiol. 2(5):548

554.
 
 
Stamatakis A. 2014. RAxML version 8: A tool 
for phylogenetic analysis and post
-
analysis of 
large phylogenies. B
ioinformatics. 30(9):1312

1313.
 
 
Su S
-
H, Krysan PJ. 2016. A double
-
mutant collection targeting MAP kinase re
lated genes in 
Arabidopsis for studying genetic interactions. Plant J. 88(5):867

878.
 
 
Sullivan AM, Arsovski AA, Lempe J, Bubb KL, Weirauch MT, Sabo
 
PJ, Sandstrom R, Thurman 
RE, Neph S, Reynolds AP, et al. 2014. Mapping and Dynamics of Regulatory DNA and 
T
ranscription Factor Networks in A. thaliana. Cell Rep. 8(6):2015

2030.
 
 
Sun Q, Zy
bailov B, Majeran W, Friso G, Olinares PDB, van Wijk KJ. 2009. PPDB
, the Plant 
Proteomics Database at Cornell. Nucleic Acids Res. 37(D1):D969

D974.
 
 
Sundell D, Mannapperuma C,
 
Netotea S, Delhomme N, Lin Y
-
C, Sjödin A, Van de Peer Y, 
Jansson S, Hvidsten TR,
 
Street NR. 2015. The Plant Genome Integrative Explorer Resource: 
P
lantGenIE.org. New Phytol. 208(4):1149

1156.
 
 
Thatcher JW, Shaw JM, Dickinson WJ. 1998. Marginal fitness con
tributions of nonessential 
genes in yeast. Proc Natl Acad Sci U S A. 95(1):253

25
7.
 
 
The Gene Ontology Consortium. 2017. Expansion of the Gene Ontol
ogy knowledgebase and 
resources. Nucleic Acids Res. 45(D1):D331

D338.
 
 
The Gene Ontology Consortium, Ashburn
er M, Ball CA, Blake JA, Botstein D, Butler H, Cherry 
JM, Davis AP, Dolinski K, D
wight SS, et al. 2000. Gene Ontology: tool for the unification of 
b
iology. Nat Genet. 25(1):25

29.
 
 
Thomas JH. 1993. Thinking about genetic redundancy. Trends Genet. 9(11):395

399.
 
 
Tischler J, Lehner B, Chen N, Fraser AG. 2006. Combinatorial RNA interfere
nce in 
Caenorhabditis elegans reveals that redundancy between gene 
duplicates can be maintained for 
more than 80 million years of evolution. 7(8):1

13
.
117
 
 
Tong AHY. 2001. Systema
tic Genetic Analysis with Ordered Arrays of Yeast Deletion Mutants. 
Science. 294(
5550):2364

2368.
 
 
Vavouri T, Semple JI, Lehner B. 2008. Widespread 
conservation of genetic redundancy during a 
billion years of eukaryotic evolution. Trends Genet. 24(10):485

488.
 
 
Venters BJ, Wachi S, Mavrich TN, Andersen BE, Jena P, Sinnamon AJ, Jain P, 
Rolleri NS, Jiang 
C, Hemeryck
-
Walsh C, et al. 2011. A Comprehensive
 
Genomic Binding Map of Gene and 
Chromatin Regulatory Proteins in Saccharomyces. Mol Cell. 
41(4):480

492.
 
 
Waern K, Snyder M. 2013. Extensive Transcript Diversity and Novel Upstream Open Re
ading 
Frame Regulation in Yeast. G3; Genes|Genomes|Genetics. 3(2):3
43

352.
 
 
Wang H, Ngwenyama N, Liu Y, Walker JC, Zhang S. 2007. Stomatal Development and 
Pat
terning are Regulated by Environmentally Responsive Mitogen
-
Activated Protein Kinases in 
Arabidops
is. Plant Cell. 19(1):63

73.
 
 
Wang Y, Li J, Paterson AH. 2013. MCSc
anX
-
transposed: detecting transposed gene duplications 
based on multiple colinearity scans.
 
Bioinformatics. 29(11):1458

1460.
 
 
Weintraub H. 1993. The MyoD Family and Myogenesis: Redundancy,
 
Networks, and 
Thresholds. Cell. 75:1241

1244.
 
 
Wilcoxon F. 1945. I
ndividual Comparisons by Ranking Methods. Biometrics Bull. 1(6):80

83.
 
 
Wilson TJ, Lai L, B
an Y, Ge SX. 2012. Identification of metagenes and their Interactions 
through Large
-
scale Analysis of Arabidopsis Gene Expression Data. BMC Genomics. 13(237).
 
 
Yang Z. 2007. PAML 4: Phylogenetic Analysis by Maximum Likelihood. Mol Biol Evol. 
24(8):1586

159
1.
 
 
Yao SG, Ohmori S, Kimizu M, Yoshida H. 2008. Unequal genetic redundancy of rice 
PISTILLATA orthologs, OsMADS2 and OsMADS4, in lodicule and stamen developmen
t. Plant 
Cell Physiol. 49(5):853

857.
 
 
Zhang J. 2012. Genetic Redundancies and Their 
Evolutionar
y Maintenance. In: Advances in 
Experimental Medicine and Biology 751. p. 279

300.