MACHINE LEARNING FOR THE STUDY OF GENE REGULATION AND COMPLEX
TRAITS
By
Anne Sonnenschein

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Genetics - Doctor of Philosophy
2017

ABSTRACT
MACHINE LEARNING FOR THE STUDY OF GENE REGULATION AND COMPLEX
TRAITS
By
Anne Sonnenschein
Functional elements are found in DNA outside of protein coding regions; an important class of
these elements are 'enhancers', which govern when and where transcription occurs. Predicting the
identity and function of potential enhancers based on DNA sequence remains a major goal of
genomics. A number of features are associated with the enhancer state, but even combinations of
these features in well-studied systems such as Drosophila have limited predictive accuracy. I
have examined the current limits of computational enhancer prediction, and analyzed which
features are most useful for this task, by applying machine-learning methods to an extensive set
of genomic features.
Inferring the genetic underpinning of even well-characterized phenotypes is equally challenging,
although similar analytical methods can be applied. Phenotypes are frequently defined based on a
set of characteristic features; when images are used as specimens, these features are frequently
based on morphometric landmarks, although computational pattern-recognition has been used as
an alternative. I use Drosophila wing shape as a model for a complex phenotype, and use
machine learning to predict underlying genotype using both traditional landmarks and features
extracted using 'computer vision'.

Copyright by
ANNE SONNENSCHEIN
2017

This thesis is dedicated to my parents, Mark Sonnenschein and Gerry Franklin Sonnenschein,
who made it possible through years of love, support, and many cups of coffee. It is also
dedicated to Malvika, who didn't want me to do a PhD, but would be glad to know I'd finished it
(you are missed).

iv

ACKNOWLEDGEMENTS

This dissertation was made possible by the contributions, help and guidance of a number
of mentors, friends, and committee members. First and foremost, my co-advisers Dr. David N.
Arnosti and Dr. Ian Dworkin have been incredible, both as mentors and by acting as examples of
exemplary scientists, who love what they do and share that passion for science with others. I
have been lucky to learn from them. In the first few years of my degree Ian introduced me to
programming, experimental design, and the fundamentals of genetics that I've since gone on to
teach. In the last few years of my degree, David took over most of the tasks of day-to-day
mentorship, and was an endless source of organization, patience career advice, and entertaining
trivia. My committee members, including Dr. Barry Williams, Dr. Patricia Wittkopp, Dr. Cathy
Ernst and Dr. Shin-Han Shiu have all provided a great deal of valuable guidance and
encouragement both in and out of committee meetings. Dr. Terri McElhinny, although not on my
committee, has been a mentor, motivator and friend in the realm of teaching focused careers.
In Lansing I have been lucky to be surrounded by co-workers and classmates who were
also friends. Both the Dworkin lab and Arnosti labs (past and present) have been amazing
communities in which to learn and do science. Over the last few years, Drs. Rewatee Gokhale,
Sandhya Payankaulam and Yiliang Wei, as well Rima Mouawad have provided a great deal of
feedback, advice and short-notice pet-sitting. I especially owe thanks to the last of the 'MichiganDworkinites', Dr. Will Pitchers and Amanda Charbonneau, for their friendship, support, and help
with data analysis. Outside of the lab, Roshan Angoshtari, David Tack, Ann Ripberger, Laura
Hurtig, Michael DeNieu and Sarah Marzac all spent a lot of after-hours time with me in coffee
shops and all-night diners on weekends and evenings, with the sometimes successful goal of

v

getting work done. Acknowledgments are also due to my long-distance friend Dr. Conor D. Cox,
for seven years of encouragement, conversation, and frequent emergency proof-reading.
Lastly, I owe thanks to my family-- my brother Matthew for his sympathy, conversation
and links to funny comics, my sister Susie for putting up with my nonsense during my
comprehensive exams, and my parents, Mark and Gerry Sonnenschein, for everything.

vi

TABLE OF CONTENTS

LIST OF TABLES ................................................................................................................ ix
LIST OF FIGURES ............................................................................................................... x
CHAPTER I: INTRODUCTION ......................................................................................... 1
Introduction ..................................................................................................................................... 2
Identification and interpretation of cis-regulatory elements ........................................................... 3
Predicting the output of enhancers .......................................................................................... 10
Experimental identification of enhancers ................................................................................ 12
Machine learning and enhancer identification ....................................................................... 15
A comparison of morphometrics and computer vision for interpreting phenotype from images . 16
Wing shape as a model system for morphometrics and computer vision................................ 17
A wing shape database .............................................................................................................. 18
REFERENCES .............................................................................................................................. 19

CHAPTER II: THE SHEEP AND THE GOATS: DISTINGUISHING
TRANSCRIPTIONAL ENHANCERS IN A COMPLEX CHROMATIN
LANDSCAPE ....................................................................................................................... 29
Abstract ......................................................................................................................................... 30
Introduction ................................................................................................................................... 31
Materials and Methods .................................................................................................................. 34
Datasets used as predictive features ......................................................................................... 34
Organization and curation of data from Fly Enhancer Resource .......................................... 37
Intersection of features with Fly Enhancer Resource ............................................................. 38
Discrimination between classes of enhancers with Random Forest ....................................... 39
Feature importance analysis..................................................................................................... 40
Identifying potential enhancers ................................................................................................ 40
Validation of predictions ........................................................................................................... 41
Data Availability ........................................................................................................................ 41
Results ........................................................................................................................................... 41
Enhancers are not highly conserved ........................................................................................ 41
Correlations of distinct chromatin immunoprecipitation data ................................................ 42
Transcriptional activity and stage specificity can be classified based on genomic features .. 54
Impact of number of features on predictive accuracy ............................................................. 59
Classifications by stage specificity and expression pattern ..................................................... 62
Testing against validated enhancers ......................................................................................... 68
Discussion ..................................................................................................................................... 69
Features correlating with enhancer activity are not guarantees of enhancer function ......... 70
Zelda a strong but not definitive indicator of early enhancer activity .................................... 71
Potential limitations and stage specificity ................................................................................ 72
Characteristics and Classification of DNA elements with Varying Expression Patterns ....... 73
Identification of enhancers based on genomic features .......................................................... 74
vii

Conclusions ................................................................................................................................... 76
REFERENCES .............................................................................................................................. 79

CHAPTER III: AN IMAGE DATABASE OF DROSOPHILA MELANOGASTER
WINGS FOR PHENOMIC AND BIOMETRIC ANALYSIS ......................................... 86
Abstract ......................................................................................................................................... 87
Introduction ................................................................................................................................... 87
Methods ......................................................................................................................................... 92
Fly genetics and sample preparation........................................................................................ 92
Imaging...................................................................................................................................... 93
Geometric morphometric data acquisition and preparation ................................................... 93
Morphometric analysis ............................................................................................................. 94
BioCAT analysis ........................................................................................................................ 95
Results ........................................................................................................................................... 96
Classification based on geometric morphometric data shows a high degree of accuracy ... 102
Computational feature detection and sub-setting for classification using BioCAT ............. 105
Comparisons between BioCAT and geometric morphometric descriptors for classification105
Comparisons between BioCAT and geometric morphometric descriptors for classification108
Discussion ................................................................................................................................... 109
APPENDIX ............................................................................................................................. 112
IMAGE PROCESSING AND CLASSIFICATION USING BIOCAT .............................. 113
REFERENCES ............................................................................................................................ 127

CHAPTER IV: CONCLUSIONS AND FUTURE PERSPECTIVE ............................. 132
Introduction ................................................................................................................................. 133
From genotype to gene expression .............................................................................................. 134
From wing shape to genotype ..................................................................................................... 135
Conclusions ................................................................................................................................. 136
REFERENCES ............................................................................................................................ 138

viii

LIST OF TABLES

Table 2-1 Features for enhancer classification...............................................................................36
Table 2-2 Overlap of active and inactive reporters with features ..................................................52
Table 3-1 Drosophila allele information ........................................................................................97
Table 3-2 Drosophila wings dissected by sex and genotype .........................................................97
Table 3-3 Classification accuracy of machine learning algorithms using landmarks ..................103
Table 3-4 Classification accuracy of machine learning algorithms compared with BioCAT ......106
Table 3-5 Classification accuracy of machine learning algorithms compared with BioCAT ......109
Table 3-6 Results with BioCAT for sex ....................................................................................... 115
Table 3-7 Results with BioCAT for genotype .............................................................................. 117
Table 3-8 Results with BioCAT across genotype......................................................................... 119
Table 3-9 Results with BioCAT for sex across technical conditions ...........................................124
Table 3-10 Results with BioCAT for genotype across technical conditions ................................126

ix

LIST OF FIGURES

Figure 2-1 Sequence conservation in active and inactive elements ...............................................43
Figure 2-2 Correlations between genomic regulatory features ......................................................44
Figure 2-3 Overlap of features between different datasets ............................................................45
Figure 2-4 ChIP peaks correlate with active elements ...................................................................47
Figure 2-5 Correlation between features for active and inactive elements ....................................49
Figure 2-6 Correlations of feature scores for overlapping or replicate features ............................50
Figure 2-7 Differential ChIP peak or binding motif enrichment values ........................................51
Figure 2-8 Similarity of distribution of feature scores for active vs. inactive elements ................53
Figure 2-9 Active DNA elements can be separated from inactive regions ....................................56
Figure 2-10 LDA Cross-validation ................................................................................................58
Figure 2-11 LDA Exclusion of unbound genomic regions and use of a balanced training set ......60
Figure 2-12 Random forest performance with least and most informative features......................61
Figure 2-13 Distinct classes of elements tested for function in embryos ......................................63
Figure 2-14 Principal component analysis, Random Forest classification, of subsets ..................64
Figure 2-15 Features correlate with tissue specific expression patterns ........................................66
Figure 2-16 Recall relative to false discovery rate when identifying enhancers. ..........................67
Figure 3-1 Wing landmarks and semi-landmarks ..........................................................................89
Figure 3-2 Mutation effects ...........................................................................................................98
Figure 3-3 Heterozygote mutant wings..........................................................................................99
Figure 3-4 Technical replicates conditions ..................................................................................101
Figure 3-5 Separation of specimens using landmark data using LDA ........................................104

x

Figure 3-6 Confusion matrices.....................................................................................................107

xi

CHAPTER I: INTRODUCTION

1

Introduction
Advances in biotechnology and computing have led to an enormous expansion of available data
in the life sciences1,2. These developments have contributed to a movement toward accessibility
of results, public data sharing, and the development of new analytical tools for combining
heterogeneous datasets3,4. Interpreting the available data has become an overarching challenge
for modern biology5. However, a bottleneck exists at making biological inferences across
disciplines; translating findings from molecular biology through to physiology and behavior in
evolving populations, and vice versa. Drawing conclusions about phenotype from genotype (e.g.
determining if a pathogen is resistant to a treatment based on its DNA sequence) and genotype
from phenotype (e.g. which genetic variants are causing a complex disease in a patient) are
equally challenging problems6,7. Machine learning is a common approach to addressing these
questions.
Machine learning algorithms identify the features associated with a concept, and optimize or
'learn' with the addition of new data. They allow incorporation of very heterogeneous data into
models, a feature that is well suited to addressing biological questions. Current tasks that profit
from such an approach include combining DNA sequence and protein binding information to
predict genomic features, or using different metabolic and environmental factors to identify a
disease state. In the life sciences, machine learning approaches have been effectively applied
across a wide range of disciplines and questions8–10. Machine learning can approach two related
goals: identifying potentially informative trends from data by looking at the distributions of
different states (sometimes referred to as generative models), or making classifications or
predictions by finding the most effective way to separate subgroups within data (discriminative
models)9. Depending on the goals of the experiment and the availability of pre-existing data,

2

machine learning algorithms may be 'supervised', that is, optimized on a training set, or 'unsupervised', in which they learn from the patterns that exist within the data9,10. Within these
categories there are a variety of frequently used algorithms, best suited to different kinds of data
and questions8,11.
Since the 1980s, these methods have been extensively applied in the life sciences, in fields
ranging from determining the conditions best suited for plants12 diagnosing and predicting the
development of cancer13,14 and identifying the conditions that might be contributing to
population changes in endangered species15. In molecular biology machine learning has been
used for interpreting data from next generation sequencing16 and extensively applied to the study
of gene regulation. The development of large publicly available datasets focused on features
associated with gene regulation, such as ENCODE and modENCODE, have stimulated studies
that make inferences about genes that are co-regulated in gene regulatory networks17–19. These
genome-wide data sets have also been used directly for genome annotation, identifying the
location of open reading frames, splice sites, and regulatory regions such as enhancers10. The
relatively constrained features of transcription units and exons facilitate their genomic
annotation, but the more flexible nature of enhancers has made their characterization a
challenging problem.
Identification and interpretation of cis-regulatory elements
In 1933, while discussing the transfer of biological information between generations, Thomas
Hunt Morgan observed that it was irrelevant if genes were real or a 'purely
fictitious...hypothetical unit'20. Either way the information was clearly there, localized in
chromosomes, and passed from parent to offspring. Within thirty years, genes had been
established as definitively real, embodied in nucleic acids, expressed through transcription into

3

messenger RNA and translation into protein. However, the sequencing of the human genome
revealed that only ~2% of the information is protein coding21. Regulatory information,
determining when, where and to what extent these protein-coding genes are expressed, is
believed to occupy a much greater percentage of the genome (estimates vary from ~8% to
~80%)22. Changes in genomic regulatory regions are believed to be critically important to the
differences between species, and evolutionary change over time. Countless studies have linked
mutations in regulatory DNA sequences to disease, as well as to both small differences between
species and major reorganizations in body-plan patterning between phylogenetic groups23,24.
Many elements of gene regulation are common to all eukaryotes, and predate the existence of a
common ancestor of plants and animals over a billion years ago. A number of gene families
encoding transcription factors important to cell cycle regulation and multi-cellular bodyplanning, such as E2F and MADs-box genes, originated before this split, though they are
deployed in very different ways across different phyla25,26. All eukaryotic genes are regulated by
a core promoter coincident with the transcriptional initiation site, and generally require distallyacting enhancer elements for more than basal transcription. These enhancers are characterized by
open chromatin, and physically interact with the promoter via the co-activator Mediator protein.
Unlike the simpler regulatory landscape found in yeast, control of animal genes can feature a
complexity of enhancer architecture, with multiple enhancers sometimes acting at great genomic
distances to regulate spatial and temporal expression (reviewed in Shlyueva et al. 2014)27.
Core promoters are comparatively well understood, representing compact regions of DNA
flanking the transcription start site. Sequence elements in the core promoter permit interactions
with the highly conserved general transcription factors necessary for basal transcription in
eukaryotes. There are several common motifs that frequently occur at the core promoter (e.g.

4

TATA, Inr), though the composition and identities of these are variable between genes, and
between species28,29. Promoters have been categorized based on a number of criteria, including
the distribution of dinucleotides, CAGE peaks, histone tail modifications, general transcription
factors, and the presence or absence of bi-directional transcription29,30. A common organization is
differentiating 'dispersed' promoters (weaker transcription, multiple transcription start sites), and
'focused' promoters (stronger transcription, one or few transcription start sites). The former is
functionally associated with housekeeping genes, and is more common in metazoans than in
single-celled eukaryotes like yeast30,31.
By comparison, enhancers are far more variable between and within species, and less clearly
defined. Enhancers typically are stretches of DNA bound by transcription factors, and were
originally differentiated from proximal-acting regulatory sequences by their ability to activate
gene expression from distal locations, in either orientation. Virtually all genes from multi-cellular
eukaryotes require enhancers for activation. They are incredibly diverse in terms of size32,
organization33 and function27,34. Much like Morgan's 'hypothetical' genes, it is not clear what
characteristics are distinct to enhancers, how their unique qualities contribute to their biological
function, and what distinguishes them from background DNA.
In metazoans, enhancers are typically characterized by clusters of sequence motifs associated
with the binding of specific transcription factors21. In the course of development, enhancers can
be initially first bound by pioneer transcription factors, which recruit proteins to promote
chromatin-remodeling, before recruitment of additional transcription factors and co-activators
(Schulz 2015). The degree to which different transcription factors bound to the same enhancer
influence each other and cooperate to regulate gene expression is highly variable. Although a
common repertoire of regulatory mechanisms can be found across many pathways35, how they

5

are specifically deployed within enhancers can be associated with specific regulatory pathways
and functions36.
Several models have been proposed as broad categories for enhancers based on the interactions
of constituent transcription factors. The most highly structured enhancers are classified as
'enhanceosomes', with proteins that exhibit highly interdependent binding, in which single-base
pair changes in the sequence lead to total loss of function. In contrast, some enhancers are more
analogous to 'billboards', wherein transcription factors are more loosely organized, and largely
direct gene expression independently37. Loss of a single transcription factor binding site might
only partially alter the enhancer's function, or have no impact at all due to built-in redundancy.
Such enhancers are not entirely unconstrained in their organization; they can still demonstrate
cis-regulatory grammar, that is the constraints on regulatory DNA sequences as a consequence of
the way that arrangements and affinities of transcription factor binding motifs generate
differential outputs37,38. Interestingly, simulations have suggested that in some cases, such
apparent constraints may be entirely an artifact of neutral sequence evolution leading to
clustering of independent regulatory regions, that are adjacent but otherwise unrelated39. A third
model proposed to describe a category of enhancer organization is the “transcription factor
collective”, in which the binding of transcription factors and co-activators has a strong impact on
the binding of additional proteins, which themselves may not rely on sequence motifs40. There
are numerous biological examples that fit each of these models, but without extensive
experimental manipulation it is difficult to categorize newly discovered enhancers, or make
general statements about the relative frequency of enhanceosomes, billboards and collectives.
In-depth analysis of specific enhancers has also highlighted examples that do not clearly fit
within previously defined models in which enhancers act as small independent modules. In

6

Drosophila melanogaster, a single wing gene was found to have clusters of transcription factor
binding dispersed over a region spanning about 10kb; these clusters direct a single gene in
unique but highly overlapping expression patterns41. Several of these clusters could be removed
without causing any visible phenotypic changes42. With enhancer information distributed semiredundantly over a large area, it is not clear whether each cluster should be viewed as a discrete
enhancer, or as part of a larger regulatory 'billboard'. Similarly, the insulin receptor gene in
Drosophila has complex and partially redundant regulatory information for various
developmental stages extending across over forty kilobases that may represent cis-regulatory
elements that necessarily act in concert rather than independently43. In mammalian genomes,
large regulatory regions with very high levels of transcriptional output have been referred to as
'super-enhancers', and hypothesized to have functionality that is distinct from smaller regulatory
elements, and a role in defining cell-type specificity. In some cases, these regions have been
shown to act as a cooperative unit, in which subsections do not function independently44. Even
for smaller enhancers, the exact borders of the functional region are not readily defined;
sequences with no known protein binding have been shown to influence or modify enhancer
function45. In some cases, large regions of AT-rich DNA may modify chromatin structure to act
as a booster sequence for a single enhancer, without containing any transcription factor binding
motifs46.
Evolutionary conservation is of limited use for predicting the functional boundaries of enhancers,
or categorizing different classes of enhancer types. There are a number of well-studied enhancers
that have substantially diverged in sequence over evolutionary time, while the timing and
location of transcriptional output are maintained47,48. However, there are also some cases where
regulatory elements are conserved across very long phylogenetic distances49,50; it has been

7

speculated that these may represent enhancers that conform to 'enhanceosome' type architecture,
where small changes in sequence are highly detrimental to function51. However, this type of
distinction is difficult to make in Drosophila, which possesses a genome in which both intergenic
and intronic regions exhibit a high degree of conservation52. Conservation of regulatory elements
tends to be more common in organisms with greater biological complexity or larger genomes
such as mammals, but even in humans the percentage of the genome that is predicted to have
regulatory importance is much greater than the portion that is highly conserved53. Changes in
sequence also do not consistently correspond to proportional differences in expression54–56. In
some cases, transcription-factor binding sites for a single protein have been found to be under
different levels of purifying selection throughout the genome; some sites are unchanged over
long periods of time while others are subject to rapid turnover. It has been speculated that this is
due to these transcription factors having different functions (with correspondingly different
specificity requirements) for these sites51.
Within populations, variation in regulatory elements is common in the form of single nucleotide
polymorphisms, insertions and deletions57. Entire regulatory elements are sometimes present in
subsets of a population58. Some of this variation is likely due to genetic drift; in humans, this
variation likely plays a role in the inheritance of complex diseases57. In other cases, it may
contribute to standing cryptic genetic variation, and only influence phenotype under certain
conditions59. Many regulatory elements have a certain amount of internal redundancy, in the
form of extra potential binding sites60,61, or backup 'shadow' enhancers, which can contribute to
buffering of transcriptional precision62,63. Even highly conserved regulatory elements that are
clearly under strong purifying selection appear to have some level of redundancy; removal of
classical 'enhanceosomes' does not always show a visible phenotypic effect64.

8

Enhancer variation can also come from directional selection. Gene regulatory elements are a
major source of adaptation, ranging from subtle shifts in phenotype65 to rewiring of entire gene
networks66. It is frequently suggested that changes in cis-regulatory elements are in fact the most
important drivers of evolutionary change23. In Drosophila, the changes necessary to lead to
pigmentation in the adult wing (which has a marked effect on Drosophila behavior and
courtship) can be reached with modifications of a few nucleotides24. At the other end of the
spectrum, extensive changes in regulatory interactions may be consistent with conserved
function of genetic interactions. For instance, the HOX genes that pattern appendages include
genes that have largely conserved function across all metazoans23. The human eye development
gene Pax6 can be substituted for its fruit fly ortholog, eyeless, and drive eye formation in
Drosophila, despite the profound differences in these organs across this evolutionary distance67–
69

. While primary function is conserved, there have clearly been substantial changes in how these

genes are deployed in their respective networks68.
There are many models of regulatory evolution70, and the relationships between the structure,
function, and evolution of regulatory elements are not fully understood. These complexities
make it difficult to accurately predict how changes in regulatory DNA leads to changes in
function, or how to measure different types of selective pressure71. In several studies, the precise
changes that led to phenotypic changes have been mapped out. In the case of the Drosophila
enhancer sparkling, rapid sequence change over time was directly connected to cis-regulatory
architecture. The enhancer produced the correct level of expression when it included several
weak binding sites, as opposed to fewer strong binding sites. Weak binding sites require a less
specific consensus sequence72. Also in Drosophila, an enhancer that drives expression of a gene
in the optic lobe was compared between two very closely related species. Although there had

9

been a number of changes, the function was conserved; however, not all paths to sequence
divergence were equally neutral. Many mutations in this regulatory element would have led to
ectopic expression55. Presumably, purifying selection was playing a role despite the overall high
levels of sequence turnover. It is plausible that similar mechanisms have been at work in the
enhancer for the even-skipped gene in Drosophila. The enhancer that drives the 'stripe two'
expression pattern has conserved function between different species of Drosophila, despite
substantial sequence divergence. However, chimeric fusions of the two enhancers do not produce
a normal expression pattern, suggesting that compensatory mutations have accompanied the
turnover of binding sites73,74.
Predicting the output of enhancers
Experimental results like those from the even-skipped enhancer have led to another approach for
studying the rules governing enhancer function, that is, building it from first principles, either
using quantitative models or synthetic biology approaches (reviewed in Ay and Arnosti 201175).
Modeling approaches can generally be grouped into several categories. These include boolean
approaches, simple decision trees with binary outcomes, very analogous to the 'genetic switch'
terminology, thermodynamic or fractional occupancy models that model that treat gene activity
as a statistical scenario based on the probability of individual transcription factors associating
with the available binding sites, and interacting with each other and key co-activators, and
differential equations models that predict output of a gene as a function. Dynamic parameters
representing time and transcription factor availability are estimated75. A limitation to these
approaches is that we don't know all the essential inputs even for well-studied of regulatory
regions76. Furthermore, the more biologically accurate methods are extremely computationally
intensive, and require a very solid biological understanding of what proteins are involved in a

10

given regulon (a question which is complicated by extensive cross-binding between different
regulatory networks, and the key role of co-activators that do not themselves bind to the DNA as
illustrated by the 'transcription factor collective' described in Spitz and Furlong 201240). It is also
challenging to apply the rules governing expression for one gene to another gene or even another
regulatory element, as the degree to which transcription factors interact with each other in
different enhancers is highly variable77,78. It has been suggested that enhancers can be roughly
grouped into three categories based on the biological function of the genes they regulate-- simple
binary switches for housekeeping genes, multiple homotypic binding sites for genes regulated by
a gradient, and complex heterotypic regulatory architecture for genes associated with cell
differentiation, that are highly tissue or time specific79. However, experimental evidence has
suggested that even ostensible housekeeping genes like that encoding the Insulin-like receptor
protein have regulatory architecture that is highly complex, and not easy to categorize43.
It would be easier to make inferences about which changes in DNA sequence will lead to
changes in enhancer function if it were better understood how common different types of
enhancers are throughout the genome. Are the majority of enhancers like sparkling, with a
number of weak binding sites72? Are enhancers of a certain size more likely to be highly
interdependent 'super-enhancers'32 or 'enhanceosomes'37? Does cooperativity between different
transcription factors within binding sites influence evolutionary rates within binding sites80? Can
enhancers be described as a single group with some traits in common, or even broken down into
sub-categories, or are enhancers highly individual? Regulatory elements identified through the
same methods will share some features that may not be universal; some methods for
experimentally identifying enhancers can lead to a false appearance of uniformity with respect to
size, conservation, and chromatin state33. The availability of new molecular biology and

11

genomics tools has contributed to the protean understanding of what constitutes an enhancer79.
Experimental identification of enhancers
Gene regulation was first worked out in bacterial and phage systems, with the lytic 'genetic
switch' in phage lambda and the operons defined by Jacob and Monod in bacterial systems.
Mutations in regulatory DNA produced characteristic phenotypes, such as constitutive
expression of LacZ. There, the concept of regulatory effects operating in cis and trans were first
worked out21. Regulatory mutations have also been the basis of classic genetic research in
eukaryotes as well, including studies of homeotic transformations in Drosophila and maize
kernel coloration81. Modern genome-wide approaches to identify regulatory alleles that impact
expression in cis or trans have been widely deployed in model systems such as yeast and
Drosophila, allowing estimation of the frequency of such regulatory alleles. At a finer scale,
unique complementation has been taken advantage of in experiments designed to find the impact
of cis-regulatory change on species divergence82.
Molecular biological approaches were the first to identify enhancers as such. Reporter assays are
generally considered the 'gold standard' of enhancer identification. In these, coding sequence for
an easily identifiable transgenic protein (lacZ, firefly luciferase, or green fluorescent protein)
along with a basal promoter is fused to a putative enhancer sequence. These are most commonly
incorporated into plasmids for transfection into cultured cells or allowed to integrate into the host
genome to generate transgenic organisms83. Sequences that enable the production of the
transgenic gene or protein are likely enhancers, and the quantity or conditions under which they
direct expression can provide insight into their in-vivo function. A disadvantage of this method is
that it requires prior knowledge about where regulatory sequences are located, although recently
high throughput methods have been employed to randomly test large fractions of the genome for

12

enhancer activity84,85. Additionally, many transgenic approaches involve the gene of interest
being randomly integrated into the genome of the organism, which can produce variable
outcomes depending on where it lands. The use of site-specific integrases to target specific
docking sites in the target organism can avoid these position effects86. Another limitation is that
molecular cloning is best suited to sequences under a specific size; it has been proposed that the
general perception of enhancers as being between 500 and 2000 base pairs is an artifact of what
sequences sizes can be amplified most efficiently using molecular cloning33.
Genetics defines elements based on what is necessary and sufficient-- with reporters, this
frequently focuses on 'sufficient'-- what minimal element can reconstitute an entire expression
pattern for a gene. 'Necessary' can be tested with rescue experiments, where an organism that is
null for an enhancer or for the gene itself has a copy of the enhancer or enhancer and gene reintroduced, either through standard molecular cloning or BAC recombineering87. A rescue
construct that is both necessary and sufficient can be further dissected to see how individual
binding sites or the order of binding sites contribute to the overall function, and how changes in
sequence are associated with changes in phenotype88. More recently, in-vivo gene editing
techniques like CRISPR-Cas9 have been used to alter sequences without the use of transgenics89.
As genomic engineering becomes more efficient, it may provide a great deal of information on
how very high resolution changes to regulatory elements can influence gene function.
While genetics methods target the regulatory elements for individual genes, much of enhancer
identification has moved on to genomics methods, where regulatory elements are identified
without knowledge of the genes they control, based on features associated with enhancer activity
throughout the genome17. DNase hypersensitivity identifies regions of open chromatin, which is
a prerequisite for active enhancers90. Chromatin-immunopreciptation has also been extensively

13

used for identifying regulatory elements, both by locating genomic regions bound by
transcription factors or combinations of transcription factors91, and by identifying histone
modifications that are associated with open chromatin that has either enhancer and/or promoter
activity90. Programs like ChromHMM92 can synthesize data from multiple chromatin signatures
to identify the genomic regions that are most likely to exhibit enhancer activity. These methods
have been used by community efforts like the ENCODE and MODENCODE projects to broadly
map all regulatory regions throughout the genome18,19,93.
Although many features correlate with enhancer activity, none of them are universal or
exclusive94. There are numerous active enhancers that are not characterized by the typical histone
modifications, and chromatin immunoprecipitation shows that transcription factors bind widely
throughout the genome, including in many regions that are not active enhancers95. Even among
the 'true-positives' that can be identified through these methods, they neither provide certain
insight into what their specific function may be (although the identities of bound activators and
repressors can provide clues about gene regulatory networks, as in Zeitlinger et al. 200791), nor
indicate which specific genes the enhancers are activating96. A common approach is to assign
putative enhancers to the nearest transcription start site, or if RNA-sequencing data is available,
to the nearest active transcription start site96. There is some biological basis for this; active
enhancers do tend to influence the activity of the nearest gene85, although there are numerous
cases where they have been found to regulate genes that are tens or hundreds of kilobases and
many genes away97,98. Chromosome conformation capture methods (4C, 5C, HiC, Chia-Pet and
others) can provide physical proof of interaction between a regulatory element and a specific
promoter99,100, although obtaining high-resolution data is not easily or economically obtained
from genome-wide applications of this method.

14

Exhaustive testing of all genomic sequences in mass reporter assays, such as that used in
STARR-seq, provides a more direct indication of enhancer locations, by using enhancers to drive
their own expression. This approach has the advantage of being very direct, because the readout
can provide both location and strength of the enhancer. It has the disadvantage of only working if
the enhancer is modular, distance and orientation independent, and is tied to the specific core
promoter used in the cloning vector. In addition, readout is limited to function in specific
cultured cells, rather than the whole organism84.
Enhancers are also predicted from DNA sequence alone. Deeply conserved DNA can indicate
enhancer activity. Very rapid change can indicate directional selection, and enhancers that are
associated with a phenotypic shift. A broadly applicable approach to identification of regulatory
regions is the combination of phenotypic information and DNA sequence information101,102. In
eQTL, genomic regions associated with changes in levels of gene expression can be used to
pinpoint enhancers. In the related approach of GWAS, comparisons of populations can be used to
identify variable regions that may be associated with variable gene expression, and may be in
linkage disequilibrium with relevant enhancers (reviewed in Albert and Kruglyak 2015103).
Machine learning and enhancer identification
Machine learning is frequently used as a supplement to experimental methods for predicting
enhancers. In addition to being a tool for predicting enhancers, it can potentially provide insight
into what features best define enhancers, and potential limiting factors. This has been
accomplished using both supervised methods104, and unsupervised models (e.g. ChromHMM92).
They have based these predictions on features ranging from motif presence to transcription factor
occupancy and chromatin modifications, although approaches that incorporate multiple diverse
datasets are generally the most successful105. In the model system Drosophila melanogaster,

15

computational enhancer identification has been studied extensively by the lab of Dr. Eileen
Furlong, who have used both combinations of transcription factors, and chromatin marks as
predictors for enhancers in embryonic mesoderm development91,106. Experimental validation of
predicted enhancers demonstrated that predictions of enhancer activity were highly accurate,
although determining the spatial expression of that activity was more challenging. From these
methods they were able to make genome-wide predictions about the frequency and distribution
of mesoderm enhancers, and the levels of enhancer redundancy107.
In Kvon et al. 2014 the lab of Dr. Alexander Stark published a database of in-vivo reporters
covering 15% of the Drosophila genome, and the individual spatial and temporal expression of
those that showed enhancer activity. Since these were identified as reporters, with relatively
consistent sizes, instead of based on histone marks or other features associated with enhancers,
there are ad-hoc fewer assumptions about what defines an enhancer85. This data set is thus
uniquely suited to serving as a training set for determining which characteristics are associated
with enhancer activity. Moreover, since approximately half of the reporters they tested exhibited
no expression, this study provides an unprecedented resource of 'negative results'. In Chapter 2,
we use this database and combined it with a wide range of features associated with enhancer
activity, to analyze how well different combinations of features could be used to train effective
machine learning classifications.
A comparison of morphometrics and computer vision for interpreting phenotype from
images
Enhancer prediction is arguably an effort to go from genotype to phenotype. Going from
phenotype to genotype is an equally difficult problem. This is sometimes referred to as
'phenomics', as a complement to 'genomics': the study how different genotypes interact with the

16

environment to produce a distribution of phenotype7. Phenotype can be described as anything
from how genes are translated into proteins, to how genes are translated into behavior. However,
images are frequently used for recording phenotype across the entire spectrum. These images can
then be quantitatively analyzed, and interpreted. For complex traits, the interpretation can rely on
machine learning. This has been applied to a variety of situations, ranging estimating the
prognosis of cancer based on images of the tumor, to automatically identifying a plant species
based photographs of its leaves. Some success has even been achieved using 2D and 3D cranialfacial phenotypes to infer the genetic cause of disease108–110, and to track the progress of
disease111.
Wing shape as a model system for morphometrics and computer vision
As with using machine learning for genome annotations, the features chosen for training a
classifier have a substantial impact on the performance. Analysis of images is usually done using
morphometrics methods, which identify corresponding landmarks across images, and use them
for quantitative comparison. These landmarks, or analysis based on them, can then be used for
classifications. Wing shape in Drosophila is a major model system for morophometric methods.
It is typically measured using the intersections of veins as landmarks; more recent techniques
generate splines between landmarks to automate quantification of both size and shape. These
methods can extract information from wings to provide extremely in-depth insight into the
underlying genotype and characteristics of the fly from which it came112.
However, morphometric methods do have some inherent limitations. Even very sophisticated
morphometrics reduce the phenotype to essentially one dimension (in wings) or two dimensions
(if images include three dimensional shape information, as with skull shape). Moreover,
morphometric landmarks are sometimes selected primarily because they are easily measured

17

rather than biologically relevant, especially in cases where the basis of a phenotype is not fully
understood. It is very possible that there are other aspects of image data that are not visually
obvious, but do provide relevant biological information. Computer vision, in which software
extracts information about images without prior information about which features are relevant,
has been proposed as an alternative or complement to morphometrics methods. It essentially
takes feature selection from images out of the hands of the biologist. Features generated from
computer vision technology have been extensively applied in facial recognition. Some
preliminary studies have shown that it can also be used to classify Drosophila wing shape113,114,
but it has not been a subject of extensive research.

A wing shape database
Facial recognition technology has greatly improved over the last decade115. This is partly due to
tremendous interest in the subject due to its use in technology developed for social media and
national security. However, it has also been encouraged by the existence of databases providing
training data, allowing researchers to determine how best to identify individuals from images
across a range of conditions116–118. In Chapter 3, we discuss the development of a database of
Drosophila wing images, with variation incorporated for sex and genotype. We also analyze
some initial comparisons of how well existing morphometric features compare with features
extracted through alternative methods can be used to identify sex and genotype.

18

REFERENCES

19

REFERENCES

1.

Marx V. Biology: The big challenges of big data. Nature. 2013;498(7453):255-260.

2.

Stephens ZD, Lee SY, Faghri F, et al. Big data: Astronomical or genomical? PLoS Biol.
2015;13(7):7.

3.

Goecks J, Nekrutenko A, Taylor J, Team TG. Galaxy: a comprehensive approach for
supporting accessible, reproducible, and transparent computational research in the life
sciences. Genome Biol. 2010;11.

4.

Joseph H. The Open Access Movement Grows Up: Taking Stock of a Revolution. PloS
Biol. 2013;11:10.

5.

Schatz MC. Biological data sciences in genome research: Figure 1. Genome Res.
2015;25(10):1417-1422.

6.

Dowell RD, Ryan O, Jansen A, et al. Genotype to Phenotype: A Complex Problem.
Science (80- ). 2010;328(5977):469.

7.

Houle D, Govindaraju DR, Omholt S. Phenomics: the next challenge. Nat Rev Genet.
2010;11(12):855-866.

8.

Larrañaga P, Calvo B, Santana R, et al. Machine learning in bioinformatics. Brief
Bioinform. 2006;7(1):86-112.

9.

Libbrecht MW, Noble WS. Machine learning applications in genetics and genomics. Nat
Rev Genet. 2015;16(6):321-332.

10.

Yip KY, Cheng C, Gerstein M. Machine learning and genome annotation: a match meant
to be? Genome Biol. 2013;14(5):205.

11.

Wolpert DH, MacCready WG. No Free Lunch Theorems for Optimization. IEEE Trans
Evol Comput. 1997;1(1):67-82.

12.

Ma C, Zhang HH, Wang X. Machine learning for Big Data analytics in plants. Trends
Plant Sci. 2014;19(12):798-808.

13.

Cruz JA, Wishart DS. Applications of machine learning in cancer prediction and
prognosis. Cancer Inform. 2006;2:59-77.

14.

Kourou K, Exarchos TP, Exarchos KP, Karamouzis M V., Fotiadis DI. Machine learning
applications in cancer prognosis and prediction. Comput Struct Biotechnol J. 2015;13:817.

15.

Kampichler C, Wieland R, Calmé S, Weissenberger H, Arriaga-Weiss S. Classification in

20

conservation biology: A comparison of five machine-learning methods. Ecol Inform.
2010;5(6):441-450.
16.

Kircher M, Stenzel U, Kelso J. Improved base calling for the Illumina Genome Analyzer
using machine learning strategies. Genome Biol. 2009;10(8):R83.

17.

Consortium TEP. An integrated encyclopedia of DNA elements in the human genome.
Nature. 2012;489(7414):57-74.

18.

Marbach D, Roy S, Ay F, et al. Predictive regulatory models in Drosophila melanogaster
by integrative inference of transcriptional networks. Genome Res . 2012;22(7):1334-1349.

19.

The modENCODE Consortium, Roy S, Ernst J, et al. Identification of Functional
Elements and Regulatory Circuits by Drosophila modENCODE. Science (80- ).
2010;330(6012):1787 LP-1797.

20.

Morgan TH. The Nobel Prize in Physiology or Medicine 1933. Nobelprize.org; 1933.

21.

Pierce BA. Genetics: A Conceptual Approach, 5th Edition. NY: W. H. Freeman; 2014.

22.

Chi KR. The dark side of the human genome. Nature. 2016;538:275-277.

23.

Carroll SB. Evo-Devo and an Expanding Evolutionary Synthesis: A Genetic Theory of
Morphological Evolution. Cell. 2008;134(1):25-36.

24.

Prud’homme B, Gompel N, Rokas a, et al. Repeated morphological evolution through
cis-regulatory changes in a pleiotropic gene. Nature. 2006;440(April):1050-1053.

25.

Gramzow L, Ritz MS, Theißen G. On the origin of MADS-domain transcription factors.
Trends Genet. 2010;26(4):149-153.

26.

Nitta KR, Jolma A, Yin Y, et al. Conservation of transcription factor binding specificities
across 600 million years of bilateria evolution. Ren B, ed. Elife. 2015;4:e04837.

27.

Shlyueva D, Stampfel G, Stark A. Transcriptional enhancers: from properties to genomewide predictions. Nat Rev Genet. 2014;15(4):272-86.

28.

Niemann UO, Heinrich. Identification and analysis of eukaryotic promotors: recent
computational approaches. Trends Genet. 2001;17(2):56-60.

29.

Danino YM, Even D, Ideses D, Juven-gershon T. The core promoter : At the heart of gene
expression. Biochim Biophys Acta. 2015;1849(2014):1116-1131.

30.

Rach EA, Winter DR, Benjamin AM, et al. Transcription initiation patterns indicate
divergent strategies for gene regulation at the chromatin level. PLoS Genet. 2011;7(1):1.

31.

Lenhard B, Sandelin A, Carninci P. Metazoan promoters: emerging characteristics and
insights into transcriptional regulation. Nat Rev Genet. 2012;13(4):233-245.

21

32.

Pott S, Lieb JD. What are super-enhancers? Nat Genet. 2015;47(1):8-12.

33.

Barolo S. Shadow enhancers: Frequently asked questions about distributed cis-regulatory
information and enhancer redundancy. BioEssays. 2012;34(2):135-141.

34.

Wittkopp PJ, Kalay G. Cis-regulatory elements: molecular mechanisms and evolutionary
processes underlying divergence. Nat Rev Genet. 2012;13(1):59-69.

35.

Barolo S, Posakony JW. Three habits of highly effective signaling pathways: principles of
transcriptional control by developmental cell signaling. Genes Dev. 2002;16(10):11671181.

36.

Erives A, Levine M. Coordinate enhancers share common organizational features in the
Drosophila genome. Proc Natl Acad Sci U S A. 2004;101(11):3851-6.

37.

Arnosti DN, Kulkarni MM. Transcriptional enhancers: Intelligent enhanceosomes or
flexible billboards? J Cell Biochem. 2005;94(5):890-898.

38.

Hewitt GF, Strunk BS, Margulies C, et al. Transcriptional repression by the Drosophila
giant protein: cis element positioning provides an alternative means of interpreting an
effector gradient. Development. 1999;126(6):1201-10.

39.

Lusk RW, Eisen MB. Evolutionary mirages: Selection on binding site composition creates
the illusion of conserved grammars in Drosophila enhancers. PLoS Genet. 2010;6(1):1.

40.

Spitz F, Furlong EEM. Transcription factors : from enhancer binding to developmental
control. Nat Rev Genet. 2012;13(9):613-626.

41.

Yao L-C, Phin S, Cho J, Rushlow C, Arora K, Warrior R. Multiple modular promoter
elements drive graded brinker expression in response to the Dpp morphogen gradient.
Development. 2008;135(12):2183-2192.

42.

Gafner L, Dalessi S, Escher E, Pyrowolakis G, Bergmann S, Basler K. Manipulating the
Sensitivity of Signal-Induced Repression: Quantification and Consequences of Altered
Brinker Gradients. PLoS One. 2013;8(8):8.

43.

Wei Y, Gokhale RH, Sonnenschein A, Montgomery KM, Ingersoll A, Arnosti DN.
Complex cis-regulatory landscape of the insulin receptor gene underlies the broad
expression of a central signaling regulator. Development. 2016;143(19):3591 LP-3603.

44.

Hnisz D, Shrinivas K, Young RA, Chakraborty AK, Sharp PA. A Phase Separation Model
for Transcriptional Control. Cell. 2017;169(1):13-23.

45.

Bickel RD, Kopp A, Nuzhdin S V. Composite effects of polymorphisms near multiple
regulatory elements create a major-effect QTL. PLoS Genet. 2011;7(1):1.

46.

Barrière A, Gordon KL, Ruvinsky I. Distinct Functional Constraints Partition Sequence
Conservation in a cis-Regulatory Element. PLoS Genet. 2011;7(6):e1002095.

22

47.

Hare EE, Peterson BK, Iyer VN, Meier R, Eisen MB. Sepsid even-skipped enhancers are
functionally conserved in Drosophila despite lack of sequence conservation. PLoS Genet.
2008;4(6):6.

48.

Kalay G, Wittkopp PJ. Nomadic enhancers: Tissue-specific cis-regulatory elements of
yellow have divergent genomic positions among Drosophila species. PLoS Genet.
2010;6(11):441-450.

49.

Siepel A, Bejerano G, Pedersen JS, et al. Evolutionarily conserved elements in vertebrate,
insect, worm, and yeast genomes. Genome Res . 2005;15(8):1034-1050.

50.

Makunin I V., Shloma V V., Stephen SJ, Pheasant M, Belyakin SN. Comparison of ultraconserved elements in drosophilids and vertebrates. PLoS One. 2013;8(12):1334-1349.

51.

Rebeiz M, Castro B, Liu F, Yue F, Posakony JW. Ancestral and conserved cis-regulatory
architectures in developmental control genes. Dev Biol. 2012;362(2):282-294.

52.

Halligan DL, Eyre-Walker A, Andolfatto P, Keightley PD. Patterns of evolutionary
constraints in intronic and intergenic DNA of Drosophila. Genome Res. 2004;14(2):273279.

53.

Bejerano G, Haussler D, Blanchette M. Into the heart of darkness: Large-scale clustering
of human non-coding DNA. Bioinformatics. 2004;20(SUPPL. 1):i40-i48.

54.

Bradley RK, Li XY, Trapnell C, et al. Binding site turnover produces pervasive
quantitative changes in transcription factor binding between closely related drosophila
species. PLoS Biol. 2010;8(3):3.

55.

Glassford WJ, Rebeiz M. Assessing constraints on the path of regulatory sequence
evolution. Philos Trans R Soc Lond B Biol Sci. 2013;368(1632):20130026.

56.

Ramos AI, Barolo S. Low-affinity transcription factor binding sites shape morphogen
responses and enhancer evolution. Philos Trans R Soc Lond B Biol Sci.
2013;368(20130):20130018.

57.

Haraksingh RR, Snyder MP. Impacts of variation in the human genome on gene
regulation. J Mol Biol. 2013;425(21):3970-3977.

58.

Balhoff JP, Wray GA. Evolutionary analysis of the well characterized endo16 promoter
reveals substantial variation within functional sites. Proc Natl Acad Sci U S A.
2005;102(24):8591-8596.

59.

Gibson G, Dworkin I. Uncovering cryptic genetic variation. Nat Rev Genet.
2004;5(9):681-690.

60.

Doniger SW, Fay JC. Frequent gain and loss of functional transcription factor binding
sites. PLoS Comput Biol 3 e99. 2007;3(e99):5.

23

61.

Gotea V, Visel A, Westlund JM, et al. Homotypic clusters of transcription factor binding
sites are a key component of human promoters and enhancers. Genome Res. 2010;20:565577.

62.

Frankel N, Davis GK, Vargas D, Wang S, Payre F, Stern DL. Phenotypic robustness
conferred
by
apparently
redundant
transcriptional
enhancers.
Nature.
2010;466(7305):490-493.

63.

Perry MW, Boettiger AN, Bothma JP, Levine M. Shadow Enhancers Foster Robustness of
Drosophila Gastrulation. Curr Biol. 2010;20(17):1562-1567.

64.

Xiong N, Kang C, Raulet DH. Redundant and unique roles of two enhancer elements in
the TCRgamma locus in gene regulation and gammadelta T cell development. Immunity.
2002;16(3):453-463.

65.

Gompel N, Prud’homme B, Wittkopp PJ, Kassner VA, Carroll SB. Chance caught on the
wing: cis-regulatory evolution and the origin of pigment patterns in Drosophila. Nature.
2005;433(7025):481-487.

66.

Erwin DH, Davidson EH. The evolution of hierarchical gene regulatory networks. Nat Rev
Genet. 2009;10(2):141-148.

67.

Halder G, Callaerts P, Gehring WJ. Induction of ectopic eyes by targeted expression of the
eyeless gene in Drosophila. Science. 1995;267(5205):1788-1792.

68.

Weasner B, Anderson J, Kumar JP. The Eye Specification Network in Drosophila. Proc
Indian Natl Sci Acad Part B, Biol Sci. 2004;B70(5-6):517-530.

69.

Weasner BM, Weasner B, DeYoung SM, Michaels SD, Kumar JP. Transcriptional
activities of the Pax6 gene eyeless regulate tissue specificity of ectopic eye formation in
Drosophila. Dev Biol. 2009;334(2):492-502.

70.

Duque T, Samee MAH, Kazemian M, Pham HN, Brodsky MH, Sinha S. Simulations of
Enhancer Evolution Provide Mechanistic Insights into Gene Regulation. Mol Biol Evol.
2014;31(1):184-200.

71.

Yáñez-Cuna JO, Kvon EZ, Stark A. Deciphering the transcriptional cis-regulatory code.
Trends Genet. 2013;29(1):11-22.

72.

Swanson CI, Schwimmer DB, Barolo S. Rapid evolutionary rewiring of a structurally
constrained eye enhancer. Curr Biol. 2011;21(14):1186-1196.

73.

Ludwig MZ, Kreitman M. Evolutionary dynamics of the enhancer region of even-skipped
in Drosophila. Mol Biol Evol. 1995;12(6):1002-11.

74.

Ludwig MZ, Bergman C, Patel NH, Kreitman M. Evidence for stabilizing selection in a
eukaryotic enhancer element. Nature. 2000;403(6769):564-567.

24

75.

Ay A, Arnosti DN. Mathematical modeling of gene expression: a guide for the perplexed
biologist. Crit Rev Biochem Mol Biol. 2011;46(2):137-51.

76.

Crocker J, Tsai A, Stern DL. A Fully Synthetic Transcriptional Platform for a Multicellular
Eukaryote. Cell Rep. 2017;18(1):287-296.

77.

Fakhouri W, Ay A, Sayal R. Deciphering a transcriptional regulatory code: modeling
short-range repression in the Drosophila embryo. Mol Syst …. 2010;6(341):341.

78.

Janssens H, Hou S, Jaeger J, et al. Quantitative and predictive model of transcriptional
control of the Drosophila melanogaster even skipped gene. Nat Genet. 2006;38(10):11591165.

79.

Levo M, Segal E. In pursuit of design principles of regulatory sequences. Nat Rev Genet.
2014;15(7):453-68.

80.

Narasimhan K, Pillay S, Huang YH, et al. DNA-mediated cooperativity facilitates the coselection of cryptic enhancer sequences by SOX2 and PAX6 transcription factors. Nucleic
Acids Res. 2015;43(3):1513-1528.

81.

Forms: CSBE. The Evolution of Gene Regulation and Morphological Diversity. Cell.
2000;101:577-580.

82.

Wittkopp PJ, Haerum BK, Clark A. Genetic basis of regulatory variation within and
between Drosophila species. Nat Genet. 2008;40:346-350.

83.

Ipe J, Swart M, Burgess KS, Skaar TC. High-Throughput Assays to Assess the Functional
Impact of Genetic Variants: A Road Towards Genomic-Driven Medicine. Clin Transl Sci.
2017;10(2):67-77.

84.

Arnold CD, Gerlach D, Stelzer C, Boryn LM, Rath M, Stark A. Genome-Wide
Quantitative Enhancer Activity Maps Identified by STARR-seq. Science (80- ).
2013;339(6123):1074-1077.

85.

Kvon EZ, Kazmar T, Stampfel G, et al. Genome-scale functional characterization of
Drosophila developmental enhancers in vivo. Nature. 2014;512(7512):91-5.

86.

Groth AC, Fish M, Nusse R, Calos MP. Construction of Transgenic Drosophila by Using
the Site-Specific Integrase from Phage qC31. Genetics. 2004;166(4):1775-1782.

87.

Venken KJT, He Y, Hoskins RA, Bellen HJ. Pacman: a BAC transgenic platform for
targeted insertion of large DNA fragments in D. melanogaster. Science (80- ).
2006;314(5806):1747-1751.

88.

Sharan SK, Thomason LC, Kuznetsov SG, Court DL. Recombineering: a homologous
recombination-based method of genetic engineering. Nat Protoc. 2009;4(2):206-223.

89.

Fulco CP, Munschauer M, Anyoha R, et al. Systematic mapping of functional enhancer–

25

promoter connections with CRISPR interference. Science (80- ). 2016;354(6313):1-8.
90.

Heintzman ND, Stuart RK, Hon G, et al. Distinct and predictive chromatin signatures of
transcriptional promoters and enhancers in the human genome. Nat Genet.
2007;39(3):311-8.

91.

Zeitlinger J, Zinzen RP, Stark A, et al. Whole-genome ChIP-chip analysis of Dorsal,
Twist, and Snail suggests integration of diverse patterning processes in the Drosophila
embryo. Genes Dev. 2007;21(4):385-390.

92.

Ernst J, Kellis M. ChromHMM: automating
characterization. Nat Methods. 2012;9(3):215-6.

93.

Devailly G, Mantsoki A, Michoel T, Joshi A. Variable reproducibility in genome-scale
public data: A case study using ENCODE ChIP sequencing resource. FEBS Lett.
2015;589(24):3866-3870.

94.

Henikoff S, Shilatifard A. Histone modification: Cause or cog? Trends Genet.
2011;27(10):389-396.

95.

Spivakov M. Spurious transcription factor binding: Non-functional or genetically
redundant? BioEssays. 2014;36(8):798-806.

96.

Whitaker JW, Nguyen TT, Zhu Y, Wildberg A, Wang W. Computational schemes for the
prediction and annotation of enhancers from epigenomic assays. Methods. 2015;72(C):8694.

97.

Claussnitzer M, Dankel SN, Kim K-H, et al. FTO Obesity Variant Circuitry and Adipocyte
Browning in Humans. N Engl J Med. 2015;373(10):895-907.

98.

Williamson I, Lettice LA, Hill RE, Bickmore WA. Shh and ZRS enhancer colocalisation is
specific to the zone of polarising activity. Development. 2016;143(16):2994-3001.

99.

Barutcu AR, Fritz AJ, Sayyed KZ, et al. C-ing the genome: A compendium of
chromosome conformation capture methods to study higher-order chromatin organization.
J Cell Physiol. 2015;231(1):1097-4652.

chromatin-state

discovery

and

100. Yao L, Berman BP, Farnham PJ. Demystifying the secret mission of enhancers: linking
distal regulatory elements to target genes. Crit Rev Biochem Mol Biol. 2015;50:550-573.
101. Gittelman RM, Hun E, Ay F, et al. Comprehensive identification and analysis of human
accelerated regulatory DNA. Genome Res. 2015;25(9):1245-1255.
102. Visel A, Bristow J, Pennacchio LA. Enhancer identification through comparative
genomics. Semin Cell Dev Biol. 2007;18(1):140-152.
103. Albert FW, Kruglyak L. The role of regulatory variation in complex traits and disease. Nat
Rev Genet. 2015;16(4):197-212.

26

104. Narlikar L, Ovcharenko I. Identifying regulatory elements in eukaryotic genomes.
Briefings Funct Genomics Proteomics. 2009;8(4):215-230.
105. Erwin GD, Oksenberg N, Truty RM, et al. Integrating Diverse Datasets Improves
Developmental Enhancer Prediction. PLoS Comput Biol. 2014;10(6):6.
106. Wilczynski B, Liu YH, Yeo ZX, Furlong EEM. Predicting Spatial and Temporal Gene
Expression Using an Integrative Model of Transcription Factor Occupancy and Chromatin
State. PLoS Comput Biol. 2012;8:12.
107. Cannavo E, Khoueiry P, Garfield DA, Geeleher P, Zichner T, et al. Shadow enhancers are
pervasive features of developmental regulatory networks. Curr Biol. 2016;26:1-14.
108. Allanson JE, Bohring A, Dorr HG, Dufke A, Gillessen-Kaesbach G, Horn D, Konig R,
Kratz CP, Kutsche K, Pauli S, et al.. The Face of Noonan Syndrome: Does Phenotype
Predict Genotype. Am J Med Genet A. 2010;152:1960-1966.
109. Boehringer S, van der Lijn F, Liu F, et al. Genetic determination of human facial
morphology: links between cleft-lips and normal variation. Eur J Hum Genet.
2011;19(11):1192-1197.
110. Ferry Q, Steinberg J, Webber C, et al. Diagnostically relevant facial gestalt information
from ordinary photos. Elife. 2014;3:e02020.
111. Zou L, Adegun OK, Willis A, Fortune F. Facial biometrics of peri-oral changes in Crohn’s
disease. Lasers Med Sci. 2014;29(3):869-874.
112. Zelditch M. Geometric Morphometrics for Biologists : A Primer. Elsevier Academic
Press; 2004.
113. Ahmad F, Roy K, O’Connor B, Shelton J, Dozier G, Dworkin I. Fly Wing Biometrics
Using Modified Local Binary Pattern, SVMs and Random Forest. Int J Mach Learn
Comput. 2014;4(3):279-285.
114. Payne M, Turner J, Shelton J, et al. Fly wing biometrics. IEEE Work Comput Intell
Biometrics Identity Manag CIBIM. 2013:42-46.
115. Sagonas C, Antonakos E, Tzimiropoulos G, Zafeiriou S, Pantic M. 300 Faces In-The-Wild
Challenge: database and results. Image Vis Comput. 2016;47:3-18.
116. Steve IK, Daniel S, Dec C V. The MegaFace Benchmark : 1 Million Faces for Recognition
at Scale. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. ; 2016:4873--4882.
117. Shelton J, Bryant K, Abrams S, et al. Genetic & Evolutionary Biometric Security:
Disposable Feature Extractors for Mitigating Biometrics Replay Attacks. Procedia
Comput Sci. 2012;8:351-360.

27

118. Schroff F, Kalenichenko D, Philbin J. FaceNet: A Unified Embedding for Face
Recognition and Clustering. In: The IEEE Conference on Computer Vision and Pattern
Recognition. ; 2015:815-823.

28

CHAPTER II: THE SHEEP AND THE GOATS: DISTINGUISHING
TRANSCRIPTIONAL ENHANCERS IN A COMPLEX CHROMATIN
LANDSCAPE

29

Abstract
Predicting regulatory function of non-coding DNA using genomic information remains a major
goal in genomics, and an important step in interpreting cis-regulatory code. Regulatory capacity
can be partially inferred from transcription factor occupancy, histone modifications, motif
enrichment, and evolutionary conservation. However, even combinations of these features in
well-studied systems such as Drosophila have limited predictive accuracy. Here we examine the
current limits of computational enhancer prediction by applying machine-learning methods to an
extensive set of genomic features, validating predictions with the Fly Enhancer Resource, which
characterized the transcriptional activity of approximately 15 percent of the genome. Supervised
machine learning trained on a range of genomic features identify active elements with a high
degree of accuracy, but are less successful at distinguishing tissue-specific expression patterns.
Consistent with previous observations of their widespread genomic interactions, many
transcription factors were associated with enhancers not known to be direct functional targets.
Interestingly, no single factor was necessary for enhancer identification, although binding by the
'pioneer' transcription factor Zelda was the most predictive feature for enhancer activity. Using
an increasing number of predictive features improved classification with diminishing returns.
Thus, additional single-timepoint ChIP data may have only marginal utility for discerning true
regulatory regions. On the other hand, spatially- and temporally-differentiated genomic features
may provide more power for this type of computational enhancer identification. Inclusion of
new types of information distinct from current chromatin-immunoprecipitation data may enable
more precise identification of enhancers, and further insight into the features that distinguish
their biological functions.

30

Introduction
Enhancers, cis-regulatory elements that coordinate the input of transcriptional activators and
repressors, are the primary determinants of eukaryotic gene expression. Enhancers can regulate
genes from distances of hundreds to hundreds of thousands of base pairs, and genes are
frequently regulated by multiple enhancers to provide complex spatial and temporal regulation 1–
3

. Changes in enhancers play pivotal roles in the evolution of multi-cellular life

4–6

, and

disruption in enhancer function has been linked in many cases to disease3,7–9.
Identifying the complete repertoire of enhancer(s) for a specific gene is difficult, although there
are a number of features that can provide insight. Enhancers are frequently characterized by
enrichment of binding motifs for various transcription factors10–12. This may rely on prior
knowledge of binding affinities for specific factors and curated position weight matrices
(PWMs)13,14, or by searching for over-represented motif clusters in sequences adjacent to coexpressed genes15,16. However the presence of motifs does not indicate when an enhancer would
be active in development, and motifs are not a perfect indicator of transcription factor binding17.
Genome-wide studies have used features such as chromatin immunoprecipitation data for
transcription factor occupancy and histone modifications to infer function of non-coding DNA;
however, features associated with enhancers are also found in regions of open chromatin that are
not necessarily active18–20. Many enhancers will have features of regulatory DNA even while
inactive21 and transcription factors and polymerases will frequently bind in preparation for future
activity22.
Going beyond simple predictions of overall activity, several studies have attempted to predict
spatial and temporal aspects of enhancer function. These have had some success at classifying
enhancers into broad categories23–25, and making inferences about the frequency and distribution

31

of specific types of enhancers (e.g. 'shadow enhancers')25. However, results from thermodynamic
models used to predict specific expression patterns based on transcription factor binding have
shown that achieving that level of predictive specificity is a challenging problem26,27.
After identifying a prospective enhancer, an additional challenge is determining which gene or
genes it regulates22. Enhancers can be very distal to the target gene, which may not be the
adjacent transcription unit

28–30

and a single enhancer can influence multiple genes25,28,31. Most

computational approaches do rely on proximity, as a substantial percentage of enhancers control
the nearest gene32–34. However, since many enhancers direct distal genes, this heuristic is far
from universally applicable35.
Chromatin conformation data can provide more direct information about which genes are
regulated by enhancers6,34,36. Many of these experiments have been conducted in cell culture, but
increasingly datasets from specific organ systems are also becoming available37. However,
identifying tissue specific enhancers from conformation data is complicated by the large number
of non-transcription interactions between genomic regions38. Combinatorial protein binding can
also provide insight into enhancer locations; clusters of transcription factor binding can indicate
enhancer activity. Genes known to be involved in a developmental network are likely to be
associated with enhancers bound by common transcription factors regulating that network39,40.
In Drosophila melanogaster enhancer identification and characterization has traditionally relied
on genetic analysis41 and in-vivo reporter assays1. However, even in organisms with relatively
small genomes like Drosophila, this is a daunting task on a genome-wide scale42. Highthroughput methods that assay the entire genome, such as STARR-seq, offer a more
comprehensive perspective, but have thus far been confined to analysis of potential enhancers in
cultured cells, and only provide insight on transcriptional activity, not function in control of

32

specific genes. Most of these assays are also limited to showing enhancer readout on specific
basal promoters (which can have a substantial impact on enhancer responsiveness), and neglect
the influence of genomic neighborhood43–45. Evolutionary conservation has been successfully
used to identify Drosophila enhancers where functional regions of non-coding DNA are more
highly conserved than background46–48, although this method is complicated by the relatively
uniform conservation across most upstream and intronic regions49. Furthermore, in many cases
enhancers have conserved function despite sequences being highly diverged50, and non-coding
regions with functions unrelated to enhancer activity can also be conserved above
background47,51,52.
Ultimately, no one genomic feature is an exclusive or universal indicator of enhancer location or
activity42. The increasing availability of genomics datasets for enhancer correlates, such as
nucleosome occupancy and transcription factor binding53, as well as spatial information from
techniques like HiC54 and ChIA-Pet55, has opened up the door for compound approaches wherein
multiple features are used to infer the activity of enhancers56,57. Machine learning approaches
combining multiple genomics datasets have employed a broad range of supervised and
unsupervised methods, as well as different combinations of enhancer features58. Some studies
have also successfully assigned enhancers to genes using a probabilistic approach, where both
proximity and biological information are considered59,60. Taken together, these studies suggest
that compound approaches, using multiple features and diverse types of features as predictors are
the most successful60–62. However, distinguishing between functional binding from background
represents a major challenge to using genome-wide datasets to predict enhancer activity, and the
accuracy of enhancer predictions is not entirely known. Many efforts at estimating the accuracy
of enhancer predictions have been limited to testing a handful of representative cases63, or

33

correlations in place of broad-scale verification. Enhancers are identified based on correlating
features (e.g. motifs, transcription factor occupancy), and the accuracy of these calls are
estimated from different correlates (e.g. DNase hypersensitivity61,64, and

eRNAs65). Until

recently, more direct methods for estimating accuracy have not been available62,66.
The recently published Fly Enhancer Resource is a database of information for nearly 8000
Drosophila genomic regions, showing the activity of in-vivo reporters with these regions at
various developmental stages. It is also an ideal resource for determining both the predictive
power of different features associated with enhancers. Importantly, the database provides detailed
information about active enhancers through development, as well as many genomic regions that
are not transcriptionally active. This data can be used as a training set for distinguishing active
enhancers from inactive regions, and determining which features are most effective for
distinguishing these groups. Here, we exploit the possibilities for validation provided by the
extensive functional survey by Kvon et al. 201467, and use extant data on transcription factor
occupancy, chromatin marks, and DNA sequence to push the limits in enhancer identification
and classification, and explore what conditions could best be used to distinguish this state.

Materials and Methods
Datasets used as predictive features
Predictive features were obtained from a variety of publicly available datasets (see Table 2-1).
For each dataset, information that overlapped with the coordinates of DNA elements were
categorized as features of those DNA elements. These include conservation information between
D. melanogaster and other species in the Drosophila genus, motif scores based on position
weight matrices (PWMs) associated with transcription factors involved in dorsal-ventral
embryonic patterning, and chromatin immunopreciptation (ChIP) data for a number of

34

transcription factors and histone modifications. ChIP data all came from stage 4-6 embryos; we
chose to focus on early embryonic development, as the regulatory networks governing this stage
of development are extremely well understood, and many genes are expressed in simple, easily
categorized expression patterns. This stage also has the fewest active genes and the smallest
number of active DNA elements (as indicated in Kvon et al. 201467), which simplifies the task of
assigning enhancers to genes.
Pairwise alignments were downloaded in axt file format from the University of California SantaCruz Genome Browser68 (http://hgdownload.soe.ucsc.edu/downloads.html). Alignments used
were D. melanogaster version dm3 to D. grimshawi droGri2, D. willistoni droWil1, D.
ananassae droAna3, D. pseudoobscura dp4, D. erecta droEre2, D. yakuba droYak2, D. sechellia
droSec1, and D. simulans droSim1. These were chosen to reflect a range of phylogenetic
distances from D. melanogaster. Summary lines from each chromosome axt file were used to
create a single summary file for each genome. A custom python script was used to determine the
average BLASTZ score per 100 base pairs for a region, which was plotted by species on a log10
scale. Perfect conservation over 100 base pairs would yield a score near 10,000. Scripts and
intermediate files are available on github at https://github.com/asonnens/crm_analysis. BLASTZ
scores were used for pairwise comparisons69.
Position probability matrices for the transcription factors Dorsal, Snail, Twist and Zelda were
obtained from the database Fly Factor Survey70. These values were formatted using a custom
python script to enable them to be input into the MEME Suite program MAST71. Motif matches
throughout the genome with a p-value of less than 0.0001 were obtained using MAST, and the
release 5.37 version of the Drosophila melanogaster genome. Output files, containing
coordinates of qualifying matches and their scores, were converted to bedgraph forma

35

Table 2-1 Features for enhancer classification
ID
Dorsal 2015 ChIP-seq
Dorsal 2009 ChIP-chip
Snail 2014 ChIP-chip
Snail 2009 ChIP-chip
Twist 2014 ChIP-chip
Twist 2011 ChIP-seq
Twist 2009 ChIP-chip
Bicoid 2013 ChIP-seq
Bicoid 2009 ChIP-chip
Caudal 2010 ChIP-seq
Caudal 2009 ChIP-chip
Hunchback 2013 ChIP-seq
Hunchback 2009 ChIP-chip
Giant 2013 ChIP-seq
Giant 2009 ChIP-chip
Hairy 2009 ChIP-chip
Knirps 2010 ChIP-seq
Kruppel 2013 ChIP-seq
Zelda 2011 ChIP-seq
H3K27ac 2015 ChIP-seq
H3K27ac 2010 ChIP-seq
H3K4me1 2015 ChIP-seq
p300 2010 ChIP-seq
Zelda motif sanger
Zelda motif solexa
Dorsal motif FlyReg
Dorsal motif NBT
Snail motif FlyReg
Snail motif sanger
Snail motif solexa
Twist motif da
Twist motif FlyReg
UCSC BlastZ scores

Short ID
Dl 15 s
Dl 09 c
Sna 14 c
Sna 09 c
Twi 14 c
Twi 11 s
Twi 09 c
Bcd 13 s
Bcd 09 c
Cad 10 s
Cad 09 c
Hb 13 s
Hb 09 c
Gt 13 s
Gt 09 c
Hry 09 c
Kni 10 s
Kr 13 s
Zld 11 s
H3K27ac 15
H3K27ac 10
H3K4me1 15
p300 10
Zld m1
Zld m2
Dl m1
Dl m2
Sna m1
Sna m2
Sna m3
Twi m1
Twi m2
NA

Features
1748
9357
7735
596, 2800
8629
8797
6684, 7414
2061
619, 702
4045
1590
4986
1831, 1717
4194
1069
1704, 2729
505
5309
9432
3055
3658
1934
3296
17179
19271
71087
26159
38221
31059
42341
29656
61676
NA

Reference
Sun et al. 2015
Macarthur et al. 2009
Rembold et al. 2014
Macarthur et al. 2009
Rembold et al. 2014
He et al. 2011
Macarthur et al. 2009
Paris et al. 2013
Macarthur et al. 2009
Bradley et al. 2010
Macarthur et al. 2009
Paris et al. 2013
Macarthur et al. 2009
Paris et al. 2013
Macarthur et al. 2009
Macarthur et al. 2009
Bradley et al. 2010
Paris et al. 2013
Harrison et al. 2011
Kok et al. 2015
Roy 2010
Kok et al. 2015
Roy et al. 2010
Fly Factor Survey
Fly Factor Survey
Fly Factor Survey
Fly Factor Survey
Fly Factor Survey
Fly Factor Survey
Fly Factor Survey
Fly Factor Survey
Fly Factor Survey
Rosenbloom et al. 2015

To obtain insight into factors driving enhancer activity, we used ChIP-chip and ChIP-seq data.
Peak scores were obtained from public databases including ModEncode, the Berkeley
Drosophila Genome Project, and other publications with relevant ChIP data (see Table 2-1 in
Methods). In each case terminal data files were converted to bedgraphs. When necessary,

36

coordinates were converted to the release 5 genome using the Flybase coordinate conversion
tool. As these datasets were analyzed using different pipelines, they are somewhat variable
regarding distributions of scores, and thresholds for what was considered a significant peak.
Intermediate files are available on github at https://github.com/asonnens/crm_analysis. The
overlap (of at least one nucleotide) between peaks called by different datasets, and correlation
between scores in overlapping peaks, was analyzed using a custom python script, and visualized
using gplots (version 3.0.1) in R (version 3.3.2).
Organization and curation of data from Fly Enhancer Resource
DNA elements from the Stark Lab Fly Enhancer Resource were downloaded from the Kvon et al.
2014 supplementary tables. For comparisons made in stage 4-6 embryos, we used regions that
were active in stage 4-6 embryos, filtered to include only those with activity scores of 3 or higher
(as annotated by Kvon et al. 2014), and confirmed (by Kvon et al. 2014) inactive regions in stage
4-6. For examination of expression patterns, the highly active stage 4-6 DNA elements were reannotated based on photographs of the slides, to identify those that exclusively fit characteristic
Anterior/Posterior and Dorsal/Ventral expression patterns. 114 DNA elements were found to
have strong expression distinctly in the Anterior region of the embryo, while 78 were only
expressed in the Central or Posterior regions. 22 DNA elements were identified that had
stereotypical mesodermal or neurogenic ectoderm expression patterns (as in enhancers for snail
or short gastrulation respectively), and 192 had strong expression patterns that did not overlap
with these two regions. All active and inactive DNA elements that Kvon et al. 2014 identified as
'verified' were used for classifications based on temporal specificity (e.g. ubiquitously active, or
stage specific). DNA elements were grouped by whether they were active in all embryonic
stages, only in stages 4-8, stages 9-12, stages 13-16 (Figure 2-13).

37

Intersection of features with Fly Enhancer Resource
To determine if these features could be used to distinguish between spurious binding and
functional binding, we compared the distribution of peak scores for features that overlapped (by
at least one base pair) with genomic regions that drive strong expression in stage 4-6 embryos (as
reported by the Fly Enhancer Resource) and regions that do not drive expression in stage 4-6
embryos. Every DNA element tested by the Fly Enhancer Resource was assigned a score for
each feature included. Scores from bedgraph items (peak scores in the case of ChIP datasets,
MAST motif scores in the case of motif datasets, and BLASTZ scores for conservation
information) that overlapped with a DNA element were assigned to form that DNA element's
scores (multiple peaks within an element were combined; scores were only combined within a
single dataset, thus normalization was not necessary between datasets). If a DNA element did not
overlap with any features for a given dataset, its score for that feature was set to a numerical
value one standard-deviation lower than the lowest score that indicated a hit for that dataset,
based on the distribution of values for that feature. We calculated fractional overlap of these
genomic features, both to determine whether individual features were reproducibly identified in
different studies, and to see if there was overlap of features known to be functionally related.
Each dataset was compared pair-wise with every other dataset; the degree of feature overlap was
calculated as a percentage (the percentage of the features in the smaller dataset that overlap with
one or more features in the larger dataset). In Fig. 2-11, the Fly Enhancer Resource DNA
elements were filtered to only include those that fit a minimal threshold for a potential active
enhancer, based on binding of transcription factors from at least one of the datasets described in
Table 2-1. Subsets of this set of potential enhancers were created to allow comparisons between
active DNA elements and inactive regions that resemble active DNA elements with regards to

38

transcription factor occupancy. This was done by restricting the data to include only elements
bound by two or more transcription factors, or only DNA elements occupied by three or more
transcription factors (see Results.) All sets of DNA elements were under-sampled from the larger
category to balance the number of active and inactive DNA elements included in these
prospective training sets.
The DNA elements with features assigned were scaled and visualized using a Principal
Components Analysis using the stats package (version 3.2.3) in R (version 3.3.2).The percentage
of variation captured is displayed in a scree plot (Figure 2-b).
Discrimination between classes of enhancers with Random Forest
To determine if it was possible to distinguish between regions that are transcriptionally active
and regions that are soon-to-be active, we also classified the Fly Enhancer Resource DNA
elements into several categories of activity. This included ubiquitous activity (active in every
developmental stage measured Kvon et al.), total inactivity (never active in any measured stage),
and stage specific regions that are active only in early, middle, or later embryonic stages. To
facilitate comparisons of different sorts of data, values for ChIP, conservation, and motif scores
were normalized using the 'scale' function in R. In Fig. 2-9, highly active (defined by a score of 3
or higher in stage 4-6) and inactive DNA elements were classified using a random forest
algorithm (randomForest package, version 4.6-12) with 500 trees in R (version 3.2.3), using two
thirds of DNA elements for training and one third for testing. Receiver operating characteristic
(ROC), Precision Recall and the area under these curves was analyzed using ROCR package in R
( version 1.0-7, Sing et al. 2005). In this and later figures, the DNA elements used for training
were randomly re-sampled ten times. In Figure 2-11, 2-12 and Figure 2-16 where balanced
datasets were used equal numbers of active and inactive elements were randomly selected using

39

the package dplyr (version 0.5.0). For subsets filtered for DNA elements containing the indicated
ChIP binding protein or chromatin mark, the number of "active" elements ranged from ~100300, and for balanced datasets, the same number of inactive elements was used. The same
classification was attempted with datasets that were reduced to only include inactive regions that
at least superficially resembled active enhancers (regions occupied by two or more transcription
factors, or regions occupied by a combination of factors associated with development).
Feature importance analysis
Feature importance was measured using the mean decrease in Gini index within the
randomForest function, which estimates variable importance during the training of the random
forest. To determine if redundancy or correlation between features influenced importance scores,
features were iteratively left out of analyses, so after each set of predictions, the most important
or least important feature was dropped. In Fig. 12 where successive features are omitted from the
analysis, the most and least important features were re-calculated after each run with subsets of
the features. After exclusion of each feature, the order of feature importance frequently changed,
as the removal of partially correlated data increased the relative importance of remaining
features.
Identifying potential enhancers
In Fig. 10, to comprehensively assess regions that contain bound transcription factors and
chromatin signatures associated with enhancers, we generated a list of potential enhancers denovo based on clustering of transcription factors adjacent to specific genes of interest, which
were selected based on annotated anterior/posterior, mesodermal, or neurogenic ectodermal
expression patterns. Prospective enhancers were defined by clusters of overlapping features
occurring within 50kb windows centered on the +1 position of genes of interest, using a custom

40

python script available on github at https://github.com/asonnens/crm_analysis. Genes of interest
brinker (brk), ventral nervous system defective (vnd), short gastrulation (sog), snail (sna), evenskipped (eve), Kruppel (Kr), hunchback (hb), and knirps (kni), were selected based on their well
studied regulation in early stage embryos. To locate clusters, we defined regions containing at
minimum Zelda and one other ChIP signal (p300, H3K4me, H3K27ac, or any TF), or H3K4me1
and one other ChIP signal (as indicated in Fig. 10). The final enhancer was defined as a region
around the center of the cluster or clusters of features, and set to a fixed size of 500-2000 bp.
These regions were then classified as 'putative enhancers' or 'background binding' by random
forest algorithms trained on regions from the Fly Enhancer Resource.
Validation of predictions
Annotated enhancers were defined based on verified regulatory regions in the RedFly database 72.
Each random forest trained on the Fly Enhancer Resource was used to classify predicted
enhancers around genes of interest ten times, using different samples of the training set. A
custom python script was used to determine the frequency that each model classified a cluster
that overlapped with an annotated enhancer as active (defined as percentage overlap), and the
frequency that each model classified non-overlapping segments as active (likely off target). The
results of this were graphed with ggplot2.
Data Availability
Scripts and final files are available on github at https://github.com/asonnens/crm_analysis.

Results
Enhancers are not highly conserved
Sequence similarity in related species has been extensively used to identify functional noncoding DNA. Highly conserved regions may indicate purifying selection on regulatory
elements69, and conversely highly diverged sequences can be a signature of enhancers associated

41

with phenotype differences between species73. We looked at sequence conservation between
Drosophila melanogaster and a range of other species in the Drosophila genus for DNA elements
that were either active in every stage of embryonic development or inactive across all stages, to
see if there were any visible trends. DNA elements do not show increased sequence conservation
over inactive genomic regions as measured by BlastZ scores (Figure 2-1). This was true for all
species comparisons.
Correlations of distinct chromatin immunoprecipitation data
We collected a range of ChIP datasets to use for studying marks of enhancers. These datasets
were produced using different methods (ChIP-chip vs. ChIP-seq) and a range of experimental
conditions. Accordingly, the number of features contained within different datasets was highly
variable (Table 2-1). We assessed the consistency and overlap of each dataset to determine how
to most effectively combine them. Different datasets measuring binding of the same transcription
factor typically do not completely overlap, but do to a greater degree than is seen in comparisons
between different transcription factors (Figure 2-2, Figure 2-3).

42

Figure 2-1 Sequence conservation in active and inactive elements

We compared the sequence conservation for reporters that drove expression in every stage of
embryonic development relative to reporters that showed no activity at any stage. Levels of
sequence conservation for these ~3,800 tested reporter elements from D. melanogaster versus
D.simulans (red), D.yakuba (yellow), D.pseudoobscura (green), or D.grimshawi (blue), are
averaged per 100bp, across the 2kb reporter elements.The average conservation scores closely
match evolutionary distances-- sequences are consistently more similar to Drosophila
melanogaster in more closely related species regardless of activity; but reporters with strong
activity (shown in hatched areas) are not distinguishable from reporters that are inactive
throughout embryonic development (rest of points, aligned by chromosomal location). BlastZ
score of ~10,000 would indicate perfect sequence identity over 100 base pairs.

43

Figure 2-2 Correlations between genomic regulatory features

Similarities between ChIP datasets were assessed by determining percentage overlap of ChIP
peaks between different datasets, ranging from full overlap (blue) to no overlap (red). Boxes on
the diagonal highlight independent measurements of specific transcription factors; Twist
occupancy shows a high degree of correlation, whereas Hunchback is lower. Three Twist datasets
and one Snail dataset showed greater degrees of cross-correlation to other factors; lower levels of
correlation were noted for complete ChIP peak datasets. In the reported bound regions with peak
scores one standard deviation above average, overlap averages at 57 percent in datasets
measuring the same transcription factor as reported by different labs, and 38 percent between
different transcription factors.

44

Figure 2-3 Overlap of features between different datasets

In many cases less than half of the peaks reported for a given transcription factor are
reproducible between labs. This is more evident when examining all reported peaks, instead of
the highest-scoring peaks (Figure 2-2). When all reported peaks are included, measurements of
the same transcription factor between labs overlap on average 52 percent, compared with 49
percent overlap between different transcription factors.

45

Separate datasets produced by the same lab (either at different times, or using different
antibodies for the same protein- e.g. Twist, Bicoid, and Hunchback) overlapped more than
datasets produced by different research groups, although excluding lower scoring peaks
increased this trend (Fig. 2-2 vs. Fig. 2-3). MacArthur et al. (2009)74 reported a 91% overlap in
ChIP results using different antibodies for the same transcription factor. In comparison, the
average peak overlap for ChIP results of specific transcription factors from different laboratories
ranges from 52% for all peaks to 57% for highest scoring peaks (Fig. 2-2 and Fig. 2-3). By
contrast, overlap between different transcription factors ranged from 49 percent (high scoring
peaks) to 38 percent (all peaks). Variability between ChIP datasets, even for duplicate
measurements of the same factor, has been noted in previous studies, but overall trends indicate
measurable differences in measured genomic occupancy of these factors.
Although ChIP datasets provide partially inconsistent information about occupancy, in
combination they are an important guide to the regulatory potential of specific DNA elements. In
some cases, loci that are inactive can be readily distinguished from loci around active genes
based on ChIP occupancy alone. Sparse occupancy around weakly transcribed or inactive genes
(Figures 2-4 A, 2-4 B) contrasts with abundant binding of factors near loci adjacent to robustly
expressed genes (Figure 2-4 C, 2-4 D). However, around active genes, some bound regions may
not be associated with known enhancers. We found many instances of transcription factor
occupancy of tested, inactive regions in the Fly Enhancer Resource (e.g. circled region in 2-4 D).
These regions are difficult to visually distinguish from active enhancers.

46

Figure 2-4 ChIP peaks correlate with active elements

Low levels of binding by Dorsal, Twist and Zelda at A weakly expressed sgl locus and B inactive
kar locus. C Numerous bound regions associated with actively transcribed vnd and D brk loci.
However, some regions found to be inactive in Kvon et al. 2014 are bound by multiple
transcription factors (circled region near brk). Gray bars indicate portions of genome assessed by
Kvon et al. 2014. Blue bars indicate transcription units for genes.

47

Reproducibility between measurements of features was limited, but higher for stronger signals.
To determine if these reproducible strong signals were more likely to be associated with active
elements, we used the pairwise comparisons of features from Figure 2-2 to measure the
correlation in scores for overlapping features in both DNA elements that were strongly active and
inactive in stage 4-6 embryos. In the case of Zelda and Dorsal (transcription factors associated
with a large number of ChIP-peaks, Table 2-1) and histone marks associated with enhancer
status, occupancy scores for different datasets measuring the same transcription factors within a
given DNA element generally correlate with each other, as do datasets measuring different but
functionally related features, although the difference in this correlation between active and
inactive regions is minimal (Figure 2-5). The same trend is largely true for other transcription
factors (Fig. 2-6). This suggests that the intensity of peak scores are fairly reproducible between
datasets, but that functional binding is not more reproducible or consistent than background
binding. Motif scores generally do not correlate at all with occupancy scores for transcription
factors that overlap with the motifs, except in the case of Zelda (Figure 2-5, Figure 2-6). This
correlation seems to be much stronger in active DNA elements than in inactive elements (Figure
2-4 C and 2-4 D).
The trends shown in Figure 2-4 suggested that there may be qualitative differences between
binding or chromatin modification that could be used to distinguish them. Therefore, we plotted
the distribution of peak or motif scores for the same set of active or inactive DNA elements for
individual ChIP datasets and motifs that were used in Fig. 2-5 (Figure 2-7). For some ChIP
datasets, the higher-scoring peaks were associated with active DNA elements (e.g. Snail 2014
chip). This trend was not consistent for a given transcription factor, suggesting that different
thresholds or levels of background binding in these datasets may obscure relevant signals.

48

Figure 2-5 Correlation between features for active and inactive elements

A,B: Levels of Dorsal and Zelda were assessed on active and inactive regions. Zelda and Dorsal
peaks were correlated on both classes of elements, but little correlation is noted between protein
occupancy and the respective binding motifs. C,D: For chromatin marks and pioneer factor Zelda
protein and motifs, higher correlations were noted on active vs. inactive elements. Results from
392 highly active elements (left) and 6,858 stage 4-6 inactive elements (right). Comparisons for
additional factors shown in Figure 2-6.

49

Figure 2-6 Correlations of feature scores for overlapping or replicate features

Most functionally related features have scores that partially correlate. The correlation is not
noticeably weaker in inactive regions, which might indicate that the correlation is not due to the
functional relationship.

50

Figure 2-7 Differential ChIP peak or binding motif enrichment values

Histograms showing distributions of ChIP peak or binding motifs were plotted for DNA elements
active (blue) and inactive (red) in stage 4-6 embryos. DNA elements from both categories that
did not overlap with the signal were excluded from the comparison, and the histograms from
both categories were scaled to integrate to 1. In most cases, there were a larger number of
inactive regions bound by transcription factors than active regions, as there are 6,858 inactive
regions vs. 392 active regions (Table 2-2). Some datasets showed substantial differences in the
distributions between active and inactive elements e.g. Snail 2014 chip and H3K27ac 2015 seq.
No large differences in enrichment of specific regulatory motifs were observed between active or
inactive elements.

51

Table 2-2 Overlap of active and inactive reporters with features
Feature
All reporters
Dorsal 2015 ChIP-seq
Dorsal 2009 ChIP-chip
Snail 2014 ChIP-chip
Snail 2009 ChIP-chip
Twist 2014 ChIP-chip
Twist 2011 ChIP-seq
Twist 2009 ChIP-chip
Bicoid 2013 ChIP-seq
Bicoid 2009 ChIP-chip
Caudal 2010 ChIP-seq
Caudal 2009 ChIP-chip
Hunchback 2013 ChIP-seq
Hunchback 2009 ChIP-chip
Giant 2013 ChIP-seq
Giant 2009 ChIP-chip
Hairy 2009 ChIP-chip
Knirps 2010 ChIP-seq
Kruppel 2013 ChIP-seq
Zelda 2011 ChIP-seq
H3K27ac 2015 ChIP-seq
H3K27ac 2010 ChIP-seq
H3K4me1 2015 ChIP
p300 2010 ChIP-seq
Zelda motif sanger
Zelda motif solexa
Dorsal motif FlyReg
Dorsal motif NBT
Snail motif FlyReg
Snail motif sanger
Snail motif solexa
Twist motif da
Twist motif FlyReg

Overlap Active
392
118
326
279
168
257
240
320
118
138
250
175
192
201
190
169
210
111
204
292
98
123
167
168
165
148
249
149
138
119
156
134
178

52

Overlap Inactive
6858
222
2081
1083
413
1115
885
1528
253
130
780
275
777
424
661
193
525
96
842
940
335
537
421
291
1073
1208
3983
1959
2632
2013
2720
1952
3154

Figure 2-8 Similarity of distribution of feature scores for active vs. inactive elements

ChIP datasets show greater differences, while conservation scores are indistinguishable on active
and inactive elements. Datasets from within the same lab, using different antibodies for the same
transcription factor, were merged.

53

Transcriptional activity and stage specificity can be classified based on genomic features
Our analysis indicated that no single feature conclusively correlates with regulatory activity,
therefore, we asked if combinations of these features might distinguish active from inactive DNA
elements. To judge the feasibility of this approach, we implemented a Principal Components
Analysis (PCA) of the forty-one datasets. PCA of all features for all DNA elements (392 active
and 6858 inactive) showed some differences in their distribution. (Figure 2-9 A). The largest
contributors to PC1 (the greatest sources of variation in the dataset) are mostly occupancy
information for transcription factors associated with Anterior-Posterior patterning, like Caudal
and Hunchback (Data not shown). Interestingly, the largest contributors to PC2 are primarily the
degree of evolutionary conservation in each DNA element, based on various species
comparisons. Almost all variation in the dataset was captured in the first ten principal
components (Figure 2-9 B).
Although active and inactive regions overlap, the visible partial separation based on the first
principal component suggested it may be possible to classify activity using machine learning.
Linear discriminant analysis of all DNA elements showed a degree of separation, although there
was overlap in the distribution of where they fell on LD axis 1 (Figure 2-9 C, cross validation in
Figure 2-10). Next, we used a random forest (chosen for its amenability to biological
interpretation) to classify all active and inactive DNA elements based on the same 41 features
used for PCA and LDA. A random forest trained on two thirds of the active and inactive DNA
elements was able to classify the remaining one third with 96 percent accuracy (measured by
total percentage of true positives and true negatives) over 10 trials. Receiver Operator
Characteristic Curve (ROC) and Precision recall curves are shown in Figure 2-9 D,E.
Classifications are much more successful as measured by ROC than by Precision-recall, as

54

indicated by their respective measurements of 'area under curve' (AUC). Zelda and Bicoid
datasets were found to be the most important features (Data not shown).
Many of the inactive regions used in the PCA and classifications in Figure 2-9 C-E are relatively
sparse in terms of transcription factor occupancy, like the regions shown in Figures 2-4 A, B.
These unoccupied regions can be trivially distinguished from potential enhancers. A more
challenging question is how to separate active regions from inactive regions that superficially
resemble them, containing many of the same marks, as shown in Figures 2-4 D. Chromatin
signatures are frequently used as a proxy for active elements, therefore we selected loci that are
associated with H3K4 mono-methylation for training and testing, reducing the datasets to 167
highly active regions and 421 inactive regions. These are much less clearly separated by
Principal Components (Figure 2-9 F) and LDA (Fig. 2-9 G). Turning to random forests, the
AUC for ROC is nearly as high as before (compare Fig. 2-9 D, H), and Precision Recall is
dramatically improved (compare Fig. 2-9 E,I). However, the reduced dataset is much closer to
having a balanced number of samples for active and inactive elements, so the probability of
correct classification due to random chance was greatly increased. Neither ROC nor Precision
recall showed substantial improvements in AUC relative to random chance. Restricting the
analysis to loci that were bound by one or more transcription factors similarly improved overall
Precision Recall, but not relative to background (discussed in Fig. 2-11).

55

Figure 2-9 Active DNA elements can be separated from inactive regions

A: DNA elements with and without regulatory activity can be partially visually distinguished
based on Principal Component 1; additional principal components showed little separation.
There is a sharp boundary on the PC1 axis; this is probably due to PC1 variation primarily
coming from ChIP occupancy scores, and the majority of DNA elements having no occupancy,
and thus share an identical 'low score' numerical stand-in. The separation is far more visible
using supervised machine learning. B: Scree plot for PCA in (A: black) and (F: blue). C:
Separation between active and inactive regions with linear discriminant analysis, using all DNA
elements (cross validation in Figure 2-10). D: Random forest trained on 2/3 of both unrestricted
and restricted datasets (using all 41 features) was able to achieve 96% accuracy of total calls, and
a high ratio of true positives relative to false positives, as shown in ROC plots (yellow), and a

56

difference of 0.37 in the area under the curve relative to classifications made on random data
with the same sample size (whose ROC are shown in gray). However, precision recall (E, green
plots) was much lower, although even higher relative to background (gray plots), with a relative
AUC difference of 0.46. F: Using only active and inactive DNA elements that are distinguished
by H3K4 mono-methylation reduces the sample size from 392 highly active regions and 6858
inactive regions, to 167 highly active and 421 inactive regions. Separation was substantially less
visible when DNA elements were restricted to only include those marked by H3K4me1. This
also lowered the degree of visible separation based on LDA (G) and decreased overall accuracy
to 83% and area under the ROC curve (H) It led to substantially higher precision recall (I).
However, it led to equal increases in the AUC of predictions made on random data of the same
sample size, shown in gray.

57

Figure 2-10 LDA Cross validation

Classifications based on Linear Discriminant Analysis cross-validation, corresponding to
analysis in Figure 2-9. Dark blue indicates active elements, light blue indicate active elements,
columns indicate how these elements were classified. LDA was performed on training sets that
included all active and inactive DNA elements from stage 4-6 in the Fly Enhancer resource, and
also only those that were marked by H3K4me1. The numbers are provided that correspond to the
True Positive Rate (TPR), False Positive Rate (FPR), and False Discovery Rate (FDR) for both
classifications.

58

Impact of number of features on predictive accuracy
To determine if a specific combination of features would be the most effective for predictions
(e.g. only using the most predictive features, or excluding the least predictive features), we reperformed random forest classification using the same training data, with different combinations
of the 41 predictors. We iteratively left out individual features (Figure 2-12). After each run of
ten random forests, the features were ranked using Gini Impurity Index. The next run then
excluded either the most or least important feature based on the previous run, and performance
was compared between each. We found that dropping the most important feature, or even top
three features had little impact on precision recall; loss of additional features had some negative
impact on classifications. After dropping the eighteen most important features (largely ChIP
data), precision recall and ROC curves are depressed. However, dropping the least important
features had no noticeable impact, until over thirty features are dropped. Analyses using only the
top twelve most useful features were equivalent to employing all features. These results suggest
that the features must contain a significant amount of redundancy, but the redundancy is not
interfering with accurate predictions.

59

Figure 2-11 LDA Exclusion of unbound genomic regions and use of a balanced training set

Accuracy/Precision scores for random forest models trained on different subsets of the Fly
Enhancer Resource DNA elements. Top: Area under precision recall curves. Middle: Area under
ROC curves. Bottom: Accuracy, calculated as total percentage of correct calls. Unbalanced
datasets are indicated in black, blue indicates models trained on datasets balanced through
undersampling the majority category (inactive reporters). "Random" represents classifications
with elements for which all feature scores were randomly generated, with values 0-1000; for
other marks indicated, DNA elements were selected that contained that mark, or those two
marks, and at least one other.

60

Figure 2-12 Random forest performance with least and most informative features

Predictive success of models with differing numbers of features, using unbalanced (black)
datasets using all active and inactive elements, and balanced datasets (blue), in which 200 active
and inactive elements are used for training, and 100 for testing. Predictions were made by
iteratively dropping the most informative feature to least informative feature (+), or least
informative to most informative (-). The scores running from left to right represent predictions
made using 41 features, ranging to predictions made using only two features. The furthermost
right scores are predictions made based on 41 randomly generated features. Note that balanced
datasets have substantially higher 'background', or scores based on random datasets of the same
size (Fig. 2-9)

61

Classifications by stage specificity and expression pattern
Some studies have shown that genes that are active throughout development have features that
make them distinct from genes and regulatory regions that are tissue or stage specific75, or that
separate poised enhancers from those that are actively expressed76. However, a comparison of
DNA elements that are active in all stages with those that are specific to early embryonic
development do not suggest that they can be as easily distinguished as active and inactive
elements (Figure 2-14 A,B). A possible mitigating factor is that regulatory regions often gain
binding of transcription factors and chromatin marks in a successive fashion during
development, prior to becoming active. Thus, some of the elements considered above as
"inactive" may actually be false negatives. Here, some DNA elements are exclusively active in
embryonic stages 4-8, whereas are not active in this earlier period, but become active in stages 912 or 13-16 (Figure 2-13). We tested if the 41 features would distinguish such elements that
become active only in later stages based on their features measured at an earlier time point.
Compared to the results with active vs. constitutively inactive, PCA and LDA analysis did not
show much separation (Figures 2-14 B,D). Consistent with these results, random forest
classification was only slightly more successful than would be expected by chance (data not
shown). The low accuracy may stem in part from the relatively smaller training set, however, it is
also possible that the considered genomic features are simply not informative about early vs. late
active DNA elements. Moreover, most of the features used for this classification were from
datasets collected at the early embryonic stage of development; additional information about
chromatin conformation or later transcription factor occupancy would likely improve predictions
for later enhancers.

62

Figure 2-13 Distinct classes of elements tested for function in embryos

To assess predictions of different classes of active elements, DNA elements (Kvon et al. 2014)
were assigned to one of six classifications. "Mixed" refers to elements that exhibit activity in two
of the three indicated stages.

63

Figure 2-14 Principal component analysis, Random Forest classification, of subsets

Principal components analysis shows very little separation based on the first two principal
components of DNA elements that are active or inactive at specific developmental times. (A)
This is true when they are separated based on continuous activation (identified as 'always' in Fig
2-13) vs. stage-specific activation (identified as 'Early' in Fig 2-13), or (B) early stage specific
activity from activity specific to later stages (later stages combining groups 'Middle' and 'Late'
from Fig 2-13). Linear discriminant analysis also shows very little separation along LD axis 1 for
either comparison (C-D)

64

Predicting expression patterns is a more difficult problem than predicting activity. Although
Kvon et al. 2014 annotated expression patterns for all DNA elements at each stage of embryonic
development, their annotations were in many cases non-exclusive, permitting a DNA element
with distinct A/P stripe activity to be annotated "mesoderm" because of its partial expression in
that region. We re-annotated expression patterns based on the images available at the Fly
Enhancer Resource, grouping patterns into exclusive and non-overlapping categories. We
categorized 114 elements as "anterior", and 78 as "central or posterior", and 22 with
"mesodermal or neurogenic ectoderm" expression patterns. Interestingly, a substantial fraction of
Anterior/Posterior and Dorsal/Ventral enhancers were ubiquitously active throughout embryonic
development. Occupancy of certain transcription factors associated with Anterior vs. Central or
Posterior expression patterns correlated well with the corresponding categories of DNA
elements: Bicoid peak scores were higher on Anterior elements, and Caudal peak scores were
higher on Central/Posterior elements (Figure 2-15 A and 2-15 B). Considering all 41 features,
however, PCA and LDA did not show much separation between these two expression patterns
(Figure 2-15 C, D). A random forest trained on all features performed better than predictions
using randomized data, but not as well as previously observed for differentiation of active vs.
inactive DNA elements (Fig. 2-15). Excluding Bicoid and Caudal related data had a dramatic
effect on predictions. Small training sets such as those used here are generally not compatible
with large numbers of predictors, therefore, we tested dimensionally reduced datasets, to see if
this would improve precision. Reducing the number of predictors using principal components did
not improve performance (Figure 2-15). Because Bicoid and Caudal play such key roles in these
predictions, Anterior vs. Central/Posterior expression may be predicted well if a larger data set of
specifically Anterior/Central-Posterior elements were available for training.

65

Figure 2-15 Features correlate with tissue specific expression patterns

Although some features clearly correlate with Anterior vs. Central/Posterior expression patterns
(A,B), and these categories are largely indistinguishable when viewed by PCA (C) and show
only slight separation based on linear discriminant analysis D: Random forests can classify these
active elements by expression pattern with greater success than when using random data, but not
with sufficiently high precision or accuracy to make useful predictions. Reducing the numbers of
features does little to improve accuracy of classifications (E).

66

Figure 2-16 Recall relative to false discovery rate when identifying enhancers.

Clusters of transcription factors around Zelda binding OR around H3K4me1 marks, that
involved two or more overlapping peaks, were used to define putative enhancers. These were
defined as 500, 1000, 1500, or 2000 base pairs centered around the clustered binding. Random
Forests were trained on subsets of the Fly Enhancer Resource, using all strongly active and
inactive DNA elements, or only active and inactive elements that overlap with Bicoid ChIP-seq
peaks, or only active and inactive elements that overlap with Twist ChIP-chip peaks (two
conditions that performed very well on the Fly Enhancer Resource in Fig 2-11). Random forests
were also trained on randomized DNA elements, where no features distinguish active and
inactive elements.

67

Testing against validated enhancers
Enhancer classification described above depends on the extensive data from the Fly Enhancer
Resource, which randomly samples ~15% of the entire genome, and has limited information
about which genes the putative enhancers regulate. Several intensively studied developmental
genes in stage 4-6 embryos have more detailed information about which regions are important
for expression, as found in the RedFly database72. To use these loci as independent guides for
enhancer classification efforts, we tested eight different models derived from random forest
analysis of balanced or unbalanced data. These models were used to predict active DNA
elements within a 50kb window encompassing the eight target genes (expressed in
Anterior/Posterior, Mesoderm, and Neuroectoderm patterns). Our assumption is that previously
annotated regulatory regions around these genes are effective measures of true-positives, and that
other regions lack enhancer activity. Thus, any additional enhancers identified within this
window that do not overlap with curated enhancers are likely to be false positives.
To ensure that our random forest models evaluate all relevant regions around these genes, we
needed to include additional DNA elements, as the Fly Enhancer Resource does not cover the
entire genome. Therefore, we identified all putative enhancers by presence of bound transcription
factors. In Figure 2-16, putative enhancers are defined as any locus that is binding of Zelda, as
well as one or more additional transcription factors, or alternatively any locus binding H3K4me1
as well as one or more transcription factors. The 'borders' of each putative enhancer were set to
2000 base pairs (approximately the same as the window size used in the Fly Enhancer
Resources). There are 23 curated enhancers for these genes total. However, there are an
additional 31 clusters of enhancer-like binding around the protein Zelda. Almost every model
trained on the Fly Enhancer Resource had high recall, and was able to identify the majority of the

68

23 'true positives' (2-16 A). However, many conditions led to an almost equal number of 'false
positives', or a high false discovery rate (FDR) (2-16 B). Random forests trained on balanced
datasets had high false discovery rates, identifying many presumably non-functional clusters as
active enhancers. Models trained on unbalanced data were far more successful at rejecting the
non-functional binding, although the balanced datasets did performed better at training and
testing within the Fly Enhancer Resource. Unbalanced datasets may more accurately reflect
natural conditions; the frequency of enhancer-like binding in stage 4-6 embryos appears to be far
more common than actual enhancers. We performed the same test again, setting the putative
enhancers to window sizes of 500, 1000 and 1500 base pairs, and the same trends were observed.
There are exactly 23 clusters of binding in 2kb regions around H3K4me1 marks, which do not
encompass every known enhancer. As a result, both Recall (percentage of true positives
identified) and FDR were much lower.

Discussion
Many genomic databases include information about regulatory regions. However, many
primarily collate 'true positives' that show activity, or simply display chromatin features that are
positively correlated with activity. From these resources, it is difficult to know how many real
regulatory are missed, or the frequency of false positives. The Fly Enhancer Resource is the first
publicly available database that includes a large set of reporters that are inactive at various
developmental stages. We used features associated with enhancer activity and cross-listed them
with the coordinates defining active and inactive regions in the Fly Enhancer Resource. From
this we were able to draw conclusions about which features are the most indicative of enhancer
activity, and how combinations influence classification.

69

Features correlating with enhancer activity are not guarantees of enhancer function
Many features have been associated with enhancer activity, including sequence conservation,
transcription factor occupancy, chromatin marks, and motif enrichment. Sequence conservation
has been observed in some Drosophila regulatory elements, but using it as a predictor is
complicated by the generally rapid rate of divergence in regulatory sequences50, and the compact
and largely conserved nature of the Drosophila genome49. Although ChIP-seq data correlates
with specific enhancer functions, it is challenging to use it as a predictor without a priori
knowledge of the genes being controlled. For example, previous studies have observed that the
transcription factors responsible for Dorsal-Ventral patterning (Dorsal, Snail and Twist) are also
observed at DNA elements for genes controlling Anterior-Posterior patterning40, although not
generally considered canonical players in A-P development77.Some features (e.g. Zelda ChIPseq) have many low scores that overlap with inactive regions, probably indicating a
bioinformatics threshold (where scores below a certain minimum were excluded as probable
background binding, and the reported peaks only represent a fraction of the original distribution).
It is possible that using an even higher threshold for ChIP peak scores in these cases could
remove a large amount of potentially spurious features. However, this would almost certainly
leave out many true positives. Motif enrichment is largely indistinguishable between active and
inactive DNA elements. Consistently across all classes of features, there are strong signals that
overlap with regions that do not drive gene expression, possibly due to regions like those
observed in Figure 4-C and 4-D, which share many of the characteristics of active DNA
elements. However, most genomic regions that did not drive expression overlapped with far
fewer features associated with enhancers (data not shown). This likely reflects regions like those
shown in Figures 4-A and 4-B, where inactive or weakly active genes are in loci with little

70

protein occupancy.
We also found that datasets measuring the same feature in some cases were highly variable. This
is not surprising; differences in resolution between ChIP-chip and ChIP-seq are well
documented78, as are the many factors that can lead to higher reproducibility within labs than
between labs79. However, not all results were inconsistent. In three out of four datasets
measuring Twist binding, the transcriptional activator visibly correlates with all transcription
factors, suggesting that Twist can play a role in diverse transcriptional networks, or that it binds
easily to all regions of open chromatin (Figure 2-2). It is plausible that further increasing the
number of features used would improve results; overlapping features would be more likely to be
non-spurious, and consequently to have high peak scores (in the case of ChIP data) or motif
scores (in the case of transcription factor PWMs), and greater reproducibility. However, the
degree to which classification accuracy plateaued suggests inclusion of more data may have
marginal returns.
Zelda a strong but not definitive indicator of early enhancer activity
Zelda occupancy is by far the most informative feature (Data not shown), which may indicate
that Zelda, more than other factors, plays a determinate role in establishing active biological
enhancers during early embryonic development. Interestingly, conservation scores, especially in
the more distantly related species, are among of the more important features for this
classification indicated by measures of feature importance from random forests using all
available data (data not shown). This is surprising given the lack of visible correlation in Figure
2-2. It is possible that the importance of conservation information measured by Gini Impurity
Index is inflated by the high level of variation contributed by this data, as shown in PCA. It is
also possible that other features are providing redundant or correlated information, elevating the

71

importance scores for conservation. Leave-one-out analysis (Figure 2-12) suggest although some
features are substantially more informative than others, no one feature is essential for predicting
active DNA elements, and many combinations of features can produce equally accurate
predictions of activity. Even classifications made using the least informative features still
outperform classifications based on random data. The largest drops in classification accuracy
were after dropping most features based on ChIP-seq and ChIP-chip of transcription factor
proteins. ChIP-seq of histone modifications was less informative than transcription factor
occupancy, but still more valuable for classifications than motif information or evolutionary
conservation.
Potential limitations and stage specificity
A possible complication for predictive accuracy is that many regulatory regions assemble much
of the transcriptional machinery well before the developmental time in which they are active21.
We used a random forest to classify DNA elements as stage-specific or ubiquitous, using the
same 41 features from previous classifications. Separating stage-specific vs. ubiquitously active
DNA elements has substantially lower precision relative to recall, which is unsurprising given
the smaller size of the dataset (only 82 of the DNA elements from the Fly Enhancer Resource are
exclusively specific to early development), that both stage specific and ubiquitous DNA elements
are active in stage 4-6, and many of the predictive features are specific to that time point.
Interestingly, most tissue specific specific expression patterns (e.g. DV stripes at the Mesoderm,
or AP pair-rule type stripes) were not associated with either category-- approximately half of
these DNA elements were stage specific, and the other half were active driving other expression
patterns later in development (data not shown). Inclusion of predictive features that are taken
from later developmental time-points may allow more effective pinpointing of which (if any)

72

features can be used to identify enhancers that are active throughout development.
Characteristics and Classification of DNA elements with Varying Expression Patterns
Predicting expression patterns is a more difficult problem than predicting activity. We used the
Fly Enhancer Resource for this as well. Although Kvon et al. 2014 annotated expression patterns
for all elements at each stage of embryonic development, their annotations were in many cases
overlapping-- any expression in developing germ layers, or in the anterior or posterior halves of
the embryo, were associated with dorsal-ventral and anterior-posterior patterns. We manually reannotated expression patterns based on the photos of embryos available at the Fly Enhancer
Resource, to group patterns into exclusive and non-overlapping categories. A majority of the
expression patterns are not easily categorized; they either have multiple expression patterns
evident, high variability between different embryos imaged, or very amorphous or faint
distribution of protein. We categorized 114 elements as driving expression only in the Anterior
part of the embryo, and 78 that were exclusive to the Central or Posterior ends of the embryo. We
identified 22 regions that displayed distinct Mesodermal or Neuroectoderm expression patterns,
as characterized by those for the Dorsal-Ventral patterning genes snail, brinker, and similar. This
expression pattern is strongly associated with a well understood biological function, and may
correspond to a unique combination of features. However, the very small sample size did not
make it ideal for machine learning. for comparison. Interestingly, neither Anterior/Posterior nor
Dorsal/Ventral DNA elements displayed a tendency toward ubiquitous activity throughout
development or toward stage specificity. Approximately half of each category was ubiquitous-elements that drive expression at every stage of embryonic development, while an equal number
showed some level of stage specificity.
Transcription factors known to be associated with Anterior vs. Central or Posterior expression

73

patterns did show expected correlations (e.g. Figure 2-15 A,B-- Bicoid peak scores were higher
DNA elements driving Anterior activity, and Caudal peak scores were higher on
Central/Posterior elements), although PCA based on all 41 features did not show separation
between these two expression patterns (Figure 2-15 C). A random forest trained on all features
was able to perform better than predictions based on randomized data, but not as well as
classifications of general activity. Although random forest algorithms are generally robust to colinearity between features these predictions were trained on a much smaller sample size, and
small training sets are not as compatible with large numbers of predictors. We created secondary
datasets wherein the first ten or first twenty-five Principal Components were used instead of
scores for individual features. Reducing the number of predictors using principal components did
not hurt the accuracy of classification, but also did not engender substantial improvement (Figure
2-15 E). However, dropping individual features known to be important for defining Anterior
expression patterns (Bicoid and Caudal binding) brought ROC and Precision Recall closer to
what would be expected based on random chance (10e). Given the elevated importance of
Bicoid and Caudal for these classifications, it seems likely that anterior expression could be
predicted with reasonably high accuracy given a larger data set of specifically Anterior/CentralPosterior DNA elements for training.
Identification of enhancers based on genomic features
For testing on experimentally identified enhancers from other sources than the Fly Enhancer
Resource, we assembled a list of putative enhancers by clustering of transcription factors, and
classified them with random forests trained on the Fly Enhancer Resource. Random forests were
trained on all active DNA elements from the Fly Enhancer Resource (as shown in Figure 2-9),
and also on all regions that had been balanced to equal numbers of active and inactive elements

74

through under-sampling. We used several different criteria for identifying clusters to use as
putative enhancers: Zelda binding that overlapped with at least one other transcription factor or
chromatin mark, or H3K4 mono-methlyation that overlapped with at least one transcription
factor or chromatin mark. There are far more Zelda ChIP-seq peaks than H3K4me1 peaks (Table
2-1), so the former criterion identified far more putative enhancers. Surprisingly, using either
criteria, random forests trained on a balanced dataset identified 100 percent of clusters as active,
which in some cases led to a very high false discovery rate (Figure 2-16). As the primary goal is
to identify active enhancers, low precision and high false-discovery was a significant concern. A
potential factor contributing to low precision of classification is the large fraction of inactive
regions in stage 4-6 embryos with very little transcription factor occupancy. If enhancers are
predicted based on the presence of clustered transcription factors, these unbound regions are less
useful for distinguishing between cases where background binding is also clustered, and
resembles truly active regions. Balancing by random under-sampling runs the risk of excluding
most if not all of the ambiguous, occupied-but-inactive DNA elements. We used several other
forms of under-sampling, only training on DNA elements that were bound by Bicoid, or bound
by Twist (two of the more successful approaches in Fig. 2-11). When H3K4me1 marks were used
as the standard for identifying putative enhancers, recall was capped at the number of H3K4me1
marks that overlapped with curated enhancers (about 60%), whereas when the more
promiscuously binding Zelda was used, recall could reach over 90% (Figure 2-16 A).
Correspondingly, false discovery rate was much lower across the board when H3K4me1
modification was used as the minimum standard for putative enhancers (2-16 B). The most
overall successful combinations (high recall, low FDR) seemed to be those that combined a
permissive setting with a restrictive setting, with regards to defining putative enhancers and what

75

data was used for training.
A potential source of inaccuracy with regards to classifying clustered binding as active or
inactive is that the Fly Enhancer Resource is used as the sole source of training data. Although it
covers a substantial portion of the genome, the included reporters may bisect biological
enhancers, and not represent in-vivo expression patterns due to loss of essential activators or
repressors. The reporters also cannot account for position specific effects, or the potential
influence of promoter specificity. This necessarily means that some of the reporters will be
incorrectly classified as active or inactive for specific expression patterns, and introduces some
minimal levels of error. When predicting enhancers around well-studied genes, it is possible that
some clusters of transcription factor binding that strongly resemble enhancers might be
regulating other genes. The assumption that all putative enhancers within a window around an
active gene are acting on that gene introduces another source of potential mis-classification.
It is also possible that elements currently classified as false positives are actually regulatory
elements that are active at an earlier or a later time in development. However, based on these
features there is no combination that appears to separate DNA elements active later in
development from ubiquitously inactive regions. Although some regions that are classified as
false positives may be active at later stages, many of the regions active later in development do
not share these characteristics. Presumably there are additional features that are not currently
known that allow biological systems to make this distinction.

Conclusions
Genomic features can be used to predict overall enhancer activity with a high degree of accuracy.
The goals of enhancer prediction usually entail identifying as many true positives as possible,
while also minimizing false-positives. Using a conservative approach when establishing the

76

training set is key to high-precision results. A wide range of features can also lead to the best
accuracy, but ChIP-seq and ChIP-chip data for a range of transcription factors and chromatin
modifications seem to be far and away the most essential information. However, no single
feature is a guarantee of enhancer function, and datasets measuring the same feature frequently
have limited reproducibility. This has been previously observed with genome-wide studies, but it
is not clear how this influences inference of function. Moreover, many features associated with a
specific function (e.g. Bicoid occupancy with Anterior-Posterior patterning) can be found
binding at genes associated with other regulatory networks. This may indicate functional crossbinding and integration of regulatory networks, or it could suggest promiscuous binding at
regions of open chromatin. Zelda, the pioneer transcription factor, is consistently the single most
important predictor of enhancer activity. However, Zelda binding is not necessary for accurate
predictions. Excluding Zelda has minimal effect on overall prediction accuracy; overall accuracy
appears to asymptotically level off after either a small number of highly informative features are
used, or a large number of less informative features. It seems unlikely that inclusion of further
ChIP-chip and ChIP-seq data would lead to meaningful improvements over the current level of
performance. It also is not clear to what extent all ChIP data is serving as a proxy for DNA
accessibility in these predictions, as opposed to each protein providing unique information.
Although in this study motif enrichment and evolutionary conservation were not the most
informative features, it is possible that more in-depth analysis in these directions, or other
features that are equally orthogonal to Chromatin Immunoprecipitation data, would be necessary
to reach higher levels of precision. High accuracy at predicting activity also did not translate to
equally accurate identification of expression patterns. The majority of the expression patterns
driven by DNA elements from the Fly Enhancer Resource were not easily categorized into clear-

77

cut categories; the paradigm of enhancers being specific to a single gene regulatory network, and
therefore to a distinct spatial expression pattern, may be biologically inaccurate.

78

REFERENCES

79

REFERENCES

1.

Small S, Arnosti DN, Levine M. Spacing ensures autonomous expression of different
stripe enhancers in the even-skipped promoter. Development. 1993;119(3):762-72.

2.

Yao L-C, Phin S, Cho J, Rushlow C, Arora K, Warrior R. Multiple modular promoter
elements drive graded brinker expression in response to the Dpp morphogen gradient.
Development. 2008;135(12):2183-2192.

3.

Pennacchio LA, Bickmore W, Dean A, Nobrega MA, Bejerano G. Enhancers: five
essential questions. Nat Rev Genet. 2013;14(4):288-295.

4.

Carroll SB. Evo-Devo and an Expanding Evolutionary Synthesis: A Genetic Theory of
Morphological Evolution. Cell. 2008;134(1):25-36.

5.

Wittkopp PJ, Kalay G. Cis-regulatory elements: molecular mechanisms and evolutionary
processes underlying divergence. Nat Rev Genet. 2012;13(1):59-69.

6.

Shlyueva D, Stampfel G, Stark A. Transcriptional enhancers: from properties to genomewide predictions. Nat Rev Genet. 2014;15(4):272-86.

7.

Emilsson V, Thorleifsson G, Zhang B, et al. Genetics of gene expression and its effect on
disease. Nature. 2008;452(7186):423-428.

8.

Epstein DJ. Cis-regulatory mutations in human disease. Briefings Funct Genomics
Proteomics. 2009;8(4):310-316.

9.

Corradin O, Saiakhova A, Akhtar-Zaidi B, et al. Combinatorial effects of multiple
enhancer variants in linkage disequilibrium dictate levels of gene expression to confer
susceptibility to common traits. Genome Res . 2014;24(1):1-13.

10.

Halfon MS, Grad Y, Church GM, Michelson AM. Computation-Based Discovery of
Related Transcriptional Regulatory Modules and Motifs Using an Experimentally
Validated Combinatorial Model. Genome Res . 2002;12(7):1019-1028.

11.

Stathopoulos A, Drenth M, Erives A, Markstein M, Levine M. Whole-genome analysis of
dorsal-ventral patterning in the Drosophila embryo. Cell. 2002;111.

12.

Narlikar L, Ovcharenko I. Identifying regulatory elements in eukaryotic genomes.
Briefings Funct Genomics Proteomics. 2009;8(4):215-230.

13.

Bailey TL, Elkan C. Unsupervised learning of multiple motifs in biopolymers using
expectation maximization. Mach Learn. 1995;21(1):51-80.

14.

Stormo GD. DNA binding sites: representation and discovery . Bioinformatics.
2000;16(1):16-23.

80

15.

Sinha S, Blanchette M, Tompa M. PhyME: A probabilistic algorithm for finding motifs in
sets of orthologous sequences. BMC Bioinformatics. 2004;5(1):170.

16.

Elemento O, Slonim N, Tavazoie S. A Universal Framework for Regulatory Element
Discovery across All Genomes and Data Types. Mol Cell. 2017;28(2):337-350.

17.

Junion G, Spivakov M, Girardot C, et al. A Transcription Factor Collective Defines
Cardiac Cell Fate and Reflects Lineage History. Cell. 2017;148(3):473-486.

18.

Fisher WW, Li JJ, Hammonds AS, et al. DNA regions bound at low occupancy by
transcription factors do not drive patterned reporter gene expression in Drosophila. Proc
Natl Acad Sci . 2012;109(52):21330-21335.

19.

Graur D, Zheng Y, Price N, Azevedo RBR, Zufall RA, Elhaik E. On the Immortality of
Television Sets: “Function” in the Human Genome According to the Evolution-Free
Gospel of ENCODE. Genome Biol Evol. 2013;5(3):578-590.

20.

Spivakov M. Spurious transcription factor binding: Non-functional or genetically
redundant? BioEssays. 2014;36(8):798-806.

21.

Ghavi-Helm Y, Klein FA, Pakozdi T, et al. Enhancer loops appear stable during
development and are associated with paused polymerase. Nature. 2014;512(7512):96-100.

22.

Kleftogiannis D, Kalnis P, Bajic VB. Progress and challenges in bioinformatics
approaches for enhancer identification. Brief Bioinform. 2016;17(6):967-979.

23.

Zinzen RP, Girardot C, Gagneur J, Braun M, Furlong EEM. Combinatorial binding
predicts spatio-temporal cis-regulatory activity. Nature. 2009;462(7269):65-70.

24.

Wilczynski B, Liu YH., Yeo ZX, Furlong EEM. Predicting Spatial and Temporal Gene
Expression Using an Integrative Model of Transcription Factor Occupancy and Chromatin
State. PLoS Comput Biol. 2012;8:12.

25.

Cannavo E, Khoueiry P, Garfield DA, Geeleher P, Zichner T, et al. Shadow enhancers are
pervasive features of developmental regulatory networks. Curr Biol. 2016;26:1-14.

26.

Dresch JM, Arnosti DN. The Wisdom of Crowds: Can Mathematical Models Crack the cis
Regulatory Code? Cell Syst. 2017;1(6):379-380.

27.

Sayal R, Dresch JM, Pushel I, Taylor BR, Arnosti DN. Quantitative perturbation-based
analysis of gene expression predicts enhancer activity in early Drosophila embryo. Barkai
N, ed. Elife. 2016;5:e08445.

28.

Frankel N, Davis GK, Vargas D, Wang S, Payre F, Stern DL. Phenotypic robustness
conferred by apparently redundant transcriptional enhancers. Nature.
2010;466(7305):490-493.

81

29.

Montavon T, Soshnikova N, Mascrez B, et al. A Regulatory Archipelago Controls Hox
Genes Transcription in Digits. Cell. 2017;147(5):1132-1145.

30.

Claussnitzer M, Dankel SN, Kim K-H, et al. FTO Obesity Variant Circuitry and
Adipocyte Browning in Humans. N Engl J Med. 2015;373(10):895-907.

31.

Perry MW, Boettiger AN, Bothma JP, Levine M. Shadow Enhancers Foster Robustness of
Drosophila Gastrulation. Curr Biol. 2010;20(17):1562-1567.

32.

Sanyal A, Lajoie BR, Jain G, Dekker J. The long-range interaction landscape of gene
promoters. Nature. 2012;489(7414):109-113.

33.

Shen Y, Yue F, McCleary DF, et al. A map of the cis-regulatory sequences in the mouse
genome. Nature. 2012;488(7409):116-120.

34.

Whitaker JW, Nguyen TT, Zhu Y, Wildberg A, Wang W. Computational schemes for the
prediction and annotation of enhancers from epigenomic assays. Methods. 2015;72(C):8694.

35.

Whalen S, Truty RM, Pollard KS. Enhancer-promoter interactions are encoded by
complex genomic signatures on looping chromatin. Nat Genet. 2016;48(5):488-496.

36.

Ernst J, Kellis M. ChromHMM: automating chromatin-state discovery and
characterization. Nat Methods. 2012;9(3):215-6.

37.

Williamson I, Berlivet S, Eskeland R, et al. Spatial genome organization: contrasting
views from chromosome conformation capture and fluorescence in situ hybridization.
Genes Dev . 2014;28(24):2778-2791.

38.

Mora A, Sandve GK, Gabrielsen OS, Eskeland R. In the loop: promoter–enhancer
interactions and bioinformatics. Brief Bioinform. 2016;17(6):980-995.

39.

Markstein M, Zinzen R, Markstein P, Yee KP, Erives A, Stathopoulos A. A regulatory
code for neurogenic gene expression in the Drosophila embryo. Development. 2004;131.

40.

Zeitlinger J, Zinzen RP, Stark A, et al. Whole-genome ChIP-chip analysis of Dorsal,
Twist, and Snail suggests integration of diverse patterning processes in the Drosophila
embryo. Genes Dev. 2007;21(4):385-390.

41.

Fu W, Duan H, Frei E, Noll M. shaven and sparkling are mutations in separate enhancers
of the Drosophila Pax2 homolog. Development. 1998;125(15):2943-50.

42.

Wang C, Zhang MQ, Zhang Z. Computational Identification of Active Enhancers in
Model Organisms. Genomics Proteomics Bioinformatics. 2013;11(3):142-150.

43.

Arnold CD, Gerlach D, Stelzer C, Boryn LM, Rath M, Stark A. Genome-Wide
Quantitative Enhancer Activity Maps Identified by STARR-seq. Science (80- ).
2013;339(6123):1074-1077.

82

44.

Arnold CD, Zabidi MA, Pagani M, et al. Genome-wide assessment of sequence-intrinsic
enhancer responsiveness at single-base-pair resolution. Nat Biotech. 2017;35(2):136-144.

45.

Zabidi MA, Arnold CD, Schernhuber K, et al. Enhancer-core-promoter specificity
separates developmental and housekeeping gene regulation. Nature. 2015;518(7540):556559.

46.

Emberly E, Rajewsky N, Siggia ED. Conservation of regulatory elements between two
species of Drosophila. BMC Bioinformatics. 2003;4(1):57.

47.

Glazov EA, Pheasant M, McGraw EA, Bejerano G, Mattick JS. Ultraconserved elements
in insect genomes: A highly conserved intronic sequence implicated in the control of
homothorax mRNA splicing. Genome Res . 2005;15(6):800-808.

48.

Marcovitz A, Jia R, Bejerano G. “Reverse Genomics” Predicts Function of Human
Conserved Noncoding Elements. Mol Biol Evol. 2016;33(5):1358-1369.

49.

Halligan DL, Keightley PD. Ubiquitous selective constraints in the Drosophila genome
revealed by a genome-wide interspecies comparison. Genome Res. 2006;16(7):875-884.

50.

Hare EE, Peterson BK, Iyer VN, Meier R, Eisen MB. Sepsid even-skipped enhancers are
functionally conserved in Drosophila despite lack of sequence conservation. PLoS Genet.
2008;4(6):6.

51.

Zhang B, Pan X, Cannon CH, Cobb GP, Anderson TA. Conservation and divergence of
plant microRNA genes. Plant J. 2006;46(2):243-259.

52.

Ahituv N, Zhu Y, Visel A, et al. Deletion of Ultraconserved Elements Yields Viable Mice.
PLOS Biol. 2007;5(9):e234.

53.

The modENCODE Consortium, Roy S, Ernst J, et al. Identification of Functional
Elements and Regulatory Circuits by Drosophila modENCODE. Science (80- ).
2010;330(6012):1787 LP-1797.

54.

Lieberman-Aiden E, van Berkum NL, Williams L, et al. Comprehensive Mapping of
Long-Range Interactions Reveals Folding Principles of the Human Genome. Science (80). 2009;326(5950):289 LP-293.

55.

Li G, Fullwood MJ, Xu H, et al. ChIA-PET tool for comprehensive chromatin interaction
analysis with paired-end tag sequencing. Genome Biol. 2010;11(2):R22.

56.

Erwin GD, Oksenberg N, Truty RM, et al. Integrating Diverse Datasets Improves
Developmental Enhancer Prediction. PLoS Comput Biol. 2014;10(6):6.

57.

Capra JA. Extrapolating histone marks across developmental stages, tissues, and species:
an enhancer prediction case study. BMC Genomics. 2015;16(1):104.

83

58.

Li Y, Shi W, Wasserman WW. Genome-Wide Prediction of cis-Regulatory Regions Using
Supervised Deep Learning Methods. bioRxiv. February 2016.

59.

He B, Chen C, Teng L, Tan K. Global view of enhancer–promoter interactome in human
cells. Proc Natl Acad Sci . 2014;111(21):E2191-E2199.

60.

Kleftogiannis D, Kalnis P, Bajic VB. DEEP: a general computational framework for
predicting enhancers. Nucleic Acids Res. 2015;43(1):e6-e6.

61.

Griffon A, Barbier Q, Dalino J, van Helden J, Spicuglia S, Ballester B. Integrative
analysis of public ChIP-seq experiments reveals a complex multi-cell regulatory
landscape. Nucleic Acids Res. 2015;43(4):e27-e27.

62.

van Duijvenboden K, de Boer BA, Capon N, Ruijter JM, Christoffels VM. EMERGE: a
flexible modelling framework to predict genomic regulatory elements from genomic
signatures. Nucleic Acids Res. 2016;44(5):e42-e42.

63.

Gurdziel K, Vogt KR, Schneider G, Richards N, Gumucio DL. Computational prediction
and experimental validation of novel Hedgehog-responsive enhancers linked to genes of
the Hedgehog pathway. BMC Dev Biol. 2016;16(1):4.

64.

Fletez-Brant C, Lee D, McCallion AS, Beer MA. kmer-SVM: a web server for identifying
predictive regulatory sequence features in genomic data sets. Nucleic Acids Res. 2013;41.

65.

Chae M, Danko CG, Kraus WL. groHMM: a computational tool for identifying
unannotated and cell type-specific transcription units from global run-on sequencing data.
BMC Bioinformatics. 2015;16(1):222.

66.

Coppola CJ, C. Ramaker R, Mendenhall EM. Identification and function of enhancers in
the human genome. Hum Mol Genet. 2016;25(R2):R190-R197.

67.

Kvon EZ, Kazmar T, Stampfel G, et al. Genome-scale functional characterization of
Drosophila developmental enhancers in vivo. Nature. 2014;512(7512):91-5.

68.

Rosenbloom KR, Armstrong J, Barber GP, et al. The UCSC Genome Browser database:
2015 update. Nucleic Acids Res. 2015;43(D1):D670-D681.

69.

Visel A, Bristow J, Pennacchio LA. Enhancer identification through comparative
genomics. Semin Cell Dev Biol. 2007;18(1):140-152.

70.

Zhu LJ, Christensen RG, Kazemian M, Hull CJ, Enuameh MS, Basciotta MD.
FlyFactorSurvey: a database of Drosophila transcription factor binding specificities
determined using the bacterial one-hybrid system. Nucleic Acids Res. 2011;39.

71.

Bailey TL, Gribskov M. Combining evidence using p-values: application to sequence
homology searches. Bioinformatics. 1998;14(1):48-54.

84

72.

Gallo SM, Gerrard DT, Miner D, et al. REDfly v3.0: toward a comprehensive database of
transcriptional regulatory elements in Drosophila. Nucleic Acids Res.
2011;39(suppl_1):D118-D123.

73.

Gittelman RM, Hun E, Ay F, et al. Comprehensive identification and analysis of human
accelerated regulatory DNA. Genome Res. 2015;25(9):1245-1255.

74.

MacArthur S, Li XY, Li J, Brown JB, Chu HC, Zeng L. Developmental roles of 21
Drosophila transcription factors are determined by quantitative differences in binding to
an overlapping set of thousands of genomic regions. Genome Biol. 2009;10.

75.

Liu Y, Pelham-Webb B, Di Giammartino DC, et al. Widespread Mitotic Bookmarking by
Histone Marks and Transcription Factors in Pluripotent Stem Cells. Cell Rep.
2017;19(7):1283-1293.

76.

Creyghton MP, Cheng AW, Welstead GG, Kooistra T, Carey BW, Steine EJ. Histone
H3K27ac separates active from poised enhancers and predicts developmental state. Proc
Natl Acad Sci U S A. 2010;107.

77.

St Johnston D, Nusslein-Volhard C. The origin of pattern and polarity in the Drosophila
embryo. Cell. 1992;68.

78.

Park PJ. ChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet.
2009;10.

79.

Devailly G, Mantsoki A, Michoel T, Joshi A. Variable reproducibility in genome-scale
public data: A case study using ENCODE ChIP sequencing resource. FEBS Lett.
2015;589(24):3866-3870.

85

CHAPTER III: AN IMAGE DATABASE OF DROSOPHILA MELANOGASTER
WINGS FOR PHENOMIC AND BIOMETRIC ANALYSIS
The work described in this chapter was published in the following manuscript: Sonnenschein, A.,
VanderZee, D., Pitchers, W.R., Chari, S., and Dworkin, I. (2015). An image database of
Drosophila melanogaster wings for phenomic and biometric analysis. GigaScience, 4(25).

86

Abstract
Extracting important descriptors and features from images of biological specimens is an ongoing
challenge. Features are often defined using landmarks and semi-landmarks that are determined a
priori based on criteria such as homology or some other measure of biological significance. An
alternative, widely used strategy uses computational pattern recognition, in which features are
acquired from the image de novo. Subsets of these features are then selected based on objective
criteria. Computational pattern recognition has been extensively developed primarily for the
classification of samples into groups, whereas landmark methods have been broadly applied to
biological inference. To compare these approaches and to provide a general community resource,
we have constructed an image database of Drosophila melanogaster wings - individually
identifiable and organized by sex, genotype and replicate imaging system - for the development
and testing of measurement and classification tools for biological images. We have used this
database to evaluate the relative performance of current classification strategies. Several
supervised parametric and nonparametric machine learning algorithms were used on principal
components extracted from geometric morphometric shape data (landmarks and semilandmarks). For comparison, we also classified phenotypes based on de novo features extracted
from wing images using several computer vision and pattern recognition methods as
implemented in the Bioimage Classification and Annotation Tool (BioCAT). Because we were
able to thoroughly evaluate these strategies using the publicly available Drosophila wing
database, we believe that this resource will facilitate the development and testing of new tools for
the measurement and classification of complex biological phenotypes.

Introduction
Understanding the causes and consequences of phenotypic variation is a unifying goal across

87

many biological disciplines. One aim of phenomics is to comprehensively measure this variation.
However, biological traits are complex and multidimensional and this presents challenges for
both measurement and analysis1. The complete ‘phenome’ of an individual includes more
phenotypes than can realistically be measured and the most informative subset of measurable
features is not necessarily known, even for specific traits2. Manually selected features benefit
from prior knowledge of the biological system, whereas computationally selected image
properties are generally optimized for discrimination between groups. However, it is not clear
how these strategies compare in their classification of images into groups (sex, genotype,
species) or in their potential to derive broader biological inferences.
Geometric morphometrics and computational pattern recognition represent very different
strategies for extracting and quantifying phenotypes from image data. Geometric morphometrics
measures shape by using homologous landmarks (or curves) across specimens as features 3,4.
Methodologically, these landmarks are determined a prioribased on biological considerations of
both homology and potential informativeness. Information about the shape of the specimen is
extracted from the configuration by removing variation in size, location and orientation of the
specimen, resulting in an explicit geometric representation of shape (Fig. 3-1)5- 7.

88

Figure 3-1 Wing landmarks and semi-landmarks

Example wing image from D. melanogaster that has been splined using WINGMACHINE. After
landmark and semi-landmark data is extracted, data is translated (centered to origin), scaled by
centroid size and superimposed (Procrustes superimposition for landmarks) data all lies in a
common subspace. Image represents 50 individual configurations from specimens to
demonstrate some of the variation among individuals.

89

Computational pattern recognition represents a school of alternative approaches, in which
features are extracted from image data with computer vision tools8,9. Pattern recognition uses
features such as the statistical distribution of pixels, or descriptions of texture or edges. A subset
of informative features is generally selected based on an objective function, such as the
classification of samples into groups, often using machine learning techniques9. The degree of
informativeness of these features is usually assessed by cross-validation. Whereas geometric
morphometrics requires a comprehensive understanding of the biological relevance and
evolutionary history of the feature, computational pattern recognition can be applied without
prior knowledge and can also detect informative patterns that are not visually perceptible9.
Both geometric morphometrics and computational pattern recognition have practical applications
in biological research. There have been varying levels of success using two-dimensional and
three-dimensional cranial-facial morphometric phenotypes to infer the genetic causes of
disease10-,12 and to track disease progression13. Morphometrics and computational pattern
recognition have also been successfully used with machine learning algorithms to classify
complex morphological phenotypes by species14-,16. Similarly, computer vision and pattern
recognition have been crucial in the development of tools for the related field of biometrics 17,
which uses phenotypes to distinguish individuals. Biometrics tools may be useful for interpreting
phenomics data, thereby extending the amount of informative variation that can be extracted
from biological images.
A potential biological application for biometrics is the interpretation of Drosophila wing shape.
Wing shape is an established model system for phenomics1,18, the genetic basis of shape19- 21 and
for phenotypic evolution22,23. Although Drosophila wings can be evaluated qualitatively24 or by
metrics such as length and surface area25, they are often measured within a geometric

90

morphometric framework14,19,26,27. Landmarks are based on vein intersections26,27 with semilandmarks defining curves (Fig. 1)14,23. Biometric facial recognition tools have had some success
at classifying images of Drosophila wings into biological categories28,29. The ‘eigenface’ method,
which is a classic technique for facial recognition, has been modified into ‘eigenwings’ using
features extracted from Drosophila wings to classify individuals by their sex29. Another facial
recognition method that uses a genetic algorithm to select texture features30 has also been used
with similar goals28- 30, with up to a 94 % successful classification rate29.
The success of facial recognition programs that rely on texture features instead of vein
positioning raises the questions of what other features might also be useful for
classifying Drosophila wings and how tools that are already used in biometrics may be applied to
phenomics datasets. However, our ability to evaluate different approaches − whether for
classification or biometric identification, as well as for long-term goals of further biological
inference − remains limited by the lack of open databases of images for comparison.
In this article, we describe the creation and implementation of a database of wing images
from Drosophila melanogaster for the development and testing of such methods. The database
was designed to include multiple levels of replication encompassing both biological and
technical variation. It allows the assessment of variation and classification by genotype, sex and
individual identity (right and left wings from the same fly). To introduce sources of technical
noise common to biological images, it includes several images of each wing, captured on various
microscopes and at multiple magnifications.
Using landmark and semi-landmark measurements extracted from images in this database, we
have analyzed the relative success of a number of machine learning algorithms at
classifying Drosophila genotype and sex. We compare the success of these methods with the

91

performance of the same classifier algorithms using features extracted by the Bioimage
Classification and Annotation Tool (BioCAT), a pattern-recognition program designed for image
analysis31. The database of images, landmark data and all source code have been made publicly
available to serve as a resource for the testing and development of biometrics tools.

Methods
Fly genetics and sample preparation
Fly stocks were obtained from the Bloomington Stock Center. These lines include wing
mutations in the genes Epidermal growth factor receptor

(Egfr),

mastermind (mam),

thickveins (tkv) and Star (S; see Table 3-1 for allele and stock information). All four mutations
are caused by insertions of P-element transposable elements, each marked with a miniw +resulting in partial rescue of wild-type red eyes. The wild-type strain used as a background
was an isogenic Samarkand (SAM), marked with a w − mutation to enable identification of the
mutant alleles38.
Each P-element-bearing strain was initially introgressed into SAM by repeatedly backcrossing
into the SAM background genotype (as described in27). These have since been maintained
heterozygous balanced over a CyO (also in the SAM background) with the exception of
the tkv mutant, which was maintained as a homozygote. Before initiating the experiment, these
flies were maintained for one generation in an incubator (Percival Model : I41VLC8 set to 24 °C,
65 % relative humidity and a 12-hour light/dark cycle) to acclimatize them to the environment.
Under these same growth conditions, the lines carrying mutant alleles were then backcrossed for
two additional generations into the SAM wild-type background prior to rearing flies for data
collection for the database. Because of the extensive back-crossing, each mutant-bearing strain is
close to co-isogenic to the SAM wild type, with the exception of the focal allele and a small

92

genomic fragment in linkage disequilibrium to that allele.
For each mutant strain, populations were expanded in five replicate bottles. Each bottle
contained 10 mutant males (red-eyed) crossed to 20 SAM virgin females (white-eyed). This
experimental design was also applied to the SAM control, with 10 SAM males and 20 SAM
virgin females. The flies were allowed to lay eggs for 4 days, after which the adults were
discarded. After 7 days, paper towel was added to the bottles to soak up excess moisture and
provide additional substrate for pupation. From days 14-18, emerging flies were phenotyped
(based on the w + marker) and sexed. They were stored separately by sex and genotype in
microtubes containing 70 % ethanol at room temperature for wing dissections. Fly wings were
dissected in phosphate buffered saline (PBS) and mounted on slides in a solution of 70 %
glycerol and 30 % PBS. All wings dissections were performed by the same person (AS). If one
wing was torn or damaged, both wings from that fly were discarded.
Imaging
Each wing was imaged at 20× and 40× magnification on both an Olympus BX51 and Leica
M125 microscope, using the DP controller (V.3,1,1208) and Leica App Suite (V.3) imaging
software respectively. Two individuals imaged wings, one using the Olympus microscope (AS)
and the other the Leica (DV) microscope. All images have unique names, using the format
‘genotype_sex_side_microscope_magnification_fly-number’. If images contained tears or folds
in the wing or indicated errors in dissection or mounting, all images from that fly were discarded.
Geometric morphometric data acquisition and preparation
Images were first converted to grayscale and cropped with the Gnu Image Manipulation Program
(GIMP, version 2.839) in batches using the David’s Batch Processor plugin (version 1.1.840). Two
starting landmarks were manually labeled at the humeral break and alula notch, using tpsDig

93

(version 2.1741). For more details on re-sizing and cropping, see Additional file 1: Supplementary
Methods. WINGMACHINE (Wings version 3.7.214) was used to generate wing splines, which
were manually reviewed and adjusted as necessary. CPR (version 1.01r 42) was used to scale
wings by centroid size, perform a Procrustes superimposition and extract landmark and semilandmark coordinates. Further details on processing of this data are available in Appendix:
Supplementary Methods. All further statistical analysis was done in R (version 3.1.035) on
images of wings taken at 40× magnification on the Olympus BX51 microscope. Scripts can be
found at the Dworkin Lab github page43 and together with the data at GigaDB44.
Procrustes coordinate values and centroid for left and right wings from the same fly were
averaged using the R plyr package (V.1.8.1). In total, this included 12 two-dimensional
landmarks and 36 semi-landmarks. However, because of image registration, scaling and
Procrustes superimposition, four dimensions do not contain any information. Furthermore, the
semi-landmarks are constrained to slide along a curve and therefore have approximately one
degree of freedom. This results in approximately 58 dimensions of potential data. Thus, the first
58 principal components contributing to shape (excluding centroid) were extracted and used for
all further analyses.

Morphometric analysis
Two-thirds of the samples from each genotype were defined as the training set and one-third as a
testing set. These were used to train and test ‘lda’ and ‘qda’ functions from the MASS package
(V. 7.3-33), ‘mda’ and ‘fda’ functions from the mda package (V. 0.4-4), the ‘bagging’ function
from the adabag package (V.3.2), random forest from the randomForestpackage (using 500
trees, version 4.6-7), the ‘svm’ function from the e1071 package (version 1.6-3) and a neural
network from the nnet package (V. 7.3-8). K-nearest neighbors (KNN) from the class package

94

(V. 7.3-10) was also tested, using k values from 1 to 100. Confidence intervals were
approximated by re-sampling the training and testing sets over 1,000 repetitions. All functions
were used with default arguments, with the exception of ‘svm’, ‘knn’ and random forest. ‘svm’
was optimized for kernel function shape and ‘knn’ for the value of k. Random forests were tested
over a range of 10-1,000 trees.
SVM and random forest (using 10 trees and 1,000 trees) were repeated on several subsets of the
original dataset, using only left wings (prior to averaging) and left-female wings, to facilitate
comparisons with the BioCAT results. The same analysis was also used to classify wings by sex,
using only wings from the SAM (wild-type) genotype.
BioCAT analysis
For the BioCAT analysis14, we used a Fisher feature selector to identify 50 eigenvalues of the
Hessian matrix37 from 144 left male and female wings (77 each) from the SAM genotype.
BioCAT applied these features to train an SVM classifier and two random forest classifiers: one
with 10 trees and one with 1,000 trees. These models were used to annotate 30 male and 30
female left SAM wings that were not included in the training set. Annotation accuracy was
determined by counting the number of correct and incorrect classifications. Although BioCAT
allows for cross-validation during combined training/testing, the quantity and size of our data
made re-sampling for confidence intervals infeasible. This process was repeated for classification
by genotype using 70 left-female wings from each of the five genotypes as a training set for
classification by genotype. The genotype classification models were tested on 30 left-female
wings from each genotype. Images used as training and testing sets have been organized with
their respective models at the database44.

95

Results
The Drosophila wing database comprises a large number of high-quality wing images and
contains both biological and technical variation. Sources of biological variation include genotype
(there are four mutant genotypes (listed in Table 3-1) in the wild-type background of Samarkand
(SAM), as well as the SAM wild-type background itself). In addition, sex and within-individual
(left and right wings) variation is included. There are 100-130 individual samples for each
combination of biological variables (Table 3-2). The mutant genotypes included in the database
are heterozygous loss-of-function mutations for the genes that encode the Epidermal growth
factor receptor (Egfr), mastermind (mam), Star (S) and thickveins (tkv) (see Table 3-1 for allele
information, Fig. 3-2 for the relative impact of each mutation on phenotype and Methods for
additional details). As heterozygotes, these mutations all have quantitative effects on shape,
although they are qualitatively indistinguishable from the wild-type background (Fig. 3-3). The
mutations represent perturbations of multiple signaling pathways: for example, tkv is a receptor
kinase in the Transforming growth factor-β (TGF-β) pathway32 and mam is a transcription factor
in the Notch signaling pathway33. Egfr and Star genetically interact as Star modulates signaling
through the Egfr pathway32, 34. These specific mutations were selected because previous studies
have shown that when heterozygous, they have a range of quantitative effects on wing shape27.

96

Table 3-1 Drosophila allele information
Bloomington # Gene name
10385
Epidermal growth factor receptor

Gene symbol
Egfr

14189

mastermind

mam

10418
14403

Star
thickveins

S
tkv

Allele name
P{lacW}Egfr k05115
P{SUPorP}mamkG02641
P{lacW}S k09530
P{SUPor-P}tkv KG01923

Table 3-2 Drosophila wings dissected by sex and genotype
Genotype
Egfr
mam
Samarkand (SAM)
Star
tkv

Female
116
106
107
115
116

97

Male
118
130
100
111
116

Figure 3-2 Mutation effects

Magnitude and direction of the effect of each mutation (red) relative to Samarkand wild type
(black). Magnitudes are in units of Procrustes distance (PD), which for this (tangent
approximation) is equivalent to the Euclidean distance between the mean vector of each mutant
and the Samarkand (SAM) wild type. The vectors of shape differences are magnified three-fold.

98

Figure 3-3 Heterozygote mutant wings

Representative

images

from

the

clockwise: mastermind, Epidermal

database.
growth

From
factor

right

top

corner

counter-

receptor, Star and thickveins.

mastermind, Egfr and Starmutations are all homozygous lethal and thickveins has a qualitative
defect as a homozygote. As heterozygotes, they are qualitatively indistinguishable from the
Samarkand (SAM) wild type (center)

99

To allow researchers to compare various classification algorithms across a range of technical
conditions, the measurements were subject to technical variation including the microscope and
software used to capture images and the magnification setting of the microscope. Each wing in
the database was imaged on two different microscope models, at both 40× and 20×
magnification, so each wing in the database was imaged a total of four times (see Fig.3-4: left
and right wings from the same fly imaged under all four technical variation conditions). Also
included in the database is landmark and semi-landmark coordinate information extracted from
all images in the database using WINGMACHINE software14, so information extracted from
these images can be compared with existing standards for wing analysis. We also repeated the
morphometrics analysis (landmarking, fitting curves (splining), editing splines and
superimposition) for a small subset of the wing images (50 left and right female wings from two
genotypes), to provide information on the technical variation in this process.

100

Figure 3-4 Technical replicates conditions

Left and right wings from the same female (SAM) fly, imaged four times. Top left are images
taken on Olympus BX51 microscope at 40× magnification, top right are taken on Leica M125 at
40× magnification. Bottom left and right are images taken at 20× magnification

101

Classification based on geometric morphometric data shows a high degree of accuracy
Although the primary goal of this project was to develop the image database, we also wished to
provide future users with some baseline data to evaluate classifiers. We used a wide range of
supervised machine learning algorithms in R (version 3.1.035) for classification based on the
landmark and semi-landmark data extracted from the images. All data for this analysis was from
images taken on an Olympus BX51 microscope at 40× magnification. The data analyzed
included the 58 principal components of shape generated from the landmark and semi-landmark
coordinates (representing all non-zero eigenvalues). We chose algorithms to represent a wide
range of models; standard errors were estimated by re-sampling training and testing sets. When
classifying wings within a common genotype (SAM) by sex, all algorithms except for quadratic
discriminant analysis (QDA) were able to predict the sex of a test set with more than 95 %
accuracy (Table 3). When classifying wings by both genotype and sex, linear discriminant
analysis (LDA), flexible discriminant analysis (FDA) and mixture discriminant analysis (MDA)
were all able to correctly categorize test data with 85 % accuracy or higher. Support vector
machines (SVM) and neural networks were also accurate with over 80 % of wings (Table 3-3).
The high accuracy of most methods, especially of LDA (Fig. 3-5), suggests that classifications
using this data, based on both sex and genotype, are robust to assumptions of linearity and
common covariance matrices between factors (genotype and sex)36.

102

Table 3-3 Classification accuracy of machine learning algorithms using landmarks
Algorithm
LDA
QDA
FDA
MDA
Bagging
Random forest
SVM
Neural network (size 10)
KNN

Sex (± Standard error)
98.2 % (±1.6)
81.5 % (±6.4)
98.2 % (±1.6)
98.1 % (±1.6)
93.3 % (±2.9)
94.6 % (±2.7) 100 trees
96.8 % (±2.1) sigmoid
98.3 % (±1.6)
98.3 % (±1.5) k = 4

103

Genotype (± Standard error)
86.1 % (±1.5)
68.7 % (±2.2)
86.0 % (±1.5)
84.8 % (±1.6)
57.6 % (±2.9)
74.9 % (±2.1) 1,000 trees
83.8 % (±1.6) radial
81.2 % (±2.2)
59.3 % (±2.1) k = 32

Figure 3-5 Separation of specimens using landmark data using LDA

Separation of specimens for each of the five genotype by linear discriminant analysis (LDA) in
training set (left panel) and testing set (right panel), plotting the first discriminant function by the
second discriminant function. This includes data for both males and females, but averaged (left
and right wings) per specimen

104

Computational feature detection and sub-setting for classification using BioCAT
We tested several methods of classification using the image analysis software BioCAT31, which
allows combinations of feature selectors, extractors and classifiers. Using the Fisher feature
selection criterion, we tested several combinations of features and classification algorithms
(Appendix: Supplementary Methods). After training a random forest classifier with 50 FeatureJ
Hessians37 extracted from a training set of wing images, we were able to classify individuals by
sex (in a common genotype) in a test set of wing images with 85 % accuracy (Table 3-4).
Classification of wing images by genotype (within sex) had an accuracy of only up to 52 %,
although this is higher than the 20 % success rate that would be expected for random
classification.
Comparisons between BioCAT and geometric morphometric descriptors for classification
BioCAT feature selectors act on raw images and therefore had access to both shape and size
information for wings, whereas morphometric analyses were performed after scaling by centroid.
When the parameter for centroid size was included with landmark and semi-landmark
coordinates in morphometric analysis, the relative effectiveness of different algorithms was
largely the same (Table 3-4), although classification accuracy generally increased for both sex
and genotype.
Classification based on features extracted by BioCAT and those using landmarks and semilandmark coordinates differed in the distribution of classification errors between genotypes.
BioCAT classified some genotypes far more consistently than others and errors frequently
skewed towards a particular genotype (Fig. 3-6). Notably, mastermind was misclassified
as Star 90 % of the time (27/30 mis-identifications in the test set). There is no similar trend
of Star and mam phenotype mis-identification in classifications based on landmarks (Fig. 3-6).

105

Table 3-4 Classification accuracy of machine learning algorithms compared with BioCAT
Algorithm
Hessian
Shape
Shape + Size
Random forest (10)
85.0 %
92.3 % (±3.7) 94.7 % (±2.6)
Random forest (1,000)
85.0 %
96.1 % (±2.2) 95.9 % (±2.1)
SVM
81.7 %
99.0 % (±1.2) 99.0 % (±1.2)
Genotype
Random forest (10)
52.0 %
43.3 % (±3.5) 44.7 % (±3.7)
Random forest (1,000)
46.7 %
69.1 % (±3.4) 70.2 % (±2.8)
SVM
43.3 %
75.1 % (±2.8) 75.8 % (±2.7)
Hessian column represents accuracy of classifications based on Hessian features extracted with
Classification
Sex

BioCAT. Shape column represents classification accuracy based on landmarks and semilandmarks, not including centroid. Shape + size represents classification accuracy based on
landmarks and semi-landmarks, including centroid

106

Figure 3-6 Confusion matrices

Heatmap of confusion matrices from classification (random forest) using features extracted using
BioCAT (a) compared with landmark and semi-landmark data (b). The data in (a) and (b) is
shown together in (c) to facilitate comparison. Numbers represent percentage of correct
classifications. lm_* represent the landmark/semi-landmark data. BioCAT features were misclassified more consistently as some genotypes, e.g. mis-classification of mam mutants
as Star (a). This pattern is not evident in the classification using the landmark data (b). Scale
represents frequency of classification

107

Comparisons between BioCAT and geometric morphometric descriptors for classification
Both geometric morphometric methods and BioCAT were able to classify images by sex across
sources of technical variation (i.e. images taken on different microscopes). Geometric
morphometric methods showed very little loss in accuracy when classifying wings at the same
magnification across microscopes (where an LDA was trained on images taken on the Olympus
at 40× magnification and tested on images from the Leica microscope at the same
magnification). However, accuracy dropped substantially from 98.2 % to 81.2 % when the LDA
was trained on images taken on the Olympus at 40× magnification and tested on images taken on
the same microscope at 20× magnification (Table 3-5). Images from each microscope and
magnification were superimposed separately and simultaneous superimposition might
substantially increase the accuracy of classification across datasets.
Machine learning algorithms using landmark and semi-landmark features, compared with
Hessian features extracted by BioCAT, trained and tested across microscopes and magnifications
was not able to make accurate classifications across datasets using unedited images. When
trained on images taken on the Olympus at 40×, it uniformly classified wings in images taken on
the Olympus at 20× magnification as males (using Hessian features and a random forest
classifier).

108

Table 3-5 Classification accuracy of machine learning algorithms compared with BioCAT
Method
BioCAT

Training images
Olympus 40×
Olympus 40×
Olympus 40× cropped
Leica 40× cropped
Olympus 40× cropped
Olympus & Leica 40× cropped
Olympus & Leica 40× cropped
Landmarks Olympus 40× landmarks
Olympus 40× landmarks
Leica 40× landmarks
Olympus 40× landmarks

Testing images
Olympus 40×
Olympus 20×
Olympus 20× cropped
Leica 40× cropped
Leica 40× cropped
Olympus 40× cropped
Leica 40× cropped
Olympus 40× landmarks
Olympus 20× landmarks
Leica 40× landmarks
Leica 40× landmarks

Sex (± SE)
85.0 %
50.0 %
50.0 %
93.0 %
73.7 %
73.3 %
86.0 %
98.2 % (±1.6)
81.2 % (±1.4)
97.8 % (±0.69)
79.1 % (±1.3)

Using images cropped to the same dimensions as used for measuring splines (cropping images
was also necessary for geometric morphometric analysis), BioCAT still had difficulty classifying
across magnifications, but was able to correctly identify sex from wing images taken at the same
magnification on the Leica microscope with 73.7 % accuracy (Table 5). Interestingly, BioCAT
performed better on images taken on the Leica - when both trained and tested on images taken on
the Leica at a common magnification, it classified images by sex with 93 % accuracy, relative to
85 % accuracy when classifying images from the Olympus microscope (Table 3-5).

Discussion
Although this database is primarily intended to serve as a resource for the development and
testing of measurement tools, we also investigated whether the image collection could provide
insights into the comparative effectiveness of existing pattern recognition and morphometrics
methods. In particular, we compared a priori biologically informed landmark data as features
analyzed within a geometric morphometrics framework with de novo feature extraction,

109

identification and optimization. Using both types of features, we evaluated the classification of
wings by genotype and by sex and the accuracy of various statistical learning methods. The
performance of the classifiers based on landmark data was generally superior with respect to
classifying test data. For a number of reasons, this success must be considered within the context
of the methods examined in this study. The availability of the database now provides a test bed
for further refinement.
Computational pattern recognition and morphometrics software are likely to extract different
features. In addition to considering how well geometric morphometric approaches compare to
‘computer vision’ de novo feature extraction (see below), it is also worth comparing the
efficiency with which the feature data can be obtained. The WINGMACHINE pipeline was
designed with a single goal and has been optimized for extracting landmark and semi-landmark
data from Drosophila wings. By contrast, BioCAT was designed (and therefore chosen for this
study) for its accessibility and flexibility, which allow it to be immediately applied to raw wing
images. The FeatureJ features extracted with BioCAT describe image texture31, whereas a
geometric morphometric approach uses biologically defined, homologous landmarks and curves
as features, defined by vein intersections and outlines14. Extracting the large number of
landmarks and semi-landmarks used in this study is laborious for most biological systems. Even
using the WINGMACHINE pipeline requires multiple stages of image processing, some manual
landmark acquisition and manual correction of splines after automated fitting. By contrast, both
feature extraction and classification using BioCAT were performed without a priori annotation
or editing. Thus, despite the overall success in classification using the landmark and semilandmark data, the efficiency of acquiring the data must also be considered for other studies.
Perhaps unsurprising given the different nature of the features extracted, the machine learning

110

algorithms that were most able to classify wings (for sex and genotype) using BioCAT’s Hessian
features differed from those that were most able to classify wings using landmark and semilandmark data. Whereas SVMs consistently performed better than random forests for
classification using morphometric data, the reverse was true using features extracted with
BioCAT (Table 3-4). Classifications based on landmarks and semi-landmarks were also
substantially improved by increasing the number of trees in the random forest from 10 (the
BioCAT default) to 1,000 trees. BioCAT classification success was unaffected or slightly lowered
by an increase in the number of trees.
The Drosophila wing database contains large numbers of wing images representing multiple
genotypes. It also includes several built-in controls for technical variation that should make it
amenable to the development of biometric classification tools. Using the landmark and semilandmark data extracted with WINGMACHINE, wings can be classified by sex and genotype
with high levels of accuracy. We were also able to classify wings by sex and genotype with
relatively good accuracy using texture features extracted by the computer vision software
BioCAT. We hope that this database will serve as a resource for research into the sources of
variation contributing to wing shape and for the development and testing of measurement tools
for image-based phenomics.

111

APPENDIX

112

IMAGE PROCESSING AND CLASSIFICATION USING BIOCAT
Supplementary Methods
1. Image processing for WINGMACHINE and CPR
1.1 Cropping wing images
Wing images taken at 2x on the Olympus BX51 microscope and at both 2x and 4x on the Leica
M125 microscope were cropped (using the David's Batch Processor plugin in GIMP 1) for
analysis with WINGMACHINE. All images taken on the Olympus were initially 1360 x 1024
pixels (px). 2x images were cropped to 750 x 555px. All images on the Leica were initially 1392
x 1040px. Leica 4x images were cropped to 960 x 718px, except for a small subset in which the
orientation of the wing required larger dimensions and these were cropped to 1200 x 898px.
Leica 2x images were primarily cropped to 500 x 374px, except for a small subset that were
cropped to 600 x 449px. The dimensions that were used for each image are indicated in the files
'Leica_2X_coords.tsv' and 'Leica_2X_coords.tsv' in the GigaScience GigaDB repository2.
1.2 Landmarking wing images with tpsUtil and tpsDig
Wing images were resized for landmarking to the dimensions 632 x 480 pixels. A.tps file was
created using tpsUtil (version 1.583), and landmarked in tpsDig23. The.tps file was converted to
an.asc file using a custom python script.
1.3 Splining wing images
Images for splining were resized to the dimensions 316 x 240px. All WINGMACHINE splines
were manually checked and edited.
1.4 Superimposition in CPR
WINGMACHINE output was superimposed in CPR. After superimposition and sliding of semilandmarks, the most proximal semi-landmark on vein 4 was removed as described in4. Data for

113

the first five principal components were visually checked as a scatterplot and outliers caused by
incorrect spline alignment were manually corrected.
1.5 Identifying scale in cropped images
Scale mm/px values were manually calibrated in ImageJ version 1.485.
2. Supplementary methods for BioCAT
This section includes a subset of informative results from testing various BioCAT settings with
different subsets of wings. These subsets of the wing images, and the classification models, are
available at the GigasScience GigaDB repository2. For these trials, training was done using
BioCAT's 'Training only' option, followed by annotation using the 'test set'. In all cases, only left
wings were used from the Olympus 40x dataset. Unless otherwise indicated, default settings
were used. Data that was entered into the confusion matrices for Figure 6 are indicated (see code
in2).
2.1 Classifying by sex within Samarkand genotype, using approximately equal numbers of
each class for training and testing. Overall, the best results were with Hessian features
combined with Random forest classifiers. F, female; M, male; RF, random forest; SAM,
Samarkand; SVM, support vector machine.

114

Table 3-6 Results with BioCAT for sex
Features

Classifie Training set
r

Testing set

Success rate

Structure
30

RF
10 trees

68 M SAM wings 15 M SAM wings 14/15 F correct
71 F SAM wings
15 F SAM wings 11/15 M correct
(Sex training set 1) (Sex test set 1)

Structure
50

RF
10 trees

68 M SAM wings 15 M SAM wings 14/15 F correct
71 F SAM wings
15 F SAM wings 11/15 M correct
(Sex training set 1) (Sex test set 1)

Laplacian
30

RF
10 trees

68 M SAM wings 15 M SAM wings 12/15 F correct
71 F SAM wings
15 F SAM wings 8/15 M correct
(Sex training set 1) (Sex test set 1)

Laplacian
30

SVM
linear

68 M SAM wings 15 M SAM wings Very low
71 F SAM wings
15 F SAM wings
(Sex training set 1) (Sex test set 1)

Stats
30

RF
10 trees

68 M SAM wings 15 M SAM wings 13/15 F correct
71 F SAM wings
15 F SAM wings 12/15 M correct
(Sex training set 1) (Sex test set 1)

Derivatives SVM
30
linear

68 M SAM wings 15 M SAM wings 11/15 F correct
71 F SAM wings
15 F SAM wings 14/15 M correct
(Sex training set 1) (Sex test set 1)

Hessian 30

68 M SAM wings 15 M SAM wings
71 F SAM wings
15 F SAM wings
(Sex training set 1) (Sex test set 1
and Sex test set 2)

Test 1:
13/15 F correct
14/15 M correct
Test 2:
8/15 F correct
15/15 M correct

Structure 50 RF
10 trees

68 M SAM wings 15 M SAM wings
71 F SAM wings
15 F SAM wings
(Sex training set 1) (Sex test set 1
and Sex test set 2)

Test 1:
12/15 F correct
11/15 M correct
Test 2:
10/15 F correct
15/15 M correct

Structure 50 SVM
linear

68 M SAM wings 15 M SAM wings
71 F SAM wings
15 F SAM wings
(Sex training set 1) (Sex test set 1
and Sex test set 2)

Test 1:
11/15 F correct
13/15 M correct
Test 2:
8/15 F correct
15/15 M correct

SVM
linear

115

Table 3-6 (cont’d)

Hessian 50

RF
10 trees

68 M SAM wings 15 M SAM wings
71 F SAM wings
15 F SAM wings
(Sex training set 1) (Sex test set 1
and Sex test set 2)

Test 1:
13/15 F correct
13/15 M correct
Test 2:
10/15 F correct
15/15 M correct

Hessian 50

SVM
linear

68 M SAM wings 15 M SAM wings
71 F SAM wings
15 F SAM wings
(Sex training set 1) (Sex test set 1
and Sex test set 2)

Test 1:
11/15 F correct
14/15 M correct
Test 2:
9/15 F correct
15/15 M correct

Hessian 50

RF
1,000
trees

68 M SAM wings 15 M SAM wings
71 F SAM wings
15 F SAM wings
(Sex training set 1) (Sex test set 1
and Sex test set 2)

Test 1:
13/15 F correct
13/15 M correct
Test 2:
10/15 F correct
15/15 M correct

2.2 Classifying by genotype - Samarkand vs. mutant phenotypes (Egfr, mam, S, tkv) with
varying numbers of trees. The best results were again achieved with Hessian combined with
Random forest. SVM linear performed much better than SVM with alternate kernel shapes. RF,
random forest; SAM, Samarkand; SVM, support vector machine.

116

Table 3-7 Results with BioCAT for genotype
Features Classifier

Training set

Testing set

Success rate

Hessian
30

RF 10 trees

Genotype training 1 Genotype test 1 Identifies
327 mutant females 15
of
each wings as
77 SAM females
mutant
Mutant
15 SAM

Hessian
30

SVM linear

Genotype training 1 Genotype test 1
327 mutant females 15
of
each
77 SAM females
mutant
15 SAM

Hessian
30

SVM radial

Genotype training 1 Genotype test 1 Identifies
327 mutant females 15
of
each wings as
77 SAM females
mutant
Mutant
15 SAM

Hessian
30

SVM
sigmoid

Genotype training 1 Genotype test 1 Very
low
327 mutant females 15
of
each accuracy
77 SAM females
mutant
15 SAM

all

9/15
SAM
wrong
6/60 mutants
wrong
all

Structure RF 10 trees
30

Genotype training 1 Genotype test 1 Very
low
327 mutant females 15
of
each accuracy
77 SAM females
mutant
15 SAM

Structure SVM linear
30

Genotype training 1 Genotype test 1
327 mutant females 15
of
each
77 SAM females
mutant
15 SAM

Structure SVM radial
30

Genotype training 1 Genotype test 1 Very
low
327 mutant females 15
of
each accuracy
77 SAM females
mutant
15 SAM

Hessian
50

Genotype training 2 Genotype test 2
77 mutant females 15 mutants total
77 SAM females
(3-4 of each)
15 SAM

SVM linear

117

Notes

2
mutants
wrong
9/15
SAM
wrong

Test 1:
6/15
SAM
correct
13/15 mutants
correct
Test 2:
2/15
SAM
correct
15/15 mutants
correct

Equal
representati
on
in
train/test
sets
greatly
improved
accuracy

Table 3-7 (cont’d)

Structure SVM linear
50

Genotype training 2 Genotype test 2 Test 1:
77 mutant females 15 mutants total 12/15
77 SAM females
15 SAM
correct
7/15
correct
Test 2:
13/15
correct
10/15
correct

SAM
mutant

SAM
mutant

Structure RF 10 trees
50

Genotype training 2 Genotype test 2 Test 1:
77 mutant females 15 mutants total 10/15 SAM
77 SAM females
15 SAM
correct
11/15 mutants
correct
Test 2:
9/15
SAM
correct
12/15 mutants
correct

Hessian
50

Genotype training 2 Genotype test 2 Test1:
77 mutant females 15 mutants total 11/15 SAM
77 SAM females
15 SAM
correct
13/15 mutants
correct
Test 2:
11/15 SAM
correct
12/15 mutants
correct

RF 10 trees

2.3 Classifying by genotype - all genotypes (females). Egfr, epidermal growth factor receptor;
mam, mastermind; RF, random forest; SAM, Samarkand; SVM, support vector machine; tkv,
thickveins.

118

Table 3-8 Results with BioCAT across genotype
Features

Classifier

Hessian 50 RF 10 trees

Training set

Testing set

Genotype training 3 Genotype test 3
70 wings of each
15 wings of each
genotype, females genotype,
females
x2 test sets

Success rate

EgfrT1: 20%; 3 52%
correct; 2 SAM; 10 comb
tkv
ined
EgfrT2: 33.3%; 5 Table
correct; 3 SAM; 7 4
tkv
Figur
e6
MamT1: 20%; 3
correct; 12 star
MamT2: 33.3%; 5
correct; 10 star
SAMT1: 86.7%; 1
mam; 1 Egfr; 13
correct
SAMT2: 66.7%; 3
mam; 10 correct; 2
tkv
StarT1: 100%; 15
correct
Star T2: 73.3%; 11
correct; 4 mam
TkvT1: 26.7%; 4
correct; 4 mam; 5
Egfr; 2 SAM
TkvT2: 60%; 9
correct; 2 Egfr; 1
mam; 3 SAM

119

Note
s

Table 3-8 (cont’d)

Hessian 50 SVM linear

Genotype training 3 Genotype test 3 EgfrT1: 1 correct; 1
70 wings of each
15 wings of each SAM; 13 tkv
genotype, females genotype,
females
EgfrT2: 1 correct; 2
mam; 1 SAM; 11
tkv
MamT1: 1 correct;
14 Star
MamT2: 2 correct;
13 Star
SAMT1: 11 correct;
1 Egfr; 3 mam
SAMT2: 4 correct;
6 Egfr; 4 mam; 1
tkv
StarT1: 14 correct;
1 mam
StarT2: 13 correct;
1 mam; 1 tkv
TkvT1: 6 correct; 3
Egfr; 6 SAM
TkvT2: 13 correct;
2 mam

120

44%
overa
ll
Table
4

Table 3-8 (cont’d)

Structure
50

SVM linear

Genotype training 3 Genotype test 3 EgfrT1: 3 correct; 2
70 wings of each
15 wings of each SAM; 10 tkv
genotype, females genotype,
females
EgfrT2: 1 correct; 1
mam; 5 SAM; 8 tkv
MamT1: 15 correct
MamT2: 1 correct;
14 Star
SAMT1: 11 correct;
1 Egfr; 3 mam
SAMT2: 9 correct;
5 Egfr; 1 mam
StarT1: 14 correct;
1 mam
StarT2: 14 correct;
1 mam
TkvT1: 6 correct; 2
Egfr; 7 SAM
TkvT2: 9 correct; 1
Egfr; 5 SAM

121

45.3
%
overa
ll

Table 3-8 (cont’d)

Structure
50

RF 10 trees

Genotype training 3 Genotype test 3 EgfrT1: 5 correct; 3
70 wings of each
15 wings of each SAM; 7 tkv
genotype, females genotype,
females
EgfrT2: 3 correct; 1
mam; 3 SAM; 8 tkv
MamT1: 3 correct;
12 Star
MamT2: 3 correct;
12 Star
SAMT1:
13
correct; 1 Egfr; 1
mam
SAMT2:
10
correct; 3 Egfr; 2
tkv
StarT1: 14 correct;
1 mam
StarT2: 13 correct;
2 mam
TkvT1: 1 correct; 6
Egfr; 4 mam; 1
SAM
TkvT2: 6 correct; 5
Egfr; 4 SAM

122

47.3
%
overa
ll

Table 3-8 (cont’d)
Hessian 50 RF
trees

1,000 Genotype training 3 Genotype test 3 EgfrT1: 0 correct; 2 47%
70 wings of each
15 wings of each SAM; 13 tkv
overa
genotype, females genotype,
ll
females
EgfrT2: 2 correct; 4
SAM; 9 tkv
Table
4
MamT1: 1 correct;
14 Star
MamT2: 2 correct;
13 Star
SAMT1:
12
correct; 1 Egfr; 1
mam
SAMT2:
10
correct; 1 Egfr; 2
mam; 2 tkv
StarT1: 15 correct
StarT2: 13 correct;
2 mam
TkvT1: 6 correct; 2
Egfr; 5 mam; 2
SAM
TkvT2: 9 correct; 1
Egfr; 1 mam; 4
SAM

2.4 Classifying by sex across microscopes/magnifications. Egfr, epidermal growth factor
receptor; F, female; M, male; mam, mastermind; RF, random forest; SAM, Samarkand; SVM,
support vector machine; tkv, thickveins.

123

Table 3-9 Results with BioCAT for sex across technical conditions
Features Classifier

Training set

Testing set

Hessian
50

RF 10 trees

68 M SAM Test 1
wings
15 M SAM wings
71
F
SAM 15 F SAM wings
wings
Test 2
Olympus, 40x 15 M SAM wings
mag
15 F SAM wings
uncropped
Olympus 20x
uncropped

Identifies 100% Table 5
as male
50% overall

Hessian
50

RF 10 trees

68 M SAM Test 1
wings
15 M SAM wings
71
F
SAM 15 F SAM wings
wings
Test 2
Olympus, 20x 15 M SAM wings
mag
15 F SAM wings
uncropped
Olympus 20X
uncropped

Test 1:
13/15 F correct
10/15 M correct
Test 2:
12/15 F correct
15/15 M correct
83.3% overall

Hessian
50

RF 10 trees

68 M SAM
wings
71
F
SAM
wings
Olympus, 40x
mag cropped for
splining

Test 1
15M SAM wings
15 F SAM wings
Test 2
15M SAM wings
15 F SAM wings
Olympus 40x cropped
for splining

Test 1
12/15 F correct
13/15 M correct
Test 2
10/15 F correct
15/15 M correct
83.3% overall

Hessian
50

RF 10 trees

68 M SAM
wings
71
F
SAM
wings
Olympus, 40x
mag cropped for
splining

Test 1
15M SAM wings
15 F SAM wings
Test 2
15M SAM wings
15 F SAM wings
Olympus 20x cropped
for splining

Test 1
Table 5
0/15 F correct
14/15 M correct
Test 2
1/15 F correct
15/15 M correct
50% overall

124

Success rate

Notes

Table 3-9 (cont’d)
Hessian
50

RF 10 trees

68 M SAM
wings
71
F
SAM
wings
Olympus, 40x
mag cropped for
splining

Test 1
15 M SAM wings
15 F SAM wings
Test 2
15M SAM wings
12 F SAM wings
Leica 40x
cropped for splining

Test 1
Table 5
7/15 F correct
14/15 M correct
Test 2
7/12 F correct
14/15 M correct
73.7% overall

Hessian
50

RF 10 trees

68 M SAM
wings
71
F
SAM
wings
Leica, 40x mag
cropped
for
splining

Test 1
15 M SAM wings
15 F SAM wings
Test 2
15 M SAM wings
12 F SAM wings
Leica 40x
cropped for splining

Test 1
Table 5
15/15 F correct
15/15 M correct
Test 2
12/12 F correct
11/15 M correct
93.0% overall

Hessian
50

RF 10 trees

68 M SAM
wings
71
F
SAM
wings
Leica, 40x mag
cropped
for
splining

Test 1
15M SAM wings
15 F SAM wings
Test 2
15M SAM wings
15 F SAM wings
Olympus 40x cropped
for splining

Test 1
14/15 F correct
0/15 M correct
Test 2
14/15 F correct
0/15 M correct

Hessian
50

RF 10 trees

Mixed
microscopes
15 males Oly 4x
15 males Lei 4x
15 females Oly
4x
15 females Lei
4x

Test 1
15M SAM wings
15 F SAM wings
Test 2
15M SAM wings
15 F SAM wings
Olympus 40x

Test 1
13/15 F correct
8/15 M correct
Test 2
15/15 F correct
8/15 M correct

Hessian
50

RF 10 trees

Mixed
microscopes
15 males Oly 4x
15 males Lei 4x
15 females Oly
4x
15 females Lei
4x

Test 1
15M SAM wings
15 F SAM wings
Test 2
15M SAM wings
15 F SAM wings
Leica 40x

Test 1
Table 5
13/15 F correct 86.0%
14/15 M correct
Test 2
9/12 F correct
13/15 M correct

Table 5
73.3%

2.5 Classifying by genotype using Leica 4x images. Egfr, epidermal growth factor; mam,

125

mastermind; RF, Random forest; SAM, Samarkand; tkv thickveins.
Table 3-10 Results with BioCAT for genotype across technical conditions
Features

Classifier

Hessian 50 RF 10 trees

Training set

Testing set

Success rate

Notes

Genotype
train
Leica
64 wings from each
genotype
Leica 4x
Cropped prior to
landmark, splining
960 x 718 pixels

Genotype test
Leica
All from Leica
4x
Test set 1
15 wings from
each genotype
Test set 2
15 wings from
each genotype

EgfrT1:
9/15 correct
6/15 SAM
EgfrT2:
12/15 correct
3/15 SAM
MamT1:
1/15 correct
10/15 Star
4/15 tkv
MamT2:
2/15 correct
2/15 Egfr
3/15 SAM
4/15 Star
4/15 tkv
SAMwT1:
14/15 correct
1/15 mam
SAMwT2:
8/15 correct
7/15 mam
StarT1:
14/15 correct
1/15 mam
StarT2:
15/15 correct
TkvT1:
2/15 correct
4/15 mam
2/15 SAMw
7/15 Star
TkvT2:
3/15 correct
3/15 mam
9/15 Star

53.3
%
accur
ate
overa
ll.
This
isn't
subst
antial
ly
highe
r than
the
accur
acy
using
the
Olym
pus
image
s
(52%
overa
ll),
unlik
e sex
predi
ctions
.

126

REFERENCES

127

REFERENCES

1.

Houle D, Govindaraju DR, Omholt S. Phenomics: the next challenge. Nat Rev Genet.
2010;11:855–66.

2.

Gerlai R. Phenomics: fiction or the future? Trends Neurosci. 2002;25:506–9.

3.

Adams DC, Rohlf FJ, Slice DE. Geometric morphometrics: Ten years of progress after
the ‘revolution’. Ital J Zool. 2004;71:5–16.

4.

Slice DE. Geometric Morphometrics. Ann Rev Anthropol. 2007;36:261–81.

5.

Klingenberg CP. MorphoJ: an integrated software package for geometric morphometrics.
Mol Ecol Resour. 2011;11:353–7.

6.

Mitteroecker P, Gunz P. Advances in geometric morphometrics. Evol Biol. 2009;36:235–
41.

7.

Zelditch ML, Swiderski DL, Sheets HD. Geometric Morphometrics for Biologists: A
Primer. 2nd ed. San Diego: Elsevier; 2012.

8.

Danuser G. Computer Vision in Cell Biology Cell. 2011;147:973–8.

9.

PubMedShamir L, Delaney JD, Orlov N, Eckley DM, Goldberg IG. Pattern Recognition
Software and Techniques for Biological Image Analysis. PLoS Comput Biol.
2010;6:e1000974.

10.

Allanson JE, Bohring A, Dorr H, Dufke A, Gillessen-Kaesbach G, Horn D, et al. The face
of Noonan syndrome: Does phenotype predict genotype. Am J Med Genet A.
2010;152A:1960–6.

11.

Boehringer S, Guenther M, Wurtz RP, Horsthemke B, Wieczorek D. Automated
syndrome detection in a set of clinical facial photographs. Am J Med Genet A.
2011;155A:2161–9.

12.

Ferry Q, Steinberg J, FitzPatrick DR, Ponting CP, Zisserman A, Nellaker C.
Diagnostically relevant facial gestalt information from ordinary photos. Elife.
2014;3:e02020.

13.

Zou L, Adegun OK, Willis A, Fortune F. Facial biometrics of peri-oral changes in
Crohn’s disease. Lasers Med Sci. 2014;29:869–74.

14.

Houle D, Mezey J, Galpern P, Carter A. Automated measurement of Drosophila wings.
BMC Evol Biol. 2003;3:25.

15.

Kumar N, Belhumeur PN, Biswas A, Jacobs DW, Kress WJ, Lopez IC, et al. Leafsnap: A

128

Computer Vision System for Automatic Plant Species Identification. In: Fitzgibbon A,
Lazebnik S, Perona P, Sato Y, Schmid C, editors. Computer Vision - ECCV 2012. Berlin:
Springer; 2012. p. 502–16.
16.

Perrard A, Baylac M, Carpenter JM. Evolution of wing shape in hornets: why is the wing
venation efficient for species identification? J Evol Biol. 2014;27:2665–75.

17.

Unar JA, Seng WC, Abbasi A. A review of biometric technology along with trends and
prospects. Pattern Recogn. 2014;47:2673–88.

18.

Houle D. Numbering the hairs on our heads: the shared challenge and promise of
phenomics. Proc Natl Acad Sci U S A. 2010;107:1793–9.

19.

Debat V, Debelle A, Dworkin I. Plasticity, canalization and developmental stability of the
Drosophila wing: joint effects of mutations and developmental temperature. Evolution.
2009;63:2864–76.

20.

Klingenberg CP. Morphometric integration and modularity in configurations of
landmarks: tools for evaluating a priori hypotheses. Evol Dev. 2009;11:405–21.

21.

Paloniswamy S, Thacker NA, Klingenberg CP. Automatic identification of landmarks in
digital images. IET Comput Vis. 2010;4:247–60.

22.

Gidaszewski NA, Baylac M, Klingenberg CP. Evolution of sexual dimorphism of wing
shape in the Drosophila melanogaster subgroup. BMC Evol Biol. 2009;9:110.

23.

Pitchers W, Pool JE, Dworkin I. Altitudinal clinal variation in wing size and shape in
African Drosophila melanogaster: one cline or many? Evolution. 2013;67:38–42.

24.

Miles WO, Korenjak M, Griffiths LM, Dyer MA, Provero P, Dyson NJ. Posttranscriptional gene expression control by NANOS is up-regulated and functionally
important in pRb-deficient cells. EMBO J. 2014;33:2201–15.

25.

Gafner L, Dalessi S, Escher E, Pyrowolakis G, Bergmann S, Basler K. Manipulating the
sensitivity of signal-induced repression: quantification and consequences of altered
brinker gradients. PLoS One. 2013;8:e71224.

26.

Birdsall K, Zimmerman E, Teeter K, Gibson G. Genetic variation for the positioning of
wing veins in Drosophila melanogaster. Evol Dev. 2000;2:16–24.

27.

Dworkin I, Gibson G. Epidermal Growth Factor Receptor and Transforming Growth
Factor-β signaling contributes to variation for wing shape in Drosophila melanogaster.
Genetics. 2006;173:1417–31.

28.

Ahmad F, Roy K, O’Connor B, Shelton J, Dozier G, Dworkin I. Fly Wing Biometrics
Using Modified Local Binary Pattern SVMs and Random Forest. International Journal of
Machine Learning and Computing. 2014;4:279–85.

129

29.

Payne M, Turner J, Shelton J, Adams J, Carter J, Williams H, et al. Fly wing biometrics.
IEEE Symposium Series on Computational Intelligence and Biometrics. 2013;42-6.
doi:10.1109/CIBIM.2013.6607912.

30.

Shelton J, Bryant K, Abrams S, Small L, Adams J, Leflore D, et al. Genetic &
Evolutionary Biometric Security: Disposable Feature Extractors for Mitigating
Biometrics Replay Attacks. Procedia Comput Sci. 2012;8:351–60.

31.

Zhou J, Lamichhane S, Sterne G, Ye B, Peng H. BIOCAT: a pattern recognition platform
for customizable biological image classification and annotation. BMC Bioinformatics.
2013;14:291.

32.

Blair SS. Wing vein patterning in Drosophila and the analysis of intercellular signaling.
Annu Rev Cell Dev Biol. 2007;23:293–319.

33.

Bray SJ. Notch signaling: a simple pathway becomes complex. Nat Rev Mol Cell Biol.
2006;7:678–89.

34.

Guichard A, Biehs B, Sturtevant MA, Wickline L, Chacko J, Howard K, et al. Rhomboid
and Star interact synergistically to promote EGFR/MAPK signaling during Drosophila
wing vein development. Development. 1999;126:2663–76.

35.

PubMedR Development Core Team. R: A language and environment for statistical
computing. Vienna, Austria: the R Foundation for Statistical Computing. 2011.
http://www.R-project.org. Accessed 21 April 2015.

36.

James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning: with
applications in R. New York: Springer; 2013.

37.

Meijering E, Jacob M, Sarria JCF, Steiner P, Hirling HL, Unser M. Design and validation
of a tool for neurite tracking and analysis in fluorescence microscopy images. Cytom Part
A. 2004;58:167–76.

38.

Chandler C, Chari S, Tack D, Dworkin I. Causes and consequences of genetic
background effects illuminated by integrative genomic analysis. Genetics.
2014;196:1321–36.

39.

Gnu Image Manipulation Program, version 2.8. 2015. http://www.gimp.org. Accessed 21
April 2015.

40.

David’s Batch Processor version 1.1.8. 2015.
http://members.ozemail.com.au/~hodsond/dbp.html. Accessed 21 April 2015.

41.

Rohlf, FJ. tpsDig version 2.17. 2015. http://life.bio.sunysb.edu/morph/soft-dataacq.html.
Accessed 21 April 2015.

42.

CPR. 2015. Houle Lab: Software. http://bio.fsu.edu/~dhoule/wings.html. Accessed 29
April 2015.

130

43.

DworkinLab Github account: Wing_Biometrics_2015.
https://github.com/DworkinLab/Wing_Biometrics_2015. Accessed 29 April 2015.

44.

David's Batch Processor version 1.1.8. 2015.
http://members.ozemail.com.au/~hodsond/dbp.html. Accessed 24 April 2015.

45.

tpsUtil. 2015. http://life.bio.sunysb.edu/morph/soft-utility.html. Accessed 24 April 2015

46.

Pitchers W, Pool JE, Dworkin I. Altitudinal clinical variation in wing size and shape in
African Drosophila melanogaster: one cline or many? Evolution. 2013;67:38-42

47.

Image J. 2015. http://imagej.nih.gov. Accessed 24 April 2015.

131

CHAPTER IV: CONCLUSIONS AND FUTURE
PERSPECTIVE

132

Introduction
A major goal of modem biology is to use newly available diverse sets of quantitative ‘omic data
to make useful predictions relevant to health, agriculture, the environment, and basic biological
research1. This objective applies in particular to 'precision medicine'. The same genetic or
clinical properties in different people can mean very different things depending on individual
factors and environmental context-- being able to take the relevant features and interpret their
meaning without direct experimental testing would be invaluable in a medical or scientific
setting2. However, predictions tend to be most accurate when they have been produced from an
individually tailored dataset, which is not feasible for all the circumstances where predictions
might be applied3. Experimentally, testing all treatments on all possible combination of inputs is
not feasible, so despite the above limitations of prediction, computational methods using these
new and growing data sets must be developed. Technological breakthroughs have provided
modern biologists with an ever-increasing treasure trove of publicly available biologically
relevant data necessary for this task; the appropriate matched computational resources are still
being developed.
Experimental approaches are changing more quickly than the analytical methods for interpreting
them. DNA sequencing technology, for example, has been increasing in speed and decreasing in
cost at an exponential rate4. In just the last ten years, new techniques like STARR-Seq, RNA-seq,
ChIP-seq and chromatin conformation capture (3C) analysis have replaced older methods like
RT-PCR and microarrays5–9. In the last five years, CRISPR has gone from a niche subject to the
“Swiss army knife” of molecular biology10. Despite this data explosion, exhaustively measuring
how cells, organisms or ecosystems with complex genetic architectures would generate
differential responses to diverse environmental circumstances, is still impossible. Thus,

133

computational methods are the only way forward. However, the machine learning methods that
are used to make predictions and classifications from this data are largely unchanged in the last
twenty years11,12.

From genotype to gene expression
In an effort to address this deficiency in one important area of genomic research, in Chapter 2, I
describe my efforts to use random forests to distinguish active cis-regulatory DNA from
background, using the Fly Enhancer Resource developed by Alexander Stark's lab as a training
set13. The existence of such a large database of in-vivo reporters allowed an in-depth analysis of
how effective these predictions were, and how well they generalize. I found random forests made
good predictions when tested on reporters set aside for testing from the Fly Enhancer Resource,
but produced substantially higher false-discovery rate when applied to putative enhancers
identified around well-characterized genes. I also found that attempting to classify by more
specific categories (expression pattern, or temporal activity) was much less accurate. However,
these are both dynamic categories that depend on the developmental window being considered,
thus trying to classify elements into discrete groups may be a fundamentally flawed approach.
Probabilistic models, like what is used in some software like Manolis Kellis ChromHMM, may
be more computationally effective and biologically accurate, as they characterize distributions
rather than classify based on discrete groups14,15.
It seems likely that the currently existing datasets should be able to yield features that show
differences between poised and currently active enhancers (as experimentally shown for H3K27
acetylation in Koenecke et al. 201616) , even if we are not currently able to concretely classify
them. Based on the degree to which classification accuracy of my random forest classification
levels off asymptotically as more datasets are incorporated, it seems likely that including more

134

information from chromatin immunoprecipitation of transcription factors would have marginal
returns. Future efforts along these lines may have more success using data along different lines,
such as RNA-seq information for the relevant locus, chromatin conformation data, histone marks
that are not explicitly associated with enhancer function (which may help parse out nonenhancers that share some similar features, like promoters) and information about proximity to
prospective transcription start sites. More information about the binding of cofactors may also
be relevant for regulatory elements that fit into the category of 'transcription factor collectives'
and are less reliant on the underlying DNA sequence, or longer enhancer regions that do not
necessarily fall into a neat 2kb window17,18.
A more nuanced use of sequence conservation data than applied in my study would be helpful;
information about variation between species and within populations is a rich resource that is
abundantly available for Drosophila, but measuring average sequence change over large
windows is not likely to yield a result19,20. Using cross-species ChIP data to look for regions that
appear to be under purifying or directional selection may be more fruitful, as would using
programs like EMMA from Saurabh Sinha's laboratory that are explicitly intended for making
comparisons of enhancers between species by modeling turnover of transcription factor binding
sites21,22.

From wing shape to genotype
In Chapter 3, I describe a database we developed for the testing of alternative methods for
extracting and analyzing phenotype features from images of Drosophila wings. I also discuss
some preliminary analysis of how existing methods for classifying Drosophila into sex and
genotype using standard morphometric features compares with classification based on features
extracted by Biocat, an open source computer vision tool23. We found that morphometric

135

methods (which have been in development for decades) were substantially more successful, but
features extracted from images using Biocat were quite successful at classifying wings by the sex
of their origin fly, and by genotype24.
In the future, it would be interesting to see how well computer vision methods could be applied
to more complex classifications, like looking for the interactions between genes, recognizing a
given genotype that has been influenced by different environmental conditions, or identifying a
common mutation in different species. It would also be useful to see whether combination
approaches that take advantage of both morphometric features and features extracted using
computer vision methods can outperform either or both methods alone.
New methods of measuring phenotype and correlating it with underlying genetic state may lead
to a more nuanced view of how subtle genetic perturbations influence phenotype. For example,
the phenotypic effects seen in the removal of 'shadow' enhancers may be evident under a range of
conditions, given sufficiently sophisticated measurement25,26.

Conclusions
Although there are still limitations on using computational predictions in place of experimental
work, it is a promising future direction. Experimental technology is constantly advancing, but the
relevant features of the phenome that biologists are interested in exploring appears to be too large
to ever exhaustively explore it through experiments-- it will always be necessary to draw
conclusions from unique combinations of inputs. We used a large number of publicly available
features to determine how accurately enhancers could be distinguished from background. Our
results suggest that based on the currently existing data, there are limits to the success of these
predictions-- surpassing these limits will either require a new method for analysis, a better
understanding of the enhancer state, or a new way of interpreting the existing features. In image

136

analysis, computer vision with machine learning has been used as an alternative to traditional
human noticeable features. Automatic extraction of features allows identifying important
elements that may not appear to have relevance at first glance. Based on our initial analysis,
however, automatic analysis alone is not comparable to the well-developed traditional methods.
We have created a database of wing images in the hopes that it will be used for the creation of
novel tools, which may find new features to use as the basis of predictions. When our ability to
analyze data catches up with our current rate of accumulation, we may end up with a much
stronger understanding of how genotype actually relates to phenotype.

137

REFERENCES

138

REFERENCES

1.

Stephens ZD, Lee SY, Faghri F, et al. Big data: Astronomical or genomical? PLoS Biol.
2015;13(7):7.

2.

Mirnezami R, Nicholson J, Darzi A. Preparing for Precision Medicine. N Engl J Med.
2012;366(6):489-491.

3.

Xue Y, Lameijer E-W, Ye K, et al. Precision Medicine: What Challenges Are We Facing?
Genomics Proteomics Bioinformatics. 2016;14(5):253-261.

4.

Mardis ER. A decade’s perspective on DNA sequencing technology. Nature.
2011;470(7333):198-203.

5.

Arnold CD, Gerlach D, Stelzer C, Boryn LM, Rath M, Stark A. Genome-Wide
Quantitative Enhancer Activity Maps Identified by STARR-seq. Science (80- ).
2013;339(6123):1074-1077. doi:10.1126/science.1232542.

6.

Bumgarner R. DNA microarrays: Types, Applications and their future. Curr Protoc Mol
Biol. 2013;0 22:Unit-22.1.

7.

Denker A, de Laat W. The second decade of 3C technologies: detailed insights into
nuclear organization. Genes Dev . 2016;30(12):1357-1382.

8.

Ho JWK, Bishop E, Karchenko P V, Nègre N, White KP, Park PJ. ChIP-chip versus ChIPseq: Lessons for experimental design and data analysis. BMC Genomics. 2011;12(1):134.

9.

Tachibana C. Transcriptomics today: Microarrays, RNA-seq, and more. Science (80- ).
July 2015.

10.

Sternberg SH, Doudna JA. Expanding the Biologist’s Toolkit with CRISPR-Cas9. Mol
Cell. 2015;58(4):568-574.

11.

Kourou K, Exarchos TP, Exarchos KP, Karamouzis M V., Fotiadis DI. Machine learning
applications in cancer prognosis and prediction. Comput Struct Biotechnol J. 2015;13:817.

12.

Cruz JA, Wishart DS. Applications of machine learning in cancer prediction and
prognosis. Cancer Inform. 2006;2:59-77.

13.

Kvon EZ, Kazmar T, Stampfel G, et al. Genome-scale functional characterization of
Drosophila developmental enhancers in vivo. Nature. 2014;512(7512):91-95.

14.

Ernst J, Kellis M. ChromHMM: automating
characterization. Nat Methods. 2012;9(3):215-216.

139

chromatin-state

discovery

and

15.

Ernst J, Vainas O, Harbison CT, Simon I, Bar‐Joseph Z. Reconstructing dynamic
regulatory maps. Mol Syst Biol. 2007;3(1).

16.

Koenecke N, Johnston J, Gaertner B, Natarajan M, Zeitlinger J. Genome-wide
identification of Drosophila dorso-ventral enhancers by differential histone acetylation
analysis. Genome Biol. 2016;17(1):196.

17.

Junion G, Spivakov M, Girardot C, et al. A Transcription Factor Collective Defines
Cardiac Cell Fate and Reflects Lineage History. Cell. 2017;148(3):473-486.

18.

Pott S, Lieb JD. What are super-enhancers? Nat Genet. 2015;47(1):8-12.

19.

Lack JB, Lange JD, Tang AD, Corbett-Detig RB, Pool JE. A Thousand Fly Genomes: An
Expanded Drosophila Genome Nexus. Mol Biol Evol. 2016;33(12):3308-3313.

20.

Halligan DL, Keightley PD. Ubiquitous selective constraints in the Drosophila genome
revealed by a genome-wide interspecies comparison. Genome Res. 2006;16(7):875-884.

21.

He Q, Bardet AF, Patton B, Purvis J, Johnston J, Paulson A. High conservation of
transcription factor binding and evidence for combinatorial regulation across six
Drosophila species. Nat Genet. 2011;43..

22.

He X, Ling X, Sinha S. Alignment and Prediction of cis-Regulatory Modules Based on a
Probabilistic Model of Evolution. PLOS Comput Biol. 2009;5(3):e1000299.

23.

Zhou J, Lamichhane S, Sterne G, Ye B, H. P. Biocat: a pattern recognition platform for
customizable biological image classification and annotation. BMC Bioinformatics.
2013;14:291.

24.

Zelditch M. Geometric Morphometrics for Biologists : A Primer. Elsevier Academic
Press; 2004.

25.

Cannavo E, Khoueiry P, Garfield DA, Geeleher P, Zichner T, et al. Shadow enhancers are
pervasive features of developmental regulatory networks. Curr Biol. 2016;26:1-14.

26.

Gafner L, Dalessi S, Escher E, Pyrowolakis G, Bergmann S, Basler K. Manipulating the
Sensitivity of Signal-Induced Repression: Quantification and Consequences of Altered
Brinker Gradients. PLoS One. 2013;8(8):8.

140