HYBRID STRUCTURAL AND BEHAVIORAL DIVERSITY
TECHNIQUES FOR EFFECTIVE GENETIC
PROGRAMMING
By
Armand Rashad Burks

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Computer Science — Doctor of Philosophy
2017

ABSTRACT
HYBRID STRUCTURAL AND BEHAVIORAL DIVERSITY TECHNIQUES
FOR EFFECTIVE GENETIC PROGRAMMING
By
Armand Rashad Burks
Sustaining the diversity of evolving populations is a fundamental issue in genetic
programming. We describe a novel measure of structural diversity for tree-based
genetic programming, and we demonstrate its utility compared to other diversity
techniques. We demonstrate our technique on the real-world application of tuberculosis screening from X-ray images. We then introduce a new paradigm of genetic
programming that involves simultaneously maintaining structural and behavioral diversity in order to further improve the efficiency of genetic programming.
Our results show that simultaneously promoting structural and behavioral diversity improves genetic programming by leveraging the benefits of both aspects of
diversity while overcoming the shortcomings of either technique in isolation. The
hybridization increases the behavioral diversity of our structural diversity technique,
and increases the structural diversity of the behavioral diversity techniques. This
increased diversity leads to performance gains compared to either technique in isolation.
We found that in many cases, our structural diversity technique provides significant performance improvement compared to other state-of-the-art techniques.
Our results from the experiments comparing the hybrid techniques indicate that the
largest performance gain was typically attributed to our structural diversity technique. The incorporation of the behavioral diversity techniques provide additional
improvement in many cases.

This dissertation is dedicated to you, Mom.
Thank you for always believing in me.

iii

ACKNOWLEDGMENTS

This material is based in part upon work supported by the National Science Foundation under Cooperative Agreement No. DBI-0939454. Any opinions, findings, and
conclusions or recommendations expressed in this material are those of the authors
and do not necessarily reflect the views of the National Science Foundation. This
work was supported in part by Michigan State University through computational
resources provided by the Institute for Cyber-Enabled Research.

iv

TABLE OF CONTENTS

LIST OF TABLES

vii

LIST OF FIGURES

viii

Chapter 1 Introduction
1.1 Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Premature Convergence . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Contributions of this Work . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 2 Existing Diversity Techniques in Tree-Based GP
2.1 Genotypic Diversity Techniques . . . . . . . . . . . . . . . .
2.1.1 Structured Populations . . . . . . . . . . . . . . . . .
2.1.2 Multi-objective Diversity Maintenance . . . . . . . .
2.2 Behavioral Diversity Techniques . . . . . . . . . . . . . . . .
2.2.1 Semantics in GP . . . . . . . . . . . . . . . . . . . .
2.2.2 Semantic-Aware Selection Methods . . . . . . . . . .
2.2.3 Semantic-Aware Operators . . . . . . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

1
1
2
5
6
6
8
10
12
13
13
16

Chapter 3 The Genetic Marker Diversity Algorithm for GP
21
3.1 Motivation behind GMD-GP . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Genetic Markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 GMD-GP Structural Diversity Preservation . . . . . . . . . . . . . . 26
Chapter 4 Experimental Validation of GMD-GP
4.1 Benchmark Problem Suite . . . . . . . . . . . .
4.2 Experimental Settings . . . . . . . . . . . . . .
4.3 GMD-GP Performance Comparison . . . . . . .
4.4 GMD-GP Diversity Comparison . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

29
29
33
34
40

Chapter 5 GP for Tuberculosis Screening: A Case Study
5.1 The Tuberculosis Epidemic . . . . . . . . . . . . . . . . . .
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 GP for Image Classification . . . . . . . . . . . . .
5.2.2 Tuberculosis Screening from Chest X-rays . . . . .
5.3 Experimental Settings . . . . . . . . . . . . . . . . . . . .
5.3.1 Tuberculosis Dataset . . . . . . . . . . . . . . . . .
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

45
45
47
47
49
50
52
53

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

Chapter 6 Hybridizing GMD-GP with Behavioral Diversity
60
6.1 Hybridizing GMD-GP with Behavioral Diversity Selection . . . . . . 61
6.1.1 Lexicase Selection in GMD-GP . . . . . . . . . . . . . . . . . 62

v

6.2

6.3

6.1.2 Other Selection Techniques in GMD-GP . . . . .
Experimental Validation of Hybrid Selection Techniques
6.2.1 Performance of Hybrid Selection Techniques . . .
6.2.2 Effect of Hybridization on Diversity . . . . . . . .
Hybridizing GMD-GP with Behavior-Aware Crossover .
6.3.1 Performance of GMD-GP and LGX Hybrid . . .
6.3.2 Effect of LGX Hybridization on Diversity . . . . .

.
.
.
.
.
.
.

Chapter 7 Conclusions
7.1 Conclusions on Structural Diversity . . . . . . . . . . . . .
7.2 Conclusions on Hybrid Structural and Behavioral Diversity
7.3 Conclusions on GP for Tuberculosis Screening . . . . . . .
7.4 Future Directions . . . . . . . . . . . . . . . . . . . . . . .
BIBLIOGRAPHY

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

. . . . . .
Techniques
. . . . . .
. . . . . .

64
64
65
71
78
78
79
83
83
86
87
88
90

vi

LIST OF TABLES

Table 4.1:

Symbolic Regression Benchmark Problems. . . . . . . . . . . .

31

Table 4.2:

Input-output Mappings for the Operators A1 - A5 Used in the
Finite Algebras Problems. . . . . . . . . . . . . . . . . . . . .

32

Table 4.3:

General GP Settings . . . . . . . . . . . . . . . . . . . . . . .

34

Table 4.4:

List of GP Techniques Used in the GMD-GP Performance and
Diversity Comparison. . . . . . . . . . . . . . . . . . . . . . .

35

Table 4.5:

Symbolic Regression Benchmarks - Mean Best Ending Fitness.

37

Table 5.1:

GP Settings Used for the TB Experiments. . . . . . . . . . . .

51

Table 5.2:

Image Classification GP Primitive Set. . . . . . . . . . . . . .

52

Table 5.3:

Accuracy Comparison of GP and GMD-GP to Previous Work.

54

Table 5.4:

Training Accuracy on the Shenzen Dataset. . . . . . . . . . . .

54

Table 5.5:

Testing Accuracy on the Shenzen Dataset. . . . . . . . . . . .

54

Table 5.6:

GP Best Classifier Confusion Matrix. . . . . . . . . . . . . . .

55

Table 5.7:

GMD-GP Best Classifier Confusion Matrix. . . . . . . . . . .

55

Table 5.8:

Time Taken (in seconds) for Best Evolved GP Classifier to
Classify an X-ray Image. . . . . . . . . . . . . . . . . . . . . .

56

Time Taken (in seconds) for Best Evolved GMD-GP Classifier
to Classify an X-ray Image. . . . . . . . . . . . . . . . . . . .

56

Table 5.10: Size of the Best-of-run Classifiers. . . . . . . . . . . . . . . . .

57

Table 5.9:

Table 6.1:
Table 6.2:
Table 6.3:

List of GP Techniques Used in the Comparison of the Hybrid
Selection Techniques. . . . . . . . . . . . . . . . . . . . . . . .

66

Ending Fitness of the Behavioral Diversity Selection Techniques
Compared to the Hybrids. . . . . . . . . . . . . . . . . . . . .

70

Ending Fitness of GMD-GP Compared to the Hybrids. . . . .

71

vii

LIST OF FIGURES

Figure 1.1:

The Conventional Evolutionary Process of GP. . . . . . . . .

2

Figure 2.1:

An Example Set of Pareto-optimal Solutions in a 2D Objective Space. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

Figure 2.2:

Semantics of a GP Tree. . . . . . . . . . . . . . . . . . . . . .

13

Figure 2.3:

Semantic Geometric Crossover for Real-valued Expressions. .

18

Figure 2.4:

Semantic Geometric Crossover for Boolean Expressions. . . .

19

Figure 3.1:

A Rooted Genetic Marker. . . . . . . . . . . . . . . . . . . .

23

Figure 3.2:

A Non-rooted Genetic Marker. . . . . . . . . . . . . . . . . .

23

Figure 3.3:

Rooted Genetic Marker vs. Rooted Tree Schema. . . . . . . .

25

Figure 3.4:

Non-rooted Genetic Marker vs. Hyperschema.

. . . . . . . .

26

Figure 3.5:

The Main Evolutionary Loop of GMD-GP. . . . . . . . . . .

27

Figure 4.1:

Symbolic Regression Benchmarks - Fitness over Time. . . . .

36

Figure 4.2:

Pairwise Comparison of Convergence Rate. . . . . . . . . . .

38

Figure 4.3:

Pairwise Comparison of Success Rate. . . . . . . . . . . . . .

39

Figure 4.4:

Pairwise Comparison of Genetic Marker Density. . . . . . . .

41

Figure 4.5:

Pairwise Comparison of Behavioral Diversity. . . . . . . . . .

42

Figure 4.6:

Pairwise Comparison of Fitness Standard Deviation. . . . . .

43

Figure 5.1:

Regions of the X-ray Used by the Best Evolved GP Classifier.

58

Figure 5.2:

Regions of the X-ray Used by the Best Evolved GMD-GP
Classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

Figure 6.1:

Pairwise Comparison of the Convergence Rate of the Hybrids. 67

Figure 6.2:

Pairwise Comparison of the Success Rate of the Hybrids. . .

68

Figure 6.3:

Pairwise Comparison of the Ending Fitness of the Hybrids. .

69

viii

Figure 6.4:
Figure 6.5:

Pairwise Comparison of the Genetic Marker Density of the
Hybrids. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

Pairwise Comparison of the Structural Diversity of the Other
Hybrids. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

Figure 6.6:

Pairwise Comparison of the Behavioral Diversity of the Hybrids. 74

Figure 6.7:

Pairwise Comparison of the Behavioral Diversity of the Other
Hybrids. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

Pairwise Comparison of the Fitness Standard Deviation of the
Hybrids. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

Pairwise Comparison of the Fitness Standard Deviation of the
Other Hybrids. . . . . . . . . . . . . . . . . . . . . . . . . . .

77

Figure 6.10: Pairwise Comparison of the Ending Fitness of the GMD-GP,
LGX Hybrid. . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

Figure 6.11: Pairwise Comparison of the Structural Diversity of the GMDGP, LGX Hybrid. . . . . . . . . . . . . . . . . . . . . . . . .

80

Figure 6.12: Pairwise Comparison of the Behavioral Diversity of the GMDGP, LGX Hybrid. . . . . . . . . . . . . . . . . . . . . . . . .

81

Figure 6.13: Pairwise Comparison of the Fitness Standard Deviation of the
GMD-GP, LGX Hybrid. . . . . . . . . . . . . . . . . . . . . .

82

Figure 6.8:
Figure 6.9:

ix

Chapter 1
Introduction
1.1

Genetic Programming

Genetic programming (GP) [39] refers to a class of biologically-inspired techniques
for the automatic synthesis of computer programs. Mimicking the evolutionary process, GP applies the concepts of natural selection, recombination, inheritance, and
variation to a population of computer programs with the goal of automatically assembling a program with a certain desired behavior and characteristics. One of the
long-standing and primary goals of GP is as follows. Given a programming task,
typically defined by (1) a set of training examples (input cases and desired outputs),
and (2) a set of primitives that make up a computer program (i.e. operands that
correspond to the input cases and operators that perform some action on the given
operands), automatically assemble a computer program that correctly performs the
task (i.e. produces the desired outputs for all of the input cases).
Figure 1.1 provides an illustration of the conventional evolutionary cycle in GP.
First, a population of programs is randomly generated. Next, each program (referred
to as an individual) is evaluated using a problem-specific fitness function to determine

1

Figure 1.1: The Conventional Evolutionary Process of GP.

its fitness, which is typically a measure of its proximity to the optimal solution. After
the population has been evaluated, selection pressure is applied by choosing fitter
individuals. The selected individuals undergo genetic operators such as crossover and
mutation to produce a new population. The process of fitness evaluation, selection
and breeding repeats until either a solution is found or a predefined number of cycles
(commonly measured in generations or total fitness evaluations) has been reached.
In tree-based GP, which is the form of GP we consider in this work, programs
are represented by syntax trees, where the internal (function) nodes are typically
operators and the terminal nodes are typically operands. This representation provides great flexibility in expressing and evolving functions. The traditional crossover
operator creates new offspring programs by swapping randomly selected subtrees between two parent programs. Similarly, the traditional mutation operators replace a
subtree (or node) in a parent program with a randomly generated subtree (or node)
using the primitive set.

1.2

Premature Convergence

Premature convergence is a serious problem that is known to hamper the performance of GP, as well as other evolutionary algorithms (EAs). We consider premature

2

convergence to have occurred during the evolutionary process when the population
becomes largely homogeneous before an optimal solution has been discovered. This
often occurs during the early stages of the evolutionary process when sub-optimal
solutions dominate the population and the typical genetic operators are unable to
explore different regions of the search space. As a result, the likelihood of discovering
an optimal solution dramatically decreases.
Premature convergence is an even more challenging problem for conventional GP
because of the variable-length tree representation of programs. In tree-based GP,
populations converge much differently than in EAs that use fixed-length representations such as the binary-encoded genetic algorithm (GA). In a binary-encoded GA,
when two genetically identical individuals undergo single-point crossover, the resulting offspring will also be identical to the parents. However, given that GP trees
typically encode functions, performing the conventional random subtree crossover
operator on two identical trees can still create functionally different offspring when
different subtrees are exchanged between the trees. Therefore, even if a GP population is largely genetically homogeneous, it is still possible for the traditional genetic
operators to yield different phenotypes.
While GP populations converge differently than in other EAs, it is known that
traditional GP populations are still greatly affected by premature convergence. An
early analysis of population diversity revealed that GP populations often rapidly lose
diversity during the early stages of the evolutionary process [53]. In that study, the
authors tracked the propagation of genetic material through the population over time
and showed that the population at the final generation tended to inherit all of its
genetic material from a very small number of individuals from the initial generation.
Inspecting the top four levels of the trees in the population revealed that near the
onset of the evolutionary process, a large proportion of the population (over 70%)

3

became genetically identical to a single individual from the initial generation [53].
Since the rooted portion of a tree defines its functional context, changes made near
the root have a much greater impact on its overall functionality. Therefore, prematurely converging to a single rooted structure can limit the variety of phenotypes
that can be evolved.
Other studies that examined the properties of the conventional random subtree
crossover operator revealed that it also contributes to the loss of diversity [24, 64].
Random subtree crossover operates on two trees and uniformly, randomly selects a
node in each tree. The resulting subtrees at the chosen nodes are swapped between
the two trees to create two offspring. It was shown that nodes deeper in the trees are
more frequently chosen for crossover, while the upper regions of the trees are largely
unaltered [24]. This is largely due to the fact that, given the branching factors of
trees, the number of nodes lower in the tree is disproportionately larger than the
number of nodes at higher levels in the tree. A common method for addressing this
is to force the crossover operator to probabilistically select internal nodes more often
than leaves. However, Langdon and Poli showed that random subtree crossover tends
to exchange little genetic material between trees, which further exacerbates genotypic
convergence [64].
In addition to premature convergence in the genotypic space, premature convergence in the phenotypic (behavioral) space is a serious, yet complex, issue in
GP [35, 36, 46]. Because many genotypes (trees) can encode the same phenotype
(the behavior expressed by a tree when it is executed), the genetic diversity of a GP
population does not directly correlate to the phenotypic diversity of the population.
Therefore, a genetically diverse population is not guaranteed to also be behaviorally
diverse. Likewise, a genetically homogeneous GP population is not guaranteed to
be behaviorally homogeneous since it is still possible for new phenotypes to arise, as

4

we discussed earlier. However, GP populations often experience a loss of behavioral
diversity [35, 36]. Due to strong selection pressure, GP populations tend to rapidly
converge to a few clusters of individuals that express the same behavior [36, 46].

1.3

Contributions of this Work

There are multiple contributions of this research. We begin with a survey of several
preexisting genotypic and phenotypic diversity techniques for GP. Next, we introduce
a simple yet powerful novel measure of structural diversity that is free of the notion
of genotypic distance. We then present a multi-objective technique that incorporates
our structural diversity measure to prevent premature convergence while accelerating
search. We validate our technique by demonstrating its utility and performance
benefit compared to state-of-the-art and conventional GP approaches, as well as by
applying our technique to the real-world problem of tuberculosis screening from Xray images. We also provide a detailed analysis of the search properties and behavior
of our technique.
Next, we introduce hybrid methods for simultaneously sustaining structural and
behavioral diversity to achieve more effective GP search than either method in isolation. Empirical analyses reveal that simultaneously maintaining structural and
behavioral diversity can improve the search performance of GP by leveraging both
aspects of diversity and overcoming some of the challenges faced by either method in
isolation. These findings also motivate the design of new GP operators that utilize
structural and behavioral diversity to achieve more efficient search.

5

Chapter 2
Existing Diversity Techniques in
Tree-Based GP
Premature convergence has received a lot of attention, and many techniques have
been developed in order to prevent it in GP as well as EAs in general. Many such
techniques focus on maintaining population diversity as a means of combating premature convergence. This chapter reviews several existing diversity techniques.

2.1

Genotypic Diversity Techniques

The traditional search operators such as crossover and mutation work at the genotypic level, making genetic modifications to the individuals in the evolving population
in order to evolve new phenotypes and explore different regions of the search space.
Therefore, the genotypes in the evolving population play a vital role in the effectiveness of EAs. As a result, several existing diversity techniques focus on sustaining
genotypic diversity in the population in order to prevent premature convergence.
Several genotypic diversity techniques rely on a distance metric to determine
the similarity between genotypes in the population. Given a distance metric, a

6

population with greater genotypic distance between individuals is said to be more
genetically diverse. Therefore, maximizing the pairwise distance between individuals
of the population has been used as a means of promoting diversity [18].
A type of tree edit distance has been used to promote diversity in GP [18]. To
determine the distance between two trees, the trees are first overlaid and then the
total nodes that differ in the overlapping region of the two trees is calculated. The
distance is normalized by dividing it by the size of the smaller tree. Genotypic
diversity is then measured as the average squared distance of each individual to the
rest of the population.
Although such genotypic diversity metrics can be used to increase population
diversity, there are limitations of such measures [11]. One of the main issues is that
depending on the way diversity is defined, high diversity may not necessarily imply
good fitness or efficient search [11]. Genotypic diversity is undoubtedly important to
the success of EAs because it increases the possible ways of assembling a solution.
However, it is less clear how a diversity metric should be defined, how much diversity
is needed, and how diversity should be enforced. Moreover, genotypic distance may
not accurately reflect phenotypic distance in GP, due to the complicated genotypeto-phenotype mapping. Therefore, many approaches focus on maintaining diversity
without the use of an explicit distance metric.
Structure Fitness Sharing (SFS) [33] promotes genotypic diversity based on tree
structure. However, instead of explicitly calculating the distance between trees, SFS
labels each unique tree structure and tracks the number of individuals and fitness
for each structure. The main idea is to promote the search for unique structures and
prevent a single structure from dominating the population. This is done by adopting
a fitness sharing scheme [25, 26], which reduces the fitness of structures that are
over-represented in the population.

7

Lineage selection [12] was proposed as a means of promoting the search for the
“fit and diverse” rather than solely selecting individuals based on fitness. Lineage
selection is based on the concept of genetic lineages, where the lineage of an individual
can be determined by tracing the path from its root parent (the parent from which
it inherited its root node) to the individuals from which it was created through
crossover [12, 53]. Lineage selection groups individuals based on genetic lineage, and
then modifies the traditional tournament selection operator to hold tournaments
among random individuals selected from different lineages.
In a similar fashion to lineage selection, Hereditary Repulsion (HR) [56] uses
genetic lineage information to promote diversity. When selecting parents to create
offspring, HR selects the first parent randomly and then selects a random subset of
candidate individuals for the second parent. Next, the hereditary overlap between the
first parent and each of the candidates is determined, where the hereditary overlap of
two individuals is the total common ancestors the individuals have in their lineages.
In this case, both parents of an individual are considered in the lineage instead of
only the root parent. The candidate with the smallest hereditary overlap with the
first parent is chosen to undergo crossover.

2.1.1

Structured Populations

Other methods have been introduced for enforcing genotypic diversity by imposing a
structure on the population instead of relying on an explicit diversity measure [30, 32].
The main motivation behind such techniques is that fitter genotypes tend to rapidly
dominate the population, which accelerates premature convergence. By restricting
competition through a structured population, and regularly introducing new genetic
material into the population, these methods attempt to prevent premature convergence and promote better exploration of the search space.

8

The Age-Layered Population Structure (ALPS) uses a hierarchical population to
counteract premature convergence [8, 30, 31]. ALPS uses genotypic age to segregate
the population into several “age layers” in order to prevent older genotypes from
dominating the population. Genotypic age is a measure of the amount of time that
has passed since an individual’s oldest ancestor was created. The age layers restrict
competition and breeding to individuals of similar genotypic age. Each layer has
an age limit, which can be determined by different numerical aging schemes [30].
As time progresses, the genotypic age of individuals reaches the age limit while the
individuals in the layers become fitter. Individuals above the age limit are forced to
move out of their layers and compete for membership in the next highest layer by
displacing a less-fit individual or an individual that is too old for that layer.
In addition to using age layers to localize competition, ALPS also periodically
introduces new randomly generated individuals into the population at the lowest
layer. Since individuals only compete against other individuals within a certain
range of their genotypic age, new randomly generated individuals don’t face the risk
of immediately being dominated by older, fitter individuals. Therefore, the lowest
layer serves as a source of new genetic material and the discovery of new building
blocks that propagate up the hierarchy and help the population escape local optima.
The Hierarchical Fair Competition Framework (HFC) and its variants work on
the idea of “fitness gradients,” into which the population is organized to make competition between individuals more balanced [32]. HFC organizes the population into
several fitness levels, where each level contains one or more subpopulations. By
dividing the range of the possible fitness values among the different fitness levels,
HFC creates a fitness gradient through which evolutionary search progresses. This
effectively decreases local selection pressure at each fitness level because individuals
must only compete with other individuals in their own fitness level. This is a ma-

9

jor advantage over the traditional EA because it protects “stepping stones,” which
are individuals of intermediate fitness that potentially contain valuable genetic material [32]. This way, stepping stones have more of a chance to spread potentially
valuable genetic material through the population without the risk of being dominated
by fitter individuals.
To provide a constant source of new genetic material, HFC continuously adds new
randomly generated individuals at the lowest fitness level. The new individuals are
less likely to be immediately dominated in fitness since competition occurs between
individuals of similar fitness. However, HFC still maintains a strong global selection
pressure since fitness increases toward the upper levels of the hierarchy [32]. This
incorporation of new genetic material is not only a way of increasing genetic diversity,
but it can also increase the likelihood of reaching previously unexplored areas of the
search space and preventing the population from prematurely converging to local
optima.

2.1.2

Multi-objective Diversity Maintenance

As we discussed in Section 1.1, the conventional fitness function is a measure of
an individual’s proximity to an optimal solution. For many problems, fitness is defined based on a single measure. However, other problems contain multiple criteria
that must be optimized, and there are several existing multi-objective techniques
(for example, the popular NSGA-II [19]) that have been developed for such problems. Leveraging the power of such multi-objective techniques, diversity maintenance techniques have been designed by incorporating a diversity measure as an
additional, artificial objective of the problem alongside fitness [18, 69]. This multiobjectivization [38] allows diversity to be maximized while fitness is simultaneously
optimized, and allows the use of the many techniques that were originally developed

10

Figure 2.1: An Example Set of Pareto-optimal Solutions in a 2D Objective
Space. Both objectives are maximized. No single point dominates any other point
in both objective values.

for multi-objective problems.
Such multi-objective methods are based on the Pareto criterion, which considers the trade-off curve of the candidate solutions [68]. The key idea is that given
multiple objectives, there may exist many combinations of objective values that are
equally optimal (i.e. Pareto-optimal as in Figure 2.1). Therefore, the Pareto criterion compares individuals on all objectives and considers an individual A to dominate
individual B if the following holds:

∀iAi ≥ Bi ∧ ∃i | Ai > Bi

(2.1)

where Ai and Bi are the objective values, respectively, of A and B for each objective i. This assumes maximization of all objective values, and the same is true for
minimization of any objective, where ≥ and > are replaced by ≤ and <, respectively.
The FOCUS (Find Only and Complete Undominated Sets) method [18] uses a
multi-objective scheme to promote genetic diversity while optimizing fitness. Based
on the overlapping tree distance metric that we discussed earlier, FOCUS calculates
for each individual the average squared distance of that individual to every other
individual in the population. This value becomes the diversity objective value for

11

each individual, which is maximized in the multi-objective scheme. Each generation,
FOCUS uses the Pareto criterion to remove all individuals that are dominated, considering diversity, fitness, and tree size as objectives. The tree size objective, which
is minimized, is used to add parsimony pressure to reduce the problem of bloat,
wherein GP trees typically grow dramatically over time. However, it has been shown
that parsimony pressure can be detrimental to GP performance [72].
Age-Fitness Pareto Optimization [69] was introduced as a means of using genotypic age as in ALPS to localize competition and maintain a constant source of new
genetic material without the overhead of managing a structured population. AgeFitness Pareto Optimization adopts a multi-objective scheme in which genotypic age
is minimized as one objective while fitness is optimized as the second objective. The
main goal is to promote the search for newer, fitter genotypes while preventing older
genotypes from dominating newer genotypes. Age-Fitness Pareto Optimization adds
a new randomly-generated individual to the population each generation as a means
of maintaining a constant source of new genetic material. Since the new individual
will be non-dominated in the age objective, it has more of a chance to propagate
potentially valuable genetic material through the population.

2.2

Behavioral Diversity Techniques

As Section 1.2 discusses, genotypically different trees can express the same behavior
due to the complex genotype-to-phenotype mapping in tree-based GP. Therefore,
enforcing diversity in the genotypic space does not ensure diversity in the phenotypic
(behavioral) space. Fortunately, recent research has focused on maintaining a diverse
set of behaviors by operating explicitly in the phenotypic space, avoiding the problem
of genotype-to-phenotype mapping.

12

Figure 2.2: Semantics of a GP Tree. The tree p is executed on the input k =
[1, 2, 3, 4, 5] and produces a corresponding output p(ki ) for each input case ki . The
semantics s(p) is the vector of outputs [1, 4, 9, 16, 25] produced by executing p on k.
s(p)

1

4

9 16 25

*
x
1

2.2.1

2

3

4

x
1

5

2

3

4

5

Semantics in GP

Before discussing the behavioral diversity techniques that have been introduced for
GP, it is important to first define behavior in GP. Since the goal of GP is to automatically synthesize programs for a given programming task, problems in GP are
typically defined with a set of training examples that contain inputs and desired
outputs. More formally, a problem can be defined by several input cases ki , each of
which has a corresponding desired output, yi .
When a program, p is evaluated by the fitness function, the program is typically
executed on each input case ki , for which it produces an output p(ki ). The semantics
or behavior, s(p), of p is commonly defined as the vector of outputs that p produces
when executed on all the input cases [54, 76]. Figure 2.2 demonstrates how the
semantics of a tree is obtained by executing the tree on the input cases.

2.2.2

Semantic-Aware Selection Methods

Several selection methods have been proposed for sustaining behavioral diversity
rather than focusing solely on genotypic diversity. The conventional selection methods such as tournament selection and fitness proportionate selection only consider
an individual’s fitness and are blind to its actual behavior. On the other hand,
semantic-aware selection methods consider some aspect of the individual’s behavior

13

when selecting individuals to produce offspring.
Since GP is often applied to problems that consist of multiple training examples
(fitness cases), Lexicase Selection [28] considers the performance of an individual on
each fitness case. The main goal of Lexicase Selection is to breed individuals that
perform well on different sets of fitness cases in order to increase the likelihood that
their offspring will inherit complementary traits from the parents. Before selecting
each parent from the population, Lexicase Selection chooses a random ordering on
the fitness cases and then, in that order, it eliminates parent candidates by retaining
only those that have the exact error on each fitness case until it can select only a
single individual.
Although Lexicase Selection has been shown to increase behavioral diversity as
well as improve search performance in different problem domains, it is known to
perform poorly for problems wherein the error for each fitness case is a real-value.
Since we used real-valued problems in our test suite, which we describe in detail in
Chapter 4, we used Epsilon Lexicase Selection [45], which is a recent enhancement
that performs well for real-valued problems. Rather than using the exact current best
error value obtained on each fitness case, Epsilon Lexicase Selection eliminates the
individuals with error values outside some epsilon of the best error. This alleviates
the problem of only one fitness case being used to eliminate candidates due to the
nature of the real-valued errors. Epsilon Lexicase Selection can also eliminate the
need for a pre-defined epsilon threshold by adapting epsilon based on the current
population.
Similar to Lexicase Selection, Comparative Partner Selection [17] considers an
individual’s performance on the fitness cases. Comparative Partner Selection assigns
each individual a binary string fitness characterization (BSFC), which captures the
individual’s relative performance on each fitness case with respect to its performance

14

on every other fitness case. Comparative Partner Selection attempts to breed individuals with different BSFCs from each other in order to take advantage of their
strengths and improve upon their weaknesses. However, rather than focusing directly on behavioral diversity, Comparative Partner Selection attempts to accelerate
convergence to the solution by reducing phenotype variance [17].
Complementary Phenotype Selection [21] is another selection technique that considers the performance of an individual on the fitness cases. Similar to the goal of
Lexicase Selection, Complementary Phenotype Selection attempts to breed individuals that perform well on complementary sets of fitness cases in attempt to create
offspring that inherit the strengths of the parents and improve upon their weaknesses. To accomplish this, the first parent is selected with a traditional selection
method (we used tournament selection in our experiments). Next, for every other
individual in the population, an imaginary output vector, BF,M , is created by taking
the best error between the individual and the first parent on each fitness case. BF,M
represents the output vector of an offspring that would perform as well as the better
performing parent on each fitness case. The individual that creates the fittest BF,M
is selected as the second parent.
Interleaved Sampling (IS) [50] also uses fitness case performance to select individuals for breeding. In contrast to Lexicase Selection, IS alternates between using
all of the fitness cases for calculating fitness in certain generations and a single,
randomly-selected fitness case in other generations. Similarly, Keep-Worst Interleaved Sampling [50] selects some number of the current most difficult fitness cases
rather than selecting a single fitness case randomly in the generations in which the
entire set of fitness cases is not used.
Implicit fitness sharing (IFS) [52] incorporates fitness case performance into the
fitness calculation. Based on the original fitness sharing concept [25, 26], IFS reduces

15

the fitness reward given by each fitness case based on how many individuals solve that
fitness case. Since individuals are selected based on fitness, IFS effectively reduces
the likelihood of selecting individuals that perform similarly.
Rather than directly considering the fitness case performance, Semantic Sharing [58] compares the semantic similarity of individuals in order to reduce the fitness
of semantically similar individuals. In order to determine semantic similarity, Sampling Semantics Distance [58] compares the output vectors of individuals and uses
a threshold of dissimilarity. Semantic Sharing was shown to improve the performance of GP, while the compared syntactic sharing method did not improve GP
performance [58].
Semantics in Selection (SiS) [23] attempts to promote behavioral diversity while
still maintaining strong selection pressure for improved fitness. Based on the commonly used tournament selection operator, SiS is used to breed fit individuals that
are also behaviorally different from each other. SiS selects the first parent based
solely on fitness, i.e. using the original tournament selection method. To select the
second parent, SiS holds another tournament to find the fittest individual that is
also behaviorally different from the first parent. Unlike other approaches, such as
Semantic Sharing, that employ a measure of behavioral similarity [58], SiS considers
two individuals to be behaviorally unique if their output vectors differ by at least
one element.

2.2.3

Semantic-Aware Operators

In recent years, several operators that utilize the semantics of individuals during
crossover have been introduced [61, 76]. Although some of the methods we discuss
in this section do not primarily focus explicitly on diversity, they affect behavioral
diversity since they consider the behavior of trees instead of completely randomly

16

modifying trees. A major limitation of the widely-used random subtree crossover is
that it makes no consideration of the context in which the selected subtrees occur
or the semantics of the selected subtrees. Therefore, random subtree crossover has
been considered to be destructive because swapping subtrees into random contexts,
irrespective of the semantic effects, can potentially destroy good solutions [49].
In an effort to overcome the destructive nature of random subtree crossover,
context-aware crossover (CAC) [49] exhaustively locates the best possible context
at which to insert the subtree from the donating tree into the receiving tree. This
is done by randomly selecting a subtree in one parent and swapping the subtree at
all possible crossover points in the other parent, excluding the root. Each resulting
offspring is evaluated, and the offspring with the highest fitness is chosen. Although
this does not directly utilize the semantics of the subtree to be swapped, it considers
the semantic effect of each possible crossover by determining the change in fitness.
Despite increasing the number of fitness evaluations required per generation, CAC
was shown to perform better than random subtree crossover [49].
Other approaches that generate several offspring (referred to as a brood) from
two parents have been proposed. Semantically Driven Crossover (SDC) [6] performs
random subtree crossover between two parents and only allows offspring to enter
the population if they are semantically different from the parents. This is similar
to offspring selection [1] in that the process continues until it creates an offspring
that satisfies the criteria for acceptance (being semantically different in the case of
SDC). The approximately geometric crossover) [40] generates n offspring and selects
the offspring whose semantics are as close as possible to the linear combination
of the two parents. Other crossover operators have been proposed that consider
the semantic similarity of the subtrees exchanged between parents, as well as the
semantic equivalence of the children [57, 74].

17

Figure 2.3: Semantic Geometric Crossover for Real-valued Expressions.
Parent 1
+

x

O spring
+

SIN
x

*

Parent 2

x

*

*

x

+
x

*
x

-

tr
1

SIN
tr

x

x

Semantic Geometric Crossover (SGX) [55] was designed to ensure that the semantics of an offspring lie somewhere on the segment connecting the semantics of
its parents in the semantic space. SGX creates an offspring tree as a convex geometric combination of the two parent trees (with respect to the semantic space), by
combining both trees with either a random constant or a random expression. Figure 2.3 and Figure 2.4 demonstrate how SGX combines two parent trees t1 and t2
to create the offspring for real-valued and Boolean domains using Equation 2.2 and
Equation 2.3, respectively, where tr is a randomly-generated expression (or possibly
a random constant in the case of the real-valued domain).

t1 ∗ tr + (1 − tr ) ∗ t2

(2.2)

(t1 ∧ tr ) ∨ (¬tr ∧ t2 )

(2.3)

SGX has many appealing properties that improve upon random subtree crossover.
First, SGX transforms the fitness landscape traversed by GP into a conical, unimodal
landscape since the fitness of an individual is based on its semantic distance to the
solution [55, 61]. Furthermore, a geometric crossover guarantees that the fitness of
the offspring is no worse that of its worse parent [61].

18

Figure 2.4: Semantic Geometric Crossover for Boolean Expressions.
Parent 1

Parent 2

AND

NOR

x0

x1

x0

O spring
OR

x1

AND

AND
AND
x0

tr

x1

NOR

NOT
tr

x0

x1

Although SGX has very attractive properties, the main drawback of its original
description [55] is that, as Figure 2.3 and Figure 2.4 show, the offspring will contain the entirety of both parents’ trees in addition to a randomly-generated subtree
(or constant), which means that tree size grows exponentially over time. However,
extensions have been proposed for alleviating this issue [59, 75].
Subtree Semantic Geometric Crossover (SSGX) [59] was recently proposed as an
extension to SGX. SSGX attempts to alleviate the issue of exponential tree growth
of the original SGX operator. Instead of using the entire parent tree, SSGX selects
from the parent the subtree with semantics as close to the semantics of the root
of the tree. Furthermore, to prevent selecting a large subtree, SSGX only selects
subtrees within a pre-defined size range. This way, SSGX chooses a subtree that is
smaller than the parent and approximates the original semantics of the parent. In
a set of experiments in the symbolic regression problem domain, SSGX performed
similar or better compared to SGX, and significantly reduced tree growth [59].
One issue with current semantic geometric crossovers such as SGX is that although the created offspring are geometric (i.e. the semantics of the offspring lie
on the segment connecting the semantics of the parents), crossover may not always
produce offspring that are semantically effective (different from the parents). This
is because, by definition, the child can be semantically equivalent to one of the

19

parents and still be geometric [61]. However, this has been addressed to varying
extents [42, 43, 44, 62].
Locally Geometric Semantic Crossover (LGX) [42, 44] improves upon random
subtree crossover by attempting to find a semantically medial subtree to place into
both parents in order to increase the likelihood that the behavior of the resulting
offspring will be between that of the parents in the behavioral space. Semantic
mediality of offspring was shown to be very valuable in reaching the desired solution
behavior [42, 44]. Rather than attempting to directly ensure that the entire offspring
program is semantically medial with respect to its parents, LGX attempts to find a
subtree that is semantically medial to the subtrees at the selected crossover points in
each parent. The intuition behind this is that finding a semantically medial subtree
may be easier than creating an offspring that is perfectly semantically medial.
LGX first selects the subtrees to be exchanged between the parents. Next, LGX
calculates the midpoint, m, of the semantics of the parents’ subtrees, and sets m as
the desired semantics for the crossover. A library of subprograms is then searched
to find a subprogram with semantics closest to m. LGX also considers the context
of the subtrees by only allowing crossover to occur at a homologous locus in each
parent (i.e. a locus at which the surrounding structure of the trees is the same).
In addition to the inception of semantic geometric crossover operators, semantic
geometric mutation operators were also introduced [55]. In this case, the semantics
of the generated offspring are guaranteed to lie within the geometric ball (in the
semantic space) centered around the parent’s semantics with a radius within some
epsilon. Since SGX is incapable of creating offspring that lie outside the convex hull
of the population, a solution will not be found if the solution is not contained within
the convex hull of the population [61]. Using semantic geometric mutation alongside
SGX can help overcome this issue [61].

20

Chapter 3
The Genetic Marker Diversity
Algorithm for GP
We recently introduced the Genetic Marker Diversity Algorithm for GP (GMD-GP)
as a structural diversity maintenance technique for tree-based GP [13]. First, we
discuss the motivation behind GMD-GP. Next, we present our novel structural diversity measure and discuss how GMD-GP uses this measure for sustaining structural
diversity.

3.1

Motivation behind GMD-GP

Two major motivations behind the design of GMD-GP are as follows. First, GP
populations have a tendency to rapidly converge to a single rooted structure [53].
Second, certain tree structures (size and shape) are much more difficult for GP to
evolve [16, 27]. The rooted structures of the trees are important because they define
the overall context of the programs, and thus modifications to a tree nearer the
root have more of an impact on the functionality of the tree. This is especially
important in early generations because large changes to program behavior can be

21

seen as exploratory steps in the search space. Although it is generally accepted
that less exploration is needed in later generations as good programs are further
refined (referred to as exploitation), exploration is an invaluable tool for helping the
population escape local optima and avoiding premature convergence.
Furthermore, since certain structures are more difficult for GP populations to
reach, if an optimal solution has such a structure, the likelihood of finding the solution
is very low, if not impossible. Since upper portions of the trees tend to remain
untouched by standard crossover [24], if the population does not converge to the
correct rooted structure, it is very unlikely that the structure will be found in later
generations (albeit that for problem domains such as symbolic regression, there could
be many structures that encode an optimal solution). Maintaining a diverse set of
structures in the population may alleviate this issue.
GMD-GP attempts to prevent the population from converging to a single rooted
structure by imposing selection pressure to maintain a diverse set of rooted structures
in the population throughout the evolutionary process. In order to achieve this, we
use the rooted fragments of the trees as genetic markers and select against a single
genetic marker dominating the population. GMD-GP is fundamentally different from
some of the other genotypic diversity techniques that we discussed in Chapter 2 in
that GMD-GP only considers a fragment of the tree rather than the entire genotype
and GMD-GP does not require an explicit genotypic distance metric.

3.2

Genetic Markers

In order to promote structural diversity, GMD-GP uses genetic markers, which are
tree fragments that capture the structure of the trees in the population. Genetic
markers are constructed by traversing a tree in depth-first order from a starting

22

Figure 3.1: A Rooted Genetic Marker. A depth-first traversal from the root to
level 2 is performed to create the genetic marker, m = (+ (*) (SIN)).

Figure 3.2: A Non-rooted Genetic Marker. A depth-first traversal from level 2
to level 3 is performed to create the genetic marker, m = (* x (+)), (SIN (+)).

level, Li , down to an ending level, Lj . All the nodes in the tree from Li to Lj will be
contained in the genetic marker, which is a partial Lisp-style expression. We chose
to use Lisp-style expressions as this is a common way of representing GP trees as
character strings [39].
Figure 3.1 and Figure 3.2 illustrate how genetic markers are constructed. In
Figure 3.1, the genetic marker is constructed from the root of the tree (level 1)
down to level 2. Since the genetic marker does not contain the entire tree, the
resulting partial Lisp-style expression simply adds closing parentheses around the
function nodes whose children do not belong to the genetic marker. This way, only
the fragment of interest is preserved. Figure 3.2 illustrates that a genetic marker can
begin at an arbitrary level in the tree. Since the fragment of interest is non-rooted,
the resulting genetic marker is a list of partial expressions in left-to-right order. The
ordering allows the genetic marker to retain the structure of the tree even though
the root does not belong to the genetic marker.
Genetic markers can be used to provide an estimate of the structural diversity

23

of a given population. To do so, GMD-GP groups members of the population by
their genetic markers and considers the density of each genetic marker. This measure allows GMD-GP to determine how widespread a particular structure is in the
population. The density, ρm , of a genetic marker m in population P is calculated
as in Equation 3.1, where m(p) is a function that determines the genetic marker of
individual p as we described above.

ρm =

|Pm |
, Pm = {∀p ∈ P | m(p) = m}
|P |

(3.1)

Using Equation 3.1, the densities of all the genetic markers in a population of size
n can be calculated in O(n) time, since each individual only needs to be considered
once. Furthermore, GMD-GP only uses the top portion of a tree for promoting
structural diversity, which means that only a small fragment of the tree needs to be
traversed in order to construct the genetic marker. Therefore, a negligible amount
of overhead is imposed upon the system in order to construct genetic markers as
well as to determine the density of each genetic marker. While it is well-known
that the fitness evaluation is usually the most computationally expensive phase of
EAs for complex problems, imposing a negligible amount of additional expense is
an attractive property of genetic markers. Also, it is possible to further improve
efficiency by constructing the genetic marker when the tree is newly generated, or
when the nodes are traversed during fitness evaluation.
We can see from the above definition of genetic markers and the corresponding density measure that genetic markers share some properties with some of the
schema definitions for GP [63, 65]. Schemata, which are essentially fragments of the
genotype, are used to identify common genetic regions shared by individuals in the
population. The schema theorem [29] states that short, low-order schemata known
as building blocks are combined to form higher-order building blocks which even-

24

Figure 3.3: Rooted Genetic Marker vs. Rooted Tree Schema. A rooted
genetic marker containing the top two levels of a tree compared to a rooted tree
schema matching a different portion of the same tree.

tually lead to solution discovery. The schema theorem allows us to reason about
the utility of such building blocks, as well as to predict the number of instances of
a particular schema in future generations. Although schemata are easier to define
in fixed-length representations such as binary GAs, there have been many schema
definitions proposed for GP as well.
The rooted tree schema [65] is perhaps the most closely related schema definition
to genetic markers. However, there are some key differences worth discussing. Figure 3.3 shows that, similar to rooted genetic markers, rooted-tree schemata [65] begin
at the root of the tree and span across a contiguous fragment of the tree. However,
unlike genetic markers, rooted tree schemata are not strictly required to span across
an entire level in the tree, as Figure 3.3 illustrates.
Similarly, a non-rooted genetic marker can also be considered as an instance of a
hyperschema [63]. However, because hyperschemata employ a “don’t-care” symbol
in order to provide flexibility and expressiveness, not all hyperschemata can be considered non-rooted genetic markers since genetic markers must span across an entire
level in the tree. Figure 3.4 shows the major difference between a hyperschema and
a non-rooted genetic marker.
While genetic markers undoubtedly share properties with GP schemata, they are
defined more strictly and serve a different purpose. The main purpose of genetic
markers is not to provide a new schema definition but to provide an inexpensive

25

Figure 3.4: Non-rooted Genetic Marker vs. Hyperschema. A non-rooted
genetic marker spanning the second and third levels of a tree compared to a hyperschema of the same tree.

means of estimating structural diversity of the population, namely among the top
portions of the trees. In addition to providing an estimate of structural diversity,
genetic markers are used in GMD-GP to actively maintain structural diversity.

3.3

GMD-GP Structural Diversity Preservation

GMD-GP uses the density values of the genetic markers in a population in order to
prevent a single structure from becoming too widespread. To achieve this, GMD-GP
incorporates a multi-objective optimization scheme that simultaneously minimizes
genetic marker density while optimizing fitness. This promotes structural diversity
by rewarding less-represented structures (and thus preventing a single structure from
dominating the population) while still maintaining global selection pressure to improve fitness.
The main evolutionary loop of GMD-GP, which is based on the multi-objective
implementation of the Age-Fitness Pareto Optimization scheme [69], is presented in
Figure 3.5. Every generation, parents are randomly selected to produce offspring
via the typical crossover, reproduction, and mutation operators. In a traditional GP
setup, offspring immediately replace their parents and elitism is used to retain the
best individuals. Instead, in GMD-GP both the parent population and the offspring
are temporarily kept, which increases the population size to 2n. Next, genetic marker

26

Figure 3.5: The Main Evolutionary Loop of GMD-GP.

density and fitness are used with Pareto tournament selection [69, 71] to remove
dominated individuals from the population in order to reduce the population size to
n − 1. The target population size is n − 1 instead of n because the final step of the
evolutionary loop adds a new randomly-generated individual to the population as a
means of providing a constant source of new genetic material.
The Pareto tournament selection method with genetic marker density and fitness
provides selection pressure for less-represented structures that are also fitter. As its
name suggests, Pareto tournament selection holds a tournament among a randomly
chosen subset of individuals in the population, and keeps only the individuals that
satisfy the Pareto selection criteria. Since multiple objectives are used, only the
individuals that are not dominated on all objectives by another individual in the
tournament will survive. In GMD-GP, each individual is assigned the density value
corresponding to its genetic marker, and then density and fitness are used by Pareto
tournament selection to determine domination. In this case, individual A dominates
individual B if A is at least equal to B in density and fitness and A is strictly better
than B in either density or fitness. If individuals are strictly equal on both objectives,
GMD-GP chooses the smaller of the two, or randomly if they are also the same size.

27

In order to reduce the population size from the temporary size of 2n to n − 1,
Pareto tournament selection is repeatedly applied to remove dominated individuals
from the population. Therefore, only the current set of non-dominated individuals
is guaranteed to survive to the next generation. It is theoretically possible that
the set of non-dominated individuals can grow to cover the entire population [69],
causing the population to grow beyond n since a new individual is added each generation. However, we have not observed this phenomenon in any of our experiments
using genetic marker density and fitness as the Pareto objectives, and it was also
not observed in the Age-Fitness Pareto Optimization algorithm [69]. Furthermore,
it is possible to restrict the growth of the non-dominated front by modifying the
domination criteria [18].
As Figure 3.5 shows, the final step of the evolutionary loop of GMD-GP introduces a new randomly-generated individual into the population. This provides a
constant source of new genetic material, which may contain essential building blocks
that can later propagate throughout the population. Since one of the goals of GMDGP is to sustain high diversity, the genetic marker density values tend to be very
low. Therefore, the newly generated individual, which would likely contain a unique
genetic marker, would likely be less fit than many of the members of the population
while it could have the same density value of n1 . In order to prevent the new individual from being immediately dominated by a fitter individual with the same density
value, we simply add the new individual to the population after the Pareto tournament selection phase, which differs slightly from the Age-Fitness Pareto Optimization
algorithm.

28

Chapter 4
Experimental Validation of
GMD-GP
In this chapter, we provide an empirical comparison and analysis of the search performance and behavior of GMD-GP and several state-of-the-art techniques as well as
traditional GP. We provide a detailed comparison of the performance and behavior
of GMD-GP and the different techniques on a large suite of benchmark problems.
These results demonstrate the feasibility of GMD-GP and provide motivation for the
work that we will present in the following chapters.

4.1

Benchmark Problem Suite

Before we delve into the experimental results, we first describe the benchmark problem suite that we used to compare the different techniques. In order to provide a
thorough comparison and analysis of GMD-GP versus other state-of-the-art techniques, we included a large set of problems of varying difficulty. The problems fall
into one of two domains: symbolic regression (a total of 15) and finite algebras (a

29

total of 10).1
The symbolic regression problems, which were chosen from the set of recommended GP benchmark problems in previous work [51], include 8 univariate and
7 bivariate problems of varying difficulty. Symbolic regression is a commonly used
problem domain for GP because the tree-based representation provides a flexible
means of expressing mathematical functions. Symbolic regression is also a challenging benchmark because GP often spends a lot of the computational budget by only
making incremental improvements to sub-optimal solutions, which can severely limit
its ability to discover an optimal solution [47].
In the symbolic regression task, GP must evolve a mathematical expression that
fits a target function given a set of input points. Fitness is calculated based on
how close an individual’s output is to that of the target function on all the input
points. We used the cumulative absolute error, normalized between 0 and 1 with the
following equation.
f=

1
1 + error

(4.1)

Table 4.1 lists all the symbolic regression problems, along with the training and
testing data that we used. We used the corrected versions2 of the domains for the
training and testing data that were originally given in [51]. The function set {+, -,
*, %, SIN, COS, LOG, EXP, −x } was used for all the symbolic regression problems.
The terminal set includes a single variable x for all the univariate problems and an
x0 and x1 for the bivariate problems. We did not include random constants in the
terminal set, similar to previous work [41].
The second class of problems in our benchmark suite were selected from [73], in
which GP produced human-competitive solutions to finite algebras problems. Each
1

For the experiments comparing GMD-GP to SSGX, we did not include the finite algebras
problems since SSGX was not designed for finite values.
2
https://cs.gmu.edu/ sean/papers/gecco12benchmarks3.pdf

30

Table 4.1: Symbolic Regression Benchmark Problems. U[a, b, c] means that c
points are uniformly randomly selected between a and b, inclusive. E[a, b, c] means
that the points are deterministically chosen on an evenly-spaced grid from a to b,
inclusive, with an interval of c.
Name
SEXT
NGUY3
NGUY4
NGUY5
NGUY6
NGUY7
KEIJ1
KEIJ4

Target Function
x6 − 2x4 + x2
x 5 + x4 + x3 + x2 + x
x 6 + x5 + x4 + x3 + x2 + x
sin(x2 )cos(x) − 1
sin(x) + sin(x + x2 )
ln(x + 1) + ln(x2 + 1)
0.3xsin(2πx)
x3 e−x cos(x)∗
sin(x)(sin2 (x)cos(x) − 1)
xy + sin((x − 1)(y − 1))
2
x4 − x3 + y2 − y
6sin(x)cos(y)

KEIJ11
KEIJ12
KEIJ13
8
KEIJ14
2+x2 +y 2
NGUY9 sin(x) + sin(y 2 )
NGUY10 2sin(x)cos(y)
2
NGUY12 x4 − x3 + y2 − y

31

Training
U[-1, 1, 20]
U[-1, 1, 20]
U[-1, 1, 20]
U[-1, 1, 20]
U[-1, 1, 20]
U[-1, 1, 20]
E[-1, 1, 0.1]

Testing
E[-1, 1, 0.001]
E[-1, 1, 0.001]
E[-1, 1, 0.001]
E[-1, 1, 0.001]
E[-1, 1, 0.001]
E[-1, 1, 0.001]
E[-1, 1, 0.001]

E[0, 10, 0.05] E[0.05, 10.05, 0.05]
U[-3, 3, 20]
U[-3, 3, 20]
U[-3, 3, 20]
U[-3, 3, 20]
U[-1,1, 100]
U[-1,1, 100]
U[0, 1, 20]

E[-3, 3, 0.01]
E[-3, 3, 0.01]
E[-3, 3, 0.01]
E[-3, 3, 0.01]
E[-1, 1, 0.001]
E[-1, 1, 0.001]
E[0, 1, 0.001]

Table 4.2: Input-output Mappings for the Operators A1 - A5 Used in the
Finite Algebras Problems.
A1
0 1
0 2 1
1 1 0
2 0 0

2
2
0
1

A2
0 1
0 2 0
1 1 0
2 1 2

2
2
2
1

A3
0 1
0 1 0
1 1 2
2 0 0

2
1
0
0

A4
0 1
0 1 0
1 0 2
2 0 1

2
1
0
0

A5
0 1
0 1 0
1 1 2
2 0 1

2
2
0
0

of these problems includes a single binary operator that has an underlying algebra
which is ternary. The algebras have defined outputs for the finite set of inputs
{0, 1, 2}. The algebras, A1 - A5, which are defined in Table 4.2 were included in the
benchmark suite. For each algebra, there are two tasks, which yields a total of 10
finite algebras problems.
The first task is to evolve a discriminator term, which must return x if x 6= y, and
z for all other cases. The fitness cases for this task are simply all combinations of the
three inputs {0, 1, 2}, which results in 33 = 27 total fitness cases. The second task is
to evolve a Mal’cev term which is defined by the following; m(x, x, y) = m(y, x, x) =
y. Given the definition, only the specified input combinations have a defined output
(the cases where all of the inputs are different from each other are not handled).
Therefore, this yields a total of 15 fitness cases.
For both tasks, the function set contained the single operator, Ai for the algebra
being used for the problem, and the terminal set, {x, y, z} contained a single terminal node corresponding to the inputs. Fitness was calculated as a function of the
total error committed on the fitness cases. We also included penalty for single node
solutions as in [73].

32

4.2

Experimental Settings

We conducted two sets of experiments in order to compare and analyze the search
behavior and performance of the different techniques we considered. In the first set
of experiments, we compared the performance of our approach to several different
methods in terms of the speed of convergence to a solution and the percentage of
trials in which a solution was found. The second set of experiments was designed to
analyze and compare different aspects of population diversity between the different
techniques.
Table 4.3 lists the general GP settings that we used for our experiments. For
the performance comparisons, we allocated a computational budget of 1 million fitness evaluations for each technique, and terminated the trial when a fitness of 1.0
was reached. All of the results were gathered from 100 independent trials of each
algorithm for each problem. For the diversity experiments, we allowed each trial
to last for 250,000 fitness evaluations and we collected the different metrics that
we discuss below throughout the duration of the trial. For the metrics that are
population-based, we collected the data at the end of the generation closest to the
nearest thousandth fitness evaluation (since the trial is based on fitness evaluations
instead of generations).
The algorithm-specific settings for ALPS were chosen in attempt to balance
the number of individuals per age layer, and the frequency of migration and reinitialization of the initial layer given the population size. For SSGX, we used the
settings that were used by its authors [59]. We used the top three tree levels for
the genetic marker density objective in GMD-GP, as the preliminary experiments
in previous work suggested that this performed best on a symbolic regression problem [13]. For the Lexicase Selection extensions that we discussed in Chapter 6, we
used the common tournament size of 7 individuals, although we plan to analyze their

33

Table 4.3: General GP Settings.
Parameter
Random initialization
Maximum evaluations
Total independent trials
Population size
Tournament size
Crossover probability
Reproduction probability
Max. tree depth
Max. tree size
Elites (GP, Lex, SiS, SSGX)
ALPS age layers
ALPS age gap
ALPS Elites
GMD-GP genetic marker start
GMD-GP genetic marker depth
SSGX min subtree size
SSGX max subtree size
SSGX max trials

Value
Ramped half & half
1 million
100
500
2
0.90
0.10
17
512
50 (10%)
10
20
2 per layer
Level 1 (root)
2
1
90
20

performance using different tournament sizes in the future.

4.3

GMD-GP Performance Comparison

We begin with a comparison of the search performance of GMD-GP versus other
GP methods. We used a standard GP setup as a baseline, and compared GMDGP to the techniques listed in Table 4.4. We refer to a standard GP setup as
the traditional generation-based GP using tournament selection, random subtree
crossover, reproduction, and no mutation. For all of the metrics we consider, we used
a pairwise Mann-Whitney U test to determine statistical significance with α = 0.05,
except where specified otherwise.
Figure 4.1 shows the mean best fitness over time for each technique for each of
the symbolic regression benchmark problems. In order to provide a fair comparison,

34

Table 4.4: List of GP Techniques Used in the GMD-GP Performance and
Diversity Comparison.
Technique
Standard GP
Age-Layered Population Structure
Age-Fitness Pareto Optimization
Lexicase Selection
Semantics in Selection
Subtree Semantic Geometric Crossover

Abbreviation
GP
ALPS
A/F
Lex
SiS
SSGX

we used fitness evaluations as a measure of time instead of generations since the algorithms consume a different number of fitness evaluations per generation. Figure 4.1
shows that GMD-GP was consistently either the top performing or among the top
methods in terms of reaching the higher fitness values significantly faster than the
other approaches. This demonstrates that GMD-GP is able to reach high fitness
relatively fast while focusing on maintaining diversity. This is often not the case for
diversity-promoting techniques, since avoiding premature convergence can come at
the cost of requiring longer to converge to a solution.
The plot for the KEIJ13 problem in Figure 4.1 illustrates how GMD-GP tended to
not only reach higher fitness values significantly faster, but the best fitness at the end
of the trial was also higher than in the other approaches. This is further confirmed by
Table 4.5, which shows the mean best overall fitness in each of the symbolic regression
problems. In 10 out of 15 problems, GMD-GP reached the highest fitness value at
the end of the trial, on average. Furthermore, the largest number of problems in
which any other algorithm achieved the highest ending fitness and GMD-GP did not
was only 4 out of 15.
We further compared the convergence rate of each technique across all the symbolic regression problems. The convergence rate measures the mean number of fitness
evaluations taken to find a perfect fitness solution, in all of the trials in which a solu-

35

Figure 4.1: Symbolic Regression Benchmarks - Fitness over Time. The xaxis shows the total elapsed fitness evaluations, and the y-axis shows the mean best
fitness. For clarity, error bars are not shown due to the number of methods plotted.
GP

Mean Best Fitness

1.0

ALPS

KEIJ1

A/F

Lex

SiS

KEIJ11

1.0

SSGX
1.0

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0.0
0.0
1.0

0.5
KEIJ13

1.0
×106

0.0
0.0

0.5
KEIJ14

1.0

1.0
×106

0.0
0.0
1.0

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0.0
0.0
1.0

0.5
NGUY10

1.0
×106

0.0
0.0
1.0

0.5
NGUY12

1.0
×106

0.0
0.0
1.0

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0.0
0.0
1.0

0.5
NGUY4

1.0
×106

0.0
0.0
1.0

0.5
NGUY5

1.0
×106

0.0
0.0
1.0

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0.0
0.0
1.0

0.5
NGUY7

1.0
×106

0.0
0.0
1.0

0.5
NGUY9

1.0
×106

0.0
0.0
1.0

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0.0
0.0

0.5

1.0
×106

0.0
0.0

0.5

1.0
×106

Total Fitness Evaluations

36

0.0
0.0

GMD-GP
KEIJ12

0.5
KEIJ4

1.0
×106

0.5
NGUY3

1.0
×106

0.5
NGUY6

1.0
×106

0.5
SEXT

1.0
×106

0.5

1.0
×106

Table 4.5: Symbolic Regression Benchmarks - Mean Best Ending Fitness.
The highest value is shown in bold font.
SEXT
KEIJ1
KEIJ4
KEIJ11
KEIJ12
KEIJ13
KEIJ14
NGUY3
NGUY4
NGUY5
NGUY6
NGUY7
NGUY9
NGUY10
NGUY12

GP
1.0
0.942
0.820
0.357
0.184
0.433
0.414
0.967
0.953
1.0
0.997
0.868
0.980
0.918
0.849

ALPS
0.985
0.982
0.985
0.615
0.237
0.684
0.487
0.983
0.978
0.977
0.984
0.980
1.0
1.0
0.934

A/F
0.969
0.964
0.961
0.438
0.157
0.498
0.467
0.971
0.957
0.949
0.951
0.968
0.981
0.947
0.888

Lex
0.989
0.834
0.512
0.253
0.040
0.088
0.211
0.864
0.846
0.981
0.944
0.767
0.990
0.980
0.790

SiS
0.999
0.937
0.796
0.394
0.148
0.467
0.438
0.973
0.954
0.999
0.996
0.853
0.926
0.844
0.852

SSGX
1.0
0.986
0.937
0.513
0.255
0.442
0.806
0.992
0.973
0.999
0.997
0.977
0.943
0.943
0.982

GMD-GP
0.996
0.998
0.996
0.626
0.382
0.849
0.717
0.992
0.998
0.996
0.994
0.995
1.0
1.0
0.951

tion was found. While this measure is influenced by the computational budget that
we set for each trial, it still provides an indication of the speed (in terms of fitness
evaluations) at which the different techniques discover a solution within the given
computational budget. Furthermore, since the fitness evaluation phase is known to
be the most time-consuming step of the evolutionary process for complex problems,
being able to find a solution in fewer fitness evaluations is highly desirable.
Figure 4.2 shows a pairwise comparison of the convergence rates of each technique.
Each plot shows the percentage of problems in which the technique listed above the
plot found a solution in significantly fewer fitness evaluations than the other techniques shown in the plot. GMD-GP faired well against the other approaches, as the
highest percentage of problems in which any algorithm found a solution significantly
faster was 20% (for GP and SiS), while, most notably, GMD-GP was significantly
faster than ALPS and SSGX in 86% and 60% of the problems, respectively.
Next, we compared the success rates of each method, where the success rate is

37

Figure 4.2: Pairwise Comparison of Convergence Rate. Each plot shows the
percentage of problems in which the method above the plot found a solution in
significantly fewer fitness evaluations, on average, in the successful trials.
GP

100

ALPS

A/F

80
60
40

SSGX

GMD-GP

SiS

Lex

GP

ALPS

GMD-GP

SiS

SSGX

SiS

Lex

100

Lex

GP

A/F

GMD-GP

SiS

SSGX

A/F

Lex

ALPS

0

SSGX

GMD-GP

80
60
40

38

SiS

SSGX

Lex

A/F

GP

ALPS

SiS

GMD-GP

A/F

Lex

GP

ALPS

GMD-GP

SSGX

A/F

Lex

GP

ALPS

GMD-GP

SiS

SSGX

A/F

0

GP

20
ALPS

Percentage of Problems

20

Figure 4.3: Pairwise Comparison of Success Rate. Each plot shows the percentage of problems in which the method above the plot had a significantly higher
success rate than each other method.
GP

100

ALPS

A/F

80
60
40

SSGX

GMD-GP

SiS

Lex

GP

ALPS

GMD-GP

SiS

SSGX

SiS

Lex

100

Lex

GP

A/F

GMD-GP

SiS

SSGX

A/F

Lex

0

ALPS

Percentage of Problems

20

SSGX

GMD-GP

80
60
40

SiS

SSGX

Lex

A/F

GP

ALPS

SiS

GMD-GP

A/F

Lex

GP

ALPS

GMD-GP

SSGX

A/F

Lex

GP

ALPS

GMD-GP

SiS

SSGX

A/F

GP

0

ALPS

20

the proportion of trials (out of 100) in which a fitness of 1.0 was reached within the
computational budget. We used Fisher’s exact test with α = 0.05 for determining
statistical significance between the differences in success rates. Figure 4.3 clearly
demonstrates that GMD-GP outperformed all of the other approaches in terms of
success rate. GMD-GP had a significantly higher success rate than every other
algorithm in at least 60% of the problems (except for SSGX against which GMD-GP
had a significantly higher success rate in 53% of the problems). Combining the fitness
and convergence results we discussed earlier, this shows that GMD-GP is capable of
often finding solutions relatively faster than the other techniques, as well as finding
solutions at a higher rate of success.

39

4.4

GMD-GP Diversity Comparison

Next, we conducted a separate set of experiments to analyze the effects GMD-GP
has on different aspects of population diversity compared to the other approaches.
We conducted another set of 100 independent trials on each problem for each algorithm using the same experimental settings as in Table 4.3 except where we specify
differently. We allowed each trial to last for 250,000 fitness evaluations, which is
roughly equivalent to 500 generations without elitism and replication, given the population size of 500 individuals. The general trend we observed for each measure we
discuss is that the value tended to remain fairly constant after several fitness evaluations. Therefore, we compared the value of each metric at the end of the trial. We
used pairwise Mann-Whitney U tests to determine statistical significance between
the differences for each measure.
Since GMD-GP focuses on using genetic markers for promoting structural diversity, we first compared the mean maximum genetic marker density of final populations of the different approaches. Similar to our previous study [13], we analyzed
the top six tree levels by creating genetic markers composed of two levels at a time.
Using multiple genetic markers made of two levels at a time is more informative than
a single genetic marker that spans the top six levels. For example, two trees that are
identical in the top six levels except for a single node difference would be treated as
having different genetic markers although they are otherwise the same.
Figure 4.4 shows the percentage of problems in which each method had significantly lower mean maximum genetic marker density, for genetic markers composed
of the top two tree levels at the end of 250,000 fitness evaluations. We report the
comparison of the top two tree levels, as the results across the top six levels are
generally the same. In all of the problems, compared to all the other algorithms,
GMD-GP had significantly lower genetic marker density, which reflects significantly

40

Figure 4.4: Pairwise Comparison of Genetic Marker Density. Each plot shows
the percentage of problems in which the method above the plot had significantly lower
genetic marker density (of genetic markers composed of the top 2 tree levels) than
each other method after 250,000 fitness evaluations.
GP

100

ALPS

A/F

80
60
40

SSGX

GMD-GP

SiS

Lex

GP

ALPS

GMD-GP

SiS

SSGX

SiS

Lex

100

Lex

GP

A/F

GMD-GP

SiS

SSGX

A/F

Lex

0

ALPS

Percentage of Problems

20

SSGX

GMD-GP

80
60
40

SiS

SSGX

Lex

A/F

GP

ALPS

SiS

GMD-GP

A/F

Lex

GP

ALPS

GMD-GP

SSGX

A/F

Lex

GP

ALPS

GMD-GP

SiS

SSGX

A/F

GP

0

ALPS

20

higher structural diversity. This can be expected since GMD-GP actively focuses on
minimizing genetic marker density. However, since this is a major difference between
GMD-GP and the other techniques, this suggests that the higher structural diversity
contributes to the performance gains that we discussed in Section 4.3.
Next, we compared the behavioral diversity of the populations of the different
approaches, where an individual’s behavior is defined as the vector of outputs produced on the fitness cases as we described in Section 2.2.1. The behavioral diversity
of a population is defined as the proportion of unique behaviors, where two behaviors
are considered different if they differ by at least one output [36]. Since the outputs
for the symbolic regression problems are real-valued, we used a threshold of 0.01 for
considering two outputs to be different.

41

Figure 4.5: Pairwise Comparison of Behavioral Diversity. Each plot shows the
percentage of problems in which the method above the plot had significantly higher
behavioral diversity than each other method after 250,000 fitness evaluations.
GP

100

ALPS

A/F

80
60
40

SSGX

GMD-GP

SiS

Lex

GP

ALPS

GMD-GP

SiS

SSGX

SiS

Lex

100

Lex

GP

A/F

GMD-GP

SiS

SSGX

A/F

Lex

0

ALPS

Percentage of Problems

20

SSGX

GMD-GP

80
60
40

SiS

SSGX

Lex

A/F

GP

ALPS

SiS

GMD-GP

A/F

Lex

GP

ALPS

GMD-GP

SSGX

A/F

Lex

GP

ALPS

GMD-GP

SiS

SSGX

A/F

GP

0

ALPS

20

Figure 4.5 shows the pairwise comparison of the behavioral diversity of the populations after 250,000 fitness evaluations. Interestingly, while GMD-GP populations
had significantly higher structural diversity than all the other approaches, it had
significantly lower behavioral diversity than all of the other approaches except A/F.
This clearly demonstrates that high structural (and therefore genotypic) diversity
indeed does not correlate directly to high behavioral diversity. However, our results
suggest that high structural diversity does still lead to better search performance in
GMD-GP.
Finally, we compared GMD-GP and the other approaches in terms of the fitness
standard deviation. As we discussed in Section 2.1, it is beneficial to maintain a
spread over the range of possible fitness values since intermediate fitness individuals
(i.e. “stepping stones” [32]) may contain valuable building blocks that could other-

42

Figure 4.6: Pairwise Comparison of Fitness Standard Deviation. Each plot
shows the percentage of problems in which the method above the plot had significantly higher fitness standard deviation than each other method after 250,000 fitness
evaluations.
GP

100

ALPS

A/F

80
60
40

SSGX

GMD-GP

SiS

Lex

GP

ALPS

GMD-GP

SiS

SSGX

SiS

Lex

100

Lex

GP

A/F

GMD-GP

SiS

SSGX

A/F

Lex

0

ALPS

Percentage of Problems

20

SSGX

GMD-GP

80
60
40

SiS

SSGX

Lex

A/F

GP

ALPS

SiS

GMD-GP

A/F

Lex

GP

ALPS

GMD-GP

SSGX

A/F

Lex

GP

ALPS

GMD-GP

SiS

SSGX

A/F

GP

0

ALPS

20

wise be lost. Therefore, high standard deviation of fitness is an indication of how
spread out individuals are in the fitness space.
Figure 4.6 shows the pairwise comparison of the fitness standard deviation of the
population at the end of the trial. GMD-GP populations had significantly higher
fitness standard deviation than all the other approaches in a large percentage of the
problems. This shows that while GMD-GP populations typically had significantly
less behavioral diversity (and therefore fewer unique fitness values), the difference in
the fitness of individuals in GMD-GP populations was significantly higher. This also
shows that the genetic marker density objective allows less-represented, and potentially less-fit, structures to survive alongside the superior structures. Furthermore,
these results suggest that it is indeed the use of genetic markers that contributes to

43

this capability, as the fitness standard deviation in GMD-GP was significantly higher
than that of A/F (which uses a similar multi-objective scheme with genotypic age
instead of genetic marker density).

44

Chapter 5
GP for Tuberculosis Screening: A
Case Study
As we discussed in Chapter 1, GP has been widely used in a number of different
application areas to solve real-world problems. In this chapter, we present a new
application of GP: using GP to detect the presence of tuberculosis in raw X-ray
images. While GP has been applied to the problem of image classification, which we
discuss in Section 5.2.1, this is the first application of GP for tuberculosis screening
from raw X-ray images.

5.1

The Tuberculosis Epidemic

Tuberculosis (TB) is an infectious disease that typically affects the lungs. TB is
caused by the bacterium Mycobacterium tuberculosis, and it is spread through the air
when an infected person expels bacteria through the air by coughing, for example [60].
TB is a major worldwide epidemic that causes millions of deaths each year. According
to the World Health Organization (WHO) 2016 global tuberculosis report, there
were around 10.4 million new cases of TB and around 1.4 million TB deaths in 2015,

45

making TB one of the top 10 causes of death worldwide [60]. The WHO 2016 global
report also states that there were 132,120 cases of multidrug-resistant TB in 2015 [60].
Furthermore, TB is a leading cause of death among HIV-positive people [60], and
opportunistic infections of HIV-positive people further worsen the TB epidemic [4].
Despite the fact that TB is such a serious worldwide issue, it can be treated and
cured if it is detected early enough. Detection methods for TB include:
• sputum smear microscopy, wherein sputum (e.g. saliva and mucus) samples are
checked to determine the presence of bacteria. Although this method is over
100 years old and it is a slow method, it is the current gold standard definitive
diagnostic [4]
R
• rapid molecular tests such as the Xpert MTB/RIF
assay, which is the only

rapid test recommended by WHO as of the 2016 WHO global tuberculosis
report [60]. The cost of such methods is an issue for poverty-stricken countries
in which the TB epidemic is a huge problem
• culture methods, in which the bacteria can be observed in a laboratory setting.
However, the bacteria are slow-growing, meaning that this process can take
months [60]
• skin tests, which can determine whether or not a person has been exposed to
TB, but do not indicate active TB and can be affected by vaccinations [4].
In addition to the above detection methods, chest X-ray (CXR) images are also
used to aid diagnosis. Although, CXR is a mandatory part of each TB evaluation [4],
CXR is not used exclusively as the definitive diagnostic tool. Furthermore, many
advances in CXR for computer-aided diagnosis have focused on specific abnormalities
such as nodule detection or on sub-problems such as lung region segmentation [4, 37].

46

5.2

Related Work

Before we discuss the experimental results of using our technique for TB screening,
we first discuss some relevant work in image classification. Particularly, we discuss
the general problem of image classification and discuss some related work using GP
for image classification. We then discuss the basic GP image classification approach
that we incorporated into our framework in order to perform TB screening from
X-ray images.

5.2.1

GP for Image Classification

The problem of detecting TB using raw X-ray images is an image classification
problem. The basic image classification problem is as follows: given an image, determine the category to which it belongs based on information contained in the
pixels of the image. GP has previously been used for the purpose of image classification [2, 5, 34, 48, 66]. Particular examples include using GP for breast cancer
detection [66], detection of cephalometric landmarks [34], and developing domainindependent raw pixel-based GP classifiers [5].
The image classification problem is typically tackled by first performing feature
extraction on the images in order to extract useful features from the images to aid
in the classification task. This often involves image processing techniques such as
performing segmentation [4] to focus on regions of interest and using image descriptors [4] to serve as features for the image. Then, given a set of features for an image,
the typical classification techniques can be applied to determine the class to which
the image belongs.
A domain-independent genetic programming technique was previously introduced
for image classification using raw image data [5]. Since the approach we use for

47

tuberculosis screening is based on this work, we discuss this methodology in detail.
As we discussed above, the typical approach to image classification is to first perform
a feature extraction phase on the images and then perform classification using the
features. However the three-tiered GP approach, 3TGP, was designed to evolve
programs that are capable of performing all the steps in one GP tree [5]. This is
accomplished by using a hierarchical tree structure in which each layer is responsible
for performing a different task in order to ultimately classify the image.
The 3TGP technique first stores each training input image as an integer array
containing the raw pixel values. Then, given an image to classify, the lowest layer
in the tree performs low-level image processing techniques such as filtering the raw
input image by performing simple mathematical operations to the pixel values of the
image (e.g. add a value to each pixel, etc.). The second layer receives as input the
resulting filtered image from the first layer, and then performs aggregation functions
such as min, max, median, mean, and standard deviation on the pixels of a given
window in the image. Finally, the topmost layer receives the single values produced
by the aggregation functions and then performs the typical mathematical functions
that are often used in symbolic regression, such as those we described in Chapter 4.
The root of the tree then returns a single real value, which is used to determine the
class of the image; if the value is less than zero, then image belongs to the negative
class, and otherwise it belongs to the positive class.
A two-tiered version (2TGP) was introduced as a more effective and more efficient
technique to raw image classification than 3TGP [2]. 2TGP uses a very similar
architecture to 3TGP. However, rather than relying on three processing layers in the
tree, 2TGP removes the need for the filtering functions of 3TGP by directly feeding
the raw image data into the aggregation functions. This removes the extra CPUintensive image filtering stage while also achieving higher classification accuracy than

48

3TGP [2].
Along with 2TGP, two additional versions (2TGP-line and 2TGP-mix) were introduced, which modify the aggregation functions to only include the pixels contained
in certain types of windows in the image. 2TGP-line only considers a single vertical or horizontal line of pixels rather than a rectangular region as in the original
2TGP version. The 2TGP-mix version is the same as 2TGP-line, except it adds
the capability of using circles and rectangles in the aggregation functions. All four
versions, 3TGP, 2TGP, 2TGP-line, and 2TGP-mix were compared using different
two-class image databases, and 2TGP-line and 2TGP-mix were the best performing methods [2]. Furthermore, the two-tiered versions performed better or at least
comparably to a GP method that uses pre-defined image features [2].

5.2.2

Tuberculosis Screening from Chest X-rays

An automated technique for TB screening using chest X-rays was proposed in previous work [37]. This technique relies upon a lung segmentation algorithm to first
extract the lung region from the rest of the image. Next, feature extraction is performed on the lung region in order to compute features that can be used to classify
the image. Finally, a support vector machine (SVM) classifier is trained on the
images using the precomputed features.
Two different sets of features used for classifying the X-rays were compared in
their research. The first feature set is inspired by object detection and includes six
different shape and texture descriptors, each of which is a histogram containing 32
bins [37]. Therefore, there are a total of 192 features in the first feature set. The
second feature set contains low-level features based on content-based image retrieval,
containing a total of 594 features [37]. The authors found that the second feature
set did not yield significantly better performance than the first (and much smaller)

49

feature set.
In a later study, three feature sets were used in this approach, including the
previous two feature sets that we described above and a third feature set containing
six MATLAB shape features [4] . The authors exhaustively searched the three feature
sets to find the optimal combination of 18 features that yielded the best performance.
An artificial neural network classifier was used in this set of experiments and showed
a significant improvement over the SVM classifier.
In that same study, the authors used an artificial neural network classifier to
perform classification of the X-rays without the need for lung segmentation [4]. This
was done by measuring the changes in the rib profiles within the chest cavity [4]. This
technique only requires approximate bounding boxes around the lung region. The
feature set is composed of only one descriptor: the pyramid histogram of orientation
gradients (PHOG) [9], using 15 bins. This technique was shown to perform nominally
better than their previous system in terms of accuracy and the area under the ROC
curve, and the authors state that it performs classification 25 times faster [4].

5.3

Experimental Settings

We used the 2TGP-line architecture, which we described in Section 5.2.1, for our tree
structure and we use GMD-GP (see Chapter 3) as our GP implementation. Table 5.1
shows the GP settings that we used for our experiments. Since the X-rays can vary
in size, we introduced a simple modification to the original 2TGP implementation.
Instead of using a predefined width and height and restricting the x and y pixel
coordinate nodes to be within that range, we simply changed the X,Y node to a
real number in [0, 1] instead of an integer. This way, the X,Y node is treated as
a fractional value that refers to a point in the X or Y dimension. For example

50

Table 5.1: GP Settings Used for the TB Experiments.
Parameter
Random initialization
Maximum generations
Total independent trials
Population size
Tournament size
Crossover probability
Mutation probability
Mutation method
Max. tree depth
Max. tree size
GMD-GP genetic marker start
GMD-GP genetic marker depth

Value
Ramped half & half
50
30
1024
7
0.81
0.19
Subtree mutation
17
None
level 1 (root)
2

an X node with a value of 0.5 for an image of width 100 would refer to the pixel
at index 50. Since the real numbers will not always be an exact pixel, we simply
round the number up. While this may not be the optimal implementation, and more
sophisticated techniques may perform better, this is a very general way of handling
images of varying size without needing to predefine a range of allowable coordinate
values.
The primitive set containing the functions and terminals that we used is shown in
Table 5.2. The function set includes the 2TGP-line functions from previous work [2].
Each function takes as arguments the input image, the x and y pixel coordinates of
the line to consider, the shape (horizontal or vertical), and the size of the line in
pixels. The terminal set includes a node that is used to determine the x and y pixel
coordinate for the different aggregation functions, an integer that determines the size
of the line used in the aggregation functions, a shape node that determines whether
the line is horizontal or vertical, a single terminal node to represent the image, and
a random double.

51

Table 5.2: Image Classification GP Primitive Set.
Primitive
AggMin
AggMax
AggMed
AggMean
AggStdev
RandDoub
Image
Size
X,Y
Shape

5.3.1

Description
Returns the minimum pixel value in the given line.
Returns the maximum pixel value in the given line.
Returns the median of the pixel values in the given line.
Returns the mean of the pixel values in the given line.
Returns the standard deviation of the pixels in the given line.
Random integer in [0, 1].
Java BufferedImage object representing the X-ray .
Integer representing the size of the line to be used.
Integer representing either the x or y pixel coordinate.
Integer representing the shape of the line, horizontal or vertical.

Tuberculosis Dataset

We used the publicly available Shenzen dataset1 that was used in previous research [4,
15, 37]. This dataset includes a total of 662 chest X-ray images, consisting of 326
TB-negative examples and 336 TB-positive examples. The images vary in width
(from 1130 to 3001 pixels) and height (from 948 to 3001 pixels). We chose to use
the Shenzen dataset because of the relatively balanced class distribution compared
to other available datasets.
Due to the large size of the images, coupled with the number of images in the
training and testing sets, the amount of memory required for storing the images
in memory is a technical challenge. However, reading each image every time an
individual undergoes fitness evaluation is also very time consuming and inefficient.
To overcome this issue, we utilized the high performance computing (HPC) resources
at Michigan State University to store all the images in memory during evolution.
Although this approach is not practical in the general sense, it serves as a proof of
concept that GP can be used on more realistic datasets for raw image classification.
Furthermore, although the classifiers are evolved using HPC resources, the classifiers
1

National Library of Medicine, National Institutes of Health, Bethesda, MD, USA and Shenzhen
No.3 People’s Hospital, Guangdong Medical College, Shenzhen, China.

52

themselves can then be used stand-alone on typical computer hardware such as a
personal laptop.
In order to determine the accuracy of the evolved classifiers on unseen data, we
used a 10-fold cross-validation approach using the entire Shenzen dataset. For each
fold, we performed 30 replicate trials. The reported accuracy values in Section 5.4
reflect the mean accuracy of the best-of-run individual from each trial across all 10
folds.

5.4

Results

We begin the discussion of our results with a comparison of the accuracy of our
evolved classifiers versus those reported in the previous work that we discussed in
Section 5.2.2. Table 5.3 shows the comparison of GP, GMD-GP, and the reported
values from [37] and [4] from the Shenzen dataset. We report our mean accuracy on
the testing data from the 10-fold cross-validation across all trials, compared to the
accuracy values of the other techniques (without exhaustive feature selection).
The best performing technique overall was the artificial neural network approach
(ANN) with an accuracy of 85.9%, while GMD-GP had a mean accuracy of 76.35%.
However, GMD-GP achieved a maximum testing accuracy of 89.39%. While this
is still lower than the 97.05% accuracy achieved by the artificial neural network
approach in which the features were selected by exhaustively searching for the optimal
combination of 18 features [4], this is important because it demonstrates that the raw
image classification approach using GP without feature extraction (2TGP-line, as we
described in Section 5.2.1) is capable of performing better than other techniques that
require lung segmentation and feature extraction. Also, while the results for GMDGP and standard GP were similar, this further demonstrates the effectiveness of the

53

Table 5.3: Accuracy Comparison of GP and GMD-GP to Previous Work.
We compared against support vector machine, SVM, linear logistic regression, LLR,
and artificial neural network, ANN, without lung segmentation from [4, 37].
SVM
82.1

LLR
84.1

ANN
85.9

GP GMD-GP
75.4 76.4

Table 5.4: Training Accuracy on the Shenzen Dataset.
Min
Max
Med
Mean Std. Dev.
GP
0.6879 0.8420 0.7909 0.7901 0.0259
GMD-GP 0.6829 0.8403 0.7995 0.7954 0.0250
2TGP-line approach at evolving classifiers without prior feature extraction or feature
selection.
Table 5.4 and Table 5.5 show the accuracy of the best-of-run individuals from
each trial on the training and testing data, respectively, across all 10 folds. Table 5.4
shows that GMD-GP performed similar to GP in terms of training accuracy. However, Table 5.5 shows that, while marginal, GMD-GP achieved higher accuracy on
the unseen testing data than GP. Most notable is the higher minimum and maximum accuracy that GMD-GP achieved compared to GP. Furthermore, although we
used a straightforward classification technique that simply uses a threshold of 0 on
the individual’s output for determining the class to which an image belongs, more
sophisticated techniques such as evolving the class boundaries [22], for example, may
increase the accuracy of our approach. These are future directions for this technique.
Table 5.6 and Table 5.7 show the confusion matrix for the best classifier evolved
with GP and GMD-GP, respectively, across the entire Shenzen dataset. One of the
most important measurements for the TB classifiers is that of sensitivity, which is
Table 5.5: Testing Accuracy on the Shenzen Dataset.
Min
Max
Med
Mean Std. Dev.
GP
0.5606 0.8636 0.7576 0.7538 0.0535
GMD-GP 0.6364 0.8939 0.7670 0.7635 0.0486

54

Table 5.6: GP Best Classifier Confusion Matrix.
Predicted Positive Predicted Negative
Actual Positive 255
81
Actual Negative 28
298
Table 5.7: GMD-GP Best Classifier Confusion Matrix.
Predicted Positive Predicted Negative
Actual Positive 244
92
Actual Negative 25
301
the portion of TB-positive instances the classifier correctly identifies. The best GP
classifier achieved a sensitivity of 0.759 while the best GMD-GP classifier achieved a
sensitivity of 0.726. In previous work [4], the radiologists were instructed to error on
the side of “over-reading” in order to minimize the chance of misdiagnosing a TBpositive patient. While the GP and GMD-GP classifiers were not highly sensitive,
incorporating the confusion matrix into the fitness function could further improve
the sensitivity of the evolved classifiers.
Next, we compared the time taken for our technique to classify a given X-ray with
the reported results of the ANN technique, which was reported as taking less than 20
seconds [4]. In order to time our evolved classifiers, we performed two experiments.
In the first experiment, we took the highest accuracy evolved classifier from all trials
and executed it on the entire Shenzen dataset and recorded the time taken to classify
each image. Similarly, the second experiment recorded the timing of all best-of-run
classifiers in order to identify the general behavior of our evolved classifiers.
Table 5.8 and Table 5.9 show the detailed timing information for the best evolved
classifier from all trials for GP and GMD-GP, respectively. This demonstrates that
our technique is effective at evolving classifiers that are extremely fast at providing
a classification from a raw input image. While the total time taken to load and
classify an image is very little (less than half a second, on average), we can see

55

Table 5.8: Time Taken (in seconds) for Best Evolved GP Classifier to Classify an X-ray Image.
Min
Load
0.0418
Classify 0.0003
Total
0.0422

Max
0.4958
0.0164
0.5123

Med
0.3113
0.0005
0.3119

Mean
0.2954
0.0006
0.2960

Std. Dev.
0.0621
0.0008
0.0623

Table 5.9: Time Taken (in seconds) for Best Evolved GMD-GP Classifier
to Classify an X-ray Image.
Min
Load
0.0418
Classify 0.0006
Total
0.0425

Max
0.5598
0.0302
0.5655

Med
0.3113
0.0014
0.3127

Mean
0.2964
0.0015
0.2979

Std. Dev.
0.0624
0.0012
0.0627

that the actual time the tree takes to classify an image once it has been loaded is
even smaller than the time taken to load the image into memory. This can lead
to an effective screening method in areas where large numbers of patients need to
be screened. This does come at the cost of enormous computing resources required
to train a classifier, as we discussed in Section 5.3. However, these solutions could
be evolved in an HPC setting and then readily used on inexpensive hardware such
as the laptop computer used in the system described in [4]. Further research into
reducing the burden of requiring large memory for training is necessary.
Next, we examined the sizes of the best-of-run classifiers from each trial for both
traditional GP and GMD-GP. Table 5.10 shows the different metrics we considered
for the two techniques. Given the complex nature of the classification task (using raw
X-ray images with no preprocessing or feature extraction/selection), we expected the
size of the trees to become rather large. While both techniques produced relatively
large trees, GMD-GP trees tended to be marginally smaller (with a mean size of
288.45 nodes), on average, than those of GP (with a mean size of 302.15 nodes).
The most accurate classifier produced by GMD-GP (89.39% accuracy, as shown in
Table 5.5) contained 253 nodes while the most accurate classifier produced by GP

56

Table 5.10: Size of the Best-of-run Classifiers.
Min
GP
44
GMD-GP 49

Max
823
1033

Med
Mean
284.50 302.15
248.50 288.45

Std. Dev.
136.51
150.63

(86.36% accuracy, as shown in Table 5.5) contained 229 nodes.
Finally, we analyzed the behavior of the best evolved classifiers on sample X-ray
images from the Shenzen dataset. We first examined all the regions in the images
that the classifiers used. To do this, we simply create a new image wherein we set the
color to white for every pixel used by the classifier, and then color every unused pixel
black. We then pair this new image side-by-side with the original X-ray. We chose
this technique rather than attempting to highlight the used pixels on the original
X-ray since the classifiers only use single lines of pixels, which are very difficult
to visualize on the grayscale X-ray images. Furthermore, we were only interested
in providing a general view in order to gain some insight into the behavior of the
classifiers, with respect to the regions of the image they use for determining the class
of the image.
Figure 5.1 shows the regions of the X-ray used by the best classifier evolved by
GP. This shows that the classifier, which achieved an accuracy of 86.36%, only uses a
small fraction of the pixels in the image in order to predict tuberculosis. Furthermore,
while part of the lung is used in the classifier, not all of the regions that the classifier
uses are in the lung area. However, since the images vary in dimensions as well as
the actual anatomy of the patients, the exact position used by the classifier will also
vary from patient to patient to a certain extent.
We also analyzed the number of nodes in the tree that reference pixel locations, as
well as the number of times a single pixel was used by the classifier. Out of 229 nodes,
only 42 nodes (i.e. 18.34%) are pixel coordinate nodes. The maximum number of

57

Figure 5.1: Regions of the X-ray Used by the Best Evolved GP Classifier.
The original X-ray image is shown on the left, and the image showing the pixels of
the X-ray that were used by the classifier is shown on the right. In the image on
the right, white pixels are those that were used by the classifier, while black pixels
indicate that they were not used by the classifier. Note that the lines have been
thickened to aid visualization.

times a particular pixel was referenced by this tree was 8, while the minimum was
1. This shows that some of the lines overlap, which means that the same pixels can
be used in different parts of the tree for different calculations.
Figure 5.2 shows the regions of the X-ray used by the best classifier evolved by
GMD-GP. This clearly shows that the best classifier evolved by GMD-GP is much
more complex than that of GP. This is evident in that there are many more lines
used by this classifier, compared to the GP classifier in Figure 5.1. Also, more pixels
of the lung region are used by the GMD-GP classifier, although not all of the lines
belong to the lung region in this particular X-ray. Further analysis into the nodes in
the GMD-GP classifier tree revealed that out of 253 nodes, 68 (i.e. 26.88%) of them
are pixel coordinate nodes. In this case, the maximum number of times a particular
pixel was referenced by this tree was 8, with a minimum of 1.
Figure 5.1 and Figure 5.2 show an interesting result. We hypothesized that the

58

Figure 5.2: Regions of the X-ray Used by the Best Evolved GMD-GP
Classifier. The original X-ray image is shown on the left, and the image showing
the pixels of the X-ray that were used by the classifier is shown on the right. In
the image on the right, white pixels are those that were used by the classifier, while
black pixels indicate that they were not used by the classifier. Note that the lines
have been thickened to aid visualization.

best classifiers would typically contain lines mostly within the lung region, since it is
the certain abnormalities in the lung that allow the radiologists to diagnose a patient
as having tuberculosis. However, this is not the case in these classifiers, since the
pixels being used by the classifiers do not always belong to the lung region. Further
analysis into the complex behavior of these classifiers is needed to provide better
insight into how the classifications are derived.

59

Chapter 6
Hybridizing GMD-GP with
Behavioral Diversity
As we discussed in Chapter 2, several techniques have been introduced to sustain the
diversity of GP populations in order to prevent premature convergence and improve
the search performance of GP. However, many such techniques focus explicitly on
maintaining either structural or behavioral diversity. While structural and behavioral
diversity are indeed related and affect each other, we argue that simultaneously
focusing directly on both aspects of diversity can be more beneficial than focusing
on either in isolation.
There are many potential benefits of simultaneously enforcing structural and behavioral diversity. As we discussed in Chapter 1, the relationship between structural
and behavioral diversity is nontrivial because of the complex genotype-to-phenotype
mapping in tree-based GP. Thus, techniques that focus solely on genotypic (structural) diversity may be ineffective at promoting and preserving behavioral diversity
since many genotypes express the same or similar behavior. Likewise, methods that
focus solely on behavioral diversity may still benefit from structural diversity since

60

it is ultimately the genetic makeup that encodes the behavior expressed by the tree.
This is especially true for behavioral diversity methods that use the traditional genetic operators that have been shown to lead to genotypic convergence [24, 53, 64].
We therefore argue, and demonstrate in Section 6.2, that hybrid structural and behavioral diversity techniques may overcome some of these shortcomings that are
experienced by either technique in isolation.
We found that while GMD-GP often outperformed other state-of-the art methods
in many problems, GMD-GP populations tended to have significantly less behavioral
diversity (unique behaviors) while having significantly higher structural diversity [14].
Therefore, incorporating behavioral diversity into GMD-GP may further improve its
search performance by more effectively exploring the behavioral space.

6.1

Hybridizing GMD-GP with Behavioral Diversity Selection

In the canonical GP setup, selection pressure is imposed on the population by using
fitness to select individuals to produce offspring. However, since GMD-GP adopts a
multi-objective scheme that places equal weight on fitness and genetic marker density, no selection pressure is used to create offspring. Instead, GMD-GP randomly
selects parents and then uses Pareto tournament selection after the breeding phase to
impose selection pressure on the population, as we described in Section 3.3. Therefore, incorporating behavioral diversity instead of random mating during the parent
selection phase in GMD-GP is a logical starting point for hybridizing GMD-GP with
behavioral diversity techniques.

61

6.1.1

Lexicase Selection in GMD-GP

We investigated the feasibility of using Lexicase Selection instead of random mating
in GMD-GP. Since Lexicase Selection is known to perform poorly in the symbolic
regression problem domain [45] (several of the problems in our benchmark suite
are symbolic regression problems), we introduced two simple extensions to Lexicase
Selection that address this issue. These extensions place bias on the current most
difficult fitness cases rather than randomly ordering the fitness cases each time a
parent is to be selected. This is motivated by the fact that certain fitness cases
tend to be solved by fewer individuals while other fitness cases are solved by a large
proportion of the population, and this is dynamic over time. Placing more preference
on the current most difficult fitness cases may lead to problem decomposition and
thus faster solution discovery.
Before discussing our extensions, we first describe the original Lexicase Selection
method in more detail than in Chapter 2. When a parent is to be selected, Lexicase
Selection first sets the pool of candidate individuals as the entire population. Next,
a random ordering is chosen for the set of fitness cases of the given problem. The
first fitness case in the given ordering is then removed from the set, and only the
individuals that currently have the exact best error on that fitness case are retained
as candidates. This step is repeated until either a single candidate exists or all the
fitness cases have been used. If all the fitness cases have been used and more than
one candidate exists, one of the remaining candidates is chosen randomly.
Our first extension to Lexicase Selection, which we refer to as Lexicase Selection
with Count Bias (Lex-C), is shown in Algorithm 1. Lex-C sorts the fitness cases
in ascending order based on how many individuals solve each fitness case. This
guarantees that the least-solved fitness case is always used first to filter out parent
candidates. While it is possible that the least-solved fitness case is not solved by any

62

Algorithm 1 Lex-C: Count Bias Sorting of Fitness Cases.
Require: cases {Set of fitness cases for the given problem.}
1: counts ⇐ ∅ {Map of total individuals that solve each fitness case.}
2: for case ∈ cases do
3:
counts[case] ⇐ total individuals that solve case
4: end for
5: return sortAscending(counts) {Return the sorted fitness case indices in ascending order.}
individual in the population, the best-performing individuals with respect to that
fitness case will still be selected. Also, since this ordering does not change for a given
generation, we use a random subset of the population as parent candidates (instead
of the entire population) to avoid selecting the same individual.
Our second extension, which we refer to as Lexicase Selection with error bias (LexE), is shown in Algorithm 2. Lex-E is similar to Lex-C. However, Lex-E sorts the
fitness cases in descending order based on the average error the population commits
on each fitness case. This also places bias towards performing well on the current
most difficult fitness cases, as they will have the highest average error. Since this
ordering does not change in a given generation, the initial parent candidates are also
a random subset of population.
Lex-E slightly differs from Lex-C in that for problems for which error is realvalued, such as symbolic regression, the fitness case with the highest average error
may not always be the least solved since whether or not an individual solves a fitness
case is often based on some threshold. However Lex-E will perform similar to Lex-C
on discrete problems for which there is no error information, such as in the Boolean
domain. In this case, if error is 0 for pass and 1 for fail, the average error is equal to
the proportion of individuals in the population that fail the fitness case, which yields
the same ordering as Lex-C.

63

Algorithm 2 Lex-E: Error Bias Sorting of Fitness Cases.
Require: cases {Set of fitness cases for the given problem.}
1: errors ⇐ ∅ {Map of average error for each fitness case.}
2: for case ∈ cases do
3:
errors[case] ⇐ average population error on case
4: end for
5: return sortDescending(errors) {Return the sorted fitness case indices in descending order.}

6.1.2

Other Selection Techniques in GMD-GP

In addition to the original Lexicase Selection algorithm and the different extensions
that we introduced in Section 6.1.1, we also experimented with several other behavioral diversity selection techniques that we described in Chapter 2. Particularly, we
hybridized GMD-GP with Epsilon Lexicase Selection, Complementary Phenotype
Selection, and Semantics in Selection.

6.2

Experimental Validation of Hybrid Selection
Techniques

In this section, we compare the performance of each of the hybrid selection techniques
that we discussed earlier in the chapter with that of their stand-alone counterparts.
We demonstrate the performance advantage gained by the hybridization, and we also
analyze the effect that the hybridization has on structural and behavioral diversity.
We used the same experimental settings as in Chapter 4 for all the experiments in
this section.

64

6.2.1

Performance of Hybrid Selection Techniques

First, we conducted an initial set of experiments to determine the feasibility of hybridizing GMD-GP with the different versions of Lexicase Selection that we discussed in Section 6.1.1. We first compared the performance of the hybrid techniques
to GMD-GP and the different variants of Lexicase Selection in isolation. For convenience, Table 6.1 lists the abbreviations for the techniques that we compared in these
experiments. In all the following figures and tables, we abbreviate GMD-GP as GMD
for space considerations. All experimental settings were the same as in Table 4.3.
Since the original Lexicase Selection algorithm is known to often perform poorly for
symbolic regression but well in the finite algebras problems [28], we included both
the symbolic regression problems as well as the finite algebras problems that we discussed in Section 4.1, which yielded a total of 25 problems for the experiments using
original Lexicase Selection.
Given the number of algorithms under comparison, and the number of benchmark
problems we used, we discuss the results as follows. We divide our experiments into
two sets. First, we present the results of the experiments comparing GMD-GP, the
original Lexicase Selection, and the two extensions that we described above, along
with their hybrid counterparts. Next, we compare the hybrid versions of GMD-GP
using Epsilon Lexicase Selection, Complementary Phenotype Selection, and Semantics in Selection. We chose to include Epsilon Lexicase Selection in the second set of
experiments since it was designed specifically for symbolic regression, which is what
we consider in the second set of experiments.
Figure 6.1 shows the pairwise comparison of the convergence rates for both the
symbolic regression and finite algebras problem domains. In the case of symbolic regression, Lex tended to perform poorly in comparison to the other approaches, which
can be expected [45]. However, the plots for the hybrids (especially GMD+Lex-C as

65

Table 6.1: List of GP Techniques Used in the Comparison of the Hybrid
Selection Techniques.
Technique
Lexicase Selection
Epsilon Lexicase Selection
Lexicase Selection with Error Bias
Lexicase Selection with Count Bias
Complementary Phenotype Selection
Semantics in Selection
GMD-GP and Lexicase Selection
GMD-GP and Lexicase Selection with Error Bias
GMD-GP and Lexicase Selection with Count Bias
GMD-GP and Epsilon Lexicase Selection
GMD-GP and Complementary Phenotype Selection
GMD-GP and Semantics in Seletion

Abbreviation
Lex
eLex
Lex-E
Lex-C
CPS
SiS
GMD+Lex
GMD+Lex-E
GMD+Lex-C
GMD+eLex
GMD+CPS
GMD+SiS

well as GMD+Lex-E) show that hybridization with GMD-GP significantly improved
the convergence rate over Lex in isolation. In many cases for the symbolic regression problems, the hybrids did not improve the convergence rate of GMD-GP. This
can also be expected since Lex did not perform well in isolation for these problems.
However, the hybrid GMD+Lex-C was significantly faster than GMD-GP in a few
of the symbolic regression problems while it was not significantly slower in any of
the symbolic regression problems.
The results of the convergence rate comparison for the finite algebras problems
are nearly the opposite of that of the symbolic regression problems. The plot for Lex
reveals that it was significantly faster than all of the other approaches (including
GMD-GP) in all of the finite algebras problems, with the exception of the hybrid
GMD+Lex against which Lex was still faster in over 60% of the problems. However,
the plot for GMD+Lex reveals that hybridizing GMD-GP with Lex significantly
improved the convergence rate of GMD-GP in all of the finite algebras problems.
Figure 6.2 further demonstrates the performance benefits of the hybrid techniques
in terms of the success rate. The plots for each of the hybrids show that each

66

Figure 6.1: Pairwise Comparison of the Convergence Rate of the Hybrids.
Each plot shows the percentage of problems in which the method above the plot
found a solution in significantly fewer fitness evaluations, on average, than each
other method.

67

GMD+Lex-C

GMD+Lex

GMD+Lex-E

GMD

Lex-C

Lex

GMD+Lex-C

GMD

GMD+Lex

Lex-C

GMD+Lex-C

Lex-E

GMD+Lex

GMD+Lex-E

GMD

Lex
Lex

Lex-E

GMD+Lex-E

GMD+Lex-C

GMD

GMD+Lex-E

Lex-C

Lex

Lex-E

GMD+Lex-C

Lex-E

GMD+Lex-C

GMD+Lex

GMD+Lex-E

GMD

Lex

Lex-C

GMD+Lex

GMD+Lex-E

GMD+Lex

Lex-C

GMD+Lex-C

GMD+Lex

GMD+Lex-E

GMD

Lex-E

Lex-C

Finite Algebras

Lex-E

GMD

Lex-C

100
80
60
40
20
0

Lex

Lex

100
80
60
40
20
0

Lex-E

Percentage of Problems

Symbolic Regression

Figure 6.2: Pairwise Comparison the Success Rate of the Hybrids. Each
plot shows the percentage of problems in which the method above the plot had a
significantly higher success rate than each other method.

GMD+Lex-C

GMD+Lex

GMD+Lex-E

GMD

Lex-C

Lex

GMD+Lex-C

GMD

GMD+Lex

Lex-C

GMD+Lex-C

Lex-E

GMD+Lex

GMD+Lex-E

GMD

Lex
Lex

Lex-E

GMD+Lex-E

GMD+Lex-C

GMD

GMD+Lex-E

Lex-C

Lex

Lex-E

GMD+Lex-C

Lex-E

GMD+Lex-C

GMD+Lex

GMD+Lex-E

GMD

Lex

Lex-C

GMD+Lex

GMD+Lex-E

GMD+Lex

Lex-C

GMD+Lex-C

GMD+Lex

GMD+Lex-E

GMD

Lex-E

Lex-C

Finite Algebras

Lex-E

GMD

Lex-C

100
80
60
40
20
0

Lex

Lex

100
80
60
40
20
0

Lex-E

Percentage of Problems

Symbolic Regression

hybridization significantly improved upon the success rate of Lex in most of the
symbolic regression problems. Since the plot for GMD-GP reveals that it already
performed very competitively in the symbolic regression problems, the GMD+Lex
and GMD+Lex-E hybrids had significantly lower success rates than GMD-GP in
68% and 62% of the symbolic regression problems, respectively. However, the GMDGP plot also reveals that the hybrid GMD+Lex-C did not negatively impact the
success rate of GMD-GP, while, as Figure 6.1 shows, GMD+Lex-C did improve the
convergence rate of GMD-GP in some cases.
The success rate comparison for the finite algebras problems reveals a similar
result. Figure 6.2 shows that Lex outperformed GMD-GP in terms of success rate in

68

Figure 6.3: Pairwise Comparison of the Ending Fitness of the Hybrids.
Percentage of problems in which the method shown above each plot had significantly
higher mean ending fitness than each method listed.

GMD+SiS

GMD

GMD+eLex

SIS

eLex

GMD+CPS

GMD+CPS

CPS

GMD+SiS

GMD+CPS

SIS

GMD+eLex
GMD

GMD+eLex

CPS

SIS

eLex

GMD+SiS

GMD+CPS

GMD

GMD+SiS

CPS

GMD

eLex

GMD+SiS

GMD+CPS

GMD+eLex

CPS

GMD

eLex
eLex

GMD+CPS

GMD+SiS

GMD+eLex

CPS

GMD+eLex

SIS

GMD+SiS

GMD+CPS

GMD+eLex

GMD

SIS

CPS

SIS

GMD

CPS

100
80
60
40
20
0

eLex

SIS

100
80
60
40
20
0

eLex

Percentage of Problems

Symbolic Regression

the finite algebras problems. However, the hybrid GMD+Lex significantly improved
the success rate of GMD-GP in around 80% of the problems, and it did not negatively
impact the success rate of Lex in any problem. These results demonstrate that using
behavioral diversity selection (which is gained through Lex) in GMD-GP can leverage
the strengths of both techniques and counteract their weaknesses.
Figure 6.3 shows the pairwise comparison of the other hybrid selection techniques.
Overall, GMD-GP outperformed the other stand-alone techniques in the largest percentage of the problems, having significantly higher ending fitness than eLex, CPS,
and SiS in 33.33%, 73.33%, 73.33% of the problems, respectively. The plots for the
other stand-alone techniques indicate that CPS and SiS performed similar to each

69

Table 6.2: Ending Fitness of the Behavioral Diversity Selection Techniques
Compared to the Hybrids. Percentage of problems in which the stand-alone
technique had significantly higher ending fitness than its hybrid counterpart and
vice versa.
Stand-alone vs Hybrid
eLex 0.0
CPS 0.0
SiS
6.67

Hybrid vs Stand-alone
20.0
73.33
66.67

other in a majority of the problems, while eLex often outperformed CPS and SiS.
Comparing each technique to its hybrid counterpart, Table 6.2 demonstrates the
effectiveness of the hybridization compared to the stand-alone behavioral diversity
techniques. For every other behavioral diversity selection technique, the hybrid significantly outperformed the stand-alone technique in a majority of the problems. In
fact, with the exception of SiS (in only 6.67% of the problems), none of the behavioral diversity selection techniques significantly outperformed its hybrid counterpart
in any of the problems.
Table 6.3 shows the comparison of GMD-GP to each of the hybrids. Since GMDGP was typically the best-performing stand-alone algorithm, the hybridization with
the other techniques did not yield superior performance to GMD-GP in most of the
cases. However, it is important to note that GMD-GP did not have significantly
higher ending fitness than any of its hybrid counterparts in a majority of the problems. The hybrid using SiS was the best performing compared to GMD-GP, with
significantly higher ending fitness than GMD-GP in 26.67% of the problems while
only worse than GMD-GP in 6.67% of the problems. The worst performing hybrid,
GMD-GP with Complementary Phenotype Selection was worse than GMD-GP in
only 13.33% of the problems. This shows that the hybridization not only tended
to improve the performance of the behavioral diversity techniques, as we discussed

70

Table 6.3: Ending Fitness of GMD-GP Compared to the Hybrids. Percentage of problems in which the hybrid had significantly higher ending fitness than
GMD-GP vs. significantly lower fitness than GMD-GP.

GMD+eLex
GMD+CPS
GMD+SiS

Better than GMD-GP Worse than GMD-GP
26.67
40.00
0.00
13.33
26.67
6.67

above for Table 6.2, but it also improved the performance of the structural diversity
technique in some cases.

6.2.2

Effect of Hybridization on Diversity

In addition to comparing the performance of the hybrids, we also analyzed the effect
that hybridization had on different aspects of diversity. Using the same experimental
settings that we described in Section 4.4, we compared the different measures of
diversity in the final populations.
Figure 6.4 shows the pairwise comparison of the mean maximum genetic marker
density of the final populations, for genetic markers composed of the top two tree
levels, in both problem domains. Each of the hybrids had significantly lower ending genetic marker density than Lex in all of the problems in both domains. This
shows that the hybridization with GMD-GP significantly and consistently increases
the structural diversity over Lex in isolation. Since GMD-GP had significantly lower
genetic marker density than Lex, the hybridization with Lex generally increased
the genetic marker density of GMD-GP. However, interestingly, GMD+Lex-E and
GMD+Lex-C significantly decreased the genetic marker density (and therefore increased structural diversity) of GMD-GP in many of the finite algebras problems.
Figure 6.5 shows the pairwise comparison of the mean structural diversity in the

71

Figure 6.4: Pairwise Comparison of the Genetic Marker Density of the
Hybrids. Each plot shows the percentage of problems in which the method above
the plot had significantly lower genetic marker density (of genetic markers composed
of the top 2 tree levels) than each other method after 250,000 fitness evaluations.

GMD+Lex-C

GMD+Lex

GMD+Lex-E

GMD

Lex-C

Lex

GMD+Lex-C

GMD

GMD+Lex

Lex-C

GMD+Lex-C

Lex-E

GMD+Lex

GMD+Lex-E

GMD

Lex
Lex

Lex-E

GMD+Lex-E

GMD+Lex-C

GMD

GMD+Lex-E

Lex-C

Lex

Lex-E

GMD+Lex-C

Lex-E

GMD+Lex-C

GMD+Lex

GMD+Lex-E

GMD

Lex

Lex-C

GMD+Lex

GMD+Lex-E

GMD+Lex

Lex-C

GMD+Lex-C

GMD+Lex

GMD+Lex-E

GMD

Lex-E

Lex-C

Finite Algebras

Lex-E

GMD

Lex-C

100
80
60
40
20
0

Lex

Lex

100
80
60
40
20
0

Lex-E

Percentage of Problems

Symbolic Regression

top two tree levels for the other hybrid selection techniques. The plot for GMDGP shows that GMD-GP again had significantly higher structural diversity than
the other approaches in nearly all of the problems, which can be expected since
structural diversity is the main focus of GMD-GP. Figure 6.5 also shows that each
hybrid technique had significantly higher structural diversity than its stand-alone
behavioral diversity counterpart in all of the problems. While Figure 6.5 shows the
comparison of the structural diversity across only the top two levels, the results
were nearly identical across the top six levels. This suggests that the hybridization
with GMD-GP indeed significantly increases the structural diversity of the other
techniques.

72

Figure 6.5: Pairwise Comparison of the Structural Diversity of the Other
Hybrids. Percentage of problems in which the method shown above each plot had
significantly higher structural diversity than each method listed.

GMD+SiS

GMD

GMD+eLex

SIS

eLex

GMD+CPS

GMD+CPS

CPS

GMD+SiS

GMD+CPS

SIS

GMD+eLex
GMD

GMD+eLex

CPS

SIS

eLex

GMD+SiS

GMD+CPS

GMD

GMD+SiS

CPS

GMD

eLex

GMD+SiS

GMD+CPS

GMD+eLex

CPS

GMD

eLex
eLex

GMD+CPS

GMD+SiS

GMD+eLex

CPS

GMD+eLex

SIS

GMD+SiS

GMD+CPS

GMD+eLex

GMD

SIS

CPS

SIS

GMD

CPS

100
80
60
40
20
0

eLex

SIS

100
80
60
40
20
0

eLex

Percentage of Problems

Symbolic Regression

The pairwise comparison of behavioral diversity in Figure 6.6 shows the reverse of
that of structural diversity. We consider two individuals to have different behavior if
their output vectors differ by at least one element. Behavioral diversity is then defined
as the proportion of unique behaviors in the population. We discussed in Section 4.4
that GMD-GP populations often had significantly less behavioral diversity than the
other approaches while having significantly higher structural diversity. The plots for
each of the hybrids in Figure 6.6 show that in many cases, the hybrids significantly
increased the behavioral diversity of GMD-GP populations. Since Lex maintains a
high level of behavioral diversity, the hybridization typically decreased the behavioral
diversity of Lex.1
1

In the finite algebras problems, Lex tended to have lower behavioral diversity than GMD-GP

73

Figure 6.6: Pairwise Comparison of the Behavioral Diversity of the Hybrids.
Each plot shows the percentage of problems in which the method above the plot had
significantly higher behavioral diversity than each other method after 250,000 fitness
evaluations.

GMD+Lex-C

GMD+Lex

GMD+Lex-E

GMD

Lex-C

Lex

GMD+Lex-C

GMD

GMD+Lex

Lex-C

GMD+Lex-C

Lex-E

GMD+Lex

GMD+Lex-E

GMD

Lex
Lex

Lex-E

GMD+Lex-E

GMD+Lex-C

GMD

GMD+Lex-E

Lex-C

Lex

Lex-E

GMD+Lex-C

Lex-E

GMD+Lex-C

GMD+Lex

GMD+Lex-E

GMD

Lex

Lex-C

GMD+Lex

GMD+Lex-E

GMD+Lex

Lex-C

GMD+Lex-C

GMD+Lex

GMD+Lex-E

GMD

Lex-E

Lex-C

Finite Algebras

Lex-E

GMD

Lex-C

100
80
60
40
20
0

Lex

Lex

100
80
60
40
20
0

Lex-E

Percentage of Problems

Symbolic Regression

Next, we compared the behavioral diversity of the other behavioral diversity selection approaches to that of their hybrid counterparts. Figure 6.7 shows the pairwise
comparison of ending behavioral diversity across all the problems. In a majority of
the problems, GMD-GP had significantly lower behavioral diversity than the behavioral diversity techniques (with the exception of eLex, which still had significantly
higher behavioral diversity than GMD-GP in 40% of the problems). This can be
expected, since the other techniques focus more directly on behavioral diversity.
Interestingly, the plots for the other hybrids show that integrating the behavioral
and the hybrids since it typically converged to the solution quickly although the trial lasted for a
fixed 250,000 fitness evaluations.

74

Figure 6.7: Pairwise Comparison of the Behavioral Diversity of the Other
Hybrids. Percentage of problems in which the method shown above each plot had
significantly higher behavioral diversity than each method listed.

GMD+SiS

GMD

GMD+eLex

SIS

eLex

GMD+CPS

GMD+CPS

CPS

GMD+SiS

GMD+CPS

SIS

GMD+eLex
GMD

GMD+eLex

CPS

SIS

eLex

GMD+SiS

GMD+CPS

GMD

GMD+SiS

CPS

GMD

eLex

GMD+SiS

GMD+CPS

GMD+eLex

CPS

GMD

eLex
eLex

GMD+CPS

GMD+SiS

GMD+eLex

CPS

GMD+eLex

SIS

GMD+SiS

GMD+CPS

GMD+eLex

GMD

SIS

CPS

SIS

GMD

CPS

100
80
60
40
20
0

eLex

SIS

100
80
60
40
20
0

eLex

Percentage of Problems

Symbolic Regression

diversity techniques into GMD-GP does not always significantly increase the behavioral diversity of GMD-GP.
We further analyzed the behavioral diversity on a lower level by examining the
standard deviation of fitness at the end of the trials. This provides more information than the proportion of unique behaviors alone because fitness standard deviation gives an indication of the difference in individuals’ behaviors rather than solely
whether or not they are different. Also, since many different behaviors can receive
the same fitness value, the standard deviation in fitness provides a more in-depth
analysis of the behavioral dynamics of the population.
Figure 6.8 shows the pairwise comparison of the fitness standard deviation of

75

Figure 6.8: Pairwise Comparison of the Fitness Standard Deviation of the
Hybrids. Each plot shows the percentage of problems in which the method above
the plot had significantly higher fitness standard deviation than each other method
after 250,000 fitness evaluations.

GMD+Lex-C

GMD+Lex

GMD+Lex-E

GMD

Lex-C

Lex

GMD+Lex-C

GMD

GMD+Lex

Lex-C

GMD+Lex-C

Lex-E

GMD+Lex

GMD+Lex-E

GMD

Lex
Lex

Lex-E

GMD+Lex-E

GMD+Lex-C

GMD

GMD+Lex-E

Lex-C

Lex

Lex-E

GMD+Lex-C

Lex-E

GMD+Lex-C

GMD+Lex

GMD+Lex-E

GMD

Lex

Lex-C

GMD+Lex

GMD+Lex-E

GMD+Lex

Lex-C

GMD+Lex-C

GMD+Lex

GMD+Lex-E

GMD

Lex-E

Lex-C

Finite Algebras

Lex-E

GMD

Lex-C

100
80
60
40
20
0

Lex

Lex

100
80
60
40
20
0

Lex-E

Percentage of Problems

Symbolic Regression

the hybrid approaches compared to GMD-GP and Lex in isolation. Since GMDGP populations tended to have significantly higher fitness standard deviation than
other approaches, as we discussed in Section 4.4, the hybrids had significantly higher
fitness standard deviation than Lex in all of the symbolic regression problems and
most of the finite algebras problems. Furthermore, GMD+Lex had significantly
higher fitness standard deviation than GMD-GP in all of the finite algebras problems,
and GMD+Lex-C had significantly higher fitness standard deviation than GMDGP in around 40% of the symbolic regression problems. This suggests that the
hybridization can improve the behavioral diversity of GMD-GP, while also yielding
more widespread exploration of the fitness landscape. This is important because

76

Figure 6.9: Pairwise Comparison of the Fitness Standard Deviation of the
Other Hybrids. Percentage of problems in which the method shown above each
plot had significantly higher standard deviation in fitness than each method listed.

GMD+SiS

GMD

GMD+eLex

SIS

eLex

GMD+CPS

GMD+CPS

CPS

GMD+SiS

GMD+CPS

SIS

GMD+eLex
GMD

GMD+eLex

CPS

SIS

eLex

GMD+SiS

GMD+CPS

GMD

GMD+SiS

CPS

GMD

eLex

GMD+SiS

GMD+CPS

GMD+eLex

CPS

GMD

eLex
eLex

GMD+CPS

GMD+SiS

GMD+eLex

CPS

GMD+eLex

SIS

GMD+SiS

GMD+CPS

GMD+eLex

GMD

SIS

CPS

SIS

GMD

CPS

100
80
60
40
20
0

eLex

SIS

100
80
60
40
20
0

eLex

Percentage of Problems

Symbolic Regression

many behaviors can still map to the same fitness value, and therefore high behavioral
diversity does not necessarily guarantee effective search in the fitness space.
Figure 6.9 shows the pairwise comparison of the fitness standard deviation of the
populations of the other behavioral diversity selection techniques across the benchmark problems. This shows that while GMD-GP typically had a significantly lower
number of unique behaviors than the behavioral diversity techniques (which is shown
in the GMD plot of Figure 6.7), GMD-GP populations still tended to have significantly higher standard deviation in fitness in a majority of the problems. While the
behavioral diversity techniques did not always provide GMD-GP with significantly
higher behavioral diversity, Figure 6.9 shows that the hybridization with eLex, SiS,

77

and CPS increased the fitness standard deviation of GMD-GP in around 60%, 53%,
and 33% of the problems, respectively. Furthermore, Figure 6.9 also shows that the
hybrids had significantly higher fitness standard deviation than each of the standalone behavioral diversity techniques in a majority of the problems. This suggests
that the hybrid techniques improve upon the stand-alone techniques by causing the
individuals to be more spread out in the fitness landscape, yielding broader search
and perhaps more efficient exploration of the landscape.

6.3

Hybridizing GMD-GP with Behavior-Aware
Crossover

Since the original implementation of GMD-GP uses the traditional genetic operators
such as subtree crossover, another natural place to incorporate behavioral diversity
into GMD-GP is during crossover. It is known that subtree crossover is highly disruptive [49]. This is because subtree crossover is irrespective of individuals’ behavior,
using no behavioral information to aid in its selection of crossover points. By incorporating behavior-aware crossover into GMD-GP, we aim to improve its performance
by increasing behavioral diversity. To do this, we use the LGX operator that we described in Section 2.2.3 because it was shown to perform well across different problem
domains [61].

6.3.1

Performance of GMD-GP and LGX Hybrid

We now discuss the performance effect of hybridizing GMD-GP with the LGX
crossover operator. Again, we used the same experimental settings and the symbolic
regression benchmark problem suite that we discussed in Section 4.2. Figure 6.10
shows the pairwise comparison of GMD-GP and LGX to their hybrid counterparts in

78

Figure 6.10: Pairwise Comparison of the Ending Fitness of the GMD-GP,
LGX Hybrid. Percentage of problems in which the method shown above each plot
had significantly higher mean ending fitness than each method listed.

GMD+LGX

LGX

GMD

GMD+LGX

GMD

GMD+LGX

100
80
60
40
20
0

LGX

GMD

100
80
60
40
20
0

LGX

Percentage of Problems

Symbolic Regression

terms of the mean fitness at the end of the trial. This shows a similar result to what
we observed for the other hybrid selection techniques. GMD-GP outperformed LGX
in 53.33% of the problems, while LGX only outperformed GMD-GP in 6.67% of the
problems. Since GMD-GP tended to be the best performing stand-alone technique,
the hybridization with LGX did not tend to improve upon GMD-GP in most cases,
while it significantly improved upon LGX in 53.33% of the problems.

6.3.2

Effect of LGX Hybridization on Diversity

Next, we analyzed the effect that the hybridization with LGX has on the same
measures of diversity that we considered in Section 6.2.2. Figure 6.11 shows the

79

Figure 6.11: Pairwise Comparison of the Structural Diversity of the GMDGP, LGX Hybrid. Percentage of problems in which the method shown above each
plot had significantly higher structural diversity than each method listed.

GMD+LGX

LGX

GMD

GMD+LGX

GMD

GMD+LGX

100
80
60
40
20
0

LGX

GMD

100
80
60
40
20
0

LGX

Percentage of Problems

Symbolic Regression

pairwise comparison of mean structural diversity in the top two tree levels at the end
of the trial. Again, since GMD-GP focuses directly on maintaining a high level of
structural diversity, it is no surprise that GMD-GP has significantly higher structural
diversity than LGX (which we can see in the plot for GMD in Figure 6.11). In fact,
GMD-GP has significantly higher structural diversity than LGX in all (100%) of the
problems. This is also because LGX only performs crossover in homologous regions
of the tree, from the root, downwards, which further encourages the population to
become more structurally homologous. Therefore, as we can see in the plot for
the hybrid GMD+LGX in Figure 6.11, the hybridization with GMD-GP caused the
structural diversity of LGX to significantly increase in all of the problems.

80

Figure 6.12: Pairwise Comparison of the Behavioral Diversity of the GMDGP, LGX Hybrid. Percentage of problems in which the method shown above each
plot had significantly higher behavioral diversity than each method listed.

GMD+LGX

LGX

GMD

GMD+LGX

GMD

GMD+LGX

100
80
60
40
20
0

LGX

GMD

100
80
60
40
20
0

LGX

Percentage of Problems

Symbolic Regression

Figure 6.12 shows the pairwise comparison of the mean behavioral diversity at
the end of the trial. This shows the opposite result to that of structural diversity.
LGX has significantly higher behavioral diversity than GMD-GP in all (100%) of
the problems, which can be expected since LGX, by definition, promotes behavioral
diversity. Figure 6.12 also shows that the hybrid incorporating LGX into GMD-GP
significantly increased the behavioral diversity of GMD-GP populations in all of the
problems.
Figure 6.13 shows the pairwise comparison of the standard deviation of fitness
at the end of the trial. Analyzing the standard deviation in fitness shows that
even though GMD-GP had fewer unique behaviors, GMD-GP populations were more

81

Figure 6.13: Pairwise Comparison of the Fitness Standard Deviation of the
GMD-GP, LGX Hybrid. Percentage of problems in which the method shown
above each plot had significantly higher standard deviation in fitness than each
method listed.

GMD+LGX

LGX

GMD

GMD+LGX

GMD

GMD+LGX

100
80
60
40
20
0

LGX

GMD

100
80
60
40
20
0

LGX

Percentage of Problems

Symbolic Regression

spread out in the fitness landscape than those of LGX in all of the problems. Furthermore the hybridization with GMD-GP significantly increased the fitness standard
deviation compared to LGX in 80% of the problems. These results are consistent
with the results for the hybrid selection techniques that we discussed in Section 6.2.2.

82

Chapter 7
Conclusions
This dissertation has presented several contributions to the field of evolutionary computation. We have presented a novel measure of structural diversity for tree-based
GP that is simple, efficient, and effective. We have demonstrated the utility of our
structural diversity measure by incorporating it into a simple yet powerful GP technique for structural diversity preservation. We have demonstrated the effectiveness of
our technique by comparing it to other state-of-the-art diversity techniques on a suite
of benchmark problems, and by applying our technique to the real-world problem of
tuberculosis screening from raw X-ray images. We then introduced several hybrid
techniques that focus simultaneously on structural and behavioral diversity for more
effective GP. We demonstrated the effectiveness of these techniques by comparing
them to their standalone counterparts on a suite of benchmark problems.

7.1

Conclusions on Structural Diversity

In Chapter 3, we presented GMD-GP, a novel structural diversity technique for
tree-based GP, which uses tree fragments as genetic markers to preserve structural
diversity. We showed that by only considering a portion of the tree, we can prevent

83

premature convergence while accelerating search, thereby improving the efficiency of
GP. This is important because one major cause of premature convergence in GP is
structural convergence, wherein the trees in the population rapidly converge to the
same rooted structure in a top-down fashion [53]. By working directly to prevent this
phenomenon, GMD-GP is capable of dramatically improving the search performance
of GP. Furthermore, our measure of structural diversity is trivial to compute and
therefore much more efficient than other genotypic distance measures that use the
entire genotype and perform more complex comparisons in the calculation of distance.
We discussed in Section 4.3 that GMD-GP tended to find a solution significantly
faster than other techniques in many of the benchmark problems. In particular,
GMD-GP was able to improve upon the performance of the Age-Fitness Pareto
Optimization algorithm (A/F), upon which it was based. Since GMD-GP uses the
same underlying multi-objective evolutionary scheme as A/F, the only difference
being the use of genetic marker density as the second objective instead of genotypic
age, this further strengthens our argument for the utility of genetic markers. Since
the rooted portions of the trees define the overall context of the trees, and changes
to the root are more dramatic than changes to the leaves, maintaining a diverse set
of rooted structures, which GMD-GP does by preserving genetic marker diversity in
the population, can contribute to more efficient search.
In Section 4.4, we conducted several experimental analyses to determine the differences between GMD-GP and several other diversity techniques. These analyses
uncovered several differences in the behavior of GMD-GP compared to the other
techniques. First, GMD-GP is capable of maintaining a significantly higher level of
structural diversity by only focusing on a fragment of the tree, and that structural
diversity also affects the level of behavioral diversity. This is evidenced by the fact
that GMD-GP populations tended to have significantly higher standard deviations

84

in fitness compared to the other techniques across all the problems in our benchmark
suite. As the standard deviation in fitness is a measure of how spread out individuals
are in the fitness landscape, this shows that even though GMD-GP focuses directly
on preventing premature structural convergence, GMD-GP is capable of promoting
widespread search of the fitness landscape. This is a beneficial property for exploration of complex fitness landscapes, since less-fit individuals that may contain
important low-level building blocks are allowed to exist alongside fitter individuals,
thus providing low-level building blocks the opportunity to propagate throughout
the population.
Although GMD-GP populations typically had significantly higher standard deviation in fitness, which can be considered a measure of behavioral diversity, GMD-GP
populations did tend to have significantly fewer unique behaviors compared to the
behavioral diversity techniques, which is an interesting phenomenon. As we discussed in Section 4.4, GMD-GP populations had fewer unique behaviors, using the
definition of behavior that we employed in our experiments. Further analysis showed
that GMD-GP populations tended to have several groups of individuals (consisting
of at least two individuals) that expressed the same behavior even though they were
genetically unique. This number of groups tended to be significantly greater than
those of the other techniques. This can be a beneficial property for problem domains
such as dynamic environments in which the population needs to readily adapt to
changes. This is a potential application area for future research with GMD-GP.

85

7.2

Conclusions on Hybrid Structural and Behavioral Diversity Techniques

In Chapter 6, we investigated hybridizing several existing behavioral diversity techniques with our structural diversity technique. We demonstrated how GMD-GP
can easily be extended to incorporate behavioral diversity selection into the parent
selection phase, while still imposing structural diversity during its multi-objective
optimization phase. We also showed that behavioral diversity can easily be incorporated into GMD-GP by utilizing behavior-aware crossover operators. Our results
demonstrate the feasibility and effectiveness of such hybrid techniques and serve as a
motivation for further research into new techniques that focus directly on sustaining
both structural and behavioral diversity in GP.
Our results demonstrate that, in many cases, simultaneously focusing directly on
both structural and behavioral diversity can improve upon the shortcomings of both
approaches and leverage their strengths in order to achieve more effective search compared to either approach in isolation. Since GMD-GP populations tended to have
fewer unique behaviors than the behavioral diversity techniques, the hybridization
typically significantly increased the number of unique behaviors of GMD-GP populations while still maintaining a high standard deviation in fitness. Likewise, since
the behavioral diversity techniques tended to still be susceptible to low structural
diversity, the hybridization tended to significantly increase the structural diversity
compared to the standalone behavioral diversity techniques.

86

7.3

Conclusions on GP for Tuberculosis Screening

Chapter 5 presented a new application of GP: using GP to evolve classifiers for tuberculosis (TB) screening from raw X-ray images. Our results demonstrated that,
with the help of the two-tiered GP framework (2TGP), GMD-GP can evolve classifiers that are competitive in terms of accuracy compared to current methods that
require image processing and feature extraction. Furthermore, our evolved classifiers
are capable of classifying an X-ray image in only a fraction of a second, which is
dramatically faster than the reported times for other techniques.
While our technique performed competitively in terms of accuracy and the time
taken to classify an image, the large memory requirement during the training phase
is still an issue. Since all the training images are currently loaded into memory at
the beginning of evolution, this imposes a large memory overhead on the system due
to the size of the images. While the evolved classifiers can operate on single images
at a time and therefore can be run on smaller hardware such as a laptop computer,
improving the system so that the training phase is less memory intensive will make
this approach more feasible.
Another issue, which is very common to GP in general, is that of the interpretability of the evolved classifiers. One very desirable and important quality of a
computer program that is to be used in the medical domain is that the program is
understandable by the professionals that are using it. We discussed in Chapter 5
that the best evolved solutions are rather large trees, which means that interpreting
the behavior of the trees is very difficult. However, in the case of TB screening, we
can observe the features of the X-ray images that are being used by the classifier (as
we showed in Figure 5.1 and Figure 5.2) in order to gain a better understanding of
which regions in the X-ray are important to the classifier. However, more research is
needed into visualizing and analyzing the behavior of the classifiers on such regions

87

to better understand how a classification is derived.

7.4

Future Directions

This research scratches the surface of developing techniques that focus simultaneously on structural and behavioral diversity. We demonstrated the feasibility of such
techniques by hybridizing our structural diversity technique with existing behavioral
diversity techniques. However, we aim to motivate GP researchers to further explore
this promising area and develop novel techniques that accomplish the goal of sustaining both types of diversity. Specific research questions in this area are as follows.
Can we design novel operators that directly promote both types of diversity rather
than using two separate operators? Does one type of operator (e.g. crossover versus selection) provide a different advantage than the other in terms of these hybrid
techniques?
Another area in which GMD-GP can readily be extended to incorporate behavioral diversity is during its exploratory phase that occurs after Pareto tournament
selection. As we discussed in Chapter 3, after GMD-GP performs Pareto tournament
selection, it adds a new, randomly generated individual to the population in order to
maintain a constant source of new genetic material in the population. A possible extension is to use semantic geometric mutation to explore the semantic neighborhood
of certain individuals. This could be done using either the fittest individuals, the
most unique individuals (structurally or behaviorally), or randomly. One potential
benefit of this would be that this would allow us to explore targeted areas in the
behavioral space while still maintaining a constant source of new genetic material in
the population.
Another area for future work is to demonstrate the effectiveness of GMD-GP and

88

the hybrid techniques on other real-world applications. One area to which GP has
been applied is the prediction of time series data such as stock market prediction [70].
A related area of great potential is that of physiological signals. For example, GP
has previously been used to classify electroencephalogram (EEG) signals to detect
epileptic seizures [7], as well as for brain computer interfaces [3]. Taking this a step
further, one question for future research is whether or not GP can be used to evolve
EEG classifiers that can be used in a brain computer interface for controlling prosthetic limbs to restore mobility to paralyzed patients. A machine learning technique
was successfully applied to a similar task using a cortical implant to restore limb
movement to a quadriplegic patent [10]. However, it would be invaluable if the same
or better result could be achieved with non-invasive methods, for which GP could
potentially be used.
Another area where GP could be used on physiological signals is that of continuous blood pressure monitoring. For example, some techniques have been proposed
using signals such as pulse transit time for continuous, cuff-less blood pressure monitoring [20]. Furthermore, one study presented a wearable approach using GP for
blood pressure monitoring [67]. Further research into designing affordable and more
comfortable systems that can be used by everyday people would be beneficial.
Finally, our results in Chapter 5 only scratch the surface of using GP for tuberculosis screening from raw X-ray images. While our initial results show great promise
for the accuracy of the evolved solutions, further research into whether GP can be
used for detecting general abnormalities in the lung from X-ray images would be
valuable. Evolving classifiers without the large memory overhead required for the
training images is another important goal for future research. Also, improving the
true positive rate of the evolved classifiers is an essential goal for future work.

89

BIBLIOGRAPHY

90

BIBLIOGRAPHY

[1] Michael Affenzeller and Stefan Wagner. Offspring selection: A new self-adaptive
selection scheme for genetic algorithms. In Adaptive and Natural Computing
Algorithms, pages 218–221. Springer, 2005.
[2] Harith Al-Sahaf, Andy Song, Kourosh Neshatian, and Mengjie Zhang. Twotier genetic programming: towards raw pixel-based image classification. Expert
Systems with Applications, 39(16):12291–12301, 2012.
[3] Eva Alfaro-Cid, Anna Esparcia-Alcázar, and Ken Sharman. Using distributed
genetic programming to evolve classifiers for a brain computer interface. In
ESANN, pages 59–66, 2006.
[4] SK Antani. Automated detection of lung diseases in chest x-rays. Technical
Report to the LHNCBC Board of Scientific, 2015.
[5] Daniel Atkins, Kourosh Neshatian, and Mengjie Zhang. A domain independent
genetic programming approach to automatic feature extraction for image classification. In Alice E. Smith, editor, Proceedings of the 2011 IEEE Congress on
Evolutionary Computation, pages 238–245, New Orleans, USA, 5-8 June 2011.
IEEE Computational Intelligence Society, IEEE Press.
[6] Lawrence Beadle and Colin Johnson. Semantically driven crossover in genetic
programming. In Proceedings of the IEEE World Congress on Computational
Intelligence, pages 111–116, Hong Kong, 1-6 June 2008. IEEE Computational
Intelligence Society, IEEE Press.
[7] Arpit Bhardwaj, Aruna Tiwari, Ramesh Krishna, and Vishaal Varma. A novel
genetic programming approach for epileptic seizure detection. Computer methods and programs in biomedicine, 124:2–18, 2016.
[8] Josh C. Bongard and Gregory S. Hornby. Guarding against premature convergence while accelerating evolutionary search. In Proceedings of the 12th Annual
Conference on Genetic and Evolutionary Computation, GECCO ’10, pages 111–
118, New York, NY, USA, 2010. ACM.
[9] Anna Bosch, Andrew Zisserman, and Xavier Munoz. Representing shape with a
spatial pyramid kernel. In Proceedings of the 6th ACM international conference
on Image and video retrieval, pages 401–408. ACM, 2007.
[10] Chad E Bouton, Ammar Shaikhouni, Nicholas V Annetta, Marcia A Bockbrader, David A Friedenberg, Dylan M Nielson, Gaurav Sharma, Per B Sederberg, Bradley C Glenn, W Jerry Mysiw, et al. Restoring cortical control of

91

functional movement in a human with quadriplegia. Nature, 533(7602):247–
250, 2016.
[11] Edmund K. Burke, Steven Gustafson, and Graham Kendall. Diversity in genetic programming: An analysis of measures and correlation with fitness. IEEE
Transactions on Evolutionary Computation, 8(1):47–62, February 2004.
[12] Edmund K. Burke, Steven Gustafson, Graham Kendall, and Natalio Krasnogor. Is increased diversity in genetic programming beneficial? an analysis of
the effects on performance. In Ruhul Sarker, Robert Reynolds, Hussein Abbass,
Kay Chen Tan, Bob McKay, Daryl Essam, and Tom Gedeon, editors, Proceedings of the 2003 Congress on Evolutionary Computation CEC2003, pages
1398–1405, Canberra, 8-12 December 2003. IEEE Press.
[13] Armand R. Burks and William F. Punch. An efficient structural diversity technique for genetic programming. In GECCO ’15: Proceedings of the 2015 on Genetic and Evolutionary Computation Conference, pages 991–998, Madrid, Spain,
11-15 July 2015. ACM.
[14] Armand R. Burks and William F. Punch. An analysis of the genetic marker diversity algorithm for genetic programming. Genetic Programming and Evolvable
Machines, 18(2):213–245, June 2017.
[15] Sema Candemir, Stefan Jaeger, Kannappan Palaniappan, Jonathan P Musco,
Rahul K Singh, Zhiyun Xue, Alexandros Karargyris, Sameer Antani, George
Thoma, and Clement J McDonald. Lung segmentation in chest radiographs using anatomical atlases with nonrigid registration. IEEE transactions on medical
imaging, 33(2):577–590, 2014.
[16] Jason M Daida, Hsiaolei Li, Ricky Tang, and Adam M Hilss. What makes a
problem gp-hard? validating a hypothesis of structural causes. In Genetic and
Evolutionary Computation—GECCO 2003, pages 1665–1677. Springer, 2003.
[17] Peter Day and Asoke K. Nandi. Binary string fitness characterization and comparative partner selection in genetic programming. IEEE Transactions on Evolutionary Computation, 12(6):724–735, December 2008.
[18] Edwin de Jong, Richard Watson, and Jordan Pollack. Reducing bloat and promoting diversity using multi-objective methods. In Proceedings of the Genetic
and Evolutionary Computation Conference, GECCO 2001, pages 11–18. Morgan
Kaufmann, 2001.
[19] Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. A
fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE transactions on
evolutionary computation, 6(2):182–197, 2002.

92

[20] Xiao-Rong Ding, Yuan-Ting Zhang, Jing Liu, Wen-Xuan Dai, and Hon Ki
Tsang. Continuous cuffless blood pressure estimation using pulse transit time
and photoplethysmogram intensity ratio. IEEE Transactions on Biomedical
Engineering, 63(5):964–972, 2016.
[21] Brad Dolin, Maribel Garcia Arenas, and Juan J. Merelo Guervos. Opposites
attract: Complementary phenotype selection for crossover in genetic programming. In Juan J. Merelo-Guervos, Panagiotis Adamidis, Hans-Georg Beyer,
Jose-Luis Fernandez-Villacanas, and Hans-Paul Schwefel, editors, Parallel Problem Solving from Nature - PPSN VII, number 2439 in Lecture Notes in Computer Science, LNCS, pages 142–152, Granada, Spain, 7-11 September 2002.
Springer-Verlag.
[22] Jeannie Fitzgerald and Conor Ryan. Exploring boundaries: optimising individual class boundaries for binary classification problem. In GECCO ’12: Proceedings of the fourteenth international conference on Genetic and evolutionary
computation conference, pages 743–750, Philadelphia, Pennsylvania, USA, 7-11
July 2012. ACM.
[23] Edgar Galvan-Lopez, Brendan Cody-Kenny, Leonardo Trujillo, and Ali Kattan.
Using semantics in the selection mechanism in genetic programming: A simple
method for promoting semantic diversity. In Evolutionary Computation (CEC),
2013 IEEE Congress on, pages 2972–2979. IEEE, 2013.
[24] Chris Gathercole and Peter Ross. An adverse interaction between crossover and
restricted tree depth in genetic programming. In Proceedings of the 1st Annual
Conference on Genetic Programming, pages 291–296, Cambridge, MA, USA,
1996. MIT Press.
[25] David E. Goldberg. Genetic Algorithms in Search, Optimization and Machine
Learning. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA,
1st edition, 1989.
[26] David E Goldberg and Jon Richardson. Genetic algorithms with sharing for
multimodal function optimization. In Genetic algorithms and their applications: Proceedings of the Second International Conference on Genetic Algorithms, pages 41–49. Hillsdale, NJ: Lawrence Erlbaum, 1987.
[27] Steven Gustafson, Edmund K Burke, and Graham Kendall. Sampling of unique
structures and behaviours in genetic programming. In Genetic Programming,
pages 279–288. Springer, 2004.
[28] Thomas Helmuth, Lee Spector, and James Matheson. Solving uncompromising
problems with lexicase selection. IEEE Transactions on Evolutionary Computation, 19(5):630–643, October 2015.

93

[29] John H Holland. Building blocks, cohort genetic algorithms, and hyperplanedefined functions. Evolutionary computation, 8(4):373–391, 2000.
[30] Gregory S. Hornby. Alps: The age-layered population structure for reducing the
problem of premature convergence. In Proceedings of the 8th Annual Conference
on Genetic and Evolutionary Computation, GECCO ’06, pages 815–822, New
York, NY, USA, 2006. ACM.
[31] Gregory S. Hornby. Steady-state alps for real-valued problems. In Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation,
GECCO ’09, pages 795–802, New York, NY, USA, 2009. ACM.
[32] Jianjun Hu, Erik Goodman, Kisung Seo, Zhun Fan, and Rondal Rosenberg. The
hierarchical fair competition framework for sustainable evolutionary algorithms.
Evolutionary Computation, 13(2):241–277, Summer 2005.
[33] Jianjun Hu, Kisung Seo, Shaobo Li, Zhun Fan, Ronald C. Rosenberg, and
Erik D. Goodman. Structure fitness sharing (sfs) for evolutionary design by
genetic programming. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 780–787. Morgan Kaufmann Publishers, 2002.
[34] Andrew Innes. Genetic Programing for Cephalometric Landmark Detection.
PhD thesis, School of Aerospace, Mechanical and Manufacturing Engineering,
RMIT University, Victoria, Australia, 29 August 2007.
[35] David Jackson. Phenotypic diversity in initial genetic programming populations. In Proceedings of the 13th European Conference on Genetic Programming,
EuroGP 2010, volume 6021 of LNCS, pages 98–109, Istanbul, 7-9 April 2010.
Springer.
[36] David Jackson. Promoting phenotypic diversity in genetic programming. In
PPSN 2010 11th International Conference on Parallel Problem Solving From
Nature, volume 6239 of Lecture Notes in Computer Science, pages 472–481,
Krakow, Poland, 11-15 September 2010. Springer.
[37] Stefan Jaeger, Alexandros Karargyris, Sema Candemir, Les Folio, Jenifer Siegelman, Fiona Callaghan, Zhiyun Xue, Kannappan Palaniappan, Rahul K Singh,
Sameer Antani, et al. Automatic tuberculosis screening using chest radiographs.
IEEE transactions on medical imaging, 33(2):233–245, 2014.
[38] Joshua D Knowles, Richard A Watson, and David W Corne. Reducing local optima in single-objective problems by multi-objectivization. In International Conference on Evolutionary Multi-Criterion Optimization, pages 269–283. Springer,
2001.
[39] John R. Koza. Genetic Programming: On the Programming of Computers by
Means of Natural Selection. MIT Press, Cambridge, MA, USA, 1992.

94

[40] Krzysztof Krawiec and Pawel Lichocki. Approximating geometric crossover in
semantic space. In Proceedings of the 11th Annual Conference on Genetic and
Evolutionary Computation, GECCO ’09, pages 987–994, New York, NY, USA,
2009. ACM.
[41] Krzysztof Krawiec and Una-May O’Reilly. Behavioral programming: a broader
and more detailed take on semantic GP. In GECCO ’14: Proceedings of the 2014
conference on Genetic and evolutionary computation, pages 935–942, Vancouver,
BC, Canada, 12-16 July 2014. ACM. Best paper.
[42] Krzysztof Krawiec and Tomasz Pawlak. Locally geometric semantic crossover.
In Proceedings of the 14th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO ’12, pages 1487–1488, New York, NY, USA,
2012. ACM.
[43] Krzysztof Krawiec and Tomasz Pawlak. Approximating geometric crossover
by semantic backpropagation. In Proceedings of the 15th Annual Conference
on Genetic and Evolutionary Computation, GECCO ’13, pages 941–948, New
York, NY, USA, 2013. ACM.
[44] Krzysztof Krawiec and Tomasz Pawlak. Locally geometric semantic crossover:
a study on the roles of semantics and homology in recombination operators.
Genetic Programming and Evolvable Machines, 14(1):31–63, 2013.
[45] William La Cava, Lee Spector, and Kourosh Danai. Epsilon-lexicase selection for
regression. In Tobias Friedrich, editor, GECCO ’16: Proceedings of the 2016 on
Genetic and Evolutionary Computation Conference, page fp376, Denver, USA,
20-24 July 2016. ACM.
[46] W. B. Langdon and Riccardo Poli.
Springer-Verlag, 2002.

Foundations of Genetic Programming.

[47] Sean Luke. When short runs beat long runs. In Proceedings of the Genetic and
Evolutionary Computation Conference (GECCO-2001), pages 74–80, 2001.
[48] M. Maghoumi and B. J. Ross. A comparison of genetic programming feature
extraction languages for image classification. In IEEE Symposium on Computational Intelligence for Multimedia, Signal and Vision Processing (CIMSIVP
2014), December 2014.
[49] Hammad Majeed and Conor Ryan. A less destructive, context-aware crossover
operator for GP. In Pierre Collet, Marco Tomassini, Marc Ebner, Steven
Gustafson, and Anikó Ekárt, editors, Proceedings of the 9th European Conference on Genetic Programming, volume 3905 of Lecture Notes in Computer
Science, pages 36–48, Budapest, Hungary, 10 - 12 2006. Springer.

95

[50] Yuliana Martinez, Leonardo Trujillo, Enrique Naredo, and Pierrick Legrand. A
comparison of fitness-case sampling methods for symbolic regression with genetic programming. In EVOLVE - A Bridge between Probability, Set Oriented
Numerics, and Evolutionary Computation V, volume 288 of Advances in Intelligent Systems and Computing, pages 201–212, Peking, 1-4 July 2014. Springer.
[51] James McDermott, David R. White, Sean Luke, Luca Manzoni, Mauro Castelli,
Leonardo Vanneschi, Wojciech Jaskowski, Krzysztof Krawiec, Robin Harper,
Kenneth De Jong, and Una-May O’Reilly. Genetic programming needs better
benchmarks. In GECCO ’12: Proceedings of the fourteenth international conference on Genetic and evolutionary computation conference, pages 791–798,
Philadelphia, Pennsylvania, USA, 7-11 July 2012. ACM.
[52] R I (Bob) McKay. Fitness sharing in genetic programming. In Proceedings of
the Genetic and Evolutionary Computation Conference (GECCO-2000), pages
435–442, Las Vegas, Nevada, USA, 10-12 July 2000. Morgan Kaufmann.
[53] Nicholas Freitag McPhee and Nicholas J. Hopper. Analysis of genetic diversity
through population history. In Proceedings of the Genetic and Evolutionary
Computation Conference, volume 2, pages 1112–1120, Orlando, Florida, USA,
13-17 July 1999. Morgan Kaufmann.
[54] Nicholas Freitag McPhee, Brian Ohs, and Tyler Hutchison. Semantic building
blocks in genetic programming. Working Paper Series Volume 3 Number 2,
University of Minnesota Morris, 600 East 4th Street, Morris, MN 56267, USA,
12 December 2007.
[55] Alberto Moraglio, Krzysztof Krawiec, and Colin G. Johnson. Geometric semantic genetic programming. In Parallel Problem Solving from Nature, PPSN
XII (part 1), volume 7491 of Lecture Notes in Computer Science, pages 21–31,
Taormina, Italy, September 1-5 2012. Springer.
[56] Gearoid Murphy and Conor Ryan. A simple powerful constraint for genetic programming. In Michael O’Neill, Leonardo Vanneschi, Steven Gustafson, Anna Isabel Esparcia Alcazar, Ivanoe De Falco, Antonio Della Cioppa, and Ernesto
Tarantino, editors, Proceedings of the 11th European Conference on Genetic
Programming, EuroGP 2008, volume 4971 of Lecture Notes in Computer Science, pages 146–157, Naples, 26-28 March 2008. Springer.
[57] Quang Uy Nguyen, Xuan Hoai Nguyen, and Michael O’Neill. Semantic aware
crossover for genetic programming: The case for real-valued function regression. In Leonardo Vanneschi, Steven Gustafson, Alberto Moraglio, Ivanoe De
Falco, and Marc Ebner, editors, Proceedings of the 12th European Conference
on Genetic Programming, EuroGP 2009, volume 5481 of LNCS, pages 292–302,
Tuebingen, April 15-17 2009. Springer.

96

[58] Quang Uy Nguyen, Xuan Hoai Nguyen, Michael O’Neill, and Alexandros Agapitos. An investigation of fitness sharing with semantic and syntactic distance
metrics. In Proceedings of the 15th European Conference on Genetic Programming, EuroGP 2012, volume 7244 of LNCS, pages 109–120, Malaga, Spain,
11-13 April 2012. Springer Verlag.
[59] Quang Uy Nguyen, Tuan Anh Pham, Xuan Hoai Nguyen, and James McDermott. Subtree semantic geometric crossover for genetic programming. Genetic
Programming and Evolvable Machines, 17(1):25–53, March 2016.
[60] World Health Organization. Global tuberculosis report. 2016.
[61] Tomasz P Pawlak, Bartosz Wieloch, and Krzysztof Krawiec. Review and comparative analysis of geometric semantic crossovers. Genetic Programming and
Evolvable Machines, 16(3):351–386, 2015.
[62] Tomasz P. Pawlak, Bartosz Wieloch, and Krzysztof Krawiec. Semantic backpropagation for designing search operators in genetic programming. IEEE
Transactions on Evolutionary Computation, 19(3):326–340, June 2015.
[63] Riccardo Poli. Hyperschema theory for gp with one-point crossover, building
blocks, and some new results in ga theory. In Genetic Programming, volume
1802 of Lecture Notes in Computer Science, pages 163–180. Springer Berlin
Heidelberg, 2000.
[64] Riccardo Poli and William B. Langdon. On the search properties of different
crossover operators in genetic programming. In Genetic Programming 1998:
Proceedings of the Third Annual Conference, pages 293–301, University of Wisconsin, Madison, Wisconsin, USA, 22-25 July 1998. Morgan Kaufmann.
[65] Justinian P. Rosca and Dana H. Ballard. Rooted-tree schemata in genetic programming. In Advances in Genetic Programming 3, chapter 11, pages 243–271.
MIT Press, Cambridge, MA, USA, June 1999.
[66] Conor Ryan, Krzysztof Krawiec, Una-May O’Reilly, Jeannie Fitzgerald, and
David Medernach. Building a stage 1 computer aided detector for breast cancer
using genetic programming. In Miguel Nicolau, Krzysztof Krawiec, Malcolm I.
Heywood, Mauro Castelli, Pablo Garcia-Sanchez, Juan J. Merelo, Victor M.
Rivas Santos, and Kevin Sim, editors, 17th European Conference on Genetic
Programming, volume 8599 of LNCS, pages 162–173, Granada, Spain, 23-25
April 2014. Springer.
[67] G. Sannino, I. De Falco, and G. De Pietro. Genetic programming for a wearable
approach to estimate blood pressure embedded in a mobile-based health system.
In 2015 IEEE 27th International Conference on Tools with Artificial Intelligence
(ICTAI), pages 775–783, Nov 2015.

97

[68] J David Schaffer. Multiple objective optimization with vector evaluated genetic algorithms. In Proceedings of the 1st international Conference on Genetic
Algorithms, pages 93–100. L. Erlbaum Associates Inc., 1985.
[69] Michael Schmidt and Hod Lipson. Age-fitness pareto optimization. In Genetic
Programming Theory and Practice VIII, volume 8 of Genetic and Evolutionary
Computation, chapter 8, pages 129–146. Springer, Ann Arbor, USA, 20-22 May
2010.
[70] Alaa Sheta, Hossam Faris, and Mouhammd Alkasassbeh. A genetic programming model for S&P 500 stock market prediction. International Journal of
Control and Automation, 6(5):303–314, 2013.
[71] Guido Smits and Mark Kotanchek. Pareto-front exploitation in symbolic regression. In Una-May O’Reilly, Tina Yu, Rick L. Riolo, and Bill Worzel, editors, Genetic Programming Theory and Practice II, chapter 17, pages 283–299.
Springer, Ann Arbor, 13-15 May 2004.
[72] Terence Soule and James A Foster. Effects of code growth and parsimony
pressure on populations in genetic programming. Evolutionary computation,
6(4):293–309, 1998.
[73] Lee Spector, David M. Clark, Ian Lindsay, Bradford Barr, and Jon Klein. Genetic programming for finite algebras. In GECCO ’08: Proceedings of the 10th
annual conference on Genetic and evolutionary computation, pages 1291–1298,
Atlanta, GA, USA, 12-16 July 2008. ACM.
[74] Nguyen Quang Uy, Nguyen Xuan Hoai, Michael O’Neill, R. I. McKay, and Edgar
Galvan-Lopez. Semantically-based crossover in genetic programming: application to real-valued symbolic regression. Genetic Programming and Evolvable
Machines, 12(2):91–119, June 2011.
[75] Leonardo Vanneschi, Mauro Castelli, Luca Manzoni, and Sara Silva. Genetic
Programming: 16th European Conference, EuroGP 2013, Vienna, Austria, April
3-5, 2013. Proceedings, chapter A New Implementation of Geometric Semantic GP and Its Application to Problems in Pharmacokinetics, pages 205–216.
Springer Berlin Heidelberg, Berlin, Heidelberg, 2013.
[76] Leonardo Vanneschi, Mauro Castelli, and Sara Silva. A survey of semantic methods in genetic programming. Genetic Programming and Evolvable Machines,
15(2):195–214, 2014.

98