OPTIMIZING FOR MENTAL REPRESENTATIONS IN THE EVOLUTION
            OF ARTIFICIAL COGNITIVE SYSTEMS
                                    By
                     Douglas Andrew Kirkpatrick
                          A DISSERTATION
                              Submitted to
                      Michigan State University
               in partial fulfillment of the requirements
                            for the degree of
               Computer Science—Doctor of Philosophy
                                   2021


                                         ABSTRACT
     OPTIMIZING FOR MENTAL REPRESENTATIONS IN THE EVOLUTION OF
                           ARTIFICIAL COGNITIVE SYSTEMS
                                               By
                                 Douglas Andrew Kirkpatrick
Mental representations, or sensor-independent internal models of the environment, are used
to interpret the world and make decisions based upon that understanding. For example,
a human sees dark clouds in the sky, recalls that often dark clouds mean rain (a mental
representation), and consequently decides to wear a raincoat. I seek to identify, understand,
and encourage the evolution of these representations in silico. Previous work identified an
information-theoretic tool, referred to as R, that measures mental representations in artificial
cognitive systems (e.g., Markov Brains or Recurrent Neural Networks). Further work found
that selecting for R, along with task performance, in the evolution of artificial cognitive
systems leads to better overall performance on a given task. Here I explore the implications
and opportunities of this modified selection process, referred to as R-augmentation. After
an overview of common methods, techniques, and computational substrates in Chapter 1, a
series of working chapters experimentally demonstrate the capabilities and possibilities of
R-augmentation. First, in Chapter 2, I address concerns regarding potential limitations of
R-augmentation. This includes an refutation of suspected negative impacts on the system’s
ability to generalize within-domain and the system’s robustness to sensor noise. To the
contrary, the systems evolved with R-augmentation tend to perform better than those evolved
without, in the context of noisy environments and different computational components. In
Chapter 3 I examine how R-augmentation works across different cognitive structures, focusing
on the evolution of genetic programming related structures and the effect that augmentation
has on the distribution of their representations. For Chapter 4, in the context of the all-
component Markov Brain (referred to as a Buffet Brain, see [Hintze et al., 2019]) I analyze
potential reasons that explain why R-augmentation works; the mechanism seems to be


based on evolutionary dynamics as opposed to structural or component differences. Next,
I demonstrate a novel usage of R-augmentation in Chapter 5; with R-augmentation, one
can use far fewer training examples during evolution and the resulting systems still perform
approximately as well as those that were trained on the full set of examples. This advantage
in increased performance at low sample size is found in some examples of in-domain and
out-domain generalization, with the ”worst-case” scenario being that the networks created
by R-augmentation perform as well as their unaugmented equivalents. Lastly, in Chapter
6 I move beyond R-augmentation to explore using other neuro-correlates - particularly
the distribution of representations, called smearedness - as part of the fitness function. I
investigate the possibility of using MAP-Elites to identify an optimal value of smearedness
for augmentation or for use as an optimization method in its own right. Taken together,
these investigations demonstrate both the capabilities and limitations of R-augmentation,
and open up pathways for future research.


Copyright by
DOUGLAS ANDREW KIRKPATRICK
2021


                                 ACKNOWLEDGMENTS
    First and foremost, I would like to thank my advisors, Arend Hintze and Chris Adami,
who have guided and mentored me through these 6 long years. Without their advice and
wisdom, this dissertation would not have come to fruition.
    I would also like to thank my fiancée, Amber Goguen, along with our cat, Rufus, and our
dog, Badger. Without your constant love and support, I would not have made it this far.
    My parents, Brian and Jackie Kirkpatrick, and my siblings, Brennan and Moe, have also
been a constant source of encouragement and support. Thank you; I made it, even if it was
a little late (perhaps as expected).
    Although the pandemic has hindered some of the lively banter and discussion that provided
a wealth of inspiration for this work, I would like to thank the members of the Hintze and
Adami labs, past and present, for their support. In particular, I send my thanks to Cliff
Bohm, Jory Schossau, Vincent Ragusa, Acacia Ackles, Ali Tehrani, C.G. Nitash, and Leigh
Sheneman for their camaraderie over the years.
    I would like to thank my other committee members, Charles Ofria and Bill Punch, for
their guidance and patience.
    I would like to thank the other professors and instructors who have given me advice and
guidance over the years, in particular Josh Nahum, Richard Enbody, and Kevin Ohl.
    I would like to thank Michigan State University for supporting me over the past 6 years.
Without the wonderful resources available through ICER, the HPCC, and BEACON, I would
never have made it this far.
                                               v


                              TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     ix
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       x
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . . . .  1
   1.1 Computational Substrates for Artificial Cognitive Systems                 . . . . .  3
      1.1.1 Markov Brains . . . . . . . . . . . . . . . . . . . . . . . .        . . . . .  4
      1.1.2 ANNs and RNNs . . . . . . . . . . . . . . . . . . . . . .            . . . . .  5
      1.1.3 GP Trees, Gates, and Networks . . . . . . . . . . . . .              . . . . .  5
      1.1.4 NEAT Gates . . . . . . . . . . . . . . . . . . . . . . . . .         . . . . .  6
      1.1.5 LSTM Networks . . . . . . . . . . . . . . . . . . . . . . .          . . . . .  7
   1.2 Mental Representations . . . . . . . . . . . . . . . . . . . . . .        . . . . .  7
   1.3 Neurocorrelates and Information Theory . . . . . . . . . . .              . . . . .  8
      1.3.1 Entropy Measurement and Pseudo-counts . . . . . . .                  . . . . .  8
      1.3.2 R: A measure of Representation . . . . . . . . . . . . .             . . . . .  9
      1.3.3 S: A measure of Smearedness . . . . . . . . . . . . . . .            . . . . . 10
   1.4 Optimization Methods and Genetic Algorithms . . . . . . .                 . . . . . 11
      1.4.1 Roulette Selection . . . . . . . . . . . . . . . . . . . . . .       . . . . . 12
      1.4.2 MAP-Elites . . . . . . . . . . . . . . . . . . . . . . . . . .       . . . . . 12
      1.4.3 Multi-objective optimization and R-Augmentation .                    . . . . . 13
      1.4.4 Machine Learning . . . . . . . . . . . . . . . . . . . . . .         . . . . . 15
   1.5 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
      1.5.1 Active Categorical Perception Task . . . . . . . . . . .             . . . . . 16
      1.5.2 Number Discrimination Task . . . . . . . . . . . . . . .             . . . . . 18
   1.6 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . 18
Chapter 2 Addressing concerns regarding potential limitations or effects of
          R-augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       21
   2.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  21
   2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    22
   2.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . .         25
      2.3.1 Markov Brains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        25
      2.3.2 Active Categorical Perception Task . . . . . . . . . . . . . . . .             26
      2.3.3 Representations and the neuro-correlate R . . . . . . . . . . .                29
      2.3.4 Evolutionary Algorithm . . . . . . . . . . . . . . . . . . . . . . .           31
      2.3.5 Robustness to Noise . . . . . . . . . . . . . . . . . . . . . . . . . .        31
      2.3.6 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       32
   2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   33
   2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    39
   2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   41
Chapter 3 The Evolution of Representations in Genetic Programming Trees 43
   3.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
                                            vi


   3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    44
   3.3 Material and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . .        47
       3.3.1 Representations and the neuro-correlate R . . . . . . . . . . .               47
       3.3.2 Smearedness of Representations . . . . . . . . . . . . . . . . . .            52
       3.3.3 Active Categorical Perception Task . . . . . . . . . . . . . . . .            52
       3.3.4 Number Discrimination Task . . . . . . . . . . . . . . . . . . . .            53
       3.3.5 The perception-action loop for stateful machines . . . . . . . .              54
       3.3.6 Markov GP brains using CGP nodes . . . . . . . . . . . . . . .                57
       3.3.7 Genetic Encoding of GP Brains in a tree-like fashion . . . . .                57
       3.3.8 GP-Forest Brain . . . . . . . . . . . . . . . . . . . . . . . . . . . .       60
       3.3.9 GP-Vector Brain . . . . . . . . . . . . . . . . . . . . . . . . . . . .       62
       3.3.10 Evolutionary Process . . . . . . . . . . . . . . . . . . . . . . . . .       65
       3.3.11 Augmenting with R . . . . . . . . . . . . . . . . . . . . . . . . . .        66
   3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   67
       3.4.1 GP trees evolve to have representations . . . . . . . . . . . . .             67
       3.4.2 Does Augmentation using R improve the performance of a
              GA? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  69
       3.4.3 Smeared Representations . . . . . . . . . . . . . . . . . . . . . .           71
   3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    72
   3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   73
   3.7 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        74
Chapter 4   Evolutionary Dynamics Effects Account for the Improvement
           Caused By R-Augmentation . . . . . . . . . . . . . . . . . . . . . .            75
   4.1  Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
   4.2  Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   76
   4.3  Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . .        79
       4.3.1 Markov Brains . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       79
       4.3.2 Active Categorical Perception Task . . . . . . . . . . . . . . . .            80
       4.3.3 Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . .       80
       4.3.4 Fitness Function and R-Augmentation . . . . . . . . . . . . . .               82
       4.3.5 Evolutionary and Experimental Conditions . . . . . . . . . . .                82
       4.3.6 Network Structure and Composition Analysis . . . . . . . . .                  83
       4.3.7 Evolutionary Dynamics Analysis . . . . . . . . . . . . . . . . . .            84
   4.4  Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . .       84
   4.5  Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   90
   4.6  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       91
Chapter 5 R-Augmentation maintains Performance as Sample Size Decreases                    92
   5.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  92
   5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    93
   5.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   94
       5.3.1 Mental Representations and R . . . . . . . . . . . . . . . . . . .            94
       5.3.2 Active Categorical Perception Task . . . . . . . . . . . . . . . .            96
       5.3.3 Fitness Function Augmentation With R . . . . . . . . . . . . .                98
                                            vii


       5.3.4  Recurrent Neural Networks . .          . . . . . . . . . . . . . . . . . . .  99
       5.3.5  Markov Brains . . . . . . . . . .      . . . . . . . . . . . . . . . . . . .  99
       5.3.6  Evolutionary Algorithm . . . .         . . . . . . . . . . . . . . . . . . . 100
       5.3.7  Reduction of Trials . . . . . . .      . . . . . . . . . . . . . . . . . . . 101
       5.3.8  Generalization . . . . . . . . . .     . . . . . . . . . . . . . . . . . . . 102
              5.3.8.1 In-Domain Generalization .     . . . . . . . . . . . . . . . . . . . 103
              5.3.8.2 Out-Domain Generalization      . . . . . . . . . . . . . . . . . . . 106
   5.4 Results and Discussion . . . . . . . .        . . . . . . . . . . . . . . . . . . . 106
   5.5 Conclusion . . . . . . . . . . . . . . . .    . . . . . . . . . . . . . . . . . . . 118
Chapter 6    Using MAP-Elites to direct the evolution of desired                 neural
           characteristics . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . 123
   6.1  Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
   6.2  Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . 124
   6.3  Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . 127
       6.3.1 Active Categorical Perception Task . . . . . . . . . . .            . . . . . 127
       6.3.2 Numerical Comparison Task . . . . . . . . . . . . . . .             . . . . . 128
       6.3.3 Markov Brains . . . . . . . . . . . . . . . . . . . . . . . .       . . . . . 129
       6.3.4 Mental Representations and Neuro-correlates . . . . .               . . . . . 131
              6.3.4.1 Representations (R) . . . . . . . . . . . . . . . . . .    . . . . . 132
              6.3.4.2 Smearedness (S) . . . . . . . . . . . . . . . . . . . .    . . . . . 132
       6.3.5 Augmented Fitness Functions . . . . . . . . . . . . . . .           . . . . . 133
       6.3.6 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . .       . . . . . 135
       6.3.7 MAP-Elites . . . . . . . . . . . . . . . . . . . . . . . . . .      . . . . . 135
   6.4  Results and Discussion . . . . . . . . . . . . . . . . . . . . . .       . . . . . 136
   6.5  Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . 144
   6.6  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . .       . . . . . 145
Chapter 7   Conclusions and Future Work            . . . . . . . . . . . . . . . . . . . . 146
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
                                           viii


                                LIST OF TABLES
Table 2.1: Conditions for Evolution and Testing of block sizes to catch (small) and
           avoid (large) as well as the size of the gap between the sensors. . . . .     29
Table 3.1: The possible mathematical operations of each node in a GP-Forest brain.
           Each node has two possible inputs: L and R, performs a computation on
           them, and returns the value of said computation. The last category of
           LOGIC consists of 16 independent binary logic operations (e.g., AND,
           NAND, NOR, XOR, OR, etc.) performed on the L and R input: . . .               62
Table 3.2: The allowed mathematical operation for the nodes of a GP-vector-brain.
           Observe that all operations are either performed on all elements of the L
           and R vector resulting in an output vector O, or on individual elements
           of L and R. Only the LOGIC operation is only defined as “point”
           operation and not on the entire vector. . . . . . . . . . . . . . . . . .     64
Table 5.1: Number of cases reporting execution times for each condition. Con-
           ditions were determined by subset size (n = 1, 2, 4, 6, 8, 10, 12, 14,
           16, 32, 48, 64) and GA type (with and without R-augmentation). Each
           condition had 200 replicate experiments run, however notifications were
           not received for each replicate. . . . . . . . . . . . . . . . . . . . . . . 112
                                          ix


                               LIST OF FIGURES
Figure 1.1: A simplified version of the MAP-Elites algorithm. Each organism is
            represented by the circle, and each square represents a bin for the algo-
            rithm. A prototypical organism is labeled (a), and demonstrates that
            each agent has a task performance score ω along with a measurement
            of the agent’s R and S. The grid is defined in terms of S on the x-axis
            and R on the y-axis. The bins for R are from 0 to 1, 1 to 2, and 2 to
            3 while the bins for S are from 0 to 2, 2 to 4, 4 to 6, 6 to 8, 8 to 10,
            and 10 to 12. The algorithm will place the organism labelled (b) in the
            square with the same light blue shade based on the organism’s R and
            S values. The organism will be successfully placed in the bin because
            there are no organisms in that bin previously. The algorithm will place
            the organism labelled (c) in the square with the same light green shade
            based on the organism’s R and S values. The placement will succeed
            as the organism labelled (c) has a higher task performance than the
            organism already in the cell. The algorithm will attempt to place the
            organism labelled (d) in the bin of the same light yellow color based
            on the organism’s R and S values. The placement will fail, however,
            because the organism that is already in the bin will remain because it
            has a higher task score, and the organism labelled (d) will be discarded. 14
Figure 1.2: A subsection of a single case of the Active Categorical Perception Task.
            The agent, visualized by the orange box, can move either to the left
            or right as indicated by the orange arrows. The four sensors, labelled
            1,2,3, and 4, can each see directly upwards as indicated by the light
            blue shading. A small block with width 2 is falling towards the agent
            and moving to the left, as indicated by the green arrow. At the current
            point in time, sensor 3 is the only sensor that would return a signal,
            while on the next update, assuming that the agent did not move, both
            sensors 3 and 4 would return a signal. . . . . . . . . . . . . . . . . .  17
                                          x


Figure 1.3:  A single case of the Number Discrimination Task. The agent receives a
             series of inputs over a period of 9 brain updates. The input values are
             shown in the squares under ”Sensors”, where each square represents
             a unique sensor input, and correlate with brain updates listed under
            ”Time”. The description of each sensor input is likewise listed under
            ”Description”. The agent receives a first number, here 3 (note that 3
             of the sensors have a 1 as input), and then has 3 intermediate updates
            where noise is fed to each sensor with probability p = 0.5. The agent
             then receives a second number, here 4 (note likewise that 4 of the
             inputs are 1), and then a second period of 3 intermediate updates
            where noise is fed to the sensors with the same probability p. After the
             second set of noise inputs, the agent receives an output signal of all
            ’0’ inputs, and is expected to output a 1 because the second number
             is larger than the first. . . . . . . . . . . . . . . . . . . . . . . . . .  19
Figure 2.1:  A subsection of a single case of the Active Categorical Perception Task.
            The agent, visualized by the orange box, can move either to the left
             or right as indicated by the orange arrows. The four sensors, labelled
            1,2,3, and 4, can each see directly upwards as indicated by the light
             blue shading. A small block with width 2 is falling towards the agent
             and moving to the left, as indicated by the green arrow. At the current
             point in time, sensor 3 is the only sensor that would return a signal,
            while on the next update, assuming that the agent did not move, both
             sensors 3 and 4 would return a signal. . . . . . . . . . . . . . . . . .     27
Figure 2.2:  The two agent types used for generalization trials. The agent labelled
            1 has the standard gap width of 2 units, while the agent labelled 2 has
             the modified gap width of 3 units. Blue squares indicate sensors, while
             the white squares indicate the gap in the middle. . . . . . . . . . . .      28
Figure 2.3:  Venn diagram of entropies and informations for the three random
            variables W , S, and B, describing the world, sensor, and agent internal
            (brain) states. The representation R = H(W : B|S) is shaded. . . . .          30
Figure 2.4:  Average performance (W̄ over generations for agents evolved without
             using R (dotted line) and with augmenting the GA (solid line). Aver-
             ages are computed for agents on the line of descent over 800 replicates.
            The standard error of the means are shown in grey, although due to
             the high replication count the standard error is negligible. The task
             performance of agents evolved using a GA augmented with R and a
             standard GA are significantly different when tested using Wilcoxon-
             Ranked-Sum test (p = 1.88e − 67). Here I evolved agents to catch
             blocks of size 2 and avoid blocks of size 4, with a gap width between
             the sensors of size 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
                                           xi


Figure 2.5: Task performance in Markov Brains with probabilistic logic gates pro-
            duced by augmented and unaugmented GAs. The blue line depicts
            results from the lines of descent of GAs using R-augmentation, while
            the orange line comes from unagumented GAs. Each line is the per-
            formance averaged over 500 replicate experiments. The shaded area
            around each line is the standard error. . . . . . . . . . . . . . . . . .      34
Figure 2.6: Task performance in Markov Brains with GP gates produced by aug-
            mented and unaugmented GAs. The blue line depicts results from the
            lines of descent of GAs using R-augmentation, while the orange line
            comes from unagumented GAs. Each line is the performance averaged
            over 500 replicate experiments. The shaded area around each line is
            the standard error. . . . . . . . . . . . . . . . . . . . . . . . . . . . .    35
Figure 2.7: Task performance in Markov Brains with ANN gates produced by
            augmented and unaugmented GAs. The blue line depicts results from
            the lines of descent of GAs using R-augmentation, while the orange line
            comes from unagumented GAs. Each line is the performance averaged
            over 500 replicate experiments. The shaded area around each line is
            the standard error. . . . . . . . . . . . . . . . . . . . . . . . . . . . .    35
Figure 2.8: Task performance in Markov Brains with NEAT gates produced by
            augmented and unaugmented GAs. The blue line depicts results from
            the lines of descent of GAs using R-augmentation, while the orange line
            comes from unagumented GAs. Each line is the performance averaged
            over 500 replicate experiments. The shaded area around each line is
            the standard error. . . . . . . . . . . . . . . . . . . . . . . . . . . . .    36
Figure 2.9: Performance of agents evolved under different evolutionary conditions
            after 40,000 generations. Each column shows the distribution of fitness
            as a box plot indicating the mean, the 25th quartile, the 75th quartiles,
            and outliers as ‘+’. R indicates an evolutionary experiment where
            the GA was augmented by using R; C labels the control. The box
            plots show the results for 8 different experimental conditions using
            combinations of different block sizes (2v4, 1v5, 3v5, 4v6) and gap widths
            (2,3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
                                            xii


Figure 2.10: Performance of evolved agents after 40,000 generations tested on dif-
             ferent conditions than they were evolved in. Each column shows the
             distribution of performance as a box plot indicating the mean, the
             25th quartile, the 75th quartiles, and outliers as ‘+’. R indicates an
             experiment where the GA was augmented by using R, C labels the con-
             trol. As defined in Table 2.1 I used 8 different environments to evolve
             and consequently test them. Significant differences (Mann-Whitney U,
             p < 0.01) are marked with *. In each of the significant cases the mean
             tested performance was greater when evolved with R-augmentation
             than in the control (unaugmented) condition. . . . . . . . . . . . . .        37
Figure 2.11: Mean performance of agents evolved with (solid line) and without
             (dashed line) augmenting the GA with R. The performance (W̄ ) is
             from agents from 800 replicate experiments after 40,000 generations of
             evolution tested over different levels of sensor noise. Error bars indicate
             the standard error. The results were not significantly different when
             tested with a Wilcoxon Rank-Sum test. These agents were evolved to
             catch blocks of size 2 and avoid blocks of size 4 and had a gap width
             of 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Figure 2.12: Mean performance of agents evolved under different levels of noise
             (x axis) with (solid line) and without (dashed line) augmenting the
             GA with R. The performance (W̄ ) is from agents from 350 replicate
             experiments for each level of noise, after 40,000 generations of evolution.
             Error bars indicate the standard error. The results were significantly
             different when tested with a Wilcoxon Rank-Sum test (p = 5.95e − 5).
             These agents were evolved to catch blocks of size 2 and avoid blocks
             of size 4 and had a gap width of 2. . . . . . . . . . . . . . . . . . . .     39
Figure 3.1:  Three different cognitive architectures - RNN, Markov Brain, and GP
             tree - and how they distribute or condense representations. RNNs have
             a joint input and recurrence layer, feed into arbitrary many hidden
             layers of arbitrary size, compute outputs which are also recurred back
             to the input layer. These networks tend to smear representations about
             different context overall recurrent nodes. The red, blue, and green color
             bars illustrate that. Markov Brains use inputs and hidden states as
             inputs to the logic gates or to other kinds of computational units.
             The result of the computations creates recurrent states and outputs.
             These networks tend to form sparse condensed representations, illus-
             trated by narrower red, green, and blue bands. GP trees use inputs
             and potentially recurring information and combine them over several
             computational nodes into a result. For these structures, it is not clear
             to which degree they are smearing or condensing representations. It is
             also not obvious how to recur them, illustrated by the “?”. . . . . . .       48
                                           xiii


Figure 3.2: Venn diagram of entropies and information for the three random vari-
            ables W , S, and B, describing the world, sensor, and agent internal
            (brain) states. The representation R = H(W : B|S) is shaded. . . . .         49
Figure 3.3: Illustration of the perception-action loop. The world is in a particu-
            lar state and can be perceived by the agent. This perception defines
            specific input states. These states now allow the brain (computational
            substrate) to perform a computation, which leads to new output states.
            These states, in turn, cause the agent to act in the environment and
            thus can change the world state. In case the agent needs to remember
            information about the past, it has to have the ability to store these
            memories somewhere. Thus, any machine that needs to be stateful
            has to have internal (here hidden states) that can be set depending
            on the current input and the current hidden states. This scheme of a
            brain resembles the structure of a recurrent neural network, but can
            be generally applied to all sort of computational machines that have
            to be stateful and act in an environment. . . . . . . . . . . . . . . .      55
Figure 3.4: Schematic overview of the genetic encoding for computational nodes.
            The genome is a sequence of numbers (bytes), and specific subsets of
            numbers (42, 213) define start codons. The sequence of numbers behind
            the start codon define a gene. As such, a gene stores all information
            necessary to define all properties of a computational node as it is
            needed to function within a genetic programming tree. Specifically:
            the computation the node has to perform (orange), the default for its
            left input (green), the default for its right input (blue), and the address
            where it needs to be inserted (purple). Observe that the defaults can
            be specific values (as seen for example for the left input of the node),
            or the address of a hidden node as seen for example in the right input
            of the node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Figure 3.5: Illustration of tree growth by adding nodes. 1) in the beginning the
            tree is defined as an empty root, that defaults to an output (here
            0.0). Once the genome is parsed and a new gene that encodes a node
            has been found, the node replaces the root node. The left and right
            inputs now define two ranges [0.0, 0.5] and ]0.5, 1.0]. 3) Each new node
            to be added has a specific address (a value from [0.0, 1.0] as defined
            by the gene, see Figure 3.4 the purple component of the gene). This
            number defines which default input to replace. The added node itself
            subdivides the range it connected to, in order to allow new nodes to
            again be added. 4) The concept of adding nodes and subdividing the
            connection ranges continues until the tree is built. In this way, a new
            node will always have a place in the tree. However, since nodes are
            added sequentially, deletions of genes/nodes will affect the topology
            greatly, but will not make it impossible to add nodes. . . . . . . . . .     61
                                          xiv


Figure 3.6:  Illustration of a GP-Forest. In this example, I assume the system to
             have two input nodes (red and orange), two output nodes (white), and
             two hidden nodes (green and blue). When decoding the genome, nodes
             can get added to hidden and output nodes sequentially. Each node
             can either use a default input (white) or use the value specified by an
             input or hidden node. For example, the tree added to the output node
             1 reads from the orange input node, and twice from the blue hidden
             node. When a new brain state has to be computed, all trees (outputs,
             hidden) are evaluated simultaneously, and consequently a new set out
             outputs and hidden states become defined, based on previous inputs,
             and hidden states. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Figure 3.7:  Illustration of a GP-Vector Brain. This brain is structurally identical
             to conventional genetic programming trees, except that it can execute
             computations on vectors instead of single values. The inputs to nodes
             can be the state vector, including all inputs (red and orange) and all
             hidden states (blue to green). Alternatively, nodes can use default
             vectors as inputs. These defaults are defined by the genes encoding
             each node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Figure 3.8:  Performance over time for all 3 brain types (GP-Forest = Black, MB-
             wGP = Red, GP-Vector = Blue). Left plot is active categorical per-
             ception, right is number comparison . . . . . . . . . . . . . . . . . .    68
Figure 3.9:  R over time for all 3 brain types (GP-Forest = Black, MBwGP = Red,
             GP-Vector = Blue). Left plot is active categorical perception, right is
             number comparison . . . . . . . . . . . . . . . . . . . . . . . . . . .    68
Figure 3.10: Performance over time for all 3 brain types (GP-Forest = Black, MB-
             wGP = Red, GP-Vector = Blue). Left plot is active categorical per-
             ception, right is number comparison. Solid line is control, dotted line
             is augmented with R . . . . . . . . . . . . . . . . . . . . . . . . . . .  70
Figure 3.11: R over time for all 3 brain types (GP-Forest = Black, MBwGP = Red,
             GP-Vector = Blue). Left plot is active categorical perception, right is
             number discrimination. Solid line is control, dotted line is augmented
             with R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Figure 3.12: Smearedness of Nodes over time for all 3 brain types (GP-Forest =
             Black, MBwGP = Red, GP-Vector = Blue). Left plot is active cate-
             gorical perception, right is number comparison. Solid line is control,
             dotted line is augmented with R . . . . . . . . . . . . . . . . . . . .    72
                                           xv


Figure 4.1: Venn diagram of entropies and informations for the three random
            variables E, S, and M , describing the environment, sensor, and agent
            internal (memory) states. The information-theoretic measure of repre-
            sentations R = H(E : M |S) is shaded. c 2020 IEEE. . . . . . . . . .     81
Figure 4.2: Task Performance in Buffet Markov Brains produced by Augmented
            and Unaugmented GAs. The blue line represents results from the
            lines of descent of GAs using R-augmentation, while the orange line
            comes from unagumented GAs. Each line is the average performance
            over 500 replicate experiments. The shaded area around each line
            represents the standard error. R-Augmentated GAs produce Markov
            Brains that have a significantly higher average performance than the
            Markov Brains produced by unaugmented GAs. c 2020 IEEE. . . . .          85
Figure 4.3: Average Proportion of Different Gate types in Buffet Markov Brains
            produced by Augmented and Unaugmented GAs over evolutionary
            time. Deterministic gates are blue. Probabilistic gates are green. ANN
            gates are red. NEAT gates are cyan (light blue). GP gates are yellow.    85
Figure 4.4: Average Number of Different Gate types in Buffet Markov Brains
            produced by Augmented and Unaugmented GAs over evolutionary
            time. Deterministic gates are blue. Probabilistic gates are green. ANN
            gates are red. NEAT gates are cyan (light blue). GP gates are yellow.    86
Figure 4.5: Proportion of 1- and 2- Cycles in Markov Brains produced by Aug-
            mented and Unaugmented GAs over evolutionary time. The purple
            lines represent the proportion of 1-cycles (e.g., Data is read from node
            A, processed by a gate, and then fed back into node A) among all node
            to node connections. The light blue lines represent the proportion of
            2-cycles (e.g., Data is read from node B, processed by a gate, and fed
            into node C which is read, processed by a gate, and fed back into node
            B), against all node to node connections. Although not identical, the
            proportions both hover around the same value of 8 percent in networks
            produced by augmented and unagumented brains. c 2020 IEEE. . .           87
Figure 4.6: Mutational Changes in Markov Brains produced by Augmented and
            Unaugmented GAs. The blue line represents results from the lines of
            descent of GAs using R-augmentation, while the orange line comes
            from unagumented GAs. Each point around the circle represents the
            average magnitude of the changes with a given proportion of change
            in R (vertical axis) and change in fitness (horizontal axis). The two
            distributions are significantly different (Kolmogorov–Smirnov 2-sample,
            p < 0.0001). c 2020 IEEE. . . . . . . . . . . . . . . . . . . . . . . .  88
                                          xvi


Figure 4.7: Mutational Changes in Markov Brains produced by Augmented and
            Unaugmented GAs. The blue line represents results from the lines of
            descent of GAs using R-augmentation, while the orange line comes
            from unagumented GAs. Each point around the circle represents the av-
            erage count of changes with a given proportion of change in R (vertical
            axis) and change in fitness (horizontal axis). The two distributions are
            significantly different (Kolmogorov–Smirnov 2-sample test, p < 1×10−7 ). 89
Figure 5.1: A subsection of a single case of the Active Categorical Perception Task.
            The agent, visualized by the orange box, can move either to the left
            or right as indicated by the orange arrows. The four sensors, labelled
            1,2,3, and 4, can each see directly upwards as indicated by the light
            blue shading. A small block with width 2 is falling towards the agent
            and moving to the left, as indicated by the green arrow. At the current
            point in time, sensor 3 is the only sensor that would return a signal,
            while on the next update, assuming that the agent did not move, both
            sensors 3 and 4 would return a signal. . . . . . . . . . . . . . . . . .      97
Figure 5.2: All possible cases in the Active Categorical Perception Task. Each
            square in the rectangle indicates a specific combination of block size,
            movement direction, and starting position referred to as a case, labelled
            here from 1 to 64 to distinguish their uniqueness. For example the
            case labelled 20 is the case where the small block starts at position 3
            (relative to the agent at position 0), and falls to the right. . . . . . .    98
Figure 5.3: A possible evaluative subset for the Reduction of Trials experiments
            with n = 32. Here, the 32 cases marked with an ’E’ would be evalu-
            ated for correctness and used to generate the fitness and measure R.
            Note that this is only one possible random selection, and any random
            selection is equally likely in practice. . . . . . . . . . . . . . . . . . . 102
Figure 5.4: Example Cases in Withheld subset under the Random Scheme. Each
            square in the rectangle indicates a specific combination of block size,
            movement direction, and starting position referred to as a case. The 16
            cases marked with a ’W’ are withheld for testing, while the remaining
            48 unmarked cases are used for evaluation during evolution. The cases
            withheld under this scheme are 16 randomly selected cases. . . . . . 104
                                          xvii


Figure 5.5: Example Cases in Withheld subset under the Starting Position Spaced
            Scheme. Each square in the rectangle indicates a specific combination
            of block size, movement direction, and starting position referred to
            as a case. The 16 cases marked with a ’W’ are withheld for testing,
            while the remaining 48 unmarked cases are used for evaluation during
            evolution. The cases withheld under this scheme are all of the cases for
            4 different starting positions, where the starting positions are evenly
            spaced across the range of possible starting positions. . . . . . . . . 104
Figure 5.6: Example Cases in Withheld subset under the Starting Position Grouped
            Scheme. Each square in the rectangle indicates a specific combination
            of block size, movement direction, and starting position referred to
            as a case. The 16 cases marked with a ’W’ are withheld for testing,
            while the remaining 48 unmarked cases are used for evaluation during
            evolution. The cases withheld under this scheme are all of the cases for
            4 different starting positions, where the starting positions are grouped
            together. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Figure 5.7: Example Cases in Withheld subset under the One Block / Direction
            Scheme. Each square in the rectangle indicates a specific combination
            of block size, movement direction, and starting position referred to
            as a case. The 16 cases marked with a ’W’ are withheld for testing,
            while the remaining 48 unmarked cases are used for evaluation during
            evolution. The cases withheld under this scheme are all of the cases
            for 1 block size and movement direction combination. . . . . . . . . 105
Figure 5.8: Performance at end of evolution under decreasing sample size for
            Markov Brain with Deterministic Logic Gates. The red line is the
            average performance of Markov Brains evolved using a GA augmented
            with R, while the shaded area in red is the corresponding standard
            error. The blue line is the average performance of Markov Brains
            evolved using an unaugmented GA, and the shaded blue area is the
            corresponding standard error. The sample sizes used were n = 1, 2, 4,
            6, 8, 10, 12, 14, 16, 32, 48, 64. . . . . . . . . . . . . . . . . . . . . . . 107
Figure 5.9: Performance at end of evolution under decreasing sample size for RNN.
            The red line is the average performance of RNNs evolved using a GA
            augmented with R, while the shaded area in red is the corresponding
            standard error. The blue line is the average performance of RNNs
            evolved using an unaugmented GA, and the shaded blue area is the
            corresponding standard error. The sample sizes used were n = 1, 2, 4,
            6, 8, 10, 12, 14, 16, 32, 48, 64. . . . . . . . . . . . . . . . . . . . . . . 108
                                          xviii


Figure 5.10: Performance at end of evolution under decreasing sample size for
             Markov Brain with ANN Gates. The red line is the average perfor-
             mance of Markov Brain - ANN hybrids evolved using a GA augmented
             with R, while the shaded area in red is the corresponding standard
             error. The blue line is the average performance of Markov Brain - ANN
             hybrids evolved using an unaugmented GA, and the shaded blue area
             is the corresponding standard error. The sample sizes used were n =
             1, 2, 4, 6, 8, 10, 12, 14, 16, 32, 48, 64. . . . . . . . . . . . . . . . . . 109
Figure 5.11: Average Execution Time Reported under decreasing sample size for
             Markov Brain with Deterministic Logic Gates, with and without R-
             augmentation. The red line is the average execution time of runs using
             a GA augmented with R, while the shaded area in red is the corre-
             sponding standard error of those times. The blue line is the average
             execution time of runs using an unaugmented GA, and the shaded blue
             area is the corresponding standard error of those times. The sample
             sizes used were n = 1, 2, 4, 6, 8, 10, 12, 14, 16, 32, 48, 64. . . . . . . 111
Figure 5.12: Performance tested on withheld group at end of evolution under de-
             creasing sample size with random subset removed. The red line is the
             average performance of Markov Brains evolved using a GA augmented
             with R, while the shaded area in red is the corresponding standard
             error. The blue line is the average performance of Markov Brains
             evolved using an unaugmented GA, and the shaded blue area is the
             corresponding standard error. The sample sizes tested were n = 3, 8,
             13, 18, 23, 28, 33, 38, 43, 48. . . . . . . . . . . . . . . . . . . . . . . 114
Figure 5.13: Performance tested on withheld group at end of evolution under de-
             creasing sample size with all of 1 block type and direction withheld.
             The red line is the average performance of Markov Brains evolved
             using a GA augmented with R, while the shaded area in red is the
             corresponding standard error. The blue line is the average performance
             of Markov Brains evolved using an unaugmented GA, and the shaded
             blue area is the corresponding standard error. The sample sizes tested
             were n = 3, 8, 13, 18, 23, 28, 33, 38, 43, 48. . . . . . . . . . . . . . . 115
Figure 5.14: Performance tested on withheld group at end of evolution under de-
             creasing sample size with all of 4 different start locations (evenly
             spaced) withheld. The red line is the average performance of Markov
             Brains evolved using a GA augmented with R, while the shaded area
             in red is the corresponding standard error. The blue line is the average
             performance of Markov Brains evolved using an unaugmented GA, and
             the shaded blue area is the corresponding standard error. The sample
             sizes tested were n = 3, 8, 13, 18, 23, 28, 33, 38, 43, 48. . . . . . . . 116
                                            xix


Figure 5.15: Performance tested on withheld group at end of evolution under de-
             creasing sample size with all of 4 different start locations (grouped
             together) removed. The red line is the average performance of Markov
             Brains evolved using a GA augmented with R, while the shaded area
             in red is the corresponding standard error. The blue line is the average
             performance of Markov Brains evolved using an unaugmented GA, and
             the shaded blue area is the corresponding standard error. The sample
             sizes tested were n = 3, 8, 13, 18, 23, 28, 33, 38, 43, 48. . . . . . . . 117
Figure 5.16: Performance of Markov Brains tested on blocks 3 and 5 at end of
             evolution under decreasing sample size when evolved on blocks 2 and
             6. The red line is the average performance of Markov Brains evolved
             using a GA augmented with R, while the shaded area in red is the
             corresponding standard error. The blue line is the average performance
             of Markov Brains evolved using an unaugmented GA, and the shaded
             blue area is the corresponding standard error. The sample sizes used
             were n = 1, 2, 4, 6, 8, 10, 12, 14, 16, 32, 48, 64. . . . . . . . . . . . . 119
Figure 5.17: Performance tested on blocks 2 and 6 at end of evolution under de-
             creasing sample size when evolved on blocks 3 and 5. The red line is
             the average performance of Markov Brains evolved using a GA aug-
             mented with R, while the shaded area in red is the corresponding
             standard error. The blue line is the average performance of Markov
             Brains evolved using an unaugmented GA, and the shaded blue area
             is the corresponding standard error. The sample sizes used were n =
             1, 2, 4, 6, 8, 10, 12, 14, 16, 32, 48, 64. . . . . . . . . . . . . . . . . . 120
Figure 6.1:  A subsection of a single case of the Active Categorical Perception Task.
             The agent, visualized by the orange box, can move either to the left
             or right as indicated by the orange arrows. The four sensors, labelled
             1,2,3, and 4, can each see directly upwards as indicated by the light
             blue shading. A small block with width 2 is falling towards the agent
             and moving to the left, as indicated by the green arrow. At the current
             point in time, sensor 3 is the only sensor that would return a signal,
             while on the next update, assuming that the agent did not move, both
             sensors 3 and 4 would return a signal. . . . . . . . . . . . . . . . . . 128
                                             xx


Figure 6.2:  A single case of the Numerical Comparison Task. The agent receives a
             series of inputs over a period of 9 brain updates. The input values are
             shown in the squares under ”Sensors”, where each square represents
             a unique sensor input, and correlate with brain updates listed under
            ”Time”. The description of each sensor input is likewise listed under
            ”Description”. The agent receives a first number, here 3 (note that 3
             of the sensors have a 1 as input), and then has 3 intermediate updates
            where noise is fed to each sensor with probability p = 0.5. The agent
             then receives a second number, here 4 (note likewise that 4 of the
             inputs are 1), and then a second period of 3 intermediate updates
            where noise is fed to the sensors with the same probability p. After the
             second set of noise inputs, the agent receives an output signal of all
            ’0’ inputs, and is expected to output a 1 because the second number
             is larger than the first. . . . . . . . . . . . . . . . . . . . . . . . . . 130
Figure 6.3:  The information-theoretic Venn Diagram describing how R is calcu-
             lated. R is defined as the information shared between the brain’s
             memory and the salient features of the environment or task given the
             state of the sensors. . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Figure 6.4:  A simplified version of the MAP-Elites algorithm. Each organism is
             represented by the circle, and each square represents a bin for the algo-
             rithm. A prototypical organism is labeled (a), and demonstrates that
             each agent has a task performance score ω along with a measurement
             of the agent’s R and S. The grid is defined in terms of S on the x-axis
             and R on the y-axis. The bins for R are from 0 to 1, 1 to 2, and 2 to
            3 while the bins for S are from 0 to 2, 2 to 4, 4 to 6, 6 to 8, 8 to 10,
             and 10 to 12. The algorithm will place the organism labelled (b) in the
             square with the same light blue shade based on the organism’s R and
             S values. The organism will be successfully placed in the bin because
             there are no organisms in that bin previously. The algorithm will place
             the organism labelled (c) in the square with the same light green shade
             based on the organism’s R and S values. The placement will succeed
             as the organism labelled (c) has a higher task performance than the
             organism already in the cell. The algorithm will attempt to place the
             organism labelled (d) in the bin of the same light yellow color based
             on the organism’s R and S values. The placement will fail, however,
             because the organism that is already in the bin will remain because it
             has a higher task score, and the organism labelled (d) will be discarded. 137
                                          xxi


Figure 6.5: Average percent correct decisions over evolutionary time for the ACP
            and NCT tasks, where each line represents a different version of the
            fitness function used for that experiment. Each line is the average of 100
            replicate runs using different random seeds. The blue line represents
            the original fitness function (Eq. 6.1). The black line represents R-
            augmentation (Eq. 6.4). The red line represents S-maximization (Eq.
            6.5). The green line represents S-minimization (Eq. 6.6). The shaded
            area of corresponding color around each line represents the standard
            error. The inset of the graph provides an expanded view of the indicated
            subset, to better distinguish the difference between conditions. . . . 139
Figure 6.6: Average percent correct decisions over evolutionary time for the ACP
            and NCT tasks. Each line represents a different version of the fitness
            function used for that experiment and is the average of 100 repli-
            cate runs using different random seeds. As in Fig. 6.5, the black line
            represents R-augmentation (Eq. 6.4), while the blue line represents
            the control. Each of the other differently colored lines represent S-
            augmentation towards a target value (Eq. 6.7), for a range of different
            values. The correspondence of color to target values is: orange to 2.0,
            green to 4.0, red to 6.0, cyan to 8.0, and magenta to 10.0. The shaded
            area of corresponding color around each line represents the standard er-
            ror. The inset of the graph provides an expanded view of the indicated
            subset, to exemplify the difference between conditions. . . . . . . . . 141
Figure 6.7: Heatmap of average MAP-Elite Results for the Active Categorical Per-
            ception Task and the Number Comparison Task. The top heatmap
            represents results from the ACP task, while the bottom heatmap repre-
            sents results from the NCT task. Each rectangle represents the average
            value for a bin in MAP-Elites, for some subset of the range of R on
            the Y-axis and S on the X-axis. The value is the average percentage
            of correct decisions, with the highest average value occurring where
            R is between 0.8 and 1.0 for the ACP, and between 1.0 and 1.2 for
            the NCT task; and where S is between 3.6 and 4.5 for the ACP, and
            between 2.7 and 3.6 for the NCT task. . . . . . . . . . . . . . . . . 142
                                          xxii


Chapter 1
Introduction
True generalized intelligence, as far as we know, has only ever arisen by evolving through
biological processes. In recent history, a number of attempts have come to the fore with the aim
of producing an approximation of or an equivalent to general or human-level intelligence. These
attempts include various approaches in machine learning and deep learning, which generally
tune a particular system or network to best learn a set of training data and subsequently
also perform well on testing data. These systems produce impressive results, particularly in
the game playing (Go [Fu, 2016], recently) and object recognition (e.g., on the MNIST [Deng,
2012,Mohapatra et al., 2015] or CIFAR [Krizhevsky et al., 2009,Liu and Deng, 2015] datasets)
domains. These systems gain mastery of specific tasks, but often seem to lack the ability
to generalize in broader directions. With the goal of overcoming these issues and creating
a general artificial intelligence, I study neuro-evolution, or the use of genetic algorithms to
generate artificial cognitive systems or neural networks. By approaching the problem from a
bio-inspired perspective, neuro-evolution can follow in the footsteps of biological intelligence
and hopefully produce a true generalizable human-level artificial intelligence.
    The genetic algorithms that are used for optimizing artificial cognitive systems require
an assessment of how well different systems perform relative to each other, either in terms of
                                                1


specific task performance (e.g., the number of decisions made correctly in a game) or by some
other measure. Evolving cognitive systems solely using task performance may ultimately lead
to the rise of intelligent behavior but this may take a longer period of time than is feasible
for experimental or practical purposes. To combat this defect, we need to examine alternate
ways of achieving better task performance in shorter evolutionary time. Thankfully, as we no
longer need to strictly follow the whims of biological evolution, we are at liberty to tune the
process to our liking. That is not to say that we should disregard the operation of biological
evolution in its entirety; to the contrary, we want to take the best parts of biological evolution
and tune them even further.
     Although task performance may be sufficient to define a fitness function and a genetic
algorithm, such a simple function lacks the nuance and complexity of biological evolution.
Biological organisms are often evaluated on a number of factors at once. For example, a
mouse’s fitness depends on its ability to avoid predators, find food, and interact with other
mice. Although not directly measured, the mouse’s ability to understand its environment is
still a strong component of its evolutionary fitness. A mouse that can, for example, construct
a mental model of what the warning signs of predators are, or what the shape and smell of a
good food source is, will be more likely to survive and have offspring. Thus mental models and
an ability to understand one’s environment may have a real impact on fitness in biological
evolution.
     Fortunately, in silico we can subvert the intermediary steps of building up a repertoire
of tasks and directly attempt to measure and select for the agent’s understanding, or mental
models. If we can successfully identify useful mental models that an agent produces, and
select for those along with task performance, we can potentially boost task performance
overall by creating an opportunity for exaptation. The model would be selected for earlier
in evolution as part of the modified fitness function, so that it can later turn into a useful
part of the network and complete the desired task. Such an approach would be analogous
to informing someone of a specific sequence that will occur in a video game, and enforcing
                                                2


that they remember how to exploit it. Although that memory may not improve their ability
to beat the game at the exact moment, by ensuring that they remember how to exploit the
sequence, it improves their task performance (e.g., their ability to complete the game) later
on.
    As we have complete control over these artificial cognitive systems, we can access internal
states to analyze what they know about their actions. To borrow terminology from philosophy
and psychology, we can examine their internal representations - information that they have
integrated from the environment into a usable form. To establish a common vernacular for
these investigations, the following sections describe a variety of computational substrates,
methods, genetic algorithms, and tasks used over the course of this dissertation.
1.1       Computational Substrates for Artificial Cognitive
          Systems
This dissertation, in order to pursue an understanding of memory and mental representations
that is broadly applicable, utilizes a variety of cognitive systems and structures. These struc-
tures vary in their origin and functionality, but all share a common thread in that they have
a recurrent memory component. This memory structure is vital, as without this structure,
representations as I define them are impossible. Each system also has the ability to read from
a set of inputs and write to a set of outputs, which allows the cognitive system to interact with
the task environments. Lastly, each cognitive system performs a series of computational oper-
ations to transform the data from the inputs to the outputs. The cognitive systems outlined
below include Markov Brains, Artifical or Recurrent Neural Networks, Genetic Programming
Trees, NeuroEvolution of Augmenting Topology components, and Long Short-Term Memory
networks. Markov Brains [Hintze et al., 2017] have composable computational components
with a flexible, sparse network. Artifical Neural Networks (ANNs) [Russell et al., 2003] are a
broad class of networks where a matrix of weights determines the calculation of values from
                                                 3


one layer to the next, working from inputs to internal layer(s) to outputs. I use a subset of
this class, the Recurrent Neural Network (RNN), that has an additional recurrent state added
to both the input and output layers. Long Short-Term Memory Networks (LSTMs) [Hochre-
iter and Schmidhuber, 1997] are a different subset of ANNs, where the internal layers and
computational processes are strictly defined to provide both long-term storage of informa-
tion and short-term memory. NeuroEvolution of Augmenting Topologies (NEAT) defines a
algorithm for generating ANNs or RNNs. Genetic Programming (GP) trees are defined as a
tree-like structure of mathematical operations [Koza, 1994, Miller, 2011], where the inputs to
the operations are either constants, input or recurrent node values, or the results of earlier
computations in the tree. Although each of these systems follows the same broad structure
of input to computational process to output, it is in the differences both major and minor
that I can examine the true nature of representations.
1.1.1       Markov Brains
Markov Brains are the primary cognitive system used in the research for this dissertation. A
Markov Brain, in the abstract form, is a series of input, output, and hidden (i.e., recurrent)
nodes, which are read from and written to by a series of computational modules or gates. As a
result, Markov Brains are sparse - their connectivity must be evolved - and highly composable.
Each gate can be defined as any input/output pattern matching system or network, up to
and potentially including other cognitive systems, including for example another Markov
Brain. However, in this dissertation I limit the gates available to evolution to a subset of all
possible gates. The canonical version of Markov Brains uses deterministic and probabilistic
logic gates with between 1 and 4 inputs and between 1 and 4 outputs. A more recent version
of the Markov Brain, referred to as a ”Buffet Brain” [Hintze et al., 2019], uses those two
original gates in addition to ANN gates, GP gates, and NEAT gates. Each gate reads from
(i.e., is wired from) a set number of input or hidden nodes or states and writes to (i.e., is
wired to) a set number of output or hidden nodes. The number of input and output states
                                               4


are task-dependent, while the number of hidden or recurrent nodes is defined by the user
(typically, 8 hidden nodes are used in this work). The brain structure is defined by a genome,
where certain values on the genome indicate a coding region for a particular gate type, and
the subsequent values define the functionality of that specific gate (e.g., its input wiring,
output wiring, and internal logic table in the case of a deterministic logic gate). Genomes
are seeded with a set number of gates of the types specified for use at the start of each
evolutionary run. The Markov Brain structure is intended to be highly modular and sparsely
connected.
1.1.2      ANNs and RNNs
The prototypical ANN as used in this work is a feed-forward perceptron network [McCulloch
and Pitts, 1943, Minsky and Papert, 1969], with a hyperbolic tangent activation function. A
single-layer version of the ANN is used as a gate type in Markov Brains, while RNNs are
used as a standalone brain type. ANNs and RNNs as used here are both fully connected,
with the values of each weight determined by a matrix read from the genome. RNNs differ
from the ANN in that they have additional recurrent nodes that can be read from into the
first layer along with inputs, and written to the last layer along with the outputs. The values
of recurrent nodes persist across updates, allowing for persistent memory. Although many
versions of ANNs and RNNs exist, with the notable exception of LSTMs the acronyms as
used in this dissertation refer to the simple, straightforward versions described here. Although
I have no reason to doubt the applicability of the methods described here to other network
types, for the sake of efficiency I will focus only on the fully-connected ANN and RNN, as
opposed to transformers, convolutional neural networks, or other recent variations.
1.1.3      GP Trees, Gates, and Networks
Genetic programming (GP) in this dissertation refers primarily to GP trees, where each
node in the tree is either a function, a constant value, or an input value. The mathematical
                                                  5


operations or functions take the values coming from one or more of its children nodes to
calculate a new value that is propagated up the tree. The node functions can take the form
of any mathematical or logical function. In the GP gate used as part of the Markov Brain, I
use a more limited subset of common unary and binary input functions (see [Hintze et al.,
2019] for a complete list) to provide computational capability. GP Trees have been used
to evolve functions that model real-world process, and have also been adapted to create
agent controllers. For the full network structures used here, particularly in Chapter 3, I look
at the versions that serve as agent controllers. I use two primary types of GP Tree based
networks, the GP-Forest and GP-Vector Brains. For the GP-Forest Brain, I construct an
indirectly-encoded GP Tree for each output and hidden/recurrent node. Each node in the
tree is a function, a constant, or the previous value of an input or hidden node. For the GP-
Vector Brain, vector operations are used instead of unary or binary operations on individual
values. Each node is either a constant vector, the entire state vector (input, output, and
recurrent/hidden states combined), or the result of a mathematical operation applied to all
elements in the child node vector value(s).
1.1.4      NEAT Gates
While the ANN and GP gates are described in conjunction with the corresponding standalone
brain type descriptions, here I describe the NEAT gate used by the Markov Brain in a few
parts of this dissertation. The NEAT gate type, described comprehensively in [Hintze et al.,
2019], is a hybrid between the GP and ANN gates. As used here, it takes a set of inputs, with
a weighted summation and hyperbolic tangent activation function (similar to the ANN gate)
and applies a mathematical function to the result, similar to the GP gate. This formulation
of the NEAT gate draws on the mathematical transformation of networks concept sourced
from the original NEAT work.
                                               6


1.1.5      LSTM Networks
The Long Short-Term Memory network (LSTM) is a variation of the RNN specifically designed
so that there are two main channels of memory: the cell states and the hidden states. The
mathematical operations that describe the connectivity of each of these layers are designed
to maintain one set of states as a short-term memory analogue (the hidden states), while
the other is meant to track more closely with biological long-term memory storage (the cell
states). Both the cell states and hidden states are recurrent, and as such the total number
of hidden and cell states is set in this work to equal the number of recurrent states in the
other cognitive system types, expanded as needed by padding the cell layer. A number of
sigmoid and hyperbolic tangent functions operate directly on the hidden nodes and integrate
information directly from the inputs, causing it to be more volatile. The opportunity to
manipulate the cell state, on the other hand, is gated by sigmoid and point multiplication
operations, leading to higher long-term stability.
1.2       Mental Representations
Mental representations have a long and storied history in philosophy, psychology, and cognitive
science. The idea that the mind has some sort of internal understanding has been a central
point across multiple disciplines, and underlies our understanding of biological intelligence.
To be clear, I refer here to mental representations as internal, sensor-independent models of
the world possessed by an individual. These representations can range from the simplest of
concepts, including amounts or relative values, all the way up to complex understandings
of other individuals and the actions they might take [Collins and Gentner, 1987]. Although
it is clear that these representations are present in biological and human instances, their
utility in the creation of artificial cognitive systems has been questioned. Some have gone
so far as advocating to abandon representations altogether, and to instead pursue ”intelli-
gence without representation.” [Brooks, 1991] This work emphatically disputes that notion.
                                                 7


’Intelligent’ systems without representations would be purely reactive systems, unable to
 respond to dynamic environments when prior states or information must be considered to
 make the optimal decision. Instead I posit that representations are a necessary component of
 artificial cognitive systems, and furthermore that the creation of such representations must
 be encouraged to create truly intelligent artificial cognitive systems. Previous work supports
 this notion, both that mental representations exist in artificial cognitive systems [Marstaller
 et al., 2013], and also that by encouraging mental representations I may create systems that
 perform better on a given task [Schossau et al., 2015]. The crux of this work is to better
 understand how such representations form, and how I may better encourage the evolution of
 representation and better-performing artificial cognitive systems.
 1.3       Neurocorrelates and Information Theory
 In order to assess mental representations in artificial cognitive systems, I must quantitatively
 evaluate their presence or absence. To do so, I turn to neurocorrelates, which are measured
 correlations between the network state and conditions related to the environment, either in
what the actual relevant state of the environment is or what information the agent is presented
with. I use information-theoretic measures to formally quantify these neurocorrelates.
 1.3.1       Entropy Measurement and Pseudo-counts
 In order to calculate a Shannon entropy, one can either find the mathematical real value
 or come to an approximation of that value from sampling. Here I use the latter, finding
 the shared information by an approximate calculation over the observed network behavior,
 generated by running the cognitive systems through the environment and recording relevant
 states. I can then apply Shannon entropy calculations to calculate correlations, entropy, and
 shared information between relevant world (or environmental), recurrent (or hidden), and
 sensor (or input) states. That is, I record the state of the world, the hidden states of the
                                                 8


cognitive system, and the states that the system is receiving at every time point during
the execution of the task. The hidden states of the brain collectively form the independent
variable B, the environmental states form the independent variable E, and sensor states
the independent variable S. The probabilities of each state are determined by counting the
number of times each state occurs over the course of the agent’s lifetime. These counts
conducted over the lifetime of the agents are only a portion of the state space, and thus are
a pseudo-count or approximation of the true counts. For example, if the agent’s lifetime was
10 time steps long, and a certain sensor state sq appeared twice, the probability P (sq ) would
be 0.2. The probability for each state of each variable and joint combinations of different
variables are calculated, and used to calculate the Shannon entropies. The Shannon entropy
H for a given variable X with unique states xi is shown in Equation 1.1.
                                          Xn
                                H(X) =        −P (xi ) × log P (xi )                       (1.1)
                                          i=1
    A joint entropy for a given set of states (e.g., S and B) would use the probability that the
same brain and sensor states occurred simultaneously, also calculated using the pseudo-count
method. By calculating the individual entropy of each set of states or the joint entropy of
two or three sets of states, I can determine the mutual or shared information between those
states. With this shared information, I can formalize our assessment of what representations
the cognitive system has about its environment.
1.3.2     R: A measure of Representation
R is an information-theoretic measure of mental representations in artificial cognitive systems.
Formally, it is the information shared between the hidden states of the cognitive system and
the environment not currently present in the sensors (i.e., information that has already been
integrated into the system, but not what is actively presented). However, this definition does
not say that the R is independent of the sensors entirely; the internal models used can and
                                                 9


often are conditional on previous inputs. This can be explicitly formulated as I(B : E|S),
that is the information shared between the entirety of the brain hidden or recurrent states
and entirety of the environmental or task states, given all of the information currently present
in the sensors. The shared conditional information I(B : E|S) can also be visualized using
the information-theoretic Venn diagram, shown in Figure 2.3. Previously, R was calculated
using the expanded form of Equation 3.2, although a portion of this work (primarily 3.3.1)
demonstrates how the calculation may be simplified to the more efficient Equation 3.3.
1.3.3      S: A measure of Smearedness
While R measures the absolute amount of information that the cognitive system stores about
the environment, smearedness presents a more nuanced view of how the information that
the system integrates is spread across the hidden nodes. Formally, smearedness measures
how much information about discrete concepts in the environment are spread across, or
concentrated in, individual hidden nodes in the cognitive systems. Smearedness was originally
introduced in [Hintze et al., 2018], and is used here as a complement to R. In [Hintze
et al., 2018], we define two different formulations of smearedness: smearedness of nodes Sn
and smearedness of concepts Sc . To calculate either, I must first quantify the amount of
information that each hidden or recurrent node in the cognitive has about each concept of the
task or environment. This is done for all nodes n and all concepts c by calculating the ’atomic
R’, for each pair and filling out the matrix of combinations M as follows. The whole memory
state in the joint random variable B are separated into the individual random variables Bi ,
where B = B1 B2 ...By . Likewise, the environmental state variable E is decomposed into the
individual random variables Ej , where E = E1 E2 ...Ez Each individual entry in the matrix
Mi,j is defined for a memory node Bi and a concept Ej , and then is calculated in a manner
identical to R, but only considering the individual node and concept instead of the entire
state space for each. That is, each matrix element of M is calculated as Mi,j = I(Bi : Ej |S).
Using M , I can take a pairwise minimum either over each node for Sn or over each concept
                                               10


for Sc and summed to create the final calculation. The formula for Sc is shown in Equation
1.2, and the formula for Sn is shown in Equation 1.3.
                                       XX
                                  Sc =         min(Mji , Mki )                            (1.2)
                                        i  j>k
                                       XX
                                 Sn =          min(Mij , Mik )                            (1.3)
                                         i j>k
    In recent work [Kirkpatrick and Hintze, 2019] we show that the two formulations of
smearedness, Sc and Sn broadly correlate, and here I will use the notation S to refer to the
smearedness of nodes, Sn , or to the broader concept of smearedness.
1.4      Optimization Methods and Genetic Algorithms
In order to create a network that controls an agent with the goal of completing a given task,
I must be able to refine and optimize the network for better performance. There are many
algorithms used for these optimizations [LeCun et al., 1998,Kaur et al., 2018,Yao, 1999,Stanley
and Miikkulainen, 2002], although for this work I focus on the a subset of optimization
methods that fall under the category of Evolutionary Optimization, and more specifically
Genetic Algorithms (GAs). GAs operate primarily on systems that can be described using
a genome, and that are typically organized in populations of different networks. Here I use
circular genomes with point, insertion, and deletion mutations in populations of typical size
100. In most GAs, the population or a portion of the population is evaluated on the task
using a fitness function, and the best performers are selected to create the next generation. I
use two primary GAs in this work, Roulette Selection and MAP-Elites, and investigate how
using R and S in conjunction with fitness in the GAs can improve the task performance of
agents produced by the GA.
                                               11


1.4.1      Roulette Selection
The primary GA used in this work is Roulette selection, or fitness-proportionate selection.
Suppose agents are evaluated on a task, resulting in a fitness ω. In fitness-proportional selec-
tion, the agents are chosen to propagate to the next generation with probability proportional
to ω. For example, if there are two agents in a population where one has fitness ω = 7, and the
other ω = 3, the first has a 70% chance of being selected to be in the next generation, while
the other has a 30% chance of being selected for the next generation. Under this selection
regime, the proportion of different fitness values matters, and so multiplicative increases
in fitness that happen under R-augmentation may have a larger effect than under other
selection regimes like Tournament selection where the relative sizes would not matter, but
only whether one fitness was greater than the other.
1.4.2      MAP-Elites
MAP-Elites is a genetic algorithm designed to maintain phenotypic or genotypic diversity
among the organisms in the population. To achieve this goal, different measures are established
as axes in a multidimensional space, and further subdivided into subsections by selecting
different ranges of values for those measures. The intersection of these subsections creates
bins in which organisms meeting the criteria for falling into that bin (i.e., by having the
values for each measure or metric fall into a specific range) only compete with each other
on a final metric, typically task performance. The MAP-Elites algorithm can be described
as follows. I create an initial population, and evaluate each agent on the different metrics. I
classify each organism or agent in the population into the correct bin based on the values of
its corresponding metrics. If two or more agents fall into the same bin, I maintain the agent
(or agents in some variations) with the highest fitness, and discard the others. I then select
organisms from the population stored in the grid at random, evaluate them on the various
metrics, and assign the agents to bins as specified. Again, if any organisms fall into the same
                                               12


bin as a preexisting organism, only the best is kept. I show this process in Figure 1.1 This
allows for the exploration of the genotypic or phenotpyic space, as well as identifying the
values of the metrics which allow for optimal performance. As used in this work, the axes
for MAP-Elites are defined by R and S, and task performance (ω) is used as the deciding
metric.
1.4.3     Multi-objective optimization and R-Augmentation
Many different options exist for multi-objective optimization [Deb, 2001], ranging from particle
swarm optimization [Parsopoulos and Vrahatis, 2002] to directed search [Wang et al., 2017]
or other various explorations of the Pareto-front [Konak et al., 2006]. Although I view these
methods as viable paths of exploration for finding agents that achieve higher task performance
and have better internal representations, I do not seek to compare them in the context of this
work. The integration of such methods would introduce additional complexity and distract
from our goal of understanding how representations are formed and can be used to create
agents with higher task performance. Here I narrow my focus on a simple multi-objective
method that has been shown to accomplish the stated goals [Schossau et al., 2015]. This
method, referred to as R-augmentation, involves modifying the original fitness function (e.g.,
Equation 2.2) so that the measured mental representation, R, is also taken into account and
increases in R are rewarded alongside fitness. This results in the modified fitness function,
shown in Equation 2.3. My primary focus in this thesis is on demonstrating the applicability
and utility of R-augmentation, as well as investigating what factors allow the method to
produce agents with better task performance. I also explore some alternative methods of
augmentation and multi-objective optimization by the inclusion of alternate neurocorrelates
(namely S), including the use of the MAP-Elites algorithm, in Chapter 6.
                                               13


Figure 1.1: A simplified version of the MAP-Elites algorithm. Each organism is represented by the
circle, and each square represents a bin for the algorithm. A prototypical organism is labeled (a),
and demonstrates that each agent has a task performance score ω along with a measurement of the
agent’s R and S. The grid is defined in terms of S on the x-axis and R on the y-axis. The bins for R
are from 0 to 1, 1 to 2, and 2 to 3 while the bins for S are from 0 to 2, 2 to 4, 4 to 6, 6 to 8, 8 to 10,
and 10 to 12. The algorithm will place the organism labelled (b) in the square with the same light
blue shade based on the organism’s R and S values. The organism will be successfully placed in the
bin because there are no organisms in that bin previously. The algorithm will place the organism
labelled (c) in the square with the same light green shade based on the organism’s R and S values.
The placement will succeed as the organism labelled (c) has a higher task performance than the
organism already in the cell. The algorithm will attempt to place the organism labelled (d) in the
bin of the same light yellow color based on the organism’s R and S values. The placement will fail,
however, because the organism that is already in the bin will remain because it has a higher task
score, and the organism labelled (d) will be discarded.
                                                  14


1.4.4      Machine Learning
While GAs are the only method of optimization addressed in this work, it should be noted
that the information-theoretic methods used here should be no less applicable to networks
or systems optimized using Machine Learning (ML) techniques such as gradient descent for
backpropagation. Given the unpredictable and stochastic nature of evolutionary optimiza-
tion, the performance benefits of R-augmentation are especially notable. That being said,
adding R as a portion of metrics calculated for the loss function in ML applications has the
potential to produce similar gains in task performance. The increased size of ML-trained
networks, particularly in the context of Deep Learning, may limit the applicability of R
as a training metric unless further optimizations are made. Regardless, R, S, and other
information-theoretic tools may yet prove to be useful in the analysis of black-box systems
produced by ML methods.
1.5       Tasks
To evolve intelligence, I must first have a world or environment for intelligent agents to evolve
in. Moreover, when seeking to evolve or assess mental representations on these cognitive
systems, I must have a well-defined environmental state to serve as a starting point for the
evaluation of the information-theoretic measures used here. Tasks that lack a well-defined
environmental state (i.e., a state that is useful to the cognitive system) will not produce
meaningful results in terms of the information-theoretic neurocorrelates. Rather than create
new tasks to serve our purposes, in this dissertation I use two tasks identified in previous work
as having suitable environmental states: the Active Categorical Perception Task [Marstaller
et al., 2013], and the Number Discrimination Task [Kirkpatrick and Hintze, 2019].
                                                15


1.5.1      Active Categorical Perception Task
The Active Categorical Perception Task (ACP) is a primary task used in the study of
representations in evolved artificial cognitive systems [Marstaller et al., 2013, Schossau et al.,
2015, Hintze et al., 2018]. The task itself, which originates in work on perception [Beer,
1996, Beer, 2003, van Dartel et al., 2005], is relatively straightforward; an agent sees a block
falling towards it, and must either catch or avoid it based on an observed property or set of
properties of the block. The original version of the task [Marstaller et al., 2013] has an agent
with a pair of sensors of width 2 facing upwards (i.e., towards the falling blocks), separated
by a break of width 2 between the sensor blocks (i.e., a blind spot). Blocks fall downwards
and either to the left or the right at a constant pace. In the default task, small blocks (of size
2) must be caught while large blocks (of size 4) must be avoided, but these requirements may
be changed to suit the experimenter. A block is deemed to be caught if any part of the agent
(sensor or blind spot) makes physical contact with the block. The agent can move to the left
or to the right, or remain still on any given time step. The world is toroidal (i.e., if the agent
or block goes too far to the left it will wrap around to the right, and vice versa) with a width
of 16 and a height of 32. For each evaluation of the agent, every possible permutation of block
size (large or small), direction (falling left or right), and starting position relative to the agent
(out of the 16 positions in the environment) is used to test the agent. The environmental
states that are used for analysis in R and S are whether the block is large or small, whether
the block is falling to the left or to the right, and whether the block is to the left or to the
right of the agent at the current time step. Although other variants of the task are possible
(e.g., using blocks that change shape as they fall), for this work I will keep the original layout
of the task as described here unless explicitly stated otherwise. The agent is evaluated on the
number of correct (C) and incorrect (I) catch or avoid decisions made over all permutations.
                                                   16


Figure 1.2: A subsection of a single case of the Active Categorical Perception Task. The agent,
visualized by the orange box, can move either to the left or right as indicated by the orange arrows.
The four sensors, labelled 1,2,3, and 4, can each see directly upwards as indicated by the light blue
shading. A small block with width 2 is falling towards the agent and moving to the left, as indicated
by the green arrow. At the current point in time, sensor 3 is the only sensor that would return a
signal, while on the next update, assuming that the agent did not move, both sensors 3 and 4 would
return a signal.
                                                  17


1.5.2      Number Discrimination Task
The Number Discrimination Task is a secondary task used for the evaluation of representations
in evolved artificial cognitive systems, adopted more recently [Kirkpatrick and Hintze, 2019]
in silico. The task, however, has its roots in biological [Nieder, 2018, Merritt et al., 2009]
and psychological [Merritt and Brannon, 2013] analogues. For the task as used here, the
agent has a set number of binary inputs N . On the first update, a subset of the inputs x
(where 0 <= x <= N ) are set to 1, and the remainder are set to 0. After a given number
of intermediate updates i, where the agent may receive a constant signal or noise, the agent
then receives a second signal where a second number of the agent’s inputs y are set (where
0 < y <= N and y 6= x). After a second set of intermediate updates, again of length i, the
agent is given a final constant signal where it must output a 1 if the x > y, and a 0 otherwise.
The agent will see all possible permutations of x and y, both in the variations of patterns
shown (i.e., for N = 5, x = 2 could present as 00101 or 11000) and variations in the pairs
of numbers (i.e., for N = 5, some examples would be x = 2 and y = 3, x = 3 and y = 2,
x = 5 and y = 1, or x = 4 and y = 5). I show an example of one case of the task in Figure
1.3. The relevant environmental states used for R and S are a binary state corresponding to
which values of x and y are currently being tested, and a binary state indicating if the first
number was larger than the second. The agents are evaluated on the number of correct (C)
and incorrect (I) evaluations of the ordering over all permutations.
1.6       Overview
This work focuses on the use and effects of augmenting the fitness function to emphasize
mental representations in artificial cognitive systems. In Chapter 2 I study the effect of
R-augmentation on different computational components, and investigate at the cognitive
system’s ability to generalize to within-domain tasks and ability to resist sensor noise, while the
performance advantages are broadly retained. In Chapter 3 I look at the effect of augmentation
                                                18


Figure 1.3: A single case of the Number Discrimination Task. The agent receives a series of inputs
over a period of 9 brain updates. The input values are shown in the squares under ”Sensors”, where
each square represents a unique sensor input, and correlate with brain updates listed under ”Time”.
The description of each sensor input is likewise listed under ”Description”. The agent receives a
first number, here 3 (note that 3 of the sensors have a 1 as input), and then has 3 intermediate
updates where noise is fed to each sensor with probability p = 0.5. The agent then receives a second
number, here 4 (note likewise that 4 of the inputs are 1), and then a second period of 3 intermediate
updates where noise is fed to the sensors with the same probability p. After the second set of noise
inputs, the agent receives an output signal of all ’0’ inputs, and is expected to output a 1 because
the second number is larger than the first.
                                                 19


on different network structures. The advantage in performance seen primarily in Markov
Brains is also seen in GP-related structures and ANNs. Chapter 4 explores how augmentation
affects brains made of multiple computational components in the same structure, and use this
to examine reasons as to why augmentation enhances the evolutionary regime. In Chapter
5 I focus on the idea that R-augmentation can allow cognitive systems to integrate more
information from the environment and maintain task performance as sample size decreases
(i.e., as fewer examples from the task are shown during evolution). This hypothesis is examined
in the context of in-domain and out-domain generalization (i.e., R-augmentation allows better
performance on unseen trials, across different evolutionary sample sizes). Lastly, Chapter
6 goes beyond the scope of simple R-augmentation, where I examine how augmentation
could target a value beyond simple minimization and maximization. Specifically, I look
at augmenting the fitness function to bring smearedness to an optimal value. In order to
determine this optimal value, I use MAP-Elites to examine the fitness landscape in terms of
respresentation R and smearedness S, and investigate the potential use of MAP-Elites as an
optimizer in its own right.
                                               20


Chapter 2
Addressing concerns regarding
potential limitations or effects of
R-augmentation
A portion of this chapter was originally published as ”Kirkpatrick, D., & Hintze, A. (2019,
July). Augmenting neuro-evolutionary adaptation with representations does not incur a speed
accuracy trade-off. In Proceedings of the Genetic and Evolutionary Computation Conference
Companion (pp. 177-178).” Significant portions were edited or added for inclusion in this
dissertation.
2.1       Abstract
Representations, or sensor-independent internal models of the environment, are important
for any type of intelligent agent to interact with an environment. Imbuing an artificially
intelligent system with a model of the world it functions in remains a difficult problem.
However, using neuro-evolution as the means to optimize such an artificial neural network
or other similar artificial cognitive system allows the genetic algorithm to evolve networks
                                              21


with proper models of the environment. Previous work introduced an information-theoretic
measure, R, which quantifies how much information a neural computational architecture
(henceforth loosely referred to as a brain) has about its environment. In addition, R can be
used speed up the neuro-evolutionary process by including an evaluation of R as part of the
fitness function. However, it is not known whether that this improved evolutionary adaptation
comes at a cost to the brain’s ability to generalize or the brain’s robustness to noise. It is
also not known how this process of using R as part of the selection scheme (referred to as
R-augmentation), will perform when different internal mathematical and logical functions
(referred to as computational components) are evolved in the same network structure. In this
chapter, I show that neither of these issues are a concern. R-augmentation produces results
similar to a standard genetic algorithm when examining robustness and ability to generalize,
and the improved performance remains constant across different computational components.
Furthermore, I find an improved ability of the brain to evolve in noisy environments when
R is used to augment evolutionary adaptation.
2.2       Introduction
Neuro-evolution has shown promising results in creating more capable and adaptable solutions
for problems in artificial intelligence [Yao, 1999, Floreano et al., 2008, Stanley et al., 2019].
One of the downsides to using evolution rather than the more standard backpropagation to
”train” a network is that evolutionary search itself is inherently random and as a consequence,
an optimal agent may not evolve quickly. Similarly, it is possible that the evolutionary search
gets stuck, and results in a less than optimal solution. To combat this defect, many methods
have been proposed to accelerate neuro-evolution, for example by using an indirect encoding
[Stanley and Miikkulainen, 2002], increasing the diversity in the population [De Jong, 1975],
or by protecting new mutations with novelty search [Lehman and Stanley, 2008]. Another
method rewards solutions that are more modular [Clune et al., 2013], and indeed this method
                                               22


is technically the closest to the approach used here. Each of these methods shares in common
the influence of a measurable quantity on the agents’ evolutionary optimization.
    One issue that has already been identified to be specifically troublesome in deep learning
is representations. It seems as if neural networks are easily fooled [Nguyen et al., 2015] either
because they are over-fitted, or because they simply cannot generalize well. Deep neural
networks lack a robust internal model, and appear to rely too heavily on specific irregularities
in the data used for training. In image recognition these could be surface irregularities in
specific images that are learned, for example [Jo and Bengio, 2018]. The fact that these deep
learned models do not form true representations mirrors previous work which suggests that
one might even get away with creating intelligence without representations [Brooks, 1991].
However, direct programming or training is not the only way to generate internal models;
they can be evolved as well. Neuro-evolution allows systems to create cognitive agents that
have meaningful and robust representations about their environment [Marstaller et al., 2013].
Furthermore, it has been shown that one can augment a genetic algorithm (GA) using this
measurement of representations to improve network task performance relative to a standard
GA [Schossau et al., 2015]. This process works by first measuring representations and then
rewarding the agent for having these representations by including the measurement in the
fitness function. This is similar to a multiple objective optimization [Zhou et al., 2011], but
where both parameters are mostly in line with the same goal.
    It might at first sight seem impossible to measure ”how much a brain is modelling about
the world” but as the purpose of these models is to make predictions about the world, there
must be a correlation between the brain and the world that can be measured quantitatively
using information theory. Such a measure, referred to as R, was introduced in 2013 and
quantifies the information shared between a brain’s internal states and the environment that
is not shared with the sensors. Colloquially, R is the information that the brain knows about
the environment that is not coming from the sensors. Previous work has shown that using
R as a co-evolutionary factor (i.e., by rewarding the agent for the presence of R alongside
                                               23


 task performance) increases the rate of adaptation in the genetic algorithm [Schossau et al.,
2015]. In the following, I will call this method of using R as a portion of the fitness function
“augmenting the genetic algorithm with R”, or R-augmentation.
     While the benefits of R-augmentation have been demonstrated on a variety of fitness
 functions, there is some concern that this method might impair a network’s robustness, since it
 is known that the rate of adaptation and the degree of robustness are tightly connected [Wag-
 ner, 2012]. In other words it is possible that augmenting a genetic algorithm with R might
 speed up evolution but cause detrimental trade-offs for other aspects of intelligence. Here
 I investigate two aspects of intelligence that might suffer from trade-offs and are relevant
 for artificial intelligence applications: the brain’s ability to generalize as well as a brain’s
 resilience to noise.
     Often in machine learning, (e.g., when training a classifier), the distinction between
 robustness and generalization is hard to make, and the terms are often used interchangeably.
 Here, I define generalization as the ability to deal with a new task the system was not
 explicitly trained on. Robustness on the other hand is the ability to perform the same task
 even with significant levels of noise added to sensors.
     Previous work [Schossau et al., 2015] has focused on the use of a specific architecture (the
 Markov Brain) which uses a specific computational processing component, the deterministic
 logic gate. Other work [Hintze et al., 2019, Hintze et al., 2017] has introduced the possibility
 of using alternative computational units to expand the capability of the Markov Brain system.
 However, we do not know how (or even if) R-augmentation works when Markov Brains utilize
 computational components other than Boolean logic.
     In this paper, I show that neither the ability of Markov Brains to generalize nor the
 brain’s robustness to noise is reduced when the evolutionary process is augmented by using
 R. In particular I find that the R-augmentation process creates an increase in performance
 regardless of the computational component used. Moreover, I found co-evolving with R results
                                                24


in better performance when evolving in noisy environments.
2.3      Materials and Methods
2.3.1     Markov Brains
Markov Brains are a special form of evolvable neural networks, also sharing similarities with
Cartesian genetic programming [Miller and Thomson, 2000]. Here they are used to control
a virtual agent, and are optimized using a genetic algorithm. The optimization is based on
the performance of the controlled agent in a test environment. Like a conventional Recurrent
Neural Network (RNN) [Russell and Norvig, 2016] Markov Brains have input nodes that
receive data from the agent’s sensors, can process information over time using recurrent
hidden nodes, and have output nodes that are used as actuators allowing the agent to move
in the environment. However, instead of having a fixed, layered topology like an RNN, Markov
Brains use computational gates that can read from and write into input, hidden, and output
nodes. The way the gates connect to nodes, how many, and which types of logic each gate
executes are all evolvable properties. Markov Brains have originally been designed using
deterministic [Marstaller et al., 2013] or probabilistic logic gates [Edlund et al., 2011], but
have since then been extended to include many types of computational elements [Hintze et al.,
2019]. A Markov Brain uses a linear vector of bytes as a genome, with circular wraparound as
necessary. The genome is read sequentially, and each time a specific sequence of numbers is
found (e.g., 42 followed by 213 for a deterministic logic gate) the following numbers are used
to specify the functionality, logic, and connections of each gate. This is loosely analogous to
a start codon defining the start of a gene, and the following sequence of nucleotides encoding
a protein. For a more detailed description of Markov Brains see [Hintze et al., 2017]. All
experiments are implemented using the MABE code framework [Bohm et al., 2017]. For
the primary experiments about generalization and robustness, I only use the deterministic
logic gate. For the examination of gate types, I run experiments where the brain uses an
                                               25


individual gate type, with brains constructed using each of the Probabilistic Logic Gate,
the Genetic Programming (GP) Gate, the Artificial Neural Network (ANN) Gate, and the
NeuroEvolution of Augmenting Topologies (NEAT) Gate.
    Markov Brains are evolved for either 40,000 (for the generalization and robustness ex-
periments) or 10,000 generations (for the individual gate experiments) in populations of 100
organisms. The start population is initialized with random genomes of length 5000. At each
generation the genomes experience gene duplications (with probability 2x10−5 per site of the
parent genome, duplicate the next 128 to 512 contiguous sites), deletions (with probability
2x10−5 per site of the parent genome, delete the next 128 to 512 contiguous sites), and point
mutations (with probability 5x10−3 per site of the parent genome, randomize the site). The
genome size is capped at a maximum of 20,000 sites and a minimum of 2,000 sites.
2.3.2      Active Categorical Perception Task
I selected a task previously used in the quantification of R [Marstaller et al., 2013] as well
as in the original discovery of R-augmentation [Schossau et al., 2015]. This is done to
take advantage of the previously examined and well-formed definitions of the environmental
concepts of the task, which is essential to the proper assessment of an agent’s representation.
The task is referred to as the Active Categorical Perception (ACP) task, and originally comes
from previous work on agent control schemes [Beer, 1996, Beer, 2003, van Dartel et al., 2005].
    In this task, an agent is endowed with a set of upward-facing sensors, as well as a motor
that allows the agent to the left and right, as seen in Figure 2.1. Agents exist in a world
with a width of 16 units and a height of 34 units, where the left and right sides connect
toroidally. The sensors are two groups of two individual sensor inputs, separated by a space
in the middle. Previous work had set the distance between the two groups to be two units,
although here I allow for a variation of either 2 or 3 unit spaces between the sensor groups,
as shown in Figure 2.2. The sensors are capable of determining whether an object, referred
                                               26


to as a block, is directly above it or not. The blocks are dropped one at a time towards the
agent, and move at a constant rate of 1 unit per time step down and 1 unit either to the left
or to the right. Agents have a fixed amount of time (here, 34 time steps) to either be under
the block and thus catch it, or to not be under the block and thus avoid it. In previous work,
the small and large blocks were of size 2 and size 4, respectively. Here, I vary the size of the
large and small blocks, where the small blocks can be between 1 and 3 units wide, and the
large blocks can be between 4 and 6 units wide. The task for the agent is to catch the small
blocks and avoid the large ones.
Figure 2.1: A subsection of a single case of the Active Categorical Perception Task. The agent,
visualized by the orange box, can move either to the left or right as indicated by the orange arrows.
The four sensors, labelled 1,2,3, and 4, can each see directly upwards as indicated by the light blue
shading. A small block with width 2 is falling towards the agent and moving to the left, as indicated
by the green arrow. At the current point in time, sensor 3 is the only sensor that would return a
signal, while on the next update, assuming that the agent did not move, both sensors 3 and 4 would
return a signal.
    To study generalization, I use a variety of gap widths and small and large block sizes. The
combinations of block sizes used during evolution and for testing at the end of evolution are
                                                  27


Figure 2.2: The two agent types used for generalization trials. The agent labelled 1 has the standard
gap width of 2 units, while the agent labelled 2 has the modified gap width of 3 units. Blue squares
indicate sensors, while the white squares indicate the gap in the middle.
listed in Table 2.1. For all other experiments, I use the standard gap width of 2 and blocks
of size 2 and size 4.
    Regardless of the unique combination of gap width and block size, the agent’s objective in
the ACP task is unchanged: the agent either makes the correct or incorrect decision to avoid
or catch each block. Following from previous work, I comprehensively tested each agent over
a range of decisions based on the starting conditions of the block. Given the width of the
world, the block could start at 16 unique locations relative to the agent which always started
with its leftmost sensor at the position with index 0. Given that the blocks are either small
or large, move to the left or to the right, and can start in 16 unique starting positions, there
are 64 unique combinations which are comprehensively evaluated both during evolution and
testing. Agents need to observe the block for multiple updates to make the correct decision;
this leads to the integration of information over time which is aided by the creation of internal
models and representations.
                                                 28


Table 2.1: Conditions for Evolution and Testing of block sizes to catch (small) and avoid (large)
as well as the size of the gap between the sensors.
                             during evolution       during testing
  condition      gap size    small      large       small      large
                             block      block       block      block
                             size       size        size       size
  2v4/2/1v5      2           2          4           1          5
  2v4/3/1v5      3           2          4           1          5
  1v5/2/2v4      2           1          5           2          4
  1v5/3/2v4      3           1          5           2          4
  3v5/2/2v6      2           3          5           2          6
  3v5/3/2v6      3           3          5           2          6
  2v6/2/3v5      2           2          6           3          5
  2v6/3/3v5      3           2          6           3          5
2.3.3      Representations and the neuro-correlate R
Representations are defined here as the information that an agent has about its environment
(in this case the ACP task) that is not currently reflected in the sensors. This definition limits
representations to consist of information that has been integrated into the agent’s memory,
and thus representations are a reflection of preexisting mental models that the agent has
constructed. The information-theoretic measure R (where R stands for Representation) is
designed to quantify the amount of representations present under this definition, shown in
equation 2.1. The measurement of R is the result of an entropic calculation over environmental
or world states (W ), memory or brain states (B), and sensor states (S).
                         R = H(W : B|S) = H(W : B) − I(W : B : S) .                          (2.1)
    This relationship can also be more easily visualized by the information-theoretic Venn
diagram, seen in Figure 2.3).
    R is calculated in practice using a sampling technique. The states of all three variables
W , B, and S are recorded during the lifetime of the agent, and the approximate value of R
                                                29


                           W                                           B
                              H(W |S, B)        R         H(B|W, S)
                   I(W : S|B)                                        I(S : B|W )
                                           H(S|W, B)
                                                                  S
                         I(W : B : S)
Figure 2.3: Venn diagram of entropies and informations for the three random variables W , S, and B,
describing the world, sensor, and agent internal (brain) states. The representation R = H(W : B|S)
is shaded.
is found by applying Shannon entropy calculations to the probability of these recorded states.
Although the brain and sensor states are easily determined by examination of the Markov
Brain’s sensor and hidden states, the environmental states require some consideration. In
order for R to be an effective measure and be useful for R-augmentation, the world states
need to be carefully chosen by the practitioner. The variables chosen as world states attempt
to capture the concepts whose knowledge are most likely predictive of the agent’s performance,
or are the most useful for the agent to understand in order to solve the task. In a slight
variation from the original formulation of [Marstaller et al., 2013], I only use three of the
environmental concepts proposed there, as these should be sufficient for completing the task:
the size of the block (large or small), the direction of block movement (left or right), and the
position of the block relative to the agent (left or right). Preliminary work suggested that
that agents typically do not evolve representations about whether they are currently under
the block (the foruth variable used in [Marstaller et al., 2013]), and so I omit it in this work.
                                                 30


2.3.4      Evolutionary Algorithm
I test the performance of each agent in each generation using the active categorical perception
task. Specifically, since the world is 16 units wide and there are two different sized blocks that
could fall into two directions, each agent can experience at most 64 different start conditions.
Agents are tested using all these possible start conditions and thus can make in total 64 correct
catch-or-avoid decisions. I use the number of correct decisions C and incorrect decisions I to
define the fitness function as:
                                          W = 1.10(C−I)                                      (2.2)
    I use the same formula for augmenting the genetic algorithm with the measurement of R
as defined in [Schossau et al., 2015]:
                                                          R
                                    W = 1.10(C−I) (1 +        )                              (2.3)
                                                        Rmax
where Rmax is defined by the maximal size of I(W : B) (here 3.0).
    I use roulette wheel or fitness-proportional selection [Thomas, 1996] in conjunction with
the calculated fitness values to determine which organisms shall be parents for the next
generation of agents. I apply the mutations as described in Section 2.3.1 to the selected
parents, and use these mutated offspring to construct a new generation of organisms with
the same number of organisms as the previous generation.
2.3.5      Robustness to Noise
To evaluate an evolved agent’s robustness towards noise, I test all agents on all possible start
conditions of the active categorical perception task. Unlike the execution of the task as used
during evolution, at every temporal update of the environment I add different levels of noise
                                                 31


to the agent’s sensor values. Specifically, each of the four sensors had an independent chance
to not show the actual environment but contain a random bit.
2.3.6       Generalization
Generalization in machine learning, for example in classification tasks, is often described as
the ability to not only perform well on the training data, but to also perform well on test data
that was not used for training [Russell and Norvig, 2016], but the concept of ”generalization”
is really more general than that. Here, I instead think of the ability to generalize as the
capacity of a mental model to be applied to more than a specific case so that the agent
can extrapolate to other similar tasks [Johnson-Laird, 1980]. To be more specific, a classifier
trained to distinguish hand written numerals for example might only be able to answer
correctly when confronted with images from the training set. Ideally, it should also have the
ability to generalize to classify numerals in images it has not seen before. Even better would
be the ability to perform related but new tasks (this is similar to transfer learning [Pan
and Yang, 2009]). An example of this ability to complete new tasks would be asking the
classifier trained on predicting numerals to now distinguish letters. While for typical classifiers
this ability to tackle new tasks is an extremely difficult problem, generalizing is something
mammalian brains seem to be able to do easily.
    For the active categorical perception task, this generalization to a new task (also called
out-domain generalization) can be examined by testing agents on blocks of sizes that have
not been seen during evolution. For example, agents with a gap size of 2 between the sensors
were evolved to catch size 2 and avoid size 4 are then tested to see how well they can catch
size 1 and avoid size 5. I evolved the agents on the following small/large block pairings: 2/4,
1/5, 3/5, and 2/6. The first two and the last two sets of sizes were reversed for testing (i.e.,
agents evolved on 2/4 were tested on 1/5, and agents evolved on 2/6 were tested on 3/5).
I tested all block size combinations with both the gap width sizes, 2 and 3. For a complete
enumeration of the tested conditions see Table 2.1.
                                                32


2.4      Results
I first test that using a GA augmented by rewarding R increases adaptation as shown in
[Schossau et al., 2015]. I use this repetition to also test that our modified definition of the
environment and the different implementation of measuring R does not adversely affect the
positive effect R has on adaptation. I find that for agents evolved to catch blocks of size 2
and avoid blocks of size 4 with a gap width of 2, as expected, using R (see Equation 2.3)
allows the genetic algorithm to evolve faster and to produce better solutions on average (see
Figure 2.4) as opposed to not using R (equation 2.2).
                    1.0
                    0.9
                    0.8
                W   0.7
                    0.6
                    0.5
                          0        10000        20000        30000        40000
                                           Generations
Figure 2.4: Average performance (W̄ over generations for agents evolved without using R (dotted
line) and with augmenting the GA (solid line). Averages are computed for agents on the line of
descent over 800 replicates. The standard error of the means are shown in grey, although due to
the high replication count the standard error is negligible. The task performance of agents evolved
using a GA augmented with R and a standard GA are significantly different when tested using
Wilcoxon-Ranked-Sum test (p = 1.88e − 67). Here I evolved agents to catch blocks of size 2 and
avoid blocks of size 4, with a gap width between the sensors of size 2.
   With the baseline increase in performance established in Markov Brains with Deterministic
Logic Gates, I then confirmed that the average performance at the end of evolution was higher
in Markov Brains evolved with R-augmentation than those evolved without, for a variety
of gate types. This result is shown for Probabilistic Logic Gates (Figure 2.5), GP Gates
(Figure 2.6), ANN Gates (Figure 2.7), and NEAT Gates (Figure 2.8). For all gate types but
the Probabilistic Logic Gates, the average performance is greater for Markov Brains evolved
                                                33


under R-augmentation than those without for at least 80% of the 10,000 generations.
                    38
                    37
                    36
                    35
              W     34
                    33
                    32
                    31
                    30
                         0        2000      4000      6000      8000      10000
                                          Generation
Figure 2.5: Task performance in Markov Brains with probabilistic logic gates produced by aug-
mented and unaugmented GAs. The blue line depicts results from the lines of descent of GAs using
R-augmentation, while the orange line comes from unagumented GAs. Each line is the performance
averaged over 500 replicate experiments. The shaded area around each line is the standard error.
   I also confirmed that for all other agent configurations and objectives such as different
sized blocks and different gap sizes (see Table 2.1), I observe the same phenomenon: using
R to augment the GA improved performance across all tested conditions (see Figure 2.9).
The difference in performance of agents evolved with a GA augmented with R and those
evolved without augmentation is statistically significant (Mann-Whitney U, p < 0.01 for all
conditions), and the mean performance of those evolved in a GA augmented with R was
always higher than those evolved in the control conditions.
   Different environmental conditions may affect the performance of the GA, as well as
the magnitude of impact that augmentation with R had on the GA. I test if augmenting
the GA with R in any way compromises the ability of the evolved agents to generalize. To
demonstrate the agent’s ability to generalize, agents evolved in one environment are tested on
a new task where blocks had different sizes, but with a constant gap size so that the agent’s
internal understanding of the gap width between sensors was not altered. I find no significant
                                              34


                    55
                    50
                    45
              W
                    40
                    35
                         0        2000      4000      6000      8000      10000
                                          Generation
Figure 2.6: Task performance in Markov Brains with GP gates produced by augmented and unaug-
mented GAs. The blue line depicts results from the lines of descent of GAs using R-augmentation,
while the orange line comes from unagumented GAs. Each line is the performance averaged over
500 replicate experiments. The shaded area around each line is the standard error.
                    55
                    50
              W     45
                    40
                    35
                         0        2000      4000      6000      8000      10000
                                          Generation
Figure 2.7: Task performance in Markov Brains with ANN gates produced by augmented and unaug-
mented GAs. The blue line depicts results from the lines of descent of GAs using R-augmentation,
while the orange line comes from unagumented GAs. Each line is the performance averaged over
500 replicate experiments. The shaded area around each line is the standard error.
                                              35


                     44
                     42
                     40
                     38
              W      36
                     34
                     32
                     30
                     28
                           0          2000         4000            6000       8000           10000
                                                  Generation
Figure 2.8: Task performance in Markov Brains with NEAT gates produced by augmented
and unaugmented GAs. The blue line depicts results from the lines of descent of GAs using R-
augmentation, while the orange line comes from unagumented GAs. Each line is the performance
averaged over 500 replicate experiments. The shaded area around each line is the standard error.
                          2v4,2   2v4,3   1v5,2   1v5,3    3v5,2    3v5,3   2v6,2    2v6,3
Figure 2.9: Performance of agents evolved under different evolutionary conditions after 40,000
generations. Each column shows the distribution of fitness as a box plot indicating the mean, the
25th quartile, the 75th quartiles, and outliers as ‘+’. R indicates an evolutionary experiment where
the GA was augmented by using R; C labels the control. The box plots show the results for 8
different experimental conditions using combinations of different block sizes (2v4, 1v5, 3v5, 4v6)
and gap widths (2,3).
                                                      36


loss in the ability to generalize across different experimental conditions (see Figure 2.10).
                                        *                   *           *
                 evolved:  2v4  2v4    1v5    1v5    3v5   3v5    2v6   2v6
                 gap size:  2    3      2      3      2     3      2     3
                   tested: 1v5  1v5    2v4    2v4    2v6   2v6    3v5   3v5
Figure 2.10: Performance of evolved agents after 40,000 generations tested on different conditions
than they were evolved in. Each column shows the distribution of performance as a box plot
indicating the mean, the 25th quartile, the 75th quartiles, and outliers as ‘+’. R indicates an
experiment where the GA was augmented by using R, C labels the control. As defined in Table
2.1 I used 8 different environments to evolve and consequently test them. Significant differences
(Mann-Whitney U, p < 0.01) are marked with *. In each of the significant cases the mean tested
performance was greater when evolved with R-augmentation than in the control (unaugmented)
condition.
    Whenever I observe a significant difference in the agents ability to generalize, agents
evolved in a GA that are augmented with R have a mean performance greater than those
that were not evolved with R.
    Now that I have shown that using R is not detrimental but might even be slightly beneficial
to the agents ability to generalize, I ask how R affects robustness. To that end, I test agents
evolved in a GA with and without augmentation using R under different levels of noise.
Specifically, each sensor had a chance to show a random 0 or 1 instead of the original signal,
and I modify the likelihood for the bit to flip over the range from 0.0 to 0.2 in intervals of 0.01.
I find that the R-augmented and unaugmented GAs produce networks with no consistent
difference in performance for each of the tested conditions (see Figure 2.11).
    This result, where robustness is the same regardless of augmented or unaugmented GA,
shows that augmentation by using R does not impede the GA’s ability to select for agents
                                                  37


                    1.0
                    0.9
                    0.8
                W   0.7
                    0.6
                    0.5
                          0.00        0.05            0.10        0.15          0.20
                                             Sensor Noise
Figure 2.11: Mean performance of agents evolved with (solid line) and without (dashed line)
augmenting the GA with R. The performance (W̄ ) is from agents from 800 replicate experiments
after 40,000 generations of evolution tested over different levels of sensor noise. Error bars indicate
the standard error. The results were not significantly different when tested with a Wilcoxon Rank-
Sum test. These agents were evolved to catch blocks of size 2 and avoid blocks of size 4 and had a
gap width of 2.
that are robust to sensor noise under these conditions. This experiment presents agents with
sensor noise for the first time in the test environment, however the agents never experienced
noise during evolution. I now repeat the evolutionary experiment but also apply the different
levels of noise during evolution. I find that agents evolved under noise while their GA was
augmented with R perform significantly better than those evolved without the augmentation
by R (see Figure 2.12) in non-noisy environments.
   These results from evolution under noise show an interesting feature of augmenting a
GA with R. In the presence of noise, agents typically face a more difficult environment
to master than without noise. At the same time, noise reduces the accuracy of fitness as
a measurement of capability because noise affects the agent’s ability to predict the true
state of the environment. The noise should, at least in theory, also affect the calculation of
R since it takes the sensor states into account as well. Surprisingly, the presence of noise
during evolution does not hinder the improvements to task performance provided by using
R-augmentation.
                                                 38


                    1.0
                    0.9
                    0.8
                W   0.7
                    0.6
                    0.5
                          0.00        0.05            0.10       0.15         0.20
                                             Sensor Noise
Figure 2.12: Mean performance of agents evolved under different levels of noise (x axis) with
(solid line) and without (dashed line) augmenting the GA with R. The performance (W̄ ) is from
agents from 350 replicate experiments for each level of noise, after 40,000 generations of evolution.
Error bars indicate the standard error. The results were significantly different when tested with a
Wilcoxon Rank-Sum test (p = 5.95e − 5). These agents were evolved to catch blocks of size 2 and
avoid blocks of size 4 and had a gap width of 2.
2.5      Discussion
Augmenting a genetic algorithm by using the information-theoretic measure R accelerates
evolutionary adaptation. It was previously not known whether this result would be adaptable
to other computational components. In this chapter I show that R-augmentation also works
with a variety of different computational gates in the Markov brain system. While it was
previously unclear if this method incurs trade-offs, such as causing the evolved solutions
to be less robust or to have a diminished ability to generalize to a new task, I show that
comparing the robustness of solutions evolved with and without the augmentation of R
reveals no significant trade-off. However, when also evolving under the influence of sensor
noise, I find that augmenting the GA with R results in more capable solutions than those
evolved by an unaugmented GA.
    While the direct mechanism for this improvement in performance regardless of sensor
noise is not clear, I hypothesize that any negative effect of noise on the fitness measurement is
compensated by the R term in the augmented fitness function (see Equation 2.3). In the case
where R-augmentation is used, the GA is able to take a second measure (R) into account
                                                 39


and is no longer dependent only on the noisy task performance measurement.
    As an alternative explanation of the increased performance under varying levels of sensor
noise, I hypothesize that using R to augment the fitness of the GA results in the selection
of agents that experience more unique conditions of the task (i.e., explore the environment)
over those that do not explore the environment to the same extent but have the same fitness.
Specifically, for an agent to refine its internal representation (have a higher value of R), the
agent needs to explore more of the environment. For an agent to improve R, maybe that
agent needs to take into account additional observations that they might have otherwise
discarded. Of course, it is precisely these additional observations that might mitigate the
effect of sensor noise. Consequently, R does not directly affect fitness but encourages the GA
to select agents that are more robust to noise indirectly, which ultimately produces agents
that perform better.
    I also found that the agents’ ability to generalize to a new task is mostly unaffected, or even
improved, by augmenting the GA with R. In 3 out of the 8 generalization experiments, R-
augmented GAs produced networks with significantly higher performance on the new version
of the task, although the effect size is not large. In the other 5 the performance of the networks
produced by R-augmented GAs is not significantly different than the performance of networks
produced by unaugmented GAs. In other words, the networks produced by R-augmented
GAs generalize at least as well as or better than networks produced by unaugmented GAs.
    In all experiments I evolved Markov Brains to solve an active categorical perception
task, and used varying types of computational components. Representations are arguably
an important aspect of general purpose intelligence, but these experiments do not allow us
to extrapolate the effect of R-augmentation to other cognitive abilities or other structurally
different cognitive systems such as RNNs. However, given that R-augmentation is effective
when applied to architectures with a variety of computational components, I can be confident
that this method can be effective when applied to artificial cognitive systems that vary in
                                                  40


structure.
    The computation of R depends on the specification of variables identifying world state
that a well-performing agent should know about. If the experimenter chooses these variables
poorly, the resulting R could be irrelevant to the task. As an example the blocks in our
experiment had two distinct colors, but the objective remains to catch small and avoid large
blocks regardless of color, selecting for knowledge about the color will probably not improve
adaptation. In principle the relative value or cost of such counterfactual representations can
and should be tested.
2.6      Conclusions
The work described here reaffirms the benefits of R as a neuro-correlate capable of accelerating
rates of adaptation, and prompts several new avenues of research to follow. I propose that
future work adapting R-augmentation to machine learning is a worthwhile endeavor. Machine
or deep learning systems seem to struggle with creating internal models and representations,
and R-augmentation could aid in the development of representations for those systems.
Furthermore, the use of R does not compromise the ability of an evolved agent to generalize
or to be robust against sensor noise, which are both important properties for AI systems.
Unfortunately, performance optimization by a GA is very different from gradient descent or
other optimization methods in deep learning. As such, R cannot simply be applied in the
context of machine learning, even though these systems appear similar at first glance. Further
research will be necessary to adapt R-augmentation to machine learning or deep learning.
    Perhaps the most interesting lesson learned is that R-augmentation affects how brains
evolve to deal with noisy sensor data. While using R made no difference to sensor noise
robustness when agents were evolved without noise, evolution under noise resulted in sig-
nificantly better adaptation when the GA was augmented with R. This implies two things:
evolving representations under noise might be a harder challenge than performing under
                                               41


noisy situations after normal (i.e., noise-free) evolution, and the augmentation therefore is
advantageous as it increases the ability of the agent to adapt to the noisy situation.
    Augmenting the GA with R evolves better performing agents under noise than the
unaugmented GA, which implies that the R-augmentation process makes up for a lack of
information presented to the agent. Additionally, because the noise in the sensors will affect
the calculation of R, the result of R-augmentation performing better under noise implies
that the measurement of R is robust to incomplete information. This implication opens up
an interesting line of thinking: How can I use this robustness to incomplete information to
increase the performance of the R-augmentation process? I propose that instead of having
noise cause the incomplete information, using a reduction of the cases presented to the agent
would have a similar effect. This evaluation on only a subset of cases would result in a less
optimal assessment of fitness, which could be counteracted by using R as a co-evolutionary
factor at the same time. Obviously using R as the means for augmentation imposes an extra
computational cost, but that would now be offset by using fewer test cases. Using R to offset
a reduction in test cases is an interesting extension of this work that I seek to investigate
further.
                                               42


Chapter 3
The Evolution of Representations in
Genetic Programming Trees
This chapter was originally published as ”Kirkpatrick, D., & Hintze, A. (2020). The Evolution
of Representations in Genetic Programming Trees. In Genetic Programming Theory and
Practice XVII (pp. 121-143). Springer, Cham.” Used with permission. Minor changes have
been made to fit the tone and style of this dissertation.
3.1       Abstract
Artificially intelligent machines have to explore their environment, store information about
it, and use this information to improve future decision making. As such, the quest is to either
provide these systems with internal models about their environment or to imbue machines
with the ability to create their own models - ideally the later. These models are mental
representations of the environment, and research has previously shown that neuro-evolution
is a powerful method to create artificially intelligent machines (also referred to as agents)
that can form said representations. Furthermore, previous work has shown that one can
quantify representations and use that quantity to augment the performance of a genetic
                                              43


algorithm. Instead of just optimizing for performance, one can also positively select for
agents that have better representations. The neuro-evolutionary approach, that improves
performance and lets these agents develop representations, works well for Markov Brains,
which are a form of Cartesian Genetic Programming network. Conventional artificial neural
networks and their recurrent counterparts, RNNs and LSTMs, are however primarily trained
by backpropagation and not evolved, and they behave differently with respect to their ability
to form representations. When evolved, RNNs and LSTMs do not form sparse and distinct
representations, they “smear” the information about individual concepts of the environment
over all nodes in the system. This ultimately makes these systems more brittle and less
capable. The question I seek to address, now, is how can we create systems that evolve to
have meaningful representations while preventing them from smearing these representations?
I look at genetic programming trees as an interesting computational paradigm, as they can
take a lot of information in through their various leaves, but at the same time condense that
computation into a single node in the end. I hypothesize that this computational condensation
could also prevent the smearing of information. Here, I explore how these tree structures
evolve and form representations, and I test to what degree these systems either “smear” or
condense information.
3.2       Introduction
One of the most promising approaches in the quest for general purpose artificial intelligence
(AI) is neuro-evolution [Yao, 1999, Floreano et al., 2008, Stanley et al., 2019]. The idea is to
use a genetic algorithm to optimize the AI that controls an embodied agent to solve one or
many cognitive tasks. Many different AI systems have been proposed for this task, such as
Artificial Neural Networks (ANNs) and their recurrent versions (RNNs) [Russell et al., 2003],
Genetic Programs (GPs) [Koza, 1994], Cartesian Genetic Programming (CGP) [Miller, 2011],
Neuro Evolution of Augmenting Topologies (NEAT) [Stanley and Miikkulainen, 2002], and
                                              44


Markov Brains (MBs) [Hintze et al., 2017], among many others. One of the most pressing
problems is that AI have difficulty generating internal models about the environment they
act in. Humans seem to be particularly good at creating these models for themselves, but
we struggle greatly when it comes to imbuing artificial systems with such models, also called
representations1 . It even has been argued that one should not even try to do so and instead
create “intelligence without representation” [Brooks, 1991]. However, previous work found
that neuro-evolution is very capable of solving this problem of creating structures to store
representations.
      To prove that a system indeed has representations, previous work introduced an informa-
tion theoretic measure R that quantifies exactly how much information the internal states
of a system store about its environment independent of the information coming directly
from its sensors [Marstaller et al., 2013]. Over the course of evolution Markov Brains stored
more information about the environment necessary to solve complex tasks. They did so,
by associating specific hidden nodes with different environmental concepts. These sparse
and distributed representations are believed to be similar to how humans for example store
information: Not all neurons fire, but certain areas perform computations based on specific
firing patterns of other neurons. I also found that RNNs, when evolved to perform a task
that requires memory, form less sparse representations and smear the information across all
their hidden states in contrast to MBs.
      However, it remains to be shown that genetic programming (GP) or systems similar to GP
(such as CGPs) also evolve to solve tasks that require representations, to what extent they
evolve the ability to form representations, and to what degree they smear their representations.
      While it is academically interesting to find how much a system knows about its envi-
    1
      Observe that the term representations in computer science sometimes also refers to the structure of
data, or how an algorithm, for example, is encoded. I mean neither, but instead use the term representation
to be about the information a cognitive system has about its environment as defined in Marstaller et al.
2013 [Marstaller et al., 2013]. The term representation, as I use it, is adapted from the fields of psychology
and philosophy
                                                     45


ronment, quantifying representations has an application for genetic algorithms. Similar to
multiple objective optimization [Zhou et al., 2011] which improves at least two different
qualities of a system at the same time, one can not only use the performance of an agent
but also use the degree to which the system has developed representations to define the
mean number of offspring in a genetic algorithm (GA) for augmentation of the GA [Schossau
et al., 2015]. While it has been shown how this helps to optimize Markov Brains and RNNs
during evolution, again it is not known if the same is true for genetic programming systems.
Lastly, representations can not only be measured for the whole system but also for individual
sub-components. By testing all sub-components versus all concepts in the environment, I
can pinpoint where these representations are exactly stored in the system. This results in
a matrix defining the amount or information each element has about each concept. I found
earlier [Hintze et al., 2018] that substrates differ in how distributed or condensed these
representations are. Markov Brains create sparsely distributed representations, while RNNs
smear the knowledge about the environment across all nodes. In addition, I found that for
Markov Brains, RNNs, and LSTMs [Hochreiter and Schmidhuber, 1997] that smeardness and
robustness correlate negatively. Perhaps the difference in representation structure between
Markov Brains and RNNs is not surprising when one considers the topological features of
these systems. RNNs receive their inputs and sequentially propagate this information through
one layer after the other by connecting each node from one layer to all nodes to the next.
While it obviously ensures that the information can reach all components it is presumably
also preventing representations to become local and sparse. Markov Brains, on the other
hand, start with sparse connections and only retain additional ones if there is an evolutionary
benefit at the time of addition. This results in sparsely interconnected graphs, which might be
the reason why these kinds of networks evolve to have sparse and condensed representations.
    Genetic programming trees present a great tool to evolve and fit almost arbitrary math-
ematical functions to data. They can take a lot of different inputs, perform computations
on them, and due to their tree structure, condense the result of all those computations into
                                                46


a single result at the root of the tree. It stands to reason that this topology provides an
advantage to the question of condensing representations as well. These trees can be as wide
as the input layers of an RNN, but do not have the property of further distributing the
information at every step (see Figure 3.1 for a comparison of topologies and representational
smearedness across different kinds of computational substrates).
    The genetic programming trees, however, have the disadvantage that they do not have an
intuitive way to implement recurrence. One way of dealing with the exact problem of forming
memory has been addressed by adding indexed memory and additional computational nodes
into the tree that can read from and write to said memory [Teller, 1994]. Similarly, the nodes
that perform the read and write operations, or push and pop in the case of a memory stack,
can themselves be evolved by genetic programming trees. These trees implement commands
that can then be included into linear forms of genetic programming [Langdon, 1995]. Finally,
the genetically evolvable push programming language [Spector and Robinson, 2002] provides
a more complex solution for memory stored in stacks, specifically for linear genetic programs.
While these methods all solve the problem of adding memory to the system, they essentially
deviate from the idea of using the property of the tree to potentially condense information.
Therefore, I am testing two alternative ways to create recurrence in genetic programming
trees. I test if, as a consequence, the tree topology provides any advantage with respect to
the ability to evolve sparsely condensed or smeared representations.
3.3       Material and Methods
3.3.1      Representations and the neuro-correlate R
The neuro-correlate R seeks to quantify the amount of information an agent has about its
environment that is not currently provided by its sensors. The information-theoretic Venn
diagram (see Figure 3.2) illustrates the relation between environment or world states (W ),
                                               47


      RNN:                            Markov Brain:                   GP Tree:
     input         recurrence         input         recurrence         input          recurrence
                                                                                             ?
     output        recurrence         recurrence           output              output
               concept a                       concept a                            ?
              concept b                        concept c                            ?
               concept c                       concept b                            ?
Figure 3.1: Three different cognitive architectures - RNN, Markov Brain, and GP tree - and how
they distribute or condense representations. RNNs have a joint input and recurrence layer, feed into
arbitrary many hidden layers of arbitrary size, compute outputs which are also recurred back to the
input layer. These networks tend to smear representations about different context overall recurrent
nodes. The red, blue, and green color bars illustrate that. Markov Brains use inputs and hidden states
as inputs to the logic gates or to other kinds of computational units. The result of the computations
creates recurrent states and outputs. These networks tend to form sparse condensed representations,
illustrated by narrower red, green, and blue bands. GP trees use inputs and potentially recurring
information and combine them over several computational nodes into a result. For these structures,
it is not clear to which degree they are smearing or condensing representations. It is also not obvious
how to recur them, illustrated by the “?”.
                                                   48


memory or brain states (B), and sensor states (S) and defines R to be quantified as:
                        R = H(W : B|S) = H(W : B) − I(W : B : S) .                           (3.1)
                                          W                               B
                                           H(W |S, B)    R      H(B|W, S)
                                   I(W : S|B)                           I(S : B|W )
                                                      H(S|W, B)
                                                                     S
                                        I(W : B : S)
Figure 3.2: Venn diagram of entropies and information for the three random variables W , S, and B,
describing the world, sensor, and agent internal (brain) states. The representation R = H(W : B|S)
is shaded.
    In order to actually measure R one needs to record the states of all three random variables.
In the case of Markov Brains, this means the brain and sensor states are just the state of the
hidden and input nodes. The world states, however, are not that obvious. In fact, they need to
be defined and well-chosen by the experimenter and depend highly on the environment. Here I
use a block catching task (active categorical perception, see 3.3.3) and number comparison task
(number discrimination, see 3.3.4), and thus use the relevant concepts of these environments
to define the world states. Here, for the block catching task, I chose the size of the block, the
direction of the block falling, and whether the block is to the left or right of the agent. This
is different from Marstaller [Marstaller et al., 2013] where they also measured if the block
is currently above or next to the agent. It was observed that agents typically do not evolve
representations about hitting or missing the block, and thus this state was omitted. For the
number comparison task, I chose the actual numbers to be the world states. As such, I ask
to what degree the agent evolves representations about each number explicitly. In addition, I
also defined a state to represent whether the first or second number was larger, to see if the
agent is able to represent the difference between the two numbers.
                                                         49


    With regards to the actual computation of R, a reduction was found that decreases the
number of steps needed to compute R. Where the original equation 3.1 expanded in practice
to
             R = H(W ) + H(B) + H(S) − H(W, B, S) − I(S : W ) − I(S : B) ,          (3.2)
    I found a reduction to only four entropy calculations:
                      R = H(S, W ) + H(S, B) − H(S) − H(W, B, S) .                  (3.3)
    The proof for this reduction is as follows:
Proof. I start with the original formula for R from [Marstaller et al., 2013]:
                              R = HCorr − I(S : W ) − I(S : B),
where HCorr is defined as
                       HCorr = H(W ) + H(B) + H(S) − H(W, B, S).
This expands to
             R = H(W ) + H(B) + H(S) − H(W, B, S) − I(S : W ) − I(S : B).
I then use the formula for conditional information from [Marstaller et al., 2013]
                            I(X : Y ) = H(X) + H(Y ) − H(X, Y )
                                               50


to expand
                            I(S : W ) = H(S) + H(W ) − H(S, W )
and
                             I(S : B) = H(S) + H(B) − H(S, B).
Substituting these leads to
          R =H(W ) + H(B) + H(S) − H(W, B, S) − (H(S) + H(W ) − H(S, W ))
               − (H(S) + H(B) − H(S, B)).
By distributing the negative signs I get
                   R =H(W ) + H(B) + H(S) − H(W, B, S) − H(S)
                        − H(W ) + H(S, W ) − H(S) − H(B) + H(S, B).
The positive and negative entropies cancel, leaving
                     R = −H(S) − H(W, B, S) + H(S, W ) + H(S, B).
A slight reordering gives the final equation,
                      R = H(S, B) + H(S, W ) − H(S) − H(W, B, S).
    As the reduced equation (3.3) requires fewer mathematical operations than the original
equation (3.2), there is less chance for computational error and round-off in the reduced
equation, which makes it more accurate in practice than the original. In addition, the reduced
equation is more computationally efficient to calculate.
                                              51


3.3.2      Smearedness of Representations
While the neuro-correlate R calculates the amount of mental representations as a whole, it
does little to examine the structure or layout of said representations. Recent work [Hintze
et al., 2018] proposes a new measure, referred to as smearedness, to work around this deficiency.
Smearedness looks at the amount of representation that each memory state in the brain has
about each individual concept in the environment, and takes a pairwise minimum to see how
much the representations are spread across the different nodes. The equation for smearedness
is seen in equation 3.4, where M is the measure of the representation in a specific node and
concept, taken over all nodes i and for all combinations of concepts j and k.
                                         XX
                                   SN =          min(Mji , Mki )                             (3.4)
                                           i j>k
3.3.3      Active Categorical Perception Task
In order to augment the performance of a genetic algorithm using the neuro-correlate R
the environmental states need to be described well. Since the neuro-correlate R has already
been shown to correlate positively with performance on the active categorical perception
task [Marstaller et al., 2013], and been shown to boost performance in learning the task when
the GA was augmented with R [Schossau et al., 2015], the same task was used here as well
to allow for a direct comparison. Augmentation is done by multiplying the attained score for
the task by the amount of representations the agent has about the environment (see section
3.3.11).
    In this task [Beer, 1996, Beer, 2003] an agent who can move left and right has to either
catch or avoid a block that is falling towards it. The agent has 4 sensors, split into two groups
of 2 sensors, separated by a 2 or 3 unit space between them. Blocks of various size are dropped
one at a time, with the agent having 34 time steps to catch or avoid the block. The agents in
                                                52


the original experiments were evolved to catch blocks of width 2, and avoid blocks of width
4. I extended the task to try different combinations of block sizes. The blocks move 1 unit to
the right or left on any given time step. On each time step, the agent receives the sensory
input and can move 1 unit to the right or to the left. The agent is expected to determine
the size of the block, and its direction of movement, so that it can decide whether to avoid
or catch the specific block being dropped. The agents were tested over both block types and
all possible permutations of starting positions relative to the sensors and block movement
patterns.
    Due to the configuration of the sensors and the relative location of the agent to the block,
agents can only very rarely decide if a block is large or small without movement and thus need
to navigate to be directly under the block - hence the ”active” in active categorical perception.
This can only be done if the block was observed before for at least two updates in a row,
since otherwise the direction of the fall could not be determined. All this information needs
to be integrated into a decision about catching or avoiding the block. It is this integration of
sensor information over multiple time steps that requires agents to first evolve memory and
with it representations in order to make proper decisions.
3.3.4       Number Discrimination Task
In addition to the active categorical perception task, recent work [Kirkpatrick and Hintze,
2019] has identified a different task in which I can also investigate the role of representations
and smearedness in the neuro-evolutionary process. This task, referred to as the Number
Discrimination Task, has agents receive two values in sequence, and require the agent to
identify the larger one. This task is inspired by previous biological [Nieder, 2018, Merritt
et al., 2009] and psychological [Merritt and Brannon, 2013] studies. When adapted for use
in silico, the agents are presented with every possible pair of numbers between 0 and 5, with
the values being shown as bit strings where the number of ones is the value the agent must
comprehend (e.g., 00000 is 0, 01000 is 1, 11000 is 2, etc.). Agents are additionally presented
                                                53


with all possible permutations of each pair (e.g., 10001 is equivalent to 01100 and both are
used to represent 2). This task requires agents to store the first number and then perform
a comparison to the second. The agents are first shown one value, then they are given 3
updates to process the information, followed by a second value and an additional 3 updates
to process the information, before being required to make a determination of which number
is larger. The world states used in calculating R are defined by the individual numbers used,
and the information of whether the first or second value presented is larger.
3.3.5      The perception-action loop for stateful machines
Genetic programming has been used before to control robots in virtual [Reynolds, 1993,
Reynolds, 1994,Handley, 1994] as well as in actual environments [Nordin and Banzhaf, 1995a,
Nordin and Banzhaf, 1995b, Nordin and Banzhaf, 1997]. In order to create a meaningful
controller, the evolved machine or program needs to map the sensor inputs it receives to
motor outputs which control the actions of the robot controlled (see Figure 3.3). A purely
reactive machine would have a direct mapping of inputs to outputs, while a stateful machine
would use hidden states (memory) to store information from the sensors to use it for later
decisions.
    Previous approaches used genetic programming methods to evolve computer-like code
running on register machine [Nordin, 1994]. Some other approaches encoded the program
controlling the robot as a tree of binary operations executed on only the sensor inputs [Koza
and Rice, 1992]. While it is these kinds of tree-like encodings that I seek to investigate here,
this specific approach did not use hidden states and thus created purely reactive machines.
Adding memory has been done differently before [Teller, 1994, Langdon, 1995, Spector and
Robinson, 2002]. However, I think that the tree structure is of particular importance. This
structure allows the information from all sensor and hidden states to be funneled together
into a coherent representation, like all the leaves of a tree feed into the root (see Figure 3.1
for an illustration). Although intuition may indicate that this structure is correct, I must
                                               54


                                         the brain
                    perception
                                          hidden states
                          inputs                                  ouputs
                                                                          action
                                         the world
Figure 3.3: Illustration of the perception-action loop. The world is in a particular state and can
be perceived by the agent. This perception defines specific input states. These states now allow
the brain (computational substrate) to perform a computation, which leads to new output states.
These states, in turn, cause the agent to act in the environment and thus can change the world
state. In case the agent needs to remember information about the past, it has to have the ability to
store these memories somewhere. Thus, any machine that needs to be stateful has to have internal
(here hidden states) that can be set depending on the current input and the current hidden states.
This scheme of a brain resembles the structure of a recurrent neural network, but can be generally
applied to all sort of computational machines that have to be stateful and act in an environment.
                                                55


empirically test the viability of these structures. This does not imply that other, for example
linear programming structures or other memory models, cannot perform such functions.
As such, the task here is to use genetic programming to encode tree like computational
structures that can take the inputs of the machine and the hidden states into account to
create new outputs while also setting hidden states in the process. A typical recurrent neural
network [Russell et al., 2003] does this by extending the input and output layer of an artificial
neural network and then recurring the additionally computed output states back to the inputs
(see Figure 3.1 RNN). This is exactly the way how Cartesian genetic programming [Miller,
2011] would be used to create systems that can take advantage of hidden states. Similarly,
Markov Brains [Hintze et al., 2017] use the same concept of recurrence to store hidden states
(see Figure 3.1 Markov Brain). In fact, when Markov Brains use the same nodes that CGPs
use [Hintze et al., 2019], they become very similar to each other. This kind of recurrence is
very capable of creating systems that are stateful and have representations that are condensed,
but again are not necessarily tree-like, and rather are more arbitrarily connected networks.
Beyond using Markov Brains equipped with the computational nodes found in CGPs I use
two other methods to create genetic programming trees that are stateful. GP-forests use one
tree per hidden and output state and can use default values as well as inputs and hidden
states from the last update2 . GP-vector-trees are generic trees but instead of performing
computations on single continuous variables, they execute vector operations. The vector
supplied contains the hidden states of the last update as well as the current inputs, and part
of the vector defines the output computed. I assume that this obvious extension of normal
variables to vectors has certainly been done previously, but I can not find an explicit reference
for this approach. Generally speaking, here I introduce one form of recurrence into genetic
programming tree structures. The programming language push [Spector and Robinson, 2002]
should in principle allow for the same structures to emerge. However, these structures would
need to arise by chance, in comparison to the models that I introduce here which achieve the
   2
     Inspired by the multiple trees used to encode memory in Langdon 1995.
                                                    56


desired structural goals by their definition.
3.3.6     Markov GP brains using CGP nodes
Markov Brains have been extensively on similar tasks and also to introduce and confirm the
definitions of representations used here [Marstaller et al., 2013]. In a nutshell, a genome is
used to encode the computational units of a Markov Brain. The genome defines the function
and connectivity of these units, which can connect to input, output, and hidden states. The
connections allow the computational units to read from these states and compute new outputs
and new hidden states. As such a Markov Brain defines the update function of a state vector
from time point t to t + 1. Some values of this vector are inputs, some outputs, the rest hidden
states. This is very similar to how an RNN is constructed. The number of hidden states in
a Markov Brain or RNN can be chosen by the experimenter arbitrarily. Generally speaking,
too many states slows evolution down, too few states makes evolving the task impossible.
The number of hidden states used here (8) was found to be relatively optimal for the task at
hand [Marstaller et al., 2013]. The computational units a Markov Brain uses were originally
probabilistic and deterministic logic gates. As such, a Markov Brain defines a hidden Markov
Model conditional on an input sequence [Bengio and Frasconi, 1995] - hence the name Markov
Brain. For a more detailed description of Markov Brains see Hintze et al. 2017 [Hintze et al.,
2017]. In this work, I use Markov Brains with CGP gates to approximate a CGP network.
This creates a dynamic, GP-like structure that can be contrasted with the more rigid tree
structures.
3.3.7     Genetic Encoding of GP Brains in a tree-like fashion
Genetic programming trees and also Cartesian Genetic Programming define the computations
they perform by identifying computational nodes, their connections, and how they receive
inputs. When evaluated, they take those inputs to perform all computations defined by
their structure and return a solution. When mutated, the nodes and inputs can change, as
                                                57


well as rewiring between the components can occur. In addition, when for example genetic
programming trees are recombined, sub-branches of the trees are identified and exchanged
between the trees [Banzhaf et al., 1998]. Since mutations directly identify the components
to be changed, this could be called a direct encoding. An alternative to this approach are
indirect encoding schemes which can sometimes provide an advantage during evolutionary
adaptation, such as the hyper-geometric encoding for NEAT [Clune et al., 2011]. Markov
Brains, for example, use a type of indirect encoding where a genome specifies genes, and these
genes then define the computational components and their connectivity. Mutations happen
to the genome and not on the components themselves. The genetic programming substrates
used here are defined in a similar genetic encoding scheme. I use the term genetic encoding
to imply that it is not entirely direct nor as indirect as other systems might be.
    Here different kinds of computational structures (GP-Forest and GP-Vector Brains) are
defined by such a genetic encoding scheme. Starting from a root node, new nodes have to be
sequentially added. Therefore, a linear genome (vector of bytes) is scanned for start codons
(a set of two numbers) which define the start of a gene. The numbers following such a start
codon are then used to define the mathematical operation of a node, and how this node is
inserted into the genetic programming tree (see Figure 3.4). Observe that the order of genes
in the genome matters, genes found earlier encode nodes that will be inserted into a tree
earlier. Mutations affect the genome and can not only shuffle genetic material around but
also insert and delete whole parts. As such, genes can disappear or be created de novo. This
means, that at the time when a node is created by reading the gene defining it, it is not known
if other nodes will be appended to it later or not. Since each node in these implementations
has two possible inputs, each gene encodes default values or addresses for hidden states or
input values. If another node becomes added later it replaces the default value with it’s own
output.
    Since nodes are sequentially added, they always get appended to an already existing node.
To know where, each gene also encodes a continuous number [0.0, 1.0] that specifies where in
                                               58


                    genome:
                                   start codon             gene/node
                     gene:
                    node:                          #
                                                  opp
                            default
                                                                address
                             value
Figure 3.4: Schematic overview of the genetic encoding for computational nodes. The genome is
a sequence of numbers (bytes), and specific subsets of numbers (42, 213) define start codons. The
sequence of numbers behind the start codon define a gene. As such, a gene stores all information
necessary to define all properties of a computational node as it is needed to function within a genetic
programming tree. Specifically: the computation the node has to perform (orange), the default for
its left input (green), the default for its right input (blue), and the address where it needs to be
inserted (purple). Observe that the defaults can be specific values (as seen for example for the left
input of the node), or the address of a hidden node as seen for example in the right input of the
node.
                                                   59


the tree the node has to be added (see Figure 3.5 for an illustration of that process).
    This scheme applies to both GP-Forest and GP-Vector. However, what each gene encodes
differs with respect to the function of the node or the specific default values or addresses.
3.3.8      GP-Forest Brain
Any kind of computational system controlling an agent that needs to be stateful, cannot just
map outputs to inputs but instead has to rely on hidden states which are also sometimes called
recurrent states. In the case of an RNN for example, extra outputs that the network produces
are just mapped back as additional inputs. Similarly, Markov Brains use inputs, outputs,
hidden states upon which all computations are performed. While the inputs are generated
by the environment the Markov Brain generates the outputs and the new hidden states to
complete a perception-action loop. In order to allow for the same computational complexity
in a genetic programming tree, I need to somehow make it recurrent. Conventional trees
compute a single output based on a selection of inputs. However, this would allow us to only
create systems without hidden states. Here I use, what I call a genetic programming forest
(GP-Forest), where each output, as well as each hidden state, is computed by an individual
genetic programming tree (hence the forest analogy). Each tree has access to all inputs and
hidden states and computes new outputs and new hidden states at every update. The tree
itself does not allow for cycles. However, after each update of the entire tree, the new hidden
states as well as new inputs are again made available to the tree. This is what creates the
recurrence, which happens at the time step boundaries. For a list of computations that each
node can potentially compute see Table 3.1. The number of hidden states plus the number
of output nodes to be computed defines the number of trees needed to be specified. To allow
for a proper comparison of GP-forests with Markov Brains the number of hidden states used
was kept the same in all brains (Markov Brain, GP-forests, and GP-Vector Brain see below).
    The genetic encoding specifies for each newly added node which of the trees it should be
                                               60


                          1)                      root
                                                   0
                                               0.0 -1.0
                          2)                      root
                                                   +
                                           0               a
                                          .0-0.5       .5-1.0
                                                  root
                          3)
                                                   +
                                                           a
                                            -
                                                            .5-1.0
                                   b              0.3
                                .0 - .25      .25 - .5
                          4)                      root
                                                   +
                                      -                         *
                              b               0.3       c              c
                           .0 - .25      .25 - .5     .5 - .75    .75 - 1.0
Figure 3.5: Illustration of tree growth by adding nodes. 1) in the beginning the tree is defined
as an empty root, that defaults to an output (here 0.0). Once the genome is parsed and a new
gene that encodes a node has been found, the node replaces the root node. The left and right
inputs now define two ranges [0.0, 0.5] and ]0.5, 1.0]. 3) Each new node to be added has a specific
address (a value from [0.0, 1.0] as defined by the gene, see Figure 3.4 the purple component of the
gene). This number defines which default input to replace. The added node itself subdivides the
range it connected to, in order to allow new nodes to again be added. 4) The concept of adding
nodes and subdividing the connection ranges continues until the tree is built. In this way, a new
node will always have a place in the tree. However, since nodes are added sequentially, deletions of
genes/nodes will affect the topology greatly, but will not make it impossible to add nodes.
                                                   61


Table 3.1: The possible mathematical operations of each node in a GP-Forest brain. Each node
has two possible inputs: L and R, performs a computation on them, and returns the value of said
computation. The last category of LOGIC consists of 16 independent binary logic operations (e.g.,
AND, NAND, NOR, XOR, OR, etc.) performed on the L and R input:
  Operation     Return value
  NOP           0.0
  LEFT          L
  RIGHT         R
  ADD           L+R
  SUB           L-R
  MUL           L*R
  DIV           L/R, 0.0 if R==0.0
  EQU           1.0 if L==R, else 0.0
  NEQU          1.0 if L!=R, else 0.0
  LOW           1.0 if L¡=R, else 0.0
  HIGH          1.0 if L¿=R, else 0.0
  LOW           1.0 if L¡R, else 0.0
  + 16          all 16 possible Boolean logic
  LOGIC         operations performed on L and R
appended to as well as the location in the tree (see Figure 3.5 to see how nodes are added,
see Figure 3.6 for a structural overview of a GP tree brain).
3.3.9      GP-Vector Brain
While genetic programming forests can deal with the hidden state problem by adding many
trees, one for each new state to be computed, GP-Vector brains try to solve this problem
differently. I assume that the input states, output states, and hidden states form a continuous
vector. At each update of the brain, a new vector has to be computed. Therefore, I extend
the genetic programming paradigm that conventionally allows nodes to be mathematical
operators for single variables to now compute vectors. Each node is now defining a vector
operation, and the inputs of each node can be constant vectors or the current state vector.
Regardless, such a tree of vector operations will result in a newly computed vector which
itself contains the values of the current hidden state as well as the outputs of the agent’s
brain. Inputs from the environment are fed into this vector before each update of the brain.
                                                62


    In 0   In 1            Out 0              Out 1            Hidden 0          Hidden 1
Figure 3.6: Illustration of a GP-Forest. In this example, I assume the system to have two input
nodes (red and orange), two output nodes (white), and two hidden nodes (green and blue). When
decoding the genome, nodes can get added to hidden and output nodes sequentially. Each node
can either use a default input (white) or use the value specified by an input or hidden node. For
example, the tree added to the output node 1 reads from the orange input node, and twice from
the blue hidden node. When a new brain state has to be computed, all trees (outputs, hidden)
are evaluated simultaneously, and consequently a new set out outputs and hidden states become
defined, based on previous inputs, and hidden states.
Each node is again encoded by a gene which provides the mathematical operation to be
performed, where in the tree it gets added, and it defines if the default input is the state
vector, or defines an individual constant vector. Remember that when a node is added, it
replaces one of the defaults of the current node with its own output (see Figure 3.7 for a
structural overview of a GP-Vector brain).
    Each node can now perform vector operations on its inputs and will return a vector
as a result of that computation. However, such vector operations are often computationally
expensive, and maybe not always necessary. Thus, nodes are also allowed to perform operations
on the individual elements of the vectors. Each gate uses either the result of other nodes, a
default vector of zeroes, or the state vector. These vectors are called L and R to capture the
idea that each node in the tree can have two inputs, one from a left one from a right branch.
If an addition on these vectors is performed the resulting output O is just L + R. Which
means that every element Oi is the result of Li + Ri for all i. To allow for individual elements
to be manipulated by each node, all operations are also defined as “point” operations. An
ADDi,j would now compute an addition only on the i element of vector L and the j element
of vector R. The result would be Oi = Li + Rj while all other elements of O would be those
defined by L.
                                                63


                                           root:      +
                                         -                   (                    )
                    (                   ) (                    )
                   current state vector:                 constant vector:
                    (                   )              (                   )
Figure 3.7: Illustration of a GP-Vector Brain. This brain is structurally identical to conventional
genetic programming trees, except that it can execute computations on vectors instead of single
values. The inputs to nodes can be the state vector, including all inputs (red and orange) and all
hidden states (blue to green). Alternatively, nodes can use default vectors as inputs. These defaults
are defined by the genes encoding each node.
Table 3.2: The allowed mathematical operation for the nodes of a GP-vector-brain. Observe that
all operations are either performed on all elements of the L and R vector resulting in an output
vector O, or on individual elements of L and R. Only the LOGIC operation is only defined as “point”
operation and not on the entire vector.
 Operation     Return value
 ADD           O =L+R
 SUB           O =L−R
 MUL           O =L∗R
 DIV           O = L/R
 SIN           O = sin L
 COS           O = cos L
 TANH          O = tanh L
 MIN           O = min L
 MAX           O = max L
 MEAN          O = 0.5(L + R)
 ABS           O =| L |
 NEG           O =| −L |
 LOGIC         Oi = logic(Li , Rj )
                                                 64


    The size of the vector is defined by the number of inputs, outputs, and hidden states.
While the number of input and output states is defined by the task, the number of hidden
states for this brain was set to be equal to the number of hidden states in the Markov Brain,
to allow all brains to be directly comparable.
3.3.10      Evolutionary Process
Each of the agent populations with different brain types was evolved for 40,000 generations
in populations of 100 organisms. The start population was initialized with random genomes
of length 5000. The genome size was capped at a maximum of 20,000 sites and a minimum
of 2,000 sites. For each experimental condition, I ran 400 independent replicate experiments.
At the end of the evolutionary process, the line of descent was reconstructed, and all analyses
were performed on these agents. All experiments, using each of the three agent types and
two worlds, were implemented using the MABE code framework [Bohm et al., 2017] which
among many other things includes an extensive version of Markov Brains as well as all other
tools required to perform neuro-evolution on arbitrary substrates and tasks.
    In both worlds, I used the number of correct decisions C (e.g., blocks caught or avoided
correctly, pairs of numbers correctly ordered) and incorrect decisions I (e.g., blocks caught
that should have been avoided, pairs of numbers incorrectly ordered) to define the fitness
function as:
                                         W = 1.1(C−I)                                     (3.5)
    This function is an exponential fitness function where an additional correct decision of
the agent does not linearly but exponentially improves fitness. This increases the chances of
an agent to be selected for reproduction exponentially, which in my experience allows the
GA to select for functional agents better. The exponential base value (1.1) was taken from
previous work [Marstaller et al., 2013].
                                               65


    After the fitness has been computed for every organism in the population, I use roulette
wheel selection to determine which organisms contribute offspring to the next generation
[Thomas, 1996]. Roulette wheel selection is a method that ensures selection to be directly
proportional to fitness. Every time an agent made an offspring the genome experienced gene
duplications (with probability 2∗10−5 per nucleotide of the parent genome, duplicate the next
128 to 512 contiguous sites), deletions (with probability 2 ∗ 10−5 per nucleotide of the parent
genome, delete the next 128 to 512 contiguous sites), and point mutations (with probability
5 ∗ 10−3 per nucleotide of the parent genome, randomize the site).
3.3.11       Augmenting with R
In order to augment the performance of the GA, not only the performance of each agent is
measured but also the amount of representations each agent has about its environment. In
the block catching task specifically the size of the block, the direction of it, and whether it is
to the left or right. In the number comparison task, representations are about the individual
values of the numbers to be compared, and the ordering of the pair (i.e., if the first or the
second was smaller).
    To assess performance the number of correct catches (C) and misses (I), or the number of
correct (C) versus incorrect answers (I) are counted. R is quantified as described above, and
Rmax is the maximum hypothetical value for R, which is dependent on the environmental
and memory states. The total fitness of the agent is determined by the following equation:
                                                         R
                                   W = 1.1(C−I) (1 +         )                              (3.6)
                                                       Rmax
    Alternatively I could have used a multi-objective optimization method [Deb, 2001] that
would allow some agents in the population to have a high score and low R, while others
might perform poorly but have proper representations. This keeps the diversity high, but only
a recombination step might take full advantage of such diversity. Recombination however,
                                               66


would introduce another level of complexity that might have confounded the results. As such,
I did not pursue this option here.
3.4      Results
3.4.1     GP trees evolve to have representations
Each system responds differently to optimization by a genetic algorithm (GA), and as such
I expect GP-Vector, GP-Forest, and Markov Brains using GP gates (further called MBwGP)
to generally evolve to solve the task at hand but be optimized at different speeds. Similarly,
systems might tend to get stuck in local optima differently, and thus, the resulting performance
might differ between brains.
    As expected, I find the different brains to respond differently to the optimization of the
GA (see Figure 3.8). I find all systems to evolve generally well, however, GP-Forests tend to
perform better on the number task (see Figure 3.8 right panel). MBwGP and GP-Forest are
almost indistinguishable with respect to their performance in the active categorical perception
(APC) task (see Figure 3.8 left panel). GP-Vector brains, however, perform poorest on both
tasks.
    Interestingly, I find a clearer delineation of result with respect to the degree in which these
systems evolve the ability to have representations (see Figure 3.9). GP-Forests evolve the
highest degree of representations, followed by MBwGP, trailed by GP-Vector Brains. This
is particularly interesting since previous work [Marstaller et al., 2013] found that RNNs
evolve representations early, but struggle with improving their performance, even though
they already have a high level of representations. Markov Brains on the other hand evolve
representations slowly over time, and thus can only improve their performance later, after
they have sufficient representations. GP-Forest Brains seem to evolve like Markov Brains but
store more representations.
                                                  67


                                                 1.0
            0.9
                                                 0.9
            0.8
                                                 0.8
      W     0.7
                                                 0.7
            0.6
                                                 0.6
            0.5
                                                 0.5
                   0   10000 20000 30000 40000         0   10000 20000 30000 40000
                       Generations                         Generations
Figure 3.8: Performance over time for all 3 brain types (GP-Forest = Black, MBwGP = Red,
GP-Vector = Blue). Left plot is active categorical perception, right is number comparison
            0.40                                 1.2
            0.35
                                                 1.0
            0.30
            0.25                                 0.8
      R     0.20                                 0.6
            0.15
                                                 0.4
            0.10
            0.05                                 0.2
            0.00
                   0   10000 20000 30000 40000         0   10000 20000 30000 40000
                       Generations                         Generations
Figure 3.9: R over time for all 3 brain types (GP-Forest = Black, MBwGP = Red, GP-Vector =
Blue). Left plot is active categorical perception, right is number comparison
                                           68


3.4.2       Does Augmentation using R improve the performance of a
            GA?
I know that one can augment the optimization a GA performs by using R (see Augmenting
with R). The idea is, that mutations that improve an agents ability to store more repre-
sentation would normally be neutral as they not necessarily also immediately provide an
advantage to performance. However, in order to improve performance, often representations
are needed. The augmentation process, therefore, turns mutations that improve R but are
neutral into beneficial ones. While this is true for Markov Brains and RNNs, it is not clear
is the GP-forest or GP-vector brains will behave the same.
     I tested this hypothesis by evolving all three types of brains again on both environments,
but also measured the representations they have during evolution, and used this quantity
to augment the GA. This is in contrast to the results above, which merely examined the
unaugmented GA. I find that this augmentation has almost no effect on GP-Forests and only
little on MBwGP brains (see Figure 3.10). However, it significantly improves the performance
of GP-Vector Brains in the active categorical perception task. I also find that GP-Forests
and MBwGP already evolve the ability to store representations well, but GP-Vector Brains
struggle. This supports the hypothesis that augmenting a GA by using R works only for
systems that struggle to form representations in the first place.
     I further find that the augmentation with R might not improve performance, but still
had an effect on the ability to store representations in each system. Each brain when evolved
using R-augmentation evolved to have more representation over evolutionary time than the
unaugmented control (see Figure 3.11). This proves that the augmentation indeed has an
effect, but again, only improves the performance of the GA when a system struggled to evolve
representations.
     The most notable finding is that GP-Forest Brains do evolve a large degree of representa-
                                                69


                                                        1.0
              0.9
                                                        0.9
              0.8
                                                        0.8
       W      0.7
                                                        0.7
              0.6
                                                        0.6
              0.5
                                                        0.5
                    0    10000 20000 30000 40000              0   10000 20000 30000 40000
                        Generations                               Generations
Figure 3.10: Performance over time for all 3 brain types (GP-Forest = Black, MBwGP = Red,
GP-Vector = Blue). Left plot is active categorical perception, right is number comparison. Solid
line is control, dotted line is augmented with R
                                                        1.4
              0.5                                       1.2
              0.4                                       1.0
              0.3                                       0.8
       R                                                0.6
              0.2
                                                        0.4
              0.1
                                                        0.2
              0.0                                       0.0
                    0    10000 20000 30000 40000              0   10000 20000 30000 40000
                        Generations                               Generations
Figure 3.11: R over time for all 3 brain types (GP-Forest = Black, MBwGP = Red, GP-Vector =
Blue). Left plot is active categorical perception, right is number discrimination. Solid line is control,
dotted line is augmented with R
                                                   70


tions, and respond to the augmentation with an increase in R but do not significantly benefit
from the augmentation with respect to their performance.
3.4.3       Smeared Representations
I found earlier [Hintze et al., 2018] that brains who evolve to have sparse and condensed
representation are more robust to noise, and conjecture that overly smeared representations
negatively affect performance and evolvability. When representations are smeared, every mu-
tation changing the system would affect the computation of all concepts. A similar argument
has been made explaining the benefit of the dropout method in deep learning [Srivastava
et al., 2014]. By removing nodes while training an ANN, representations might become more
condensed, improving the ANNs ability to generalize. Another way of thinking about this is
that the more clearly separated the concepts in the cognitive system are, the better it can
classify. I already showed that neuro-evolution of RNNs and LSTMs results in systems that
smear representations greatly, while Markov Brains evolve to have sparse condensed represen-
tations [Hintze et al., 2018]. The question here is, if the structures of genetic programming
trees helps to further condense representations, or if these systems smear representations
never the less?
    I find that the degree to which these systems evolve smeared representations again varies
greatly between the systems. While MBwGP and GP-Vector Brains only smear their repre-
sentations marginally when evolved to solve the active categorical perception task, GP-Forest
Brains smear their representations much more (see Figure 3.12 left panel). When evolve to
solve the number task, both GP-Forest and GP-Vector smear more than the MBwGP (see
Figure 3.12 right panel).
                                              71


              1.4                                    2.5
              1.2
                                                     2.0
              1.0
              0.8                                    1.5
        SN    0.6                                    1.0
              0.4
              0.2                                    0.5
              0.0                                    0.0
                    0   10000 20000 30000 40000            0   10000 20000 30000 40000
                        Generations                            Generations
Figure 3.12: Smearedness of Nodes over time for all 3 brain types (GP-Forest = Black, MBwGP
= Red, GP-Vector = Blue). Left plot is active categorical perception, right is number comparison.
Solid line is control, dotted line is augmented with R
3.5      Discussion
I used neuro-evolution to optimize the performance of three different computational systems.
Markov Brains that use GP nodes are similar to CGPs and evolve a network topology that I
know is capable of forming representations which are not smeared. I confirmed that Markov
Brains using GP nodes benefit from augmenting the GA using R, and asked if the same
principles apply to other cognitive topologies. Of particular interest are the tree-like structures
from genetic programming trees. Other than the topology of ANNs which connects every node
with every other node and leads to extreme smearing of representations [Hintze et al., 2018]
I assumed that tree structures would behave differently in this regard. The tree structure
can take advantage of all possible inputs and hidden states but converges them into a single
output at the end. The assumption was, that these systems would condense representations
better and thus smear less. The hope was that such a construction allows a system to take
full advantage of having a high degree of representations while at the same time not smearing
them.
   While the GP-Vector implementation performed poorly compared to GP-Forests and
                                                72


MBwGP, I found that GP-Forests evolve equally well or even better when compared to
MBwGP (see Figure 3.8). GP-Forests also evolve to have more representations than the
less structured MBwGP Brains (see Figure 3.9), and these representations do smear (see
Figure 3.12). However, augmenting the GA with R has no effect on their performance (see
Figure 3.10) and only increases the amount of representations they evolve to have as expected
(see Figure 3.11). As such, GP-Forests have the ability to evolve representations and take
advantage of them by construction, and are thus “immune” to further augmentation by using
R.
    Lastly, the smearing of representation in GP-Forests is much less than it has been observed
in RNNs before [Hintze et al., 2018]. This further supports the idea that the tree structure
provides a benefit over either the networks that Markov Brains or CGPs form and over the
fully connected regular structures that RNNs have when it comes to having representations
and taking advantage of them over the course of evolution.
    In other words, GP-Forest Brains evolve equally well or better than Markov Brains and
form representations easily by their construction, and thus do not benefit from further aug-
mentation with R. At the same time, they do smear their representation but not to the
degree that it becomes a problem as it does in RNNs.
3.6       Conclusions
The two genetic programming systems (GP-Forest and GP-Vector) solve the quest to make
genetic programming trees have recurrence, thus allowing them to become stateful machines.
In the case of GP-Forests, their structure provides an advantage over other regular (ANN)
or network structures (Markov Brains) with respect to evolving meaningful representations.
The immunity of GP-Forests to the augmentation with R methods shows that this system
automatically retains representations sufficiently well, without smearing them, by nature of
its structure.
                                               73


    This insight suggests that one should take further advantage of tree-like structures in
neuro-evolution. The capabilities of the GP-Forests also suggests investigating tree structures
and how they affect the concept of smearedness in the context of deep learning.
3.7      Acknowledgements
I thank Stephan Winkler for insightful discussions on hidden states in genetic programming
trees, and for implementing the prototype for GP-Forests in EvoSphere.
                                             74


Chapter 4
Evolutionary Dynamics Effects
Account for the Improvement Caused
By R-Augmentation
A portion of this chapter was originally published as ”Kirkpatrick, D., & Hintze, A. (2020,
November). Evolutionary Dynamics Effects Account for the Improvement Caused by R-
Augmentation. In 2020 7th International Conference on Soft Computing & Machine Intelli-
gence (ISCMI) (pp. 96-100). IEEE.” A number revisions and additions have been made for
inclusion in this dissertation.
4.1      Abstract
Previous work has found a method for augmenting a genetic algorithm (GA), referred to
as R-augmentation, that produces better-scoring results earlier in evolutionary time. R-
augmentation works by using an information-theoretic measure quantifying mental repre-
sentations (denoted as R) as an additional contribution to the fitness function of the GA.
It has been shown that this method improves the performance of the GA by encouraging
                                            75


the evolution of high-performing artificial agents controlled by Markov Brains or Recurrent
Neural Networks. Different mechanisms could explain this phenomenon, ranging from struc-
tural changes to differences in the associated mutational effects. Here I examine a range of
potential sources of improvement and demonstrate that the majority of the improvement is
caused by R-augmentation altering the evolutionary dynamics of the GA, reflected in the
changed effects of mutational operators and changes to the fitness landscape.
4.2      Introduction
Evolutionary algorithms used to train artificial intelligences can produce novel and often
unexpected results as compared to other types of optimization, particularly reinforcement
learning [Lehman et al., 2018]. However, an evolutionary optimization system may take a long
time to reach an optimal solution to a problem, if it ever reaches that optimal solution at all.
Many methods have been proposed to reduce the run time of these evolutionary algorithms,
in order to create better performing solutions with less compute time. Previous work has
identified one method of boosting the evolutionary process, referred to as R-augmentation,
that improves GAs by promoting the evolution of artificial intelligences with increased men-
tal representations [Schossau et al., 2015]. This augmentation process has been found to
result in a higher number of perfect performers evolved as well as higher task performance
in fewer generations, when compared to an unaugmented GA. Here, mental representations
are measured as the amount of sensor-independent information that the artificial intelligence
has integrated into its internal memory from the environment and are evaluated using the
information theoretic neurocorrelate R [Marstaller et al., 2013]. Colloquially, R quantifies
how much a system knows about its environment without looking at the environment. This
method can only be applied to tasks that require agents to form recurrent internal models
about their environment, which are generally stored in memory through some form of network
feedback loop. R-augmentation works by changing the original fitness function so that R
                                              76


is evaluated in addition to the task-related components (see subsection 4.3.3 for a fuller
description of the R-augmentation process). While R-augmentation as well as augmenta-
tion using other neurocorrelates allow GAs to generate better-performing solutions in less
evolutionary time, it is not clear how the improvement is generated. Although the benefit
of R-augmentation comes at a cost, previous work was unable to identify a cost-benefit
trade-off beyond the computational cost of calculating R (see Ch. 2). Understanding how
and why these augmentation methods improve the performance of GAs is a crucial step
towards accelerating evolutionary search as a whole.
    A number of changes to the networks created by R-augmented GAs could account for
the difference in performance as opposed to those created by an unaugmented GA. First and
foremost are changes to the internal structure of the cognitive network, particularly those
geared at improving access to memory. In order for representations to form, information must
be maintained in the network. In order for information to be maintained in the network, it
must be repeatedly propagated into a usable location, and thus form a part of an internal
looping structure. Since the formation of representations requires an internal looping structure
to maintain information, I will attempt to quantify the recurrent structures of the network
by analysing the connections between nodes. I primarily consider two cases; first, a node may
have an output that feeds back (through computational components) into its own inputs
and form a self-recurrence. Secondly, a node’s output may feed into another node, which is
then fed back into that node, creating a two-step recurrence. Although there may be larger,
more complex chains of recurrence, for the context of this paper I only consider the first
two cases as a difference in the recurrent structure should be obvious there. The cognitive
network that I study, the Markov Brain, is highly composable, that is it can be made of
different internal computational components, referred to as gates. These gates can have
internal mechanisms that depend on deterministic or probabilistic logic, or even mimic the
functionality of other network structures like artificial neural networks (ANNs) or Genetic
Programming (GP) trees. Moreover, recent work [Hintze et al., 2019] suggests a new paradigm
                                               77


for approaching the question of how R-augmentation works by offering a new method of
configuring Markov Brains, referred to as the ”Buffet Brain.” [Hintze et al., 2019] suggests
that evolution proceeds more effectively if the Markov Brain has access to all gate types.
Here, I can allow the GA access to all gate types, and see what differences arise in the gates
chosen by R-augmented and unaugmented GAs. Presumably, if the computational substrate
matters for the R-augmentation process to be effective, I would see a difference in gate type
distributions for the runs using the buffet method.
    Beyond the changes to the network structure or composition, I must examine alternate
sources of the difference in performance. I will frame this difference by first considering an
analogous method to R-augmentation, namely multiple objective optimization [Zhou et al.,
2011]. As opposed to R-augmentation, objectives in multi-objective optimization usually
involve a trade-off, that is, they are often antagonistically linked. In R-augmentation the
relation between performance (of fitness W ) and R is not necessarily antagonistic. Their
relation could be synergistic, where high levels of R drive high performance. Alternatively,
one might be a necessary condition for the other. For example, the amount of information
present in the system could be limiting the performance, and only once the more information
is acquired can performance follow. With R-augmentation, a positive change to R would
increase fitness and thus not be neutral. As a consequence, later mutations would then be
able to take advantage of the increase in R and improve performance directly. Without R-
augmentation, the first mutation would be neutral and effectively invisible, and any change
in performance would require an additional change in R through a separate mutation.
    An alternative answer to the question of how R-augmentation works involves the fitness
landscape itself. The fitness landscape metaphor assumes a space defined by all possible
genotypes. In this space, the neighborhood relations are defined by the mutational operators.
Two genotypes that are one mutation apart would be direct neighbors, whereas genotypes
that are separated by many mutations would be far from each other in the fitness landscape.
The ”height” at each genetic location is given by the fitness of each genotype. Without R-
                                              78


augmentation this landscape has a particular slope and ruggedness that determines the rate
of adaptation. R-augmentation changes the landscape, perhaps making it smoother than the
unaugmented GA’s landscape. As a consequence, the ability to cross valleys in this altered
landscape might be increased, or perhaps valley-crossings are eliminated altogether. Here
I examine differences in the network structure and composition as well as several different
indicators of a change to the evolutionary dynamics of the augmented and unaugmented GAs
to gain a fuller understanding of how R-augmentation results in increased task performance.
4.3      Materials and Methods
4.3.1     Markov Brains
Markov Brains are a form of neural network composed of computational modules (most often
Boolean logic gates) that read and write from data nodes. The entire topology of the network
can evolve along with the functionality of each individual computational module that connects
nodes. The topology and functionality of the network is encoded in a genome that encodes an
arbitrary number of genes, where each gene encodes a portion of a computational unit. The
genes are delimited by markers (”start codons”) that specify a gate, with subsequent genes
determining the gate’s function and connectivity. Gates can read from and write into common
data nodes, allowing for data interaction and complex computations. Data nodes can receive
sensor information and provide output to generate motor actions, and additionally read from
and write to a persistent set of recurrent nodes that allow for memory and representations.
The sensor input and motor output make it possible for the brain to behave as an agent in
a virtual environment. In principle, Markov Brains can use almost any conceivable input-
output computational structure as a gate, however here I limit the gate types to deterministic
logic, probabilistic logic, Genetic Programming (GP) logic, artificial neural network (ANN),
and Neural Augmenting of Evolved Topologies (NEAT) derived mathematical functions. In
contrast, the original R-augmentation only allowed deterministic logic gates. I use a version
                                               79


of Markov Brain from recent literature [Hintze et al., 2019] - referred to as a Buffet Brain -
which allows the Markov Brain to select from all of the listed gate types at once. For a fuller
description of Markov Brains, see [Hintze et al., 2017].
4.3.2     Active Categorical Perception Task
Agents are evaluated in an active categorical perception task which has been extensively
described in prior literature [Schossau et al., 2015, Kirkpatrick and Hintze, 2019]. In brief,
agents can move laterally and have to either catch or avoid blocks of different sizes dropped
at them. Due to a blind spot in the agent’s sensor, a single observation is not sufficient to
unambiguously determine either the size of the block or the direction in which the block is
moving. Thus, the agents have to actively move in order to position themselves properly
and then make multiple observations to categorize block size and movement direction. This
information has to be stored so that the agent can use it to successfully catch the small
blocks or avoid the large blocks at a later time point. An agent’s ability to complete the
task is tested over 64 different drop conditions - permutations of large or small block size,
block dropping to the left or right, and 16 different block start positions relative to the agent.
The performance of the agent in the task is measured terms of in the number of correct and
incorrect decisions made over all drop conditions.
4.3.3     Representations
”Representation” is a term I use to quantify the information about the world, environment,
or task stored internally by the network that is not shared with the current sensor input.
In other words representations are the stored information in the brain that correlates with
information in the environment but does not correlate with sensory clues. Such a quantity
can be measured using the information-theoretic measure R [Marstaller et al., 2013], which
is used in the process of R-augmentation to change the fitness function so that mental
representations are evaluated along with task performance.
                                               80


    Generally speaking, information is shared between the world, the sensors, and the hidden
states of the computational system. Each state can be expressed as an independent random
variable, and their intersection defines how much each of these random variables can explain
the other (see Figure 4.1). The representation R (how much the memory states M know
about the environment E given the sensory data S) can be expressed as:
                        R = H(E : M |S) = I(E : M ) − I(E : M : S) .                          (4.1)
    Here, H(E : M ) (from Eq. 4.1) refers to the shared Shannon entropy between world
states and brain states, while I(E : M : S) stands for the information shared between the
environment, memory, and sensors. In practice I use the simplified equation (Eq. 3.3) found
in Chapter 3.
                     Environment                                  Memory
                               H(E|S,M)                   H(M|E,S)
                                                 R
                                             H(S|E,M)
                                                                   Sensors
Figure 4.1: Venn diagram of entropies and informations for the three random variables E, S, and M ,
describing the environment, sensor, and agent internal (memory) states. The information-theoretic
measure of representations R = H(E : M |S) is shaded. c 2020 IEEE.
                                                81


4.3.4     Fitness Function and R-Augmentation
To implement R-augmentation , I supplement the standard fitness function, Equation 4.2
defined in terms of correct (C) and incorrect (I) decisions,
                                        W = 1.10(C−I)                                    (4.2)
    with a term involving R (see [Schossau et al., 2015] or Chapter 2). The measured value of
R is normalized by dividing by the theoretical maximal attainable value of R, Rmax , which is
dependent on the number of bits of information present in the specific task states and brain
hidden states (here, 4 and 8 bits, respectively). The addition of this term results in a new
function for fitness (W ) as seen in equation 4.3 that is used in the augmented GA.
                                                        R
                                   W = 1.10(C−I) (1 +        )                           (4.3)
                                                       Rmax
    The base value of 1.10 is a scaling factor that allows us to transform an otherwise linear
fitness function into an exponential. Previous work [Schossau et al., 2015] has shown that
such a scaling function works well, and I adopt it here.
4.3.5     Evolutionary and Experimental Conditions
I evolve agents controlled by Markov Brains [Hintze et al., 2017] in populations of 100, with
500 replicate runs for each experimental condition (unaugmented and R-augmented GAs
evolving the Buffet Markov Brain). At the beginning of each evolutionary run, I generate
all agents stochastically using a replicate-specific random number seed. Over the course of
evolution, agents can experience point mutations (with probability p=0.005), or insertions and
deletions (with probability p=0.0002) of genetic material. At every generation, I evaluate the
agents on the active categorical perception task, with both task performance and R calculated.
                                               82


After 10,000 generations the line of descent [Lenski et al., 2003] was reconstructed, with
data stored for every time point. The line of descent is the lineage (parent, grandparent,
great-grandparent, etc.) of the highest-performing individual in the final population (i.e.,
from generation 10,000). I run two experimental conditions, the first where I use the original
fitness function (Equation 4.2), and one where I apply the R-augmentation process using the
augmented fitness function (Equation 4.3).
4.3.6      Network Structure and Composition Analysis
To examine how network structures affect representations I look specifically at the connections
that feed back into themselves, which is necessary for information storage and representations.
Without a feedback structure, the information will not be maintained in the network, which
makes memory, and therefore representations, impossible. For Markov Brains in particular,
as the topology is evolved and not predetermined, these network loops must be evolved more
or less from scratch, which makes their presence worth examining. To determine the looping
structures, I first process each computational gate from every individual brain on the line of
descent, determining in each brain which data nodes are read from and written into. With
these links, I create a connectivity matrix for each brain, and determine the number of 1-cycles
(where data is read from and written back into the same node) and 2-cycles (where data is
read from a node, written to a second node, and then read from the second node and written
back to the original node). In order for representations to be persistent and therefore useful,
they must be stored in these feedback network cycles.
    While studying differences in network composition to explain how R-augmentation im-
proves task performance, I can also examine what gate types are preferred by brains evolved
by GAs with and without R-augmentation. To study differences in the types of gates selected
for, I can visualize the distribution of different gate types over evolutionary time. Furthermore,
I can examine the total number of gates evolved, to get an approximation of the networks’
complexity.
                                                  83


4.3.7      Evolutionary Dynamics Analysis
After I reconstruct the lines of descent, I compare the differences from one time point to
the next. This makes it possible to calculate the change in fitness (∆W ) and the change in
R (∆R) associated with each mutation along the line of descent. While a majority of the
mutations are neutral (having no ∆W and no ∆R) a significant number of these mutations
have changes that result in a non-zero ∆W or ∆R. By examining how these changes vary
in each experimental condition, I can unfold how R-augmentation affects the evolutionary
dynamics.
4.4       Results and Discussion
Since I am using a different selection of computational gates for the Markov Brain structure,
I first replicate the results from [Schossau et al., 2015], demonstrating the utility of R-
augmentation. I find that, despite the difference in choice of gates, R-augmentation allows
the GA to evolve solutions that have higher task performance overall, and arrive at these
high-performing solutions earlier in evolutionary time relative to unaugmented GAs (see
Figure 4.2).
    Many possible factors could explain how R-augmentation improves performance of models
created by the GA: structural changes to the network, different exploratory behavior of evolved
agents, selection of different computational primitives, the order of steps in which evolution
optimizes the behavior, to name only a few. In this chapter, I will examine the structural
differences, component differences, and differences to evolutionary dynamics. First, I look at
the gate distribution of the evolved Markov Brains. At first glance, the distribution of gates
as a proportion of all gates used, as seen in Figure 4.3, does not show a noticeable difference
between the Markov Brains created by R-augmentation and those created without.
    However, when the distribution is not taken as a proportion of the total gates, but rather
                                               84


                                 60
                                 55
                                 50
  W                              45
                                 40
                                 35
                                      0       2000         4000      6000                              8000      10000
                                                           Generation
  Figure 4.2: Task Performance in Buffet Markov Brains produced by Augmented and Unaugmented
  GAs. The blue line represents results from the lines of descent of GAs using R-augmentation, while
  the orange line comes from unagumented GAs. Each line is the average performance over 500 replicate
  experiments. The shaded area around each line represents the standard error. R-Augmentated GAs
  produce Markov Brains that have a significantly higher average performance than the Markov Brains
  produced by unaugmented GAs. c 2020 IEEE.
                           100        GA Augmented With R                                       100           Unaugmented GA
                           80                                                                   80
Average Percent of Gates                                             Average Percent of Gates
                           60                                                                   60
                           40                                                                   40
                           20                                                                   20
                            0 0       2000   4000 6000      8000 10000                           0 0     2000    4000 6000     8000 10000
                                              Generation                                                          Generation
  Figure 4.3: Average Proportion of Different Gate types in Buffet Markov Brains produced by Aug-
  mented and Unaugmented GAs over evolutionary time. Deterministic gates are blue. Probabilistic
  gates are green. ANN gates are red. NEAT gates are cyan (light blue). GP gates are yellow.
                                                                     85


                                   GA Augmented With R                                                 Unaugmented GA
                          14                                                                14
                          12                                                                12
Average Number of Gates                                           Average Number of Gates
                          10                                                                10
                          8                                                                 8
                          6                                                                 6
                          4                                                                 4
                          2                                                                 2
                          0                                                                 0
                               0   2000   4000 6000     8000 10000                               0   2000   4000 6000     8000 10000
                                           Generation                                                        Generation
  Figure 4.4: Average Number of Different Gate types in Buffet Markov Brains produced by Aug-
  mented and Unaugmented GAs over evolutionary time. Deterministic gates are blue. Probabilistic
  gates are green. ANN gates are red. NEAT gates are cyan (light blue). GP gates are yellow.
  the average count of gates, as shown in Figure 4.4, a more interesting effect becomes clear. The
  Markov Brains created using R-augmented GAs have a higher average number of gates, where
  the difference is approximately half a gate per brain on average. This added complexity is a
  result of evolutionary dynamics: gates that improve R but not fitness are likely maintained
  under R-augmentation but fall away due to genetic drift without. It may be that this added
  complexity contributes to the increase in task performance; once an additional gate is present
  and contributing to R, it can also be more easily mutated to provide an increase in fitness
  as opposed to the unaugmented case where a gate would need to be created from scratch.
                          As a continuation of this investigation, I test if the improvement from R-augmentation
 was caused by a change to the underlying network structure of the evolved Markov Brains.
 As mental representations and R require the persistence of information through looping
  constructs in the network, I examine the cyclical nature of the networks produced by unaug-
  mented and augmented GAs by counting the number of self-referential information cycles of
  length 1 and length 2. There does not appear to be a substantive difference in the cyclical
  structures produced by either type of GA, see Figure 4.5. This measurement is only an ap-
                                                                  86


                                  GA Augmented With R                                                   Unaugmented GA
                            14                                                                    14
                            12                                                                    12
% of All Node Connections                                             % of All Node Connections
                            10                                                                    10
                            8                                                                     8
                            6                                                                     6
                            4                                                                     4
                            2                                                                     2
                            0 0      2000 4000 6000 8000 10000                                    0 0   2000 4000 6000 8000 10000
                                           Generation                                                         Generation
  Figure 4.5: Proportion of 1- and 2- Cycles in Markov Brains produced by Augmented and Unaug-
  mented GAs over evolutionary time. The purple lines represent the proportion of 1-cycles (e.g.,
  Data is read from node A, processed by a gate, and then fed back into node A) among all node to
  node connections. The light blue lines represent the proportion of 2-cycles (e.g., Data is read from
  node B, processed by a gate, and fed into node C which is read, processed by a gate, and fed back
  into node B), against all node to node connections. Although not identical, the proportions both
  hover around the same value of 8 percent in networks produced by augmented and unagumented
  brains. c 2020 IEEE.
  proximation and does not address how well the network uses any looping structures, which
  may explain why we detect no notable difference between networks created by augmented or
  unaugmented GAs.
                            After concluding that changes in network effects alone could not explain the change in
  performance from R-augmentation, I examine how mutations resulted in changes to task
  fitness and R along the line of descent, hoping to find evidence of a change in the overall
  evolutionary dynamic.
                            Next, I examine the mutations by using the lines of descent from each replicate experiment
  from R-augmented and unaugmented GAs. I measure the effect of each mutation on R and
  performance (W ) independently using the difference between pairs of organisms on the line
  of descent. Thus, each mutation can be understood as a movement in a 2D space defined
                                                                      87


                                                 90°
                               135°                                45°
                                                                    0.14
                                                               0.10
                                                          0.06
                                                     0.02
                        180°                                              0°
                               225°                               315°
                                                270°
Figure 4.6: Mutational Changes in Markov Brains produced by Augmented and Unaugmented
GAs. The blue line represents results from the lines of descent of GAs using R-augmentation, while
the orange line comes from unagumented GAs. Each point around the circle represents the average
magnitude of the changes with a given proportion of change in R (vertical axis) and change in fitness
(horizontal axis). The two distributions are significantly different (Kolmogorov–Smirnov 2-sample,
p < 0.0001). c 2020 IEEE.
by two axes. I define the y-axis to be the change in R, and the x-axis to be the change in
W . Neutral mutations do not cause any movements in this space, while an improvement
in R corresponds to a move upwards and a decrease to a move downwards. Similarly, an
improvement to W would result in a move to the right, and a decrease a move to the left. I
first measure the average magnitude of the movements along each angle of these axes (see
Figure 4.6). I also measure the number of mutations that moved along each angle of the axes,
demonstrated in Figure 4.7.
    The positive effect generated by R-augmentation seems to be a result of change to the
fitness landscape or more broadly a change in the evolutionary dynamics of the genetic
algorithm. GAs augmented with R have more mutations that produce larger changes in
both R and fitness, as opposed to unaugmented GAs whose mutations produce changes
more exclusively in terms of task fitness. This difference was also seen in the count of the
                                                 88


                                                  90°
                                 135°                               45°
                                                                        2.5
                                                                    2.0
                                                                1.5
                                                           1.0
                                                       0.5
                          180°                                              0°
                                 225°                              315°
                                                 270°
Figure 4.7: Mutational Changes in Markov Brains produced by Augmented and Unaugmented
GAs. The blue line represents results from the lines of descent of GAs using R-augmentation,
while the orange line comes from unagumented GAs. Each point around the circle represents the
average count of changes with a given proportion of change in R (vertical axis) and change in fitness
(horizontal axis). The two distributions are significantly different (Kolmogorov–Smirnov 2-sample
test, p < 1 × 10−7 ).
                                                 89


mutations, where fewer but perhaps more effective movements were made in the mutations
R-augmented GAs. These different mutations may allow the augmented GAs to explore the
fitness landscape more efficiently, finding solutions that have better task performance in less
evolutionary time.
4.5       Conclusion
I found that the improved performance found by R-augmented GAs cannot be explained
by changes to the structure of the evolved neural network, but could likely be explained by
evolutionary dynamics effects. The mutations along the line of descent produce movement
through the fitness landscape primarily in terms of task performance in unaugmented GAs,
while augmented GAs have mutational effects that result in greater changes in terms of both
R and task performance.
    While there might be other possible factors at play causing the augmentation, a change to
the fitness landscape appears likely, as the augmented fitness function is significantly altered
by integrating R. Further, R measures a property that is desirable for the task–it is better
for the agent to know more about the environment. As such, R and performance both have
the chance to positively synergize in the augmented fitness function. That might not be true
for other neuro-correlates or multiple parameter optimization. Thus I do not believe that this
result necessarily generalizes to all other ways of combining a performance measure with a
property of the optimized agent. However, further exploring the exact mechanisms of how
R-augmentation or other augmentation methods change the fitness function, and thus the
evolutionary dynamics of the GA, could lead to more powerful GAs as a whole.
    As there are other additional possible explanations contributing to the success of R-
augmentation, in the future I will explore other factors such as computational or behavioral
changes to the optimized networks and their effect on R-augmentation.
                                                90


4.6      Acknowledgements
This work was in part sponsored by the BEACON Center for the Study of Evolution in
Action NSF Cooperative Agreement No. DBI-0939454. I would like to thank Vincent Ragusa,
Cliff Bohm, and Chris Adami for their insights and advice. All research was conducted using
the MABE software platform, run on the Michigan State HPCC.
                                            91


Chapter 5
R-Augmentation maintains
Performance as Sample Size Decreases
5.1       Abstract
Cognitive systems, both biological and artificial, require an understanding of the world in
order to act and make decisions. This understanding is aided by the creation of mental models
or representations. Although artificial cognitive systems created by deep learning struggle
to form representations, previous work has shown that using neuro-evolution to generate
representations in artificial cognitive systems is a viable method. This has been confirmed
by using an information-theoretic measure, R, to determine the amount of information
a cognitive system has integrated about its environment. Furthermore, by modifying the
fitness function to select for both R and task performance (through a process known as R-
augmentation), neuro-evolution can more quickly select for cognitive systems that have a high
task performance. In this chapter, I extend this work to investigate a method of improving the
efficiency of R-augmentation by showing that using R-augmentation requires only a small
subset of the task cases to achieve high performance across the entire task. Furthermore, I
                                              92


show that this improvement does not come at a cost to the ability of the cognitive system
to generalize either within-domain or out-domain. This work has the potential to allow for
more complex tasks and faster neuro-evolution of artificial cognitive systems.
5.2       Introduction
The field of Artificial Intelligence, and the sub-field of Machine Learning in particular, is
continuously searching for better and more efficient ways to optimize neural networks to
complete various tasks [Sun et al., 2019]. Many of these optimizations have focused on more
efficient algorithms [Johnson and Zhang, 2013]; or on tweaking network architectures to
more efficiently learn a specific problem (e.g., [Dao et al., 2020]). However, neither of these
approaches address a fundamental problem with optimization for optimization’s sake: namely,
optimization alone does not necessarily produce a network with better understanding of any
given task, and thus fails to produce networks that are able to generalize to similar tasks. For
this chapter, I investigate how creating a better understanding of a specific task may allow
for more efficient adaptation and better task performance.
    In the psychological and philosophical literature [Phillips and Singer, 1997, Pinker, 2013],
the ability of a cognitive system to understand a given environment or task is referred to as
a mental model or mental representation. Improvements to the task performance of cognitive
systems have been shown to correlate with improvements in the system’s mental representa-
tions. More specifically, this mental representation is defined as a sensor-independent model of
the environment stored within the system’s memory. These representations may encapsulate
an entire task or focus on individual concepts that the agent deals with [Halford, 2014]. More
importantly, mental representations may allow an individual to assess, understand, and form
internal models about the entirety of a task even though they are only exposed to a small
subset of the entire possible experience. As such mental representations are desirable or even
necessary for human intelligence [Collins and Gentner, 1987], recent work has attempted to
                                                93


replicate these representations in silico [Edlund et al., 2011,Marstaller et al., 2013,Albantakis
et al., 2014, Tehrani-Saleh et al., 2018, Olson et al., 2016, Tehrani-Saleh et al., 2019].
     [Marstaller et al., 2013] developed an information-theoretic measure, referred to as R,
to measure these representations in a formal way . Formally, R is the information shared
between the task or environmental states and the hidden states of the cognitive system, given
the information from the sensor states. R correlates with task performance, indicating that
it measures the cognitive system’s understanding of the given task. Moreover, [Schossau
et al., 2015] created a process of selecting for both the task performance and R, referred to as
R-augmentation, to encourage the formation of representations in artificial cognitive systems.
That paper also demonstrates that selecting for networks with better mental representations
results in higher task performance in Markov Brains.
    The core idea behind this chapter is simple; with a greater understanding of the task, as
measured by R and promoted during evolution through R-augmentation, I should be able
to evolve cognitive systems towards high performance despite only viewing a few examples
at any given time. This enhanced understanding should allow networks to understanding the
missing pieces, and evolve the ability to complete the entire task. Furthermore, the ability of
the cognitive system to generalize to tasks beyond what it has been evolved to solve should
remain unchanged by the R-augmentation process,
5.3       Methods
5.3.1      Mental Representations and R
In order to quantify mental representations, I turn to information-theoretic analysis. More
specifically, I focus on the information-theoretic measure R [Marstaller et al., 2013]. R is
defined to be the information shared between the task and the cognitive system’s memory
that is not currently being provided by the sensors. This can be quantified in the theoretical
                                                 94


definition of Eq. 5.1, where T is the task variable, M is the random variable corresponding
to the cognitive system’s memory, and S is the variable corresponding to the states of the
sensors.
                                       R = I(T : M |S)                                    (5.1)
    Note that conditioning on sensor states is intended to isolate the information that has
been integrated and stored in memory separate from current sensory experience, thus finding
true representations. This conditioning prevents the identification of internal models that are
purely reactive to sensory experience.
    In practice, during the evaluation of the task, the relevant task states, the states of the
input provided by the task to the cognitive system, and the memory states of the agent (after
it has processed the given input) are all recorded for all time points of the evaluation. The
memory states of the cognitive system are defined by the particular network structure, while
the input states are also easily identified for any given task. However, the relevant task or
environmental states must be carefully selected to identify salient concepts within the task.
A task where an agent must find its way through a maze, for example, would likely encode
the absolute direction to the goal location as well as a local optimum direction as salient
concepts. For the salient features that I use in this work, see the task description in 5.3.2.
    When all of the state vectors have been collected, the probability of each state is calcu-
lated. These probabilities are used in a series of Shannon entropy calculations to find the
measurement of R, as optimized previously (see Chapter 3). The total equation for R as
used is shown in Eq. 5.2.
                       R = H(S, T ) + H(S, M ) − H(S) − H(T, M, S) .                      (5.2)
                                               95


    The joint entropies are calculated by appending the individual state vectors together
over all time points and finding probabilities for the appended vectors. Then, similarly to
the process for individual states, these probabilities are used to calculate joint entropies
as required for Eq. 5.2. As the memory, task, and sensor states collected are often not a
comprehensive enumeration of all possible states, R is often only an approximation of the
true amount of representation that a cognitive system has. Previous work [Kirkpatrick and
Hintze, 2019] shows that it is possible to measure R using only some of the task states. Given
that R is an approximation, using fewer states to calculate R may make the approximation
slightly less accurate, but this inaccuracy does not invalidate the measurement itself.
5.3.2      Active Categorical Perception Task
The Active Categorical Perception Task (ACP) is the primary task used in the study of mental
representations and R in evolved artificial cognitive systems. The task is both complex enough
that agents perform interesting behavior when evolved on the task and simple enough that
the integral components of the task can be easily identified and quantified, and therefore
the agent’s understanding of those components can be readily measured. The task itself
consists of an agent that sits underneath a series of falling blocks. The agent, controlled by
the artificial cognitive system, has two sensors, each of width two, that are separated by a
gap also of width two, and the agent can move to the right, to the left, or stay put on any
given update (see Figure 5.1). The agent has two outputs, where if only the first output is set
to 1 the agent moves left, if only the second output is set to 1 the agent moves right, and for
all other combinations the agent remains stationary. The blocks that fall towards the agent
move at a constant rate either to the left or to the right. The blocks may also be large or
small, and the agent must navigate relative to the small blocks which must be caught and
the large blocks which must be avoided. The agent is considered to be catching the block if
any part of the agent overlaps with the block. Unless stated otherwise, the small blocks have
a width of 2 units, and the large blocks have a width of 4 units. The block drops towards the
                                               96


agent for 32 time steps, and exists in a toroidal world (i.e., where the edges wrap around) of
width 16. With the agent starting at point zero by definition, this means that the block may
drop from any of 16 different starting positions relative to the agent. By default the agent is
tested on all possible permutations of block size, movement direction, and relative starting
position, as enumerated in Figure 5.2.
Figure 5.1: A subsection of a single case of the Active Categorical Perception Task. The agent,
visualized by the orange box, can move either to the left or right as indicated by the orange arrows.
The four sensors, labelled 1,2,3, and 4, can each see directly upwards as indicated by the light blue
shading. A small block with width 2 is falling towards the agent and moving to the left, as indicated
by the green arrow. At the current point in time, sensor 3 is the only sensor that would return a
signal, while on the next update, assuming that the agent did not move, both sensors 3 and 4 would
return a signal.
    The agent is evaluated on the individual trials it sees by determining whether it correctly
or incorrectly avoids or catches the block for that trial. Based on previous work [Marstaller
et al., 2013], the fitness for this task is calculated by raising the difference of correct (C) and
incorrect (I) actions over all trials to a given power (here 1.1). This equation for fitness W
is formalized in Equation 5.3.
                                                  97


Figure 5.2: All possible cases in the Active Categorical Perception Task. Each square in the
rectangle indicates a specific combination of block size, movement direction, and starting position
referred to as a case, labelled here from 1 to 64 to distinguish their uniqueness. For example the case
labelled 20 is the case where the small block starts at position 3 (relative to the agent at position
0), and falls to the right.
                                             W = 1.10(C−I)                                        (5.3)
    For the purposes of calculating R, the relevant world states, as identified in previous
literature (and reaffirmed in Chapter 2), are the size of the block (i.e., whether it is large
or small), the position of the block relative to the agent (i.e., whether the block is over the
agent or not), and whether the block is falling to the right or to the left. The sensor states
used in calculating R are the values input to each of the four sensors at each time step.
5.3.3      Fitness Function Augmentation With R
Recent work has identified a process of modifying the fitness function to encourage the
evolution of representation. This process, referred to as R-augmenation, is shown in [Schossau
et al., 2015] to improve task performance relative to an unaugmented GA. The change from
the original fitness function (Eq. 5.3) is best described as multiplying the original function
by a term utilizing R, which approximates a multi-objective function and encourages the
simultaneous evolution of R and task performance. This may allow changes in R, and hence
improvements to the agent’s understanding of its environment, to be selected for ahead of
                                                    98


when they are needed to improve performance, which would allow for the faster evolutionary
adaptation (see Ch. 4). This selection for R before it is needed is a form of exaptation, where
the original selection of R is intended to be reused later to increase task performance. The
added term is a normalized value between 1 and 2, which allows for a multiplicative boost
to the fitness. This value is achieved by adding 1 to a R value normalized to between 0
and 1, where the normalization occurs when R is divided by a task- and network-dependent
theoretical maximum value of R, denoted Rmax . This R-augmented fitness function is shown
in Eq. 5.4.
                                                           R
                                      W = 1.10(C−I) (1 +       )                               (5.4)
                                                         Rmax
5.3.4      Recurrent Neural Networks
Recurrent Neural Networks (RNNs) are a standard neural network type commonly used in
a variety of applications. While they may take many forms, here I use a fully-connected,
feed-forward, multi-layer network with genetically encoded weights. The recurrent nodes used
are appended to the network’s input and output layers, and persist between time steps. More
specifically, there is a layer of input and recurrent nodes, one hidden layer of nodes that is reset
between updates, and a layer of output and recurrent nodes. The activation function is the
standard hyperbolic tangent function, applied to a summation of the weights times previous
layer values. The recurrent nodes fulfill the role of internal memory for the calculation of R.
5.3.5      Markov Brains
Markov Brains [Hintze et al., 2017] are sparse, evolvable neural networks commonly used for
the study of mental representations in artificial cognitive systems among other applications.
They consist of an input buffer that receives data from the environment, an output buffer
that allows the network to provide data to the environment, a hidden buffer that store data,
                                                  99


and computational gates that read from and write to each of those sets of states. The hidden
buffer recurs across time points, allowing for the preservation of memories. Sizes of the input
and output buffers are determined by the task selected, and hidden buffers are determined
by the user. For the ACP task, the input buffer is size 4, and the output buffer is size 2.
Here, I select a hidden buffer size of 8 in line with previous literature. The gates, genetically
encoded, read from a fixed set of 1-4 input or hidden nodes, and write to a set of 1-4 output
or hidden nodes. While there exist many different types of gate [Hintze et al., 2019], here I
only use two in the context of this chapter. The first is the deterministic logic gate, where
a binary logic table is encoded that defines the output states provided for any given input.
This is the gate type most commonly used in previous work. Also, I use the ANN gate which
operates as a single hidden layer with functionality identical to the RNN, reading from a set
of 1-4 genetically encoded input and hidden nodes as the prior layer, and writing to a set
of 1-4 genetically encoded output and hidden nodes as the next layer. I construct Markov
Brains only using one gate type or the other, with Markov Brains using deterministic gates
referred to as simply Markov Brains, while Markov Brains using ANN gates are referred to
as Markov Brain - ANN hybrids. The naming of the Markov Brain - ANN hybrids indicate
their functionality, which has been shown in recent work [Albani et al., 2021] to approximate
a sparse RNN. For the purposes of the calculation of R, the hidden nodes are considered the
internal memory.
5.3.6     Evolutionary Algorithm
All of the evolutionary runs and experiments used in this chapter follow the same pattern
described here. I generate each network from a circular genome, with starting length of 5000
sites, maximum length of 20,000 sites, and minimum length of 2000 sites. There are three types
of mutation allowed on these genomes, namely point, insertion duplication, and deletion. The
point mutation changes a site changes to a different random value, with a per-site probability
of 0.005. The insertion duplication mutation copies a section of the genome and inserts the
                                                100


section at another point in the genome, which occurs with a per-site probability of 0.00002,
copying a chunk of the genome between 128 and 512 sites long and inserting that chunk at a
random point. The deletion mutation deletes a chunk of the genome between 128 and 512
sites long from the genome, with a per-site probability of 0.00002. Genomes and the resulting
networks were generated randomly in initial populations of 100 organisms. I applied the
fitness function for each individual only after evaluation on the ACP task and the subsequent
calculation of R, where the fitness function is either the unaugmented function (Eq. 5.3)
or the R-augmented function (Eq. 5.4). I then applied roulette (i.e., fitness-proportional)
selection [Thomas, 1996] using the generated fitness values to choose a new population of
size 100. I applied the three mutation operations (point, insertion duplication, and deletion)
to the new population. I repeated this process for 10,000 generations for each experimental
condition. At the end of evolution, I gather the agents along the line of descent [Lenski et al.,
2003], and use the agents at the end of the line of descent for further analysis.
5.3.7      Reduction of Trials
The first major experiment of this chapter is the assessment of a standard GA compared to a
GA using R-augmentation across varying subset size of trials provided to the agent. For any
given assessment of the ACP, there are 64 individual trials run, although for these experiments
the agent is only presented with a random subset of those trials with size n, for selected
n ≤ 64. Using n = 64 would be equivalent to the standard evaluation of the ACP Task. As
an example, for n = 32 on any given generation a random 32 trials are selected and used to
evaluate the population (see Fig 5.3 for an example of such a subset). For each generation, I
randomly re-select a different subset of the same size. This is similar in concept to lexicase
selection [Helmuth et al., 2015] as a small random subset of the cases are used, although in
contrast to lexicase selection here the number of cases used remains fixed and the selection
method is a typical roulette selection. I run these experiments with each of the three brain
types (Markov Brain, RNN, and Markov Brain - ANN Hybrid), a range of subset sizes (n = 1,
                                               101


2, 4, 6, 8, 10, 12, 14, 16, 32, 48, 64), and using GAs with and without R-augmentation. I evaluate
the different network types to determine the applicability of the method to different cognitive
structures. I selected the subset sizes to provide sufficient coverage around the portions of the
graph where the results changed most drastically, without being computationally wasteful.
For each experimental condition, I run 200 replicate experiments with other evolutionary
parameters as described in 5.3.6. At the end of evolution, I evaluate the evolved networks
on the entire test case set (i.e., all 64 cases). For the experimental conditions using Markov
Brains, I also record the time of execution for the entire program when returned, although
not all experimental runs return a notification. All experimental runs were conducted on the
Michigan State University High Performance Computing cluster on the Intel-16 architecture.
Figure 5.3: A possible evaluative subset for the Reduction of Trials experiments with n = 32. Here,
the 32 cases marked with an ’E’ would be evaluated for correctness and used to generate the fitness
and measure R. Note that this is only one possible random selection, and any random selection is
equally likely in practice.
5.3.8        Generalization
As a natural consequence of the primary experiment, where agents are expected to evolve
to perform well on the entire task set despite only seeing a subset at any given time, I must
also consider how this affects generalization. Generalization refers to the ability of a network
to perform correctly on conditions that it has not seen before. There are two main forms
of generalization: in-domain generalization and out-domain generalization. The difference
                                                    102


between these forms lies in the similarity of the new testing data to the data the network
was trained or evolved under. When the new data is more similar to or comes from the same
distribution as the training data, I consider it to be in-domain generalization. On the other
hand, if the data is dissimilar to or comes from a different distribution than the training data,
it is considered to be out-domain generalization.
5.3.8.1    In-Domain Generalization
For the purposes of this chapter, I consider in-domain generalization to be where the task
parameters, in particular block size, are held constant. To enable the evaluation of generaliza-
tion within the ACP task, I withhold a subset of the runs to serve as a testing set, and then
use the remainder for fitness evaluation during evolution. As a extensive although perhaps
not comprehensive assessment of the in-domain generalization, I examine four different with-
holding schemes that determine which blocks are part of the withheld subset. The first, and
simplest scheme is the Random Scheme, where 16 cases are selected at random from the 64
permutations or cases of the ACP task (see Fig 5.4 for a visualization). In each experimental
run I choose a different selection of cases for each run, but this set remains constant over
evolutionary time. For the Starting Position Spaced scheme I withhold cases for 4 of the
possible block starting position, with the distribution of those positions evenly spaced across
the 16 possible positions (see Fig. 5.5). Next is the Starting Position Grouped Scheme, where
again I withhold cases for 4 of the block starting positions, but the starting positions are
grouped together in one chunk (see Fig. 5.6). Lastly, the One Block / Direction scheme is
where I withhold all of the cases for one block size (e.g., small) moving in one direction (e.g.,
to the left; see Fig. 5.7).
     For each of these four in-domain generalization schemes, I construct experiments where I
use a standard GA or an R-augmented GA, and also use varying subsets of the non-withheld
sets for evolution in a manner identical to the reduction of trials experiments, although here
I use n = 3, 8, 13, 18, 23, 28, 33, 38, 43, 48. For each experimental combination, 200 replicate
                                                 103


Figure 5.4: Example Cases in Withheld subset under the Random Scheme. Each square in the
rectangle indicates a specific combination of block size, movement direction, and starting position
referred to as a case. The 16 cases marked with a ’W’ are withheld for testing, while the remaining
48 unmarked cases are used for evaluation during evolution. The cases withheld under this scheme
are 16 randomly selected cases.
Figure 5.5: Example Cases in Withheld subset under the Starting Position Spaced Scheme. Each
square in the rectangle indicates a specific combination of block size, movement direction, and
starting position referred to as a case. The 16 cases marked with a ’W’ are withheld for testing,
while the remaining 48 unmarked cases are used for evaluation during evolution. The cases withheld
under this scheme are all of the cases for 4 different starting positions, where the starting positions
are evenly spaced across the range of possible starting positions.
                                                  104


Figure 5.6: Example Cases in Withheld subset under the Starting Position Grouped Scheme. Each
square in the rectangle indicates a specific combination of block size, movement direction, and
starting position referred to as a case. The 16 cases marked with a ’W’ are withheld for testing,
while the remaining 48 unmarked cases are used for evaluation during evolution. The cases withheld
under this scheme are all of the cases for 4 different starting positions, where the starting positions
are grouped together.
Figure 5.7: Example Cases in Withheld subset under the One Block / Direction Scheme. Each
square in the rectangle indicates a specific combination of block size, movement direction, and
starting position referred to as a case. The 16 cases marked with a ’W’ are withheld for testing,
while the remaining 48 unmarked cases are used for evaluation during evolution. The cases withheld
under this scheme are all of the cases for 1 block size and movement direction combination.
                                                  105


experiments are run with other evolutionary parameters as described in 5.3.6. At the end
of evolution, the networks created are evaluated on the withheld test set (unique to each
scheme).
5.3.8.2    Out-Domain Generalization
In order to study the capacity of networks for out-domain generalization, I turn to a straight-
forward modification of the task: manipulation of the block sizes that are present. More
specifically, I would evolve the networks on one set of large and small blocks (e.g., width 2
and width 6), and then see if they can generalize to another set of blocks (e.g., width 3 and
width 5). I focus on a subset of the possible combinations shown in Chapter 2, selected to
investigate those combinations without a previously known bias towards R-augmentation, so
that I can isolate any effect on out-domain generalization caused by the reduction of trials.
These combinations are where I evolve the networks on widths 2,6 and test their ability to
generalize to widths 3,5, and where I evolve the networks on widths 3,5 and test their ability
to generalize to widths 2,6, both using the standard agent setup as seen in [Marstaller et al.,
2013]. For each of these two out-domain generalization setups, I construct experiments where
I use a standard GA or an R-augmented GA, and also use varying subsets of the test cases for
evolution in a manner identical to the reduction of trials experiments. For each experimental
condition, I run 200 replicate experiments with other evolutionary parameters as described
in 5.3.6. At the end of evolution, I evaluate the evolved networks on the 64 cases of the other
block pairing that was not used during evolution.
5.4       Results and Discussion
First, in order to evaluate the feasibility of the method for different network structures,
I examine the results across decreasing sample size, for the 3 major brain types: Markov
Brain with Deterministic Logic Gates (Figure 5.8), RNN (Figure 5.9, and the Markov Brain
- ANN Hybrid (Figure 5.10). Each of these shows the average performance tested at end of
                                               106


evolution across the entire test set, regardless of the subset size n used, evolved either with
the R-augmented GA, or with the unaugmented GA. I observe that for both the standard
Markov Brain and the Markov Brain with ANN gates, R-augmentation produces networks
that perform at least as well as or better than those produced by an unaugmented GA, even
when a smaller subset is used for the augmented GA. This result is seen most clearly in
Markov Brains where networks created by the augmented GA for n = 16 achieve an average
tested task performance across the whole set (92.4% correct) that is similar to what as the
networks evolved using an unaugmented GA achieve at n = 32 (93.1% correct). That is, when
using R-augmentation, you can potentially attain the same level of performance while only
using a quarter of the test cases to measure performance. A similar increase in performance
at low values of n can be seen for the Markov Brain - ANN hybrid.
                                                  100%
                                                  90%
                  Percent Correct on Entire Set
                                                  80%
                                                  70%
                                                  60%
                                                  50%
                                                  40%
                                                         0   10    20     30   40   50   60
                                                                  Subset Size Evolved
Figure 5.8: Performance at end of evolution under decreasing sample size for Markov Brain with
Deterministic Logic Gates. The red line is the average performance of Markov Brains evolved using
a GA augmented with R, while the shaded area in red is the corresponding standard error. The
blue line is the average performance of Markov Brains evolved using an unaugmented GA, and the
shaded blue area is the corresponding standard error. The sample sizes used were n = 1, 2, 4, 6, 8,
10, 12, 14, 16, 32, 48, 64.
                                                                        107


                                                  100%
                                                  90%
                  Percent Correct on Entire Set
                                                  80%
                                                  70%
                                                  60%
                                                  50%
                                                  40%
                                                         0   10    20     30   40   50   60
                                                                  Subset Size Evolved
Figure 5.9: Performance at end of evolution under decreasing sample size for RNN. The red line is
the average performance of RNNs evolved using a GA augmented with R, while the shaded area in
red is the corresponding standard error. The blue line is the average performance of RNNs evolved
using an unaugmented GA, and the shaded blue area is the corresponding standard error. The
sample sizes used were n = 1, 2, 4, 6, 8, 10, 12, 14, 16, 32, 48, 64.
                                                                        108


                                                  100%
                                                  90%
                  Percent Correct on Entire Set
                                                  80%
                                                  70%
                                                  60%
                                                  50%
                                                  40%
                                                         0   10    20     30   40   50   60
                                                                  Subset Size Evolved
Figure 5.10: Performance at end of evolution under decreasing sample size for Markov Brain with
ANN Gates. The red line is the average performance of Markov Brain - ANN hybrids evolved using
a GA augmented with R, while the shaded area in red is the corresponding standard error. The
blue line is the average performance of Markov Brain - ANN hybrids evolved using an unaugmented
GA, and the shaded blue area is the corresponding standard error. The sample sizes used were n =
1, 2, 4, 6, 8, 10, 12, 14, 16, 32, 48, 64.
                                                                        109


    The RNN does not see the same performance boost from the R-augmentation process,
regardless of sample size, but instead performs at approximately the same level when opti-
mized by either the augmented or unaugmented GAs. A notable difference is that both GAs
fail to optimize RNNs to achieve comparable task performance to the other structures either
with or without R-augmentation. Both the Markov Brain and Markov Brain - ANN hybrid
achieve close to 90% correct or better, whereas the RNN reaches a maximum of less than 85%
at its peak. This difficulty in achieving peak performance as opposed to other cognitive sub-
strates may explain the different response to R-augmentation. Most of the difference in peak
performance can likely be explained by the relative structure of the RNN versus the Markov
Brain and Markov Brain - ANN hybrid. RNNs are fully connected, as opposed to the sparse
network connections of the Markov Brain and the Markov Brain - ANN hybrid. Previous
work (e.g., [Hintze et al., 2018, Kirkpatrick and Hintze, 2019] and Ch. 3) has demonstrated
that fully connected networks evolve differently from sparse networks. More specifically, fully-
connected networks have preexisting information flow due to their connectivity and must
reduce the connections to form useful representations. On the other hand, sparse networks
need to evolve connections to create information flow and form representations. In this light,
the benefits of R-augmentation should and do have a greater effect on the sparse network
types.
    Another point of interest is that the RNN achieves peak performance over the entire set
when it is only evaluated on between an eighth and a quarter of the whole set. Given that
there are more connections and weights to evolve in the RNN as opposed to its sparsely-
connected relatives, it seems likely that the full-connectivity struggles to produce results on
all of the cases at a single time. However, when evaluated on only a subset of the cases, the
network is able to produce more meaningful results, and evolutionary pressures are more
effective in directing networks towards high performance.
    I will now test how these additional analyses affect the computational time required to
evolve with or without R-augmentation. For any given evolutionary experiment, using R-
                                              110


                                                  6000
                  Average Time of Execution (s)
                                                  5000
                                                  4000
                                                  3000
                                                  2000
                                                  1000
                                                     0
                                                         0   10    20     30   40   50   60
                                                                  Subset Size Evolved
Figure 5.11: Average Execution Time Reported under decreasing sample size for Markov Brain
with Deterministic Logic Gates, with and without R-augmentation. The red line is the average
execution time of runs using a GA augmented with R, while the shaded area in red is the corre-
sponding standard error of those times. The blue line is the average execution time of runs using
an unaugmented GA, and the shaded blue area is the corresponding standard error of those times.
The sample sizes used were n = 1, 2, 4, 6, 8, 10, 12, 14, 16, 32, 48, 64.
                                                                        111


                                                      GA Type
                                           R-Augmented Unagumented
                                     1           148                155
                                     2           143                146
                                     4           125                146
                                     6           113                126
                                     8           137                128
                        Subset      10           140                132
                        Size        12           140                133
                                    14           111                150
                                    16           115                127
                                    32           146                155
                                    48           152                142
                                    64           122                135
Table 5.1: Number of cases reporting execution times for each condition. Conditions were deter-
mined by subset size (n = 1, 2, 4, 6, 8, 10, 12, 14, 16, 32, 48, 64) and GA type (with and without
R-augmentation). Each condition had 200 replicate experiments run, however notifications were
not received for each replicate.
augmentation incurs some cost due to the calculation of R. I validate this by collecting the
total execution time from the experimental replicate runs for the Markov Brains. Notifications
were returned containing execution time data for approximately two-thirds of the experiments
(3,267 of 4,800 experimental runs), distributed across the 24 conditions. The exact number
of data points recorded for each condition is shown in Table 5.1; I average at least 110
times for each condition. The averages of these times for each condition is shown in Figure
5.11. Note that for all subset sizes, the R-augmented GA requires more computational time
than the unaugmented GA. However, given the reduction of trials shown in Figures 5.8 and
5.10, the networks created by a R-augmented GA can still outperform those created by an
unaugmented GA in a lesser amount of computational time. For example, the augmented GA
ran for an average of approximately 1800 seconds with n = 16, while the unaugmented GA
ran for an average of approximately 2370 seconds for n = 32, while achieving comparable
performance. As such, the extra cost of calculating R may be outweighed by the reduction
in trials; that is, an R-augmented GA may be able to evolve higher-performing networks in
less computational time than an unaugmented GA because the R-augmented GA can use
                                                 112


fewer cases during evolution.
    The remainder of the results presented here focus on validating the performance of Markov
Brains evolved under varying subset sizes, primarily focusing on the ability of the agent to
generalize to different data as previously studied in Chapter 2. I expand upon that work to
include a more thorough investigation of in-domain vs. out-domain generalization, whereas
previous work (Ch. 2) focused only on out-domain generalization. First, I focus on the four
versions of in-domain generalization: withholding a random subset of the cases, withholding all
of one block/direction combination, withholding a block of starting positions, and withholding
a distributed sample of starting positions.
    The generalization results of first pattern of in-domain withholding, random subset with-
holding, are seen in Figure 5.12. The data shown is the average performance on the withheld
subset at end of evolution; in this case, the subset is randomly selected in each trial, with
the caveat that the withheld subset is kept constant over evolutionary time. The results
seen here track with the pattern seen in the first three reduction of trials results. That is,
networks generated by an R-augmented GA typically perform better on average during the
final analysis (in this case, on the withheld subset). The performance difference is again
particularly noticeable when n < 20.
    The second of the in-domain withholding methods, namely withholding all of one block,
is shown in Figure 5.13. Unlike the previous in-domain scheme, this withholding pattern
prevents the cognitive system from learning anything about the block-direction pair, which
seems to serve as a major impediment to generalizing to that withheld set. Performance for
networks produced by both R-augmented and unaugmented GAs suffer the same impediment,
and their ability to generalize in this case seems near-identical.
    Figure 5.14 shows the performance of Markov Brains tested on a withheld subset where
the withheld cases have the same starting positions, and the starting postions are evenly
distributed. In other words, given that I withhold 4 starting positions, each of the withheld
                                               113


                                                    80%
                  Percent Correct on Withheld Set
                                                    75%
                                                    70%
                                                    65%
                                                    60%
                                                    55%
                                                    50%
                                                    45%
                                                    40%
                                                          10       20      30        40   50
                                                               Subset Size Evolved
Figure 5.12: Performance tested on withheld group at end of evolution under decreasing sample
size with random subset removed. The red line is the average performance of Markov Brains evolved
using a GA augmented with R, while the shaded area in red is the corresponding standard error.
The blue line is the average performance of Markov Brains evolved using an unaugmented GA, and
the shaded blue area is the corresponding standard error. The sample sizes tested were n = 3, 8,
13, 18, 23, 28, 33, 38, 43, 48.
                                                                    114


                                                   40%
                 Percent Correct on Withheld Set
                                                   35%
                                                   30%
                                                   25%
                                                   20%
                                                   15%
                                                   10%
                                                         10       20      30        40   50
                                                              Subset Size Evolved
Figure 5.13: Performance tested on withheld group at end of evolution under decreasing sample
size with all of 1 block type and direction withheld. The red line is the average performance
of Markov Brains evolved using a GA augmented with R, while the shaded area in red is the
corresponding standard error. The blue line is the average performance of Markov Brains evolved
using an unaugmented GA, and the shaded blue area is the corresponding standard error. The
sample sizes tested were n = 3, 8, 13, 18, 23, 28, 33, 38, 43, 48.
                                                                   115


positions is 4 units away from the other. For this withholding pattern, I see no notable differ-
ence between the ability to generalize of the Markov Brains evolved using an R-augmented
and those evolved using a standard GA. This pattern is notable because the agents have the
highest generalization performance as compared to the other withholding patterns.
                  Percent Correct on Withheld Set
                                                    80%
                                                    70%
                                                    60%
                                                    50%
                                                          10       20      30        40   50
                                                               Subset Size Evolved
Figure 5.14: Performance tested on withheld group at end of evolution under decreasing sample
size with all of 4 different start locations (evenly spaced) withheld. The red line is the average
performance of Markov Brains evolved using a GA augmented with R, while the shaded area in
red is the corresponding standard error. The blue line is the average performance of Markov Brains
evolved using an unaugmented GA, and the shaded blue area is the corresponding standard error.
The sample sizes tested were n = 3, 8, 13, 18, 23, 28, 33, 38, 43, 48.
   The last withholding pattern that I consider is where all of a group of starting positions
are withheld, with the difference that the starting positions are from a contiguous block. The
performances measured on the withheld set, as shown in Figure 5.15, suggest that Markov
Brains evolved using R-augmentation produce a higher average performance than those
evolved with a unaugmented GA. This is particularly noticeable for n < 18, where the
difference is often 5% or more.
                                                                    116


                                                    65%
                  Percent Correct on Withheld Set
                                                    60%
                                                    55%
                                                    50%
                                                          10       20      30        40   50
                                                               Subset Size Evolved
Figure 5.15: Performance tested on withheld group at end of evolution under decreasing sample
size with all of 4 different start locations (grouped together) removed. The red line is the average
performance of Markov Brains evolved using a GA augmented with R, while the shaded area in
red is the corresponding standard error. The blue line is the average performance of Markov Brains
evolved using an unaugmented GA, and the shaded blue area is the corresponding standard error.
The sample sizes tested were n = 3, 8, 13, 18, 23, 28, 33, 38, 43, 48.
                                                                    117


     In summary, the in-domain generalization capability of networks produced either with
or without R-augmentation seem largely identical. In the few cases where there is a notable
difference, particularly at lower sample size, the mean performance of the networks produced
by R-augmented GAs is higher than that of networks produced by unaugmented GAs (see Fig.
5.12 and 5.15). These results are consistent with the results in Chapter 2, where R-augmented
GAs produce networks that generalize at least as well as or better than networks produced
by unaugmented GAs. Interestingly, the one withholding pattern which prevents agents from
seeing all of a block/direction pair and is therefore closest to out-domain generalization leads
to very low performance measured on the withheld set. In either case, the ability of Markov
Brains to generalize in-domain is not negatively affected by the R-augmentation process.
     Finally, I look at the ability of networks evolved with different subset sizes to generalize
out-domain. I evolved Markov Brains using GAs with and without R-augmentation, for a
small block size of 2 and a large block size of 6, and also a small block size of 3 and a large
block size of 5. At the end of evolution, the resulting cognitive systems were tested on the
opposite block pairing (i.e., the networks evolved with the 2,6 pairing were tested on 3,5 and
vice-versa). The results of these tests on the opposite sizing are visualized in Figures 5.16
and 5.17 for the 2,6 to 3,5 and 3,5 to 2,6 combinations, respectively. The mean difference
between the performance of Markov Brains evolved using unaugmented and R-augmented
GAs is less than 3% in almost all cases. As such, I conclude that the out-domain ability of
Markov Brains to generalize to unseen data is not negatively affected by R-augmentation.
5.5       Conclusion
The results presented in this chapter offer a tantalizing opportunity for the application of
R-augmentation. The original promise of R-augmentation relied on the fact that even though
there was a computational cost to calculating R, the improved performance benefits could
still outweigh that extra computational cost. With this work, I demonstrate that it may be
                                                118


                                                     65%
                 Percent Correct on Generalize Set
                                                     60%
                                                     55%
                                                     50%
                                                     45%
                                                     40%
                                                           0   10    20     30   40   50   60
                                                                    Subset Size Evolved
Figure 5.16: Performance of Markov Brains tested on blocks 3 and 5 at end of evolution under
decreasing sample size when evolved on blocks 2 and 6. The red line is the average performance
of Markov Brains evolved using a GA augmented with R, while the shaded area in red is the
corresponding standard error. The blue line is the average performance of Markov Brains evolved
using an unaugmented GA, and the shaded blue area is the corresponding standard error. The
sample sizes used were n = 1, 2, 4, 6, 8, 10, 12, 14, 16, 32, 48, 64.
                                                                          119


                                                      65%
                  Percent Correct on Generalize Set
                                                      60%
                                                      55%
                                                      50%
                                                      45%
                                                      40%
                                                            0   10    20     30   40   50   60
                                                                     Subset Size Evolved
Figure 5.17: Performance tested on blocks 2 and 6 at end of evolution under decreasing sample
size when evolved on blocks 3 and 5. The red line is the average performance of Markov Brains
evolved using a GA augmented with R, while the shaded area in red is the corresponding standard
error. The blue line is the average performance of Markov Brains evolved using an unaugmented
GA, and the shaded blue area is the corresponding standard error. The sample sizes used were n =
1, 2, 4, 6, 8, 10, 12, 14, 16, 32, 48, 64.
                                                                           120


possible to have your cake and eat it, too: you can evolve with R-augmentation, and achieve
comparable performance while minimizing execution time by reducing the number of trials
during evolution.
    Furthermore, this method of reducing computational time does not significantly impede the
cognitive system’s ability to generalize to other problems, either in-domain or out-domain. In
fact, in some cases, the increased value of R (and understood increase in mental representation)
caused by R-augmentation may allow for a better understanding of the problem, resulting
in increased ability to generalize.
    Although this research was conducted primarily on a simple task, using evolutionary
adaptation, I see potential applications that reach far beyond this original remit. For more
complex tasks, particularly those with simulated physics governing object behavior or where 3-
dimensional vision is used, the computational run-time of the task itself can rise exponentially.
For such a task, where assessment can take hours or days, a case reduction of one half would
allow for longer or more replicate experiments, assuming that the relevant environmental
states can be encoded efficiently.
    This work also hints at the applicability of R-augmentation to other optimization methods,
particularly machine learning. Given that for some tasks (e.g., image segmentation), the
network only sees a portion or subset of the whole image and is expected to understand
broader context, I see an immediate parallel. R-augmentation applied to machine or deep
learning models would take the form of a modification to the loss function instead of the fitness
function, and could reduce the number of epochs needed to achieve high task performance.
    Regardless of the direction in which the methods developed here are applied, there remains
a number of open questions to be solved, ranging from how best to apply these methods
to machine learning to understanding why the R-augmentation process does not work as
effectively in fully connected RNNs. While these results are promising, further refinement
will be needed to allow for wider application and use. The results and methods here provide
                                                121


a new starting point for the efficient creation of more capable cognitive systems.
                                              122


Chapter 6
Using MAP-Elites to direct the
evolution of desired neural
characteristics
This chapter was originally published as ”Douglas Kirkpatrick, & Arend Hintze (2021, July).
Using MAP-Elites to direct the evolution of desired neural characteristics. In Proceedings of
the ALIFE 2021: The 2021 Conference on Artificial Life. ALIFE 2021: The 2021 Conference
on Artificial Life (pp. 113). ASME.” Used with permission. Several additions and edits have
been made, as well as changes to fit the tone and style of this dissertation.
6.1       Abstract
Artificial cognitive systems (e.g., artificial neural networks) have taken an ever more present
role in the modern world, providing enhancements to everyday life in our cars, in our phones,
and on the internet. In order to produce systems more capable of achieving their designated
tasks, previous work has sought to direct the evolution of networks using a process referred
to as R-augmentation. This process selects for the maximisation of an information-theoretic
                                                 123


measure of the agent’s stored understanding of the environment, or its representation (R) in
addition to the task performance. This method was shown to induce increased task perfor-
mance in a shorter amount of evolutionary time compared to a standard genetic algorithm.
Extensions of this work have looked at how R-augmentation affects the distribution of rep-
resentations across the neurons of the brain ”tissue” or nodes of the network, referred to
as smearedness (S). Here I seek to improve upon the prior methods by moving beyond the
simple maximization used in the original augmentation formula by using the MAP-Elites
algorithm to identify intermediate target values to optimize towards. I also examine the
feasibility of using MAP-Elites itself as an optimization method as opposed to the traditional
selection methods used with R-augmentation, to mixed success. These methods will allow us
to shape how the network evolves, and produce better-performing artificial cognitive systems.
6.2       Introduction
While recent interest, funding, and computational breakthroughs have driven artificial intel-
ligence to new heights in both capability and widespread use, there is still a lack of under-
standing on how to best shape the behavior, structure, and performance of artificial cognitive
systems to reflect modern needs. In order to broadly adapt artificial cognitive systems to
a wide range of tasks and problems, we should focus our efforts on generating cognitive
systems that understand the tasks that they are presented with. For inspiration as to how to
generate artificial cognitive systems that possess such understanding, I seek inspiration from
beyond computer science. In the fields of philosophy, psychology, and cognitive science, for
example, researchers have stressed the importance of mental representations. In these fields,
representations are the internal imagery of objects that are not currently perceived by the
sensory organs. According to cognitive scientists (and many philosophers, e.g., [Phillips and
Singer, 1997, Pinker, 2013]) these internal models of the world are essential for intelligent
decision-making, and form the very foundation of human intelligence [Collins and Gentner,
                                               124


1987]. While the importance of mental representations for Artificial Intelligence (AI) has
 been clear from the outset, it has been proven difficult to program mental representations
 into computational substrates, leading proponents at the forefront of AI research to search
 for “Intelligence without Representation” [Brooks, 1991], an approach that has largely failed
 to generate truly intelligent systems. Instead, modern AI research has focused Convolutional
 Neural Networks (CNNs), which appear to be powerful classifiers of complex contextual vi-
 sual scenes [LeCun et al., 1998]. However, many of the researchers that were spearheading
 the “Deep Learning” revolution now concede that those networks are vulnerable to so-called
“adversarial perturbations” [Jo and Bengio, 2018], and are unlikely to represent a step towards
 artificial general intelligence, in part because they do not have robust mental models.
     Recent research has shown that instead of directly programming mental representations, it
 is possible to produce them by evolving artificial brains that control agents that act and behave
 in complex simulated environments [Edlund et al., 2011, Marstaller et al., 2013, Albantakis
 et al., 2014, Tehrani-Saleh et al., 2018, Olson et al., 2016, Tehrani-Saleh et al., 2019]. Within
 these digital brains (Markov Brains, see [Hintze et al., 2017]) mental representations are used
 to make decisions in conjunction with (and sometimes even without) reference to sensory
 information. The ability to mathematically define representations in terms of information
 theory [Marstaller et al., 2013] has made it possible to monitor the evolution of representations
 quantitatively, and to measure what is being represented in the brain even where those
 representations are to be found in subsets of neurons. As it turns out, the relative structure
 of representations (i.e., how localized or distributed representations are) can be quantified
 as well. [Hintze et al., 2018] introduced a measure of smearedness (S) that quantifies how
 much representations are localized within a single (or small group) of neurons, or conversely
 how much representations are smeared out over large groups of neurons. In previous work I
 showed that how smeared-out representations are depends on the type of cognitive system (see
 Chapter 3) and additionally this smearedness can affect the robustness of the computation
 [Kirkpatrick and Hintze, 2019].
                                                125


    Furthermore, previous work shows that rewarding representations (referred to as R) ex-
plicitly (along with other aspects of fitness) in a Genetic Algorithm (GA) leads to significantly
better performance because “augmenting evolution with R” encourages the evolution of AIs
with robust mental models (see [Schossau et al., 2015] or Chapter 2). However, many open
questions remain. Is there an optimal balance between rewarding ”R” and rewarding fitness?
Should representations be modular (different concepts represented in different areas of the
brain) or should they be non-local, “smeared out” over the brain, as it were? Here I focus on
the role and impact of smearedness and how it can be used to augment genetic search.
    One might assume that, like R-augmentation, increasing smearedness is also advantageous.
I will show that this is not the case, and that trying to minimize it is equally useless. Viewed
in the context of neuro-correlate maximization introduced by [Schossau et al., 2015], the
smearedness S will not help us to improve the performance of the GA. However, I will show
that integrating the neuro-correlates S and R with MAP-Elites optimization [Mouret and
Clune, 2015] produces a novel, useful method to investigate this issue.
    MAP-Elites was originally designed to improve performance by maintaining a diverse
population of solutions. MAP-Elites optimization does not allow all solutions in the population
to compete against each other, but instead creates a grid of solutions. The grid is superimposed
on a multidimensional parameter space, where each parameter describes set of solutions binned
by their performance on that parameter. The best performance for different combinations of
parameter values are stored, creating a ”Map of Elites”. For example, instead of just evolving
a predator for hunting success, one could also measure predator size and weight. When
evaluating a predator’s performance, the size and weight of the predator are also determined
and are used to determine a bin on the grid that it is categorized into. This predator is then
not competing against the rest of the population, but only against a previous solution with
similar size and weight defined by the grid. When applying the MAP-Elites method using R
and S as the dimensions for the grid, the search of the GA can be improved. Here I find that,
perhaps as expected, the optimal solutions have an intermediary S rather than a maximal
                                                 126


or minimal S. However, surprisingly, those best solutions do not maximize R but are found
to have intermediary values of R as well. This shows, that while maximizing or minimizing
S does not improve the search of a GA, S is still a useful neurocorrelate when applying
MAP-Elites selection. Further, using S in conjunction with R in MAP-Elites outperforms
R-augmentation in some cases.
6.3        Methods
6.3.1       Active Categorical Perception Task
The Active Categorical Perception Task (ACP) is the preeminent task used for evolution of
artificial agents (especially Markov Brains) for the assessment of mental representations and
R [Marstaller et al., 2013, Schossau et al., 2015, Hintze et al., 2018]. In this task, an agent
equipped with a sensor made from four sensor neurons arranged in two groups of two (with
a blind spot in between) must either catch or avoid blocks falling diagonally at constant rate,
depending on the size of the block (see Figure 6.1). Blocks of varying width (here, width 2
or width 4) are dropped one at a time towards the agent on a toroidal surface of height 34
units and width 16 units, with the blocks moving by 1 unit to the left or to the right as they
fall 1 unit down towards the agent. The agent must identify the size and direction of the
falling block, and position itself to either be under (i.e., catch) the small blocks or not under
(i.e., avoid) the large blocks. As the agent cannot see the whole block at one point in time
due to the blind spot, it must integrate information from successive timesteps in order to
decide which actions to take. Given the sizes of the blocks, the movement direction, and 16
different starting positions relative to the agent, there are 64 different conditions that the
agent is comprehensively evaluated for. Agents are assigned fitness based on the number of
correct (C) and incorrect (I) catch/avoid decisions that they make, using an exponential
fitness function (see Equation 6.1).
                                                127


                                           W = 1.10(C−I)                                        (6.1)
Figure 6.1: A subsection of a single case of the Active Categorical Perception Task. The agent,
visualized by the orange box, can move either to the left or right as indicated by the orange arrows.
The four sensors, labelled 1,2,3, and 4, can each see directly upwards as indicated by the light blue
shading. A small block with width 2 is falling towards the agent and moving to the left, as indicated
by the green arrow. At the current point in time, sensor 3 is the only sensor that would return a
signal, while on the next update, assuming that the agent did not move, both sensors 3 and 4 would
return a signal.
6.3.2      Numerical Comparison Task
The Numerical Comparison Task (NCT) is derived from real-world psychological [Merritt and
Brannon, 2013] and biological [Nieder, 2018, Merritt et al., 2009] experiments with demon-
strated utility in measuring R experimentally in silico [Kirkpatrick and Hintze, 2019]. In
short, the agent receives one number and waits for a number of updates over which the agent
is expected to remember the first number. After that time period the agent is given a second
number, and again waits for the same number of updates, after which the agent must report
                                                 128


which of the numbers was larger. In silico, the agent has a number of inputs, here 5, which are
set to be either 0 or 1. The agent must keep track of how many of these inputs are provided
as 1s (i.e., remember the number of 1s). There are a number of intermediary brain updates
(here 3) after which the inputs are reset to have a different number of 1s. After a second
intermediary period of brain updates (where length equal to the first period), the agent must
provide an output with a value of 1 if the second input was larger than the first value, and a 0
otherwise. During the intermediary period, the agent may be provided with a constant signal
(all 1s or 0s) or the input bits may be assigned randomly. For the experiments described
herein, the agent was provided with random noise during the intermediary timesteps, with a
bit being randomly assigned with probability p = 0.5. The agents are tested over all possible
permutations of input variations (i.e., the value of 2 could be input as 01100 or 11000, etc.)
as well as permutations of relative number comparisons where unequal values are compared
(i.e., 1 vs. 2, 2 vs. 1, 2 vs 5, etc.). I show an example of an agent being tested on a single
permutation in Figure 6.2. The NCT is structured so that agents must receive information
at two points in time, and then make a decision about them at a third time point, requiring
information integration to occur between all three time points. Similarly to the ACP, agents
are given fitness exponentially relative to the number of correct (C) and incorrect (I) de-
cisions they make regarding which number is larger, again using the fitness function from
Equation 6.1.
6.3.3       Markov Brains
Markov Brains are a sparsely-connected evolvable neural network structure. These networks
interact with their environment through sensory and motor neurons (also called nodes), which
can receive input from and provide output to their environment, respectively. The number of
input and output neurons in each Markov Brain is task-dependent. Here I use 4 input neurons
and 2 output neurons for the Active Categorical Perception Task, and 5 input neurons and 1
output neuron for the Number Comparison task. In addition, there are a variable number of
                                                129


Figure 6.2: A single case of the Numerical Comparison Task. The agent receives a series of inputs
over a period of 9 brain updates. The input values are shown in the squares under ”Sensors”, where
each square represents a unique sensor input, and correlate with brain updates listed under ”Time”.
The description of each sensor input is likewise listed under ”Description”. The agent receives a
first number, here 3 (note that 3 of the sensors have a 1 as input), and then has 3 intermediate
updates where noise is fed to each sensor with probability p = 0.5. The agent then receives a second
number, here 4 (note likewise that 4 of the inputs are 1), and then a second period of 3 intermediate
updates where noise is fed to the sensors with the same probability p. After the second set of noise
inputs, the agent receives an output signal of all ’0’ inputs, and is expected to output a 1 because
the second number is larger than the first.
                                                 130


Figure 6.3: The information-theoretic Venn Diagram describing how R is calculated. R is defined
as the information shared between the brain’s memory and the salient features of the environment
or task given the state of the sensors.
hidden nodes, here 8 for both tasks, which are only accessible internally and are recurrent
to allow for the persistence of information and the formation of memory. These nodes are
read-from or written-to by modular computational units that I simply call ”gates”. While
a wide variety of gate types are available (e.g., single layer ANNs, Genetic Programming
Functions, etc. - see [Hintze et al., 2017] for examples), here I only use the deterministic and
probabilistic logic gates to connect neurons.
6.3.4      Mental Representations and Neuro-correlates
In order to quantify what the artificial cognitive system knows about its environment, I
turn to information theory. Previous work has demonstrated the utility of two different
measures: R or representation, which measures the amount of information about the world
that the system has integrated, and S or smearedness, which measures the distribution of
that information across the hidden state space or memory of the system.
                                                131


6.3.4.1    Representations (R)
 [Marstaller et al., 2013] identify an information-theoretic measure, called R, that identifies
the amount of mental representation that an agent has stored about its environment. The
mental representation is defined to be the information shared between the agent’s recurrent
memory and the environmental states given the information currently coming from the agent’s
sensors. R can be visualized in the information-theoretic Venn diagram, Fig. 6.3. In practice,
I use Equation 6.2 to calculate R, which was introduced in previous work (see Chapter 3).
                         R = H(S, B) + H(S, E) − H(S) − H(E, B, S)                          (6.2)
    In Eq. 6.2, B refers to the hidden state or memory of the brain, E refers to the salient
features of the environment or task, and S refers to the information coming from the sensors.
6.3.4.2    Smearedness (S)
I am not only interested in the amount of representations measured, but also investigate the
structure of said representations by considering an additional information-theoretic measure,
smearedness or S [Hintze et al., 2018] to quantify the distribution of mental representations
across the hidden states of the artificial cognitive system. First, observe that the memory
state B can be deconstructed into individual neurons Bi , where B = B1 B2 ...Bn . Similarly,
the environmental state E can be deconstructed into individual concepts Ej where E =
E1 E2 ...Em . S is calculated by first recording the ’atomic R’ or Rij , which is the information
shared between individual brain states Bi and individual concepts Ej from the task, given the
entire state of the sensors S. I then construct the matrix M where Mij = Rij = H(Bi : Ej |S).
I then take a pairwise minimum is all concepts i and all nodes j and k to calculate S (see
Equation 6.3).
                                                132


                                          XX
                               S = SC =          min(Mji , Mki )                        (6.3)
                                           i j>k
    This can alternatively be done pairwise over nodes and concepts, however due to the high
correlation between the two different calculations [Kirkpatrick and Hintze, 2019], I only use
the first as S in the context of this paper.
6.3.5      Augmented Fitness Functions
For both the ACP and NCT, the typical fitness function used (see Equation 6.1) takes into
account the correct (C) and incorrect (I) decisions that the agent makes to determine the
score. The traditional method for encouraging mental representations in the fitness function,
referred to as R-augmentation, involves multiplying the original function by a normalized
term including R [Schossau et al., 2015]. Schossau et al. find that the augmented function
(Equation 6.4) results in the evolution of agents which have both higher levels of mental
representation and better task performance.
                                                       R
                                   W = 1.10(C−I) (1 +      )                            (6.4)
                                                      Rmax
    The augmentation function is modified with the addition of the R term so that the
evolutionary process encourages higher values of R and simultaneously the formation of
useful mental representations.
    However, previous work [Hintze et al., 2018] shows that the structure of the mental
representation may itself matter as much as the amount of representation present in the
brain. Therefore, it seems plausible that I may also want to select for the structure of the
representation. For example, I may choose to modify the augmented function to maximise
the smearedness of the representations (see Equation 6.5).
                                             133


                                                         S
                                  W = 1.10(C−I) (1 +        )                            (6.5)
                                                       Smax
    However, simple maximization may not work as well for the structure of the representations.
Notably, [Hintze et al., 2018] suggests that a lower value of smearedness may work better
for network robustness. Thus, I may try a matching augmentation method that encourages
the evolution of minimal smearedness. The original augmented fitness function is structured
so that the term containing R is evaluated to a floating point value between 1 and 2, so
I shall do the same for our minimisation augmentation, with the reversed importance of
smearedness (i.e., higher smearedness results in a lesser evolutionary benefit and vice-versa).
This formulation results in a new fitness function, Equation 6.6.
                                                         S
                                  W = 1.10(C−I) (2 −        )                            (6.6)
                                                       Smax
    Alternatively, an absolute minimum of smearedness may not be ideal either, as some
smearedness and consequently information distribution across the brain is likely necessary to
perform computation and make decisions based on multiple concepts from the environment.
Following the same constraints as for the minimal smearedness modification, I in its stead
propose a new augmented fitness formula, Equation 6.7, that promotes the evolution of
networks with smearedness measured close to a hypothetical ideal target value, T .
                                                      |S − T |
                           W = 1.10(C−I) (2 −                      )                     (6.7)
                                                max{T, Smax − T }
    I structure the equation so that the difference between the observed value of S and T is
minimized (i.e., the closer that T and S are, the larger the increase to W ). I also maintain
the normalization between 1 and 2 for the multiplicative term, as in Equations 6.4, 6.5, and
6.6. Here, I used target values of 2.0, 4.0, 6.0, 8.0, and 10.0 based on previously observed
                                              134


values of smearedness.
6.3.6      Genetic Algorithm
All experiments were implemented using the MABE software framework [Bohm et al., 2017].
The circular genomes were restricted in size to be at least 2,000 sites and at most 20,000
sites long, where sites contained values in the range of 0 to 256. When I do not use MAP-
Elites, I apply roulette-wheel selection to populations of 100 initially random agents with
genomes of size 5,000. I apply mutations between generations using the default per-site point
mutation rate of 0.005, insertion duplication rate of 0.00002 for chunks of size between 128
and 512, and a per-site deletion rate of 0.00002 for chunks with size between 128 and 512.
After evolutionary optimization concluded, the line of descent (LOD) was recovered [Lenski
et al., 2003], and subsequent analysis was conducted on agents from the LOD. I construct
the LOD by maintaining a reference the parent of each organism as the agents are evolved,
and subsequently finding the entire lineage of the best-performing agent in the population
at the end of evolution. For each task, the experiments were run for 20,000 generations, with
100 replicate experiments conducted for each task and augmentation method.
6.3.7      MAP-Elites
The MAP-Elites algorithm [Mouret and Clune, 2015] works by maintaining a grid in mul-
tidimensional space (here, I will use R and S as the axes of a 2-dimensional space), where
individual squares along the grid contain the highest-performing individual on a different
measure (here the task fitness from Equation 6.1, W ). After starting with an empty grid,
100 randomly generated Markov Brains are evaluated and placed on the 2-dimensional grid
according to their respective values of R and S using bins of fixed width for each dimension.
The intersection of one bin from the R axis (e.g., where R is between 1.4 and 1.6) and one bin
from the S axis (e.g., where S is between 6.3 and 7.2) constitute a cell for that grid location,
where the corresponding task score W is recorded. Afterwards, I take a randomly-selected
                                              135


sample of 10 individuals from all of the agents in the grid, and then mutate them and evaluate
them on the given task to generate scores for R, S, and W . Using these values for each new
agent, I find the corresponding bin locations on the grid location, and if the new agent has a
higher W than the agent in the cell, or the cell is empty, I place the new agent at that spot
on the grid. I show an example of this process in Figure 6.4. I repeat this process for 200,000
updates, and run 100 replicate experiments. I averaged the values in the resulting heatmaps
of the 100 replicate experiments, so that I take a more accurate estimation of the ideal values
of R and S. Based on the observation of values of R in previous work [Marstaller et al.,
2013, Schossau et al., 2015, Kirkpatrick and Hintze, 2019] and in accordance with preliminary
experiments (data not shown), I select a range of 0.0 to 2.0 for R, with 10 equally distributed
bins across those values to provide proper resolution without subdividing too far. As S has
been shown to have a greater range of potential values (see Chapter 3 or [Hintze et al.,
2018, Kirkpatrick and Hintze, 2019]), I use a similar selection methodology to set the bins
for smearedness to be more numerous and across a slightly larger range - 20 bins equally
distributed over the range from 0.0 to 18.0. Any values below the range specified for either
dimension are grouped together in 1 bin, and any values above the range specified are also
grouped together, in a separate bin.
6.4       Results and Discussion
Schossau et al. show that R-augmentation can improve the performance of a genetic algorithm.
Without R-augmentation, mutations that improve a network’s internal model often do not
convey an immediate fitness advantage. When using R-augmentation, those mutations are
not neutral to evolution but instead contribute to fitness. Consequently, R-augmentation
accelerates genetic search [Schossau et al., 2015]. R-augmentation assumes that a system
with a high R-value is in general better than a system with a lower R-value. Is the same
true for smearedness? Mutations that affect information smearing could be neutral and only
                                              136


Figure 6.4: A simplified version of the MAP-Elites algorithm. Each organism is represented by the
circle, and each square represents a bin for the algorithm. A prototypical organism is labeled (a),
and demonstrates that each agent has a task performance score ω along with a measurement of the
agent’s R and S. The grid is defined in terms of S on the x-axis and R on the y-axis. The bins for R
are from 0 to 1, 1 to 2, and 2 to 3 while the bins for S are from 0 to 2, 2 to 4, 4 to 6, 6 to 8, 8 to 10,
and 10 to 12. The algorithm will place the organism labelled (b) in the square with the same light
blue shade based on the organism’s R and S values. The organism will be successfully placed in the
bin because there are no organisms in that bin previously. The algorithm will place the organism
labelled (c) in the square with the same light green shade based on the organism’s R and S values.
The placement will succeed as the organism labelled (c) has a higher task performance than the
organism already in the cell. The algorithm will attempt to place the organism labelled (d) in the
bin of the same light yellow color based on the organism’s R and S values. The placement will fail,
however, because the organism that is already in the bin will remain because it has a higher task
score, and the organism labelled (d) will be discarded.
                                                  137


later, in conjunction with other mutations, provide a fitness advantage. However, I do not
know if smearedness should be maximized or minimized. Previous work [Hintze et al., 2018]
finds that computational systems which have sparse connections to evolve low S, while dense
systems like fully connected neural networks evolve to have a high degree of smearedness. To
explore if S should be maximized or minimized I compare both options for augmentation
with a control experiment without augmentation, and an experiment using conventional R-
augmentation. I find that, as expected, R-augmentation improves the performance of the
genetic algorithm compared to the control on both tasks (see Figure 6.5, where the blue line
represents the unaugmented control and the black line represents augmented data). At the
same time, trying to augment by maximizing (using Eq. 6.5) or minimizing (using Eq. 6.6)
smearedness results in worse performance than either R-augmentation or the unaugmented
control (see Figure 6.5, where the red line represents S-maximization and the green line
represents S-minimization).
    This confirms our intuition that intermediary levels of smearedness might be optimal as
opposed to the extremes. To determine which value for smearedness is optimal, I repeated
the experiment, but used an augmentation function that optimizes for a specific value of
smearedness instead of just maximizing or minimizing it (see Equation 6.7). Specifically, the
following target values were used: 2.0, 4.0, 6.0, 8.0, and 10.0.
    I find several target values for smearedness that when used for augmentation perform equal
to or better than the unaugmented control (see Figure 6.6) at varying points in evolutionary
time. The target values of 4.0, 6.0, and 10.0 applied to the ACP task result in higher task
performance than control at end of evolution, while the target value of 6.0 causes higher
task performance than control at around 3000 generations into the evolutionary process,
and equivalent performance to the unaugmented control at the end of evolution. I find that
the exact target values of S that maximize performance seems to be task-dependent, and
targeted augmentation only outperforms R-augmentation on the ACP task, but not the NCT
task. While this confirms that intermediary values can be used for augmentation towards a
                                               138


                             1.0
                                       Active Categorical Perception
                             0.9
                                         0.98
                             0.8
                         W               0.97
                             0.7         0.96
                                         0.95
                             0.6
                                         0.94
                                            8000 100001200014000160001800020000
                             0.5
                                   0        5000      10000      15000     20000
                                                   Generations
                             1.0
                                           Number Comparison
                             0.9
                                         0.98
                             0.8
                         W               0.97
                             0.7         0.96
                                         0.95
                             0.6
                                         0.94
                                            8000 100001200014000160001800020000
                             0.5
                                   0        5000      10000      15000     20000
                                                   Generations
Figure 6.5: Average percent correct decisions over evolutionary time for the ACP and NCT tasks,
where each line represents a different version of the fitness function used for that experiment. Each
line is the average of 100 replicate runs using different random seeds. The blue line represents the
original fitness function (Eq. 6.1). The black line represents R-augmentation (Eq. 6.4). The red
line represents S-maximization (Eq. 6.5). The green line represents S-minimization (Eq. 6.6). The
shaded area of corresponding color around each line represents the standard error. The inset of
the graph provides an expanded view of the indicated subset, to better distinguish the difference
between conditions.
                                                    139


target value, such an approach is extremely wasteful with respect to computational resources:
the performance gain is relatively small while many extra runs are needed to narrow in
on the optimal value. This could mean that using augmentation towards a target value of
S is only viable if the optimal value of S is known beforehand. It might be the case that
there is an optimal value for smearing that is general for all computational systems, but
from observations in other contexts I found no evidence for a consistent value of S across
computational systems (data not shown).
    As a simple sweep of values for smearedness may or may not produce an optimal value for
the targeted evolution, and would require excessive computational time to verify, I elect to use
a more comprehensive method, MAP-Elites, to identify a hypothetical ideal value for targeted
evolution. Previous work [Mouret and Clune, 2015] shows that the MAP-Elite selection
method explores fitness landscapes more effectively and potentially finds better solutions
when other phenotypic or genotypic properties beyond task performance are measured. In
our case, I use R and smearedness as the dimensions for the MAP-Elite space. Based on
previous work [Kirkpatrick and Hintze, 2019] and preliminary experiments, there are 10 bins
for R in the range from 0.0 to 2.0, leading to a bin width of 0.2. Due to increased variance in
the previously observed [Kirkpatrick and Hintze, 2019, Hintze et al., 2018] values of S, there
are 20 bins for S over the range of 0.0 to 18.0, leading to a bin width of 0.9.
    When evolving a populations of agents using MAP-Elites with R and smearedness as
the dimensional variables, I find that as expected, optimal performing agents can be found
for intermediary values of S (see Figure 6.7). However, surprisingly, those solutions that
perform well having intermediary smearedness do not have maximal values of R at the same
time. These results suggest that even though R-augmentation works by maximizing R, a
maximized value of R might not even be the ideal goal for augmentation, and also that there
is an interaction between the neuro-correlates R and S.
    I find that when using MAP-Elites, the best average performance over all replicate ex-
                                              140


                             1.0
                                       Active Categorical Perception
                             0.9
                                         0.99
                             0.8         0.98
                        W                0.97
                             0.7         0.96
                                         0.95
                             0.6         0.94
                                         0.93
                                                10000      14000      18000
                             0.5
                                   0        5000        10000      15000      20000
                                                   Generations
                             1.0
                                           Number Comparison
                             0.9
                                         0.99
                             0.8
                                         0.98
                        W                0.97
                             0.7         0.96
                                         0.95
                             0.6         0.94
                                         0.93
                                                10000      14000      18000
                             0.5
                                   0        5000        10000      15000      20000
                                                   Generations
Figure 6.6: Average percent correct decisions over evolutionary time for the ACP and NCT tasks.
Each line represents a different version of the fitness function used for that experiment and is the
average of 100 replicate runs using different random seeds. As in Fig. 6.5, the black line represents
R-augmentation (Eq. 6.4), while the blue line represents the control. Each of the other differently
colored lines represent S-augmentation towards a target value (Eq. 6.7), for a range of different
values. The correspondence of color to target values is: orange to 2.0, green to 4.0, red to 6.0, cyan
to 8.0, and magenta to 10.0. The shaded area of corresponding color around each line represents
the standard error. The inset of the graph provides an expanded view of the indicated subset, to
exemplify the difference between conditions.
                                                    141


periment heatmaps for a specific combination of S and R in the ACP is 99.66% correct
decisions and in the NCT is 86.66% correct decisions. This result can be directly compared
to the previous optimizations using a regular GA or R-augmentation (see Figure 6.5). In
both cases 2,000,000 solutions were evaluated in total, and the average performance for the
non MAP-Elite cases from an unaugmented GA were 97.41% for the ACP, and 97.39% for
the NCT. Further, R-augmentation with roulette selection improves the outcome perfor-
mance by 0.5% or 1% for the NCT and ACP, respectively. While the absolute differences
are not necessarily significant on their own, R-augmentation finds the same performance as
unaugmented evolution much earlier. When evaluated using the Mann-Whitney U test, I
find that the performance at end of evolution for the agents evolved on the ACP task using
MAP-Elites is significantly higher than the performance at end of evolution for the agents
evolved on the ACP task using R-augmentation (p < 0.0001). This shows that optimization
using MAP-Elites using R and S can outperform R-augmentation. Although I do not see
the same effect when using MAP-Elites for the NCT task, it stands to reason that a more
optimal setup of MAP-Elites may produce a better result for that task also. Notably, the
heatmap for NCT in Fig. 6.7 shows a large amount of unused space where high values of S
(e.g., greater than 10.0) are not reached. This indicates that the range of values chosen may
be too large, reducing the selective pressure on the agents. This discrepancy in MAP-Elites
performances on both tasks related to the range of S may also be linked to the nature of
the tasks, as the ACP task requires more information integration than the NCT task, and as
such can tolerate a larger range of values for S. The identification of the exact reason for the
discrepancy in performance between the two tasks, and more broadly the further refinement
of the MAP-Elites method using neurocorrelates for generalized applications remains an open
direction for future research.
                                              143


6.5       Conclusion
Using neurocorrelates to improve the ability of a genetic algorithm to search for optimal
solutions is a promising direction for current and future research. As R-augmentation showed,
a simple joint fitness function already improves performance. Naturally, one might think
that adding more complex neurocorrelates would improve the augmentation process further.
However, I have shown that maximizing or minimizing for S is not a rewarding endeavor.
Although it appears that using the augmentation towards an optimal target neurocorrelate
value for either S or R could produce a better outcome than either evolution alone or even
the original R-augmentation method, the lack of a priori knowledge about optimal target
values limits the applicability of this method. MAP-Elites serves as a potential method to
overcome these issues both in the process of identifying ideal target values and also as an
alternate evolutionary scheme whose performance is on par with R-augmentation. Measuring
R and S during evolution and using it to define the “Map of Elites” not only finds which
values for R and S correlate with finding optimal solutions, the method itself outperforms a
standard genetic algorithm as well as R-augmentation.
    Obviously, the neurocorrelates R and S can only be used in systems that have recurrent
hidden states, as only then can these values be properly measured. Successfully measuring
these neurocorrelates depends on the ability of the experimenter to determine the salient
features of the environment that should be reflected in mental representations. But given
these limitations, using MAP-Elites in conjunction with these neurocorrelates is a promising
approach. The future will tell as to what degree other neurocorrelates could be used. An
example of an alternative neurocorrelate is information integration [Albantakis et al., 2014]
which also does not necessarily linearly correlate with the performance of an agent [Joshi
et al., 2013]. Another limitation is that here I only used two different simple computational
tasks, active categorical perception and number comparison. While I see no reason that this
method should not work for other tasks, it would be good for future research to use this
                                               144


method on different tasks to confirm that it performs well. Additionally, although the focus
of this paper was on Markov Brains, given that previous work (Chapter 3 or [Kirkpatrick
and Hintze, 2019]) has shown that the information-theoretic tools used here work on other
computational substrates as well, I can identify a number of open questions that seek to
answer how we can best use this method to evolve other types of artificial cognitive systems
(e.g., RNN, LSTM, or any other recurrent network architecture). To conclude, the methods
presented here broadly allow for the shaping of representation quantity and structure, and
are not limited to any one task or cognitive system.
6.6      Acknowledgements
The author would like to thank Chris Adami for his input and guidance on the paper.
    This work was supported in part by Michigan State University through computational
resources provided by the Institute for Cyber-Enabled Research. This material is based in
part upon work supported by the National Science Foundation under Cooperative Agreement
No. DBI-0939454.
                                             145


Chapter 7
Conclusions and Future Work
Although this dissertation covers a number of different perspectives and questions, an under-
lying thread of understanding R-augmentation runs through it all. Though the experimental
questions may take the form of evaluating the effects of R-augmentation on evolved net-
works, learning why R-augmentation is able to create better performing cognitive systems,
or understanding how to best apply R-augmentation, each facet reflects the central theme of
promoting task understanding in evolved artificial cognitive systems. In this chapter, I will
give an overview of the major conclusions of this work as well as illuminate paths to future
research.
    When R-augmentation was first introduced, some worried how the process might affect
the properties of the evolved networks, including the potential for introduced drawbacks
or flaws. Several of the chapters (e.g., Ch. 2 and Ch. 5) in this dissertation alleviate these
concerns, demonstrating that beyond the additional computational cost, there are no notice-
able drawbacks. I examined the ability of cognitive systems evolved with R-augmentation
to generalize to different tasks and to be robust to sensor noise. In some cases, Markov
Brains evolved with R-augmentation were able to generalize significantly better than those
evolved without, but this result was not found in all experimental conditions. Also notable
                                             146


was that Markov Brains evolved with R-augmentation were able to evolve more effectively
when affected by sensor noise than those evolved with a standard GA.
    Another concern for R-augmentation was that it had primarily been shown to work for
Markov Brains, using only deterministic and probabilistic logic gates, but little was known
about how the process would affect other network types. I examined the use of various other
gate types, and found that R-augmentation works to improve task performance regardless of
the computational components selected for Markov Brains. Also, I investigated several new
cognitive structures based on Genetic Programming (GP) trees, and found that although
the effect of R-augmentation was not pronounced in every case, generally cognitive systems
evolved with R-augmentation outperformed those evolved without. Notably, the systems that
struggled the most with forming representations received the greatest performance benefit
from the use of R-augmentation.
    While the advantage of R-augmentation in terms of increased task performance was
obvious, it was less clear how precisely this increased performance was achieved. Chapter
4 examines this question of how R-augmentation works, with the conclusion that a change
in evolutionary dynamics drives the difference in performance from networks created by an
unaugmented GA. This change in dynamics includes selection of mutations that produce
changes in R and task fitness simultaneously, as opposed to mutations that only favor one or
the other. It may also be that R-augmentation allows for the persistence of mutations that
allow for useful mental representations, when they would otherwise fall victim to mutational
drift, although I was not able to confirm this exact situation.
    One of the notable results discovered in the course of this investigation of R-augmentation
is that the process allows for maintaining the same evolved performance while using fewer task
evaluations than a standard GA. This can result in reduced total processing time even when
factoring in the computational overhead necessary for the evaluation of R. Fewer evaluations
means reduced server time, more eco-friendly and efficient computation, and the opportunity
                                               147


 to pursue more complex tasks. The effect is seen here on a relatively simple task, and the
 reduction of trials would be even more beneficial on a computationally taxing task (i.e., a
 physics-based or 3D vision related task).
     Finally, I show that optimizing for R can be achieved using other selection methods,
 using the MAP-Elites method as a demonstration. Instead of solely optimizing for a maximal
 measurement of mental representations, I investigate optimizing towards a hypothetical
 intermediary value for the distribution of representations, or smearedness. Such an approach
 may potentially provide better results, but the search necessary to find an ideal target value
would likely be too costly to provide a net advantage. Instead, using the MAP-Elites optimizer
 to maintain diversity in both R and smearedness may produce a better result with the same
 number of agent evaluations. Both methods show promise in certain circumstances, but
 further refinement is required.
     Overall, there are two main findings from this work. The first is that R-augmentation
works best for sparsely-connected networks that do not have a predetermined information flow
(i.e., fixed connectivity), in tasks with well-defined states. That is, fully-connected networks like
 RNNs already have full information flow and potentially too much representation. Although
’too much representation’ may seem like an odd concept, there is a potential for information
 overload. That is, if there are too many internal models that need to be selected from or
 combined together, it may be impossible for the network to come to a consensus and make
 the correct decision. In that case, R-augmentation is not useful as those networks require
 a reduction of R, whereas in sparse networks the representations need to be created, and
 R-augmentation is more effective. The second finding is that in the cases seen thus far, R-
 augmentation produces networks that perform better than or equal to the systems produced
 by unaugmented GAs. In tasks where the cost of evaluating R is small relative to the cost
 of task evaluation, there may be a large performance benefit to adding R-augmentation to
 the GA with little overhead.
                                                  148


    This work covers a lot of ground in evolutionary AI, but has the potential to drive
impactful research in other fields, most notably Machine Learning and Deep Learning. Given
the larger scale at which those fields operate, in terms of both tasks and network sizes, either
the increase in performance offered by R-augmentation or the potential reduction in trial
cases could be useful to researchers and practitioners alike. Although perhaps not a direct
analogue, modifying the loss function of a machine learning model to take into account the
network’s representation could result in an effect similar to evolutionary R-augmentation.
The performance increases from such a change could allow for more efficient training and
less computation cost and run time. In deep learning there is an existing issue whereby the
network overfits to a specific set of inputs and can no longer generalize to new data [Zhang
et al., 2018]. Even worse, it appears as if deep learned neural networks are vulnerable to
statistical regularities and can be susceptible to adversarial images which cause incorrect
classifications [Jo and Bengio, 2018]. The ability of R-augmented networks to evolve on
a small subset of cases and still generalize to the whole set of data provides a framework
for solving both of these issues. Namely, by forcing the deep learned network to maintain
the representation of the whole task, it could also be trained on fewer data points while
maintaining the ability to perform well on the whole. Using a rotating subset of training
cases with R-augmentation of the loss function, in a manner similar to that demonstrated
in Chapter 5, could provide a method for avoiding overfitting to any one sample.
    Another potential avenue of research falls in the realm of generalized intelligence. Opti-
mizing for understanding on multiple tasks could lead to the evolution of a cognitive system
that can perform new tasks previously unseen or with minimal training, and thus is more
able to generalize and has a higher degree of adaptive cognition. This would be a form of
meta- or multi-R-augmentation. In order to evolve cognitive systems that can generalize,
fitness landscapes encompassing more complex tasks and environments would be necessary.
However, it would be in this complexity that we would finally be evolving artificial cognitive
systems in the same way as their biological equivalents.
                                               149


    In summary, in this dissertation I discussed a range of topics related to R-augmentation,
creating a foundation for understanding how to encourage the evolution of mental represen-
tations. These topics include understanding the resulting changes or lack of changes that
R-augmentation causes in the networks created, investigating the evolutionary source of
the improvement caused by R-augmentation, and exploring other optimization methods and
adaptations made possible by optimizing for mental representations. Although the results
covered here are a good foundation for understanding R-augmentation, there remain many
outstanding questions for future research to explore.
                                              150


BIBLIOGRAPHY
      151


                                BIBLIOGRAPHY
[Albani et al., 2021] Albani, S., Ackles, A., Ofria, C., and Bohm, C. (2021). The comparative
  hybrid approach to investigate cognition across substrates. In Pending - Artificial Life
  Conference Proceedings.
[Albantakis et al., 2014] Albantakis, L., Hintze, A., Koch, C., Adami, C., and Tononi, G.
  (2014). Evolution of integrated causal structures in animats exposed to environments of
  increasing complexity. PLoS computational biology, 10(12):e1003966.
[Banzhaf et al., 1998] Banzhaf, W., Nordin, P., Keller, R. E., and Francone, F. D. (1998).
  Genetic programming: An introduction: On the automatic evolution of computer programs
  and its applications. dpunkt–verlag fur digitale technologie gbmh and morgan kaufmann
  publishers. Inc., Heidelberg and San Francisco CA, resp.
[Beer, 1996] Beer, R. (1996). Toward the evolution of dynamical neural networks for mini-
  mally cognitive behavior. In P. Maes et al., editor, Proc. 4th Intern. Conf. on Simulation
  of Adaptive Behavior, pages 421–429, Cambridge, MA. MIT Press.
[Beer, 2003] Beer, R. D. (2003). The dynamics of active categorical perception in an evolved
  model agent. Adaptive Behavior, 11(4):209–243.
[Bengio and Frasconi, 1995] Bengio, Y. and Frasconi, P. (1995). An input output hmm
  architecture. In Advances in neural information processing systems, pages 427–434.
[Bohm et al., 2017] Bohm, C., CG, N., and Hintze, A. (2017). MABE (modular agent based
  evolver): A framework for digital evolution research. Proceedings of the European Confer-
  ence of Artificial Life.
[Brooks, 1991] Brooks, R. A. (1991). Intelligence without representation. Artificial intelli-
  gence, 47(1-3):139–159.
[Clune et al., 2013] Clune, J., Mouret, J.-B., and Lipson, H. (2013). The evolutionary origins
  of modularity. Proc. R. Soc. B, 280(1755):20122863.
[Clune et al., 2011] Clune, J., Stanley, K. O., Pennock, R. T., and Ofria, C. (2011). On the
  performance of indirect encoding across the continuum of regularity. IEEE Transactions
  on Evolutionary Computation, 15(3):346–367.
[Collins and Gentner, 1987] Collins, A. and Gentner, D. (1987). How people construct mental
  models. Cultural models in language and thought, 243:243–265.
[Dao et al., 2020] Dao, D. V., Ly, H.-B., Vu, H.-L. T., Le, T.-T., and Pham, B. T. (2020).
  Investigation and optimization of the c-ann structure in predicting the compressive strength
                                               152


  of foamed concrete. Materials, 13(5):1072.
[De Jong, 1975] De Jong, K. (1975). An analysis of the behavior of a class of genetic adaptive
  systems. Ph. D. Thesis, University of Michigan.
[Deb, 2001] Deb, K. (2001). Multi-objective optimization using evolutionary algorithms, vol-
  ume 16. John Wiley & Sons.
[Deng, 2012] Deng, L. (2012). The mnist database of handwritten digit images for machine
  learning research [best of the web]. IEEE Signal Processing Magazine, 29(6):141–142.
[Edlund et al., 2011] Edlund, J. A., Chaumont, N., Hintze, A., Koch, C., Tononi, G., and
  Adami, C. (2011). Integrated information increases with fitness in the evolution of animats.
  PLoS Comput Biol, 7(10):e1002236.
[Floreano et al., 2008] Floreano, D., Dürr, P., and Mattiussi, C. (2008). Neuroevolution: from
  architectures to learning. Evolutionary Intelligence, 1(1):47–62.
[Fu, 2016] Fu, M. C. (2016). Alphago and monte carlo tree search: the simulation optimization
  perspective. In 2016 Winter Simulation Conference (WSC), pages 659–670. IEEE.
[Halford, 2014] Halford, G. S. (2014). Children’s understanding: The development of mental
  models. Psychology Press.
[Handley, 1994] Handley, S. G. (1994). The automatic generations of plans for a mobile
  robot via genetic programming with automatically defined functions. Advances in genetic
  programming, 18:391–407.
[Helmuth et al., 2015] Helmuth, T., Spector, L., and Matheson, J. (2015). Solving uncompro-
  mising problems with lexicase selection. IEEE Transactions on Evolutionary Computation,
  19(5):630–643.
[Hintze et al., 2017] Hintze, A., Edlund, J. A., Olson, R. S., Knoester, D. B., Schossau, J., Al-
  bantakis, L., Tehrani-Saleh, A., Kvam, P., Sheneman, L., Goldsby, H., Bohm, C., and Adami,
  C. (2017). Markov brains: A technical introduction. arXiv preprint arXiv:1709.05601.
[Hintze et al., 2018] Hintze, A., Kirkpatrick, D., and Adami, C. (2018). The structure of
  evolved representations across different substrates for artificial intelligence. In Artificial
  Life Conference Proceedings, pages 388–395. MIT Press.
[Hintze et al., 2019] Hintze, A., Schossau, J., and Bohm, C. (2019). The evolutionary buffet
  method. In Genetic Programming Theory and Practice XVI, pages 17–36. Springer.
[Hochreiter and Schmidhuber, 1997] Hochreiter, S. and Schmidhuber, J. (1997). Long short-
  term memory. Neural Computation, 9(8):1735–1780.
                                               153


[Jo and Bengio, 2018] Jo, J. and Bengio, Y. (2018). Measuring the tendency of CNNs to
   learn surface stastistical regularities. arXiv:1711.11561.
[Johnson and Zhang, 2013] Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradi-
   ent descent using predictive variance reduction. Advances in neural information processing
  systems, 26:315–323.
[Johnson-Laird, 1980] Johnson-Laird, P. N. (1980). Mental models in cognitive science. Cog-
  nitive science, 4(1):71–115.
[Joshi et al., 2013] Joshi, N. J., Tononi, G., and Koch, C. (2013). The minimal complexity
   of adapting agents increases with fitness. PLoS Comput Biol, 9(7):e1003111.
[Kaur et al., 2018] Kaur, J., Kalra, A., and Sharma, D. (2018). Comparative survey of swarm
   intelligence optimization approaches for ann optimization. In Intelligent Communication,
  Control and Devices, pages 305–314. Springer.
[Kirkpatrick and Hintze, 2019] Kirkpatrick, D. and Hintze, A. (2019). The role of ambient
   noise in the evolution of robust mental representations in cognitive systems. In Artificial
  Life Conference Proceedings, pages 432–439. MIT Press.
[Konak et al., 2006] Konak, A., Coit, D. W., and Smith, A. E. (2006). Multi-objective op-
   timization using genetic algorithms: A tutorial. Reliability engineering & system safety,
   91(9):992–1007.
[Koza, 1994] Koza, J. R. (1994). Genetic programming as a means for programming com-
   puters by natural selection. Statistics and computing, 4(2):87–112.
[Koza and Rice, 1992] Koza, J. R. and Rice, J. P. (1992). Automatic programming of robots
   using genetic programming. In AAAI, volume 92, pages 194–207. Citeseer.
[Krizhevsky et al., 2009] Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers
   of features from tiny images. University of Toronto.
[Langdon, 1995] Langdon, W. B. (1995). Evolving data structures with genetic programming.
   In ICGA, pages 295–302.
[LeCun et al., 1998] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-
   based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
[Lehman et al., 2018] Lehman, J., Clune, J., and Misevic, D. (2018). The surprising creativity
   of digital evolution. In Artificial Life Conference Proceedings, pages 55–56. MIT Press.
[Lehman and Stanley, 2008] Lehman, J. and Stanley, K. O. (2008). Exploiting open-
   endedness to solve problems through the search for novelty. In ALIFE, pages 329–336.
                                                154


[Lenski et al., 2003] Lenski, R. E., Ofria, C., Pennock, R. T., and Adami, C. (2003). The
   evolutionary origin of complex features. Nature, 423:139–144.
[Liu and Deng, 2015] Liu, S. and Deng, W. (2015). Very deep convolutional neural network
   based image classification using small training sample size. In 2015 3rd IAPR Asian
  conference on pattern recognition (ACPR), pages 730–734. IEEE.
[Marstaller et al., 2013] Marstaller, L., Hintze, A., and Adami, C. (2013). The evolution of
   representation in simple cognitive networks. Neural computation, 25(8):2079–2107.
[McCulloch and Pitts, 1943] McCulloch, W. S. and Pitts, W. (1943). A logical calculus of the
   ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133.
[Merritt and Brannon, 2013] Merritt, D. J. and Brannon, E. M. (2013). Nothing to it: Pre-
   cursors to a zero concept in preschoolers. Behavioural processes, 93:91–97.
[Merritt et al., 2009] Merritt, D. J., Rugani, R., and Brannon, E. M. (2009). Empty sets
   as part of the numerical continuum: conceptual precursors to the zero concept in rhesus
   monkeys. Journal of Experimental Psychology: General, 138(2):258.
[Miller, 2011] Miller, J. F. (2011). Cartesian genetic programming. In Cartesian Genetic
  Programming, pages 17–34. Springer.
[Miller and Thomson, 2000] Miller, J. F. and Thomson, P. (2000). Cartesian genetic pro-
   gramming. In European Conference on Genetic Programming, pages 121–132. Springer.
[Minsky and Papert, 1969] Minsky, M. and Papert, S. (1969). An introduction to computa-
   tional geometry. Cambridge tiass., HIT.
[Mohapatra et al., 2015] Mohapatra, R. K., Majhi, B., and Jena, S. K. (2015). Classification
   performance analysis of mnist dataset utilizing a multi-resolution technique. In 2015
  International Conference on Computing, Communication and Security (ICCCS), pages
  1–5. IEEE.
[Mouret and Clune, 2015] Mouret, J.-B. and Clune, J. (2015). Illuminating search spaces by
   mapping elites. arXiv preprint arXiv:1504.04909.
[Nguyen et al., 2015] Nguyen, A., Yosinski, J., and Clune, J. (2015). Deep neural networks
   are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of
  the IEEE Conference on Computer Vision and Pattern Recognition, pages 427–436.
[Nieder, 2018] Nieder, A. (2018).       Honey bees zero in on the empty set.          Science,
  360(6393):1069–1070.
[Nordin, 1994] Nordin, P. (1994). A compiling genetic programming system that directly
                                              155


   manipulates the machine code. Advances in genetic programming, 1:311–331.
[Nordin and Banzhaf, 1995a] Nordin, P. and Banzhaf, W. (1995a). Genetic programming
   controlling a miniature robot. In Working Notes for the AAAI Symposium on Genetic
  Programming, volume 61, page 67. MIT, Cambridge, MA, USA, AAAI.
[Nordin and Banzhaf, 1995b] Nordin, P. and Banzhaf, W. (1995b). A genetic programming
  system learning obstacle avoiding behavior and controlling a miniature robot in real time.
   Univ., Systems Analysis Research Group.
[Nordin and Banzhaf, 1997] Nordin, P. and Banzhaf, W. (1997). Real time control of a
   khepera robot using genetic programming. Control and Cybernetics, 26:533–562.
[Olson et al., 2016] Olson, R., Adami, C., and Tehrani-Saleh, A. (2016). Flies as ship captains?
   digital evolution unravels selective pressures to avoid collision in drosophila. In Proceedings
  of the Artificial Life Conference 2016 13, pages 554–561. MIT Press.
[Pan and Yang, 2009] Pan, S. J. and Yang, Q. (2009). A survey on transfer learning. IEEE
  Transactions on knowledge and data engineering, 22(10):1345–1359.
[Parsopoulos and Vrahatis, 2002] Parsopoulos, K. E. and Vrahatis, M. N. (2002). Particle
   swarm optimization method in multiobjective problems. In Proceedings of the 2002 ACM
  symposium on Applied computing, pages 603–607.
[Phillips and Singer, 1997] Phillips, W. and Singer, W. (1997). In search of common founda-
   tions for cortical computation. Behav Brain Sci, 20:657–683.
[Pinker, 2013] Pinker, S. (2013). Learnability and Cognition, new edition: The Acquisition
  of Argument Structure. MIT press.
[Reynolds, 1993] Reynolds, C. W. (1993). An evolved, vision-based behavioral model of
   coordinated group motion. From animals to animats, 2:384–392.
[Reynolds, 1994] Reynolds, C. W. (1994). Evolution of obstacle avoidance behavior: using
   noise to promote robust solutions. Advances in genetic programming, 1:221–241.
[Russell and Norvig, 2016] Russell, S. J. and Norvig, P. (2016). Artificial intelligence: a
  modern approach. Malaysia; Pearson Education Limited,.
[Russell et al., 2003] Russell, S. J., Norvig, P., Canny, J. F., Malik, J. M., and Edwards, D. D.
  (2003). Artificial intelligence: a modern approach, volume 2-9. Prentice hall Upper Saddle
   River.
[Schossau et al., 2015] Schossau, J., Adami, C., and Hintze, A. (2015). Information-theoretic
   neuro-correlates boost evolution of cognitive systems. Entropy, 18(1):6.
                                                156


[Spector and Robinson, 2002] Spector, L. and Robinson, A. (2002). Genetic programming
   and autoconstructive evolution with the push programming language. Genetic Program-
  ming and Evolvable Machines, 3(1):7–40.
[Srivastava et al., 2014] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhut-
   dinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The
  Journal of Machine Learning Research, 15(1):1929–1958.
[Stanley et al., 2019] Stanley, K. O., Clune, J., Lehman, J., and Miikkulainen, R. (2019).
   Designing neural networks through neuroevolution. Nature Machine Intelligence, 1(1):24–
   35.
[Stanley and Miikkulainen, 2002] Stanley, K. O. and Miikkulainen, R. (2002). Evolving
   neural networks through augmenting topologies. Evolutionary computation, 10(2):99–127.
[Sun et al., 2019] Sun, S., Cao, Z., Zhu, H., and Zhao, J. (2019). A survey of optimiza-
   tion methods from a machine learning perspective. IEEE transactions on cybernetics,
   50(8):3668–3681.
[Tehrani-Saleh et al., 2018] Tehrani-Saleh, A., LaBar, T., and Adami, C. (2018). Evolution
   leads to a diversity of motion-detection neuronal circuits. In Artificial Life Conference
  Proceedings, pages 625–632. MIT Press.
[Tehrani-Saleh et al., 2019] Tehrani-Saleh, A., McAuley, J. D., and Adami, C. (2019). Mecha-
   nism of perceived duration in artificial brains suggests new model of attentional entrainment.
  bioRxiv, page 870535.
[Teller, 1994] Teller, A. (1994). The evolution of mental models. Advances in genetic pro-
  gramming, pages 199–220.
[Thomas, 1996] Thomas, B. (1996). Evolutionary algorithms in theory and practice. Oxford
   University Press, New York.
[van Dartel et al., 2005] van Dartel, M., Sprinkhuizen-Kuyper, I., Postma, E., and van den
   Herik, J. (2005). Reactive agents and perceptual ambiguity. Adaptive Behavior, 13:227–42.
[Wagner, 2012] Wagner, A. (2012). The role of robustness in phenotypic adaptation and
   innovation. Proceedings of the Royal Society B: Biological Sciences, 279(1732):1249–1258.
[Wang et al., 2017] Wang, D.-J., Liu, F., and Jin, Y. (2017). A multi-objective evolutionary
   algorithm guided by directed search for dynamic scheduling. Computers & Operations
  Research, 79:279–290.
[Yao, 1999] Yao, X. (1999). Evolving artificial neural networks. Proceedings of the IEEE,
   87(9):1423–1447.
                                                157


[Zhang et al., 2018] Zhang, C., Vinyals, O., Munos, R., and Bengio, S. (2018). A study on
  overfitting in deep reinforcement learning. arXiv preprint arXiv:1804.06893.
[Zhou et al., 2011] Zhou, A., Qu, B.-Y., Li, H., Zhao, S.-Z., Suganthan, P. N., and Zhang, Q.
  (2011). Multiobjective evolutionary algorithms: A survey of the state of the art. Swarm
  and Evolutionary Computation, 1(1):32–49.
                                             158