THE EVOLUTION OF FUNDAMENTAL NEURAL CIRCUITS FOR COGNITION IN SILICO
                                        By
                                 Ali Tehrani-Saleh
                                A DISSERTATION
                                    Submitted to
                            Michigan State University
                    in partial fulfillment of the requirements
                                 for the degree of
                   Computer Science – Doctor of Philosophy
                                       2021


                                              ABSTRACT
THE EVOLUTION OF FUNDAMENTAL NEURAL CIRCUITS FOR COGNITION IN SILICO
                                                   By
                                            Ali Tehrani-Saleh
Despite decades of research on intelligence and fundamental components of cognition, we still
know very little about the structure and functionality of nervous systems. Questions in cognition
and intelligent behavior are addressed by scientists in the fields of behavioral biology, neuroscience,
psychology, and computer science. Yet, it is difficult to reverse-engineer observed sophisticated
intelligent behaviors in animals and even more difficult to understand their underlying mecha-
nisms. In this dissertation, I use a recently-developed neuroevolution platform–called Markov
brain networks–in which Darwinian selection is used to evolve both structure and functionality of
digital brains. I use this platform to study some of the most fundamental cognitive neural circuits: 1)
visual motion detection, 2) collision-avoidance based on visual motion cues, 3) sound localization,
and 4) time perception. In particular, I investigate both the selective pressures and environmental
conditions in the evolution of these cognitive components, as well as the circuitry and computations
behind them. This dissertation lays the groundwork for an evolutionary agent-based method to
study the neural circuits for cognition in silico.


        Copyright by
ALI TEHRANI-SALEH
               2021


                                   ACKNOWLEDGEMENTS
I would like to thank my advisor, Christoph Adami, for his help and support over my six years as
his student. During my Ph.D., Chris was always available and willing to help me with my academic
work and otherwise. I certainly would not have been this productive without Chris as my mentor.
I also thank the past and current members of the Adami lab especially for helping me by sharing
their knowledge and experience when I started as new a member of the Adami’s lab. I also thank
my committee members, Arend Hintze, Charles Ofria, J. Devin McAuley, and Wolfgang Banzhaf,
for their guidance and comments during my Ph.D.
                                               iv


                                  TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      x
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . .              . . 1
   1.1 In search of building blocks of intelligence and cognition in light of artificial life . . 1
       1.1.1 Why use computational evolution? . . . . . . . . . . . . . . . . . . . .         . . 2
       1.1.2 Why Markov Brains? . . . . . . . . . . . . . . . . . . . . . . . . . . .         . . 3
       1.1.3 Cognitive Widgets: Fundamental Neural Circuits . . . . . . . . . . . .           . . 8
   1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 13
       1.2.1 Visual motion detection . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 14
       1.2.2 Collision avoidance mechanisms in Drosophila melanogaster . . . . . .            . . 15
       1.2.3 Information flow in evolved in silico motion detection and sound local-
               ization circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 15
       1.2.4 Evolution of event duration perception and implications on attentional
               entrainment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 16
CHAPTER 2     EVOLUTION LEADS TO A DIVERSITY                 OF  MOTION-DETECTION
              NEURONAL CIRCUITS . . . . . . . . .            . . . . . . . . . . . . . . . .  . . 17
   2.1 Methods . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . .  . . 20
   2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 25
   2.3 Discussion . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . .  . . 29
CHAPTER 3     FLIES AS SHIP CAPTAINS? DIGITAL EVOLUTION UNRAVELS SE-
              LECTIVE PRESSURES TO AVOID COLLISION IN DROSOPHILA . .                          . . 32
   3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 32
   3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 34
       3.2.1 Markov Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        . . 34
       3.2.2 Experimental Configurations . . . . . . . . . . . . . . . . . . . . . . .        . . 36
       3.2.3 Collision Probability in Events with Regressive Optic Flow . . . . . . .         . . 38
               3.2.3.1 Proposition 1. . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 40
               3.2.3.2 Proof. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 40
               3.2.3.3 Definition 1. . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 40
               3.2.3.4 Proposition 2. . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 40
               3.2.3.5 Proof. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 41
               3.2.3.6 Definition 2. . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 41
               3.2.3.7 Proposition 3. . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 41
               3.2.3.8 Proof. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 41
               3.2.3.9 Definition 3. . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 42
   3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 43
   3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 47
                                                v


CHAPTER 4   CAN TRANSFER ENTROPY INFER INFORMATION FLOW IN NEU-
            RONAL CIRCUITS FOR COGNITIVE PROCESSING? . . . . . . . . .                      . . 50
  4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 50
  4.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 58
      4.2.1 Markov Brains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       . . 58
      4.2.2 Motion Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 58
      4.2.3 Sound Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 60
  4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
      4.3.1 Gate Composition of Evolved Circuits . . . . . . . . . . . . . . . . . .        . . 61
      4.3.2 Transfer Entropy Misestimates Caused by Encryption or Polyadicity . .           . . 63
      4.3.3 Transfer Entropy Measurements from Recordings of Evolved Brains . .             . . 65
  4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 69
  4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 72
CHAPTER 5   MECHANISM OF DURATION PERCEPTION IN ARTIFICIAL BRAINS
            SUGGESTS NEW MODEL OF ATTENTIONAL ENTRAINMENT . . . . 74
  5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
  5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
      5.2.1 Discrimination thresholds of evolved Markov Brains comply with We-
              ber’s law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
      5.2.2 Evolved Brains show systematic duration perception distortion patterns
              similar to human subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
      5.2.3 Algorithmic analysis of duration judgement task in Markov Brains . . . . . 82
              5.2.3.1 Temporal information about stimuli is encoded in sequences of
                         Markov Brain states . . . . . . . . . . . . . . . . . . . . . . . . 82
      5.2.4 Algorithmic analysis of distortions in duration judgements: Experience
              and perception during misjudgements of early/late oddballs . . . . . . . . . 84
              5.2.4.1 The onset of the tone does not alter a Brain’s perception of the tone 85
              5.2.4.2 Experience of early or late oddball is similar to adapting en-
                         trainment to phase change . . . . . . . . . . . . . . . . . . . . . 88
  5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
  5.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
      5.4.1 Markov Brains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
      5.4.2 Evolution of Markov Brains . . . . . . . . . . . . . . . . . . . . . . . . . 94
      5.4.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
      5.4.4 Discrete time in Markov Brains . . . . . . . . . . . . . . . . . . . . . . . . 97
      5.4.5 Markov Brains as finite state machines . . . . . . . . . . . . . . . . . . . . 97
      5.4.6 Attention, experience, and perception in Markov Brains . . . . . . . . . . . 100
      5.4.7 Information shared between perception and the oddball tone . . . . . . . . 102
  5.5 Additional Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 103
      5.5.1 Fitness landscape structure and historical contingencies result in Markov
              Brains using smaller regions of state space in trials with longer IOIs . . . . 105
              5.5.1.1 Longer evolutionary time does not resolve systematic behavioural
                         distortions in longer rhythms/standard tones . . . . . . . . . . . . 107
                                               vi


               5.5.1.2    Training Markov Brains equally in all IOIs and standard tones
                          has a minor effect on behavioural deviations in longer rhythms . . 111
               5.5.1.3    Constant errors in longest rhythms are greater than zero regard-
                          less of trial size . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
CHAPTER 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . .             . . . . 125
   6.1 Visual Motion Detection . . . . . . . . . . . . . . . . . . . . . . . . . . .       . . . . 126
   6.2 Intraspecific Collision-Avoidance Strategy based on Apparent Motion Cues            . . . . 127
   6.3 Information Flow in Motion Detection and Sound Localization Circuits . .            . . . . 129
   6.4 Event Duration Perception in Rhythmic Auditory Stimuli . . . . . . . . . .          . . . . 130
   6.5 Information-Driven Image Classification via Saccadic Eye Movements . . .            . . . . 131
   6.6 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . . . 134
       6.6.1 Proof of concept . . . . . . . . . . . . . . . . . . . . . . . . . . .        . . . . 134
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
                                                 vii


                                        LIST OF TABLES
Table 2.1: Genetic Algorithm configuration. We evolved 100 populations of 100 MBs
           for 10,000 generations with point mutations, deletions, and insertion. We used
           roulette wheel selection, with 5% elitism, and with no cross-over or immigration.     23
Table 3.1: Configurations for GA and Environmental setup . . . . . . . . . . . . . . . . . . 37
Table 4.1: Transfer entropies and information in all possible 2-to-1 binary logic gates with
           or without feedback. The logic of the gate is determined by the value 𝑍𝑡+1
           (second column) as a function of the input 𝑋𝑡 𝑌𝑡 =(00,01,10,11). 𝐻 (𝑍𝑡+1 ) is the
           Shannon entropy of the output assuming equal probability inputs, 𝑇 𝐸 𝑋→𝑍 is
           the transfer entropy from 𝑋 to 𝑍. In 2-to-1 gates without feedback, transfer
           entropies TE 𝑋→𝑍 and TE𝑌 →𝑍 reduce to 𝐼 (𝑋𝑡 : 𝑍𝑡+1 ), and 𝐼 (𝑌𝑡 : 𝑍𝑡+1 ), respec-
           tively. Similarly, transfer entropy of a process to itself is simply 𝐼 (𝑍𝑡 : 𝑍𝑡+1 )
           which is the information processed by 𝑍. . . . . . . . . . . . . . . . . . . . . . 56
Table 5.1: This table contains point of subjective equality (PSE), just noticeable differ-
           ence (JND), and their standard deviations (SD), as well as relative JNDs, and
           constant error (CE) of on-time oddballs for all inter-onset-intervals, standard
           tones. Responses are averaged across all 50 Brains to generate psychometric
           curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Table 5.2: Genetic Algorithm configuration. We evolved 50 populations of Markov
           Brains for 2,000 generations with point mutations, deletions, and insertions.
           We used roulette wheel selection, with 5% elitism, and with no cross-over or
           immigration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Table 5.3: Complete set of all inter-onset-intervals, standard tones, and oddball durations
           used for the evolution of duration judgement. Oddballs can occur in either
           of the 5th, 6th, 7th, or 8th position in the rhythmic sequence. Also, oddball
           durations are always either shorter or longer than the standard tone. The total
           number of trials for each pair hioi, tonei is four times the IOI minus 2 (excluding
           oddball duration=standard tone, oddball duration=IOI), because the oddball
           can appear in four different positions within the rhythmic sequence. . . . . . . . 98
Table 5.4: Non-linear regression analysis used to explain the correlation between the
           constant errors (CE) and 𝛿IOI which is a function of the distinct number of
           states used in encoding stimuli. Residuals sum of squares (RSS), and the
           Bayesian information criterion. A BIC difference > 10 provide very strong
           support for one model over the other [155]. . . . . . . . . . . . . . . . . . . . . 112
                                                  viii


Table 5.5: Complete set of all inter-onset-intervals, standard tones, and oddball durations
           used for evolution of duration judgement task. Oddballs can occur in either of
           5th, 6th, 7th, or 8th position in the rhythmic sequence. Also, oddball durations
           are always either shorter or longer than the standard tone. . . . . . . . . . . . . . 113
Table 5.6: Non-linear regression analysis used to explain the correlation between the
           constant errors (CE) and 𝛿IOI which is a function of the distinct number of
           states used in encoding stimuli. Residuals sum of squares (RSS), and the
           Bayesian information criterion. . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Table 5.7: Non-linear regression analysis used to explain the correlation between the
           constant errors (CE) and 𝛿IOI which is a function of the distinct number of
           states used in encoding stimuli. Residuals sum of squares (RSS), and the
           Bayesian information criterion. . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Table 5.8: Non-linear regression analysis used to explain the correlation between the
           constant errors (CE) and 𝛿IOI which is a function of the distinct number of
           states used in encoding stimuli. Residuals sum of squares (RSS), and the
           Bayesian information criterion. . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
                                                   ix


                                       LIST OF FIGURES
Figure 2.1: (A) A half Reichardt detector circuit. An object (star) moving from left to
            right stimulating two adjacent receptors, n1 and n2, at time points 𝑡 and 𝑡 + Δ𝑡.
            (B) A full Reichardt detector circuit. In full Reichardt detector circuits, the
            results of the multiplications from each half circuit are subtracted. . . . . . . . . 19
Figure 2.2: (A) A Markov brain with 11 neurons and 2 gates shown at two time steps 𝑡
            and 𝑡 + 1. The states of neurons at time t and the logic operations of gates
            determine the states of neurons at time 𝑡 + 1. (B) One of the gates of the MB
            whose inputs are neurons 0, 2, and 6 and its outputs are neurons 6 and 7. (C)
            Probabilistic logic table of gate 1. . . . . . . . . . . . . . . . . . . . . . . . . . 22
Figure 2.3: A Markov brain is encoded in a sequence of bytes that serves as the agent’s
            genome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Figure 2.4: Schematic examples of three types of input patterns received by the two
            sensory neurons at two consecutive time steps. Grey squares show presence
            of the stimuli in those neurons. (A) Preferred direction (PD). (B) Stationary
            stimulus. (C) Null direction (ND). . . . . . . . . . . . . . . . . . . . . . . . . 24
Figure 2.5: Markov brains evolve alternative circuits to encode a motion detection circuit
            (duplicated logic gates with same inputs and outputs are omitted). (A) Exam-
            ple simple evolved motion detection circuit. (B) Example complex evolved
            motion detection circuit. Gate symbols are US Standard. . . . . . . . . . . . . 25
Figure 2.6: Evolved motion detection circuits vary greatly in complexity. (A) Histogram
            of the number of essential gates (i.e., gates that resulted in a fitness loss when
            removed) for each evolved motion detection circuit. (B) Histogram of the
            number of redundant gates (i.e., gates that resulted in no fitness loss when
            removed) for each evolved circuit. . . . . . . . . . . . . . . . . . . . . . . . . . 26
Figure 2.7: Distribution of specific gates used in evolved motion detectors. (A) Average
            number of essential logic gates of each type of logic gate per evolved brain.
            Error bars represent 95% confidence intervals. (B) Average number of re-
            dundant logic gates of each type of logic gate per evolved brain. Error bars
            represent 95% confidence intervals. . . . . . . . . . . . . . . . . . . . . . . . . 27
Figure 2.8: Evolution of a simple Reichardt detector leads to greater complexity. (A) Di-
            agram of a hand-written Markov Brain encoding a simple Reichardt detector
            (B) Distribution of the number of essential gates for brains evolved from a
            hand-written ancestor. (C) Mutational sensitivity of evolved motion detectors. . 28
                                                  x


Figure 3.1: An illustration of regressive (back-to-front, left) and progressive (front-to-
            back, right) optic flows in a fly’s retina. . . . . . . . . . . . . . . . . . . . . . . 33
Figure 3.2: Probabilistic logic gates in Markov network brains with three inputs and two
            outputs. One of the outputs writes into one of the inputs of this gate, so its
            output is “hidden.” Because after firing all Markov neurons automatically
            return to the quiescent state, values can only be kept in memory by actively
            maintaining them. Probability table shows the probability of each output
            given input values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Figure 3.3: An illustration of a portion of genome containing two genes that encode two
            HMGs. The first two loci represent start codon (red blocks), followed by
            two loci that determine the number of inputs and outputs respectively (green
            blocks). The next four loci specify which nodes are inputs of this gate (blue
            blocks) and the following four specify output nodes (yellow blocks). The
            remaining loci encode the probabilities of HMG’s logic table (cyan blocks). . . 35
Figure 3.4: The digital fly and its visual field in the model. Flies have a 12 pixel retina
            that is able to sense surrounding objects in 280◦ within a limited distance (250
            units). The red circle is an external object that can be detected by the agent
            within its vision field. Activated sensors are shown in red, while inactive
            sensors are blue. In (A) the object activates two sensors, in (B) the object is
            detected in one sensor, and in (C) the object is outside the range. . . . . . . . . 38
Figure 3.5: An illustration of a moving fly at the onset of the event. . . . . . . . . . . . . . 39
Figure 3.6: Probability of collision Πcoll (𝜈, 𝜌) with an object that creates regressive
            motion on the retina as a function of the ratio of vision radius to collision
            radius 𝜌, for different fly-object velocity ratios 𝜈. . . . . . . . . . . . . . . . . . 43
Figure 3.7: The stop probability of the evolved agent vs. the angular velocity of the
            image on its retina for 100 events. Positive values of angular velocity show
            progressive motion events and negative angular velocities stand for regressive
            motion events. The average velocity of the agent is also shown during each event. 45
Figure 3.8: Fitness and regressive-collision-cue (RCC) value on the line of descent for an
            agent that evolved RCC as a strategy to avoid collisions. Only the first 20,000
            generations are shown, for every 500 generations. . . . . . . . . . . . . . . . . 46
Figure 3.9: Mean values of fitness and regressive-collision-cue (RCC) over all 20 repli-
            cates vs. evolutionary time in the line of descent in the environment with
            penalty-reward ratio of 2. Standard error lines are shown with shaded areas
            around mean values. Only the first 20,000 generations are shown, for every
            500 generations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
                                                  xi


Figure 3.10: RCC value distribution in environments with different penalty-reward ratios.
             Each box-plot shows the RCC value averaged over the last 1000 generations
             on the line of descent for 20 replicates. . . . . . . . . . . . . . . . . . . . . . . 49
Figure 4.1: (A) A network where processes 𝑋 and 𝑌 influence future state of 𝑍, 𝑍𝑡+1 =
              𝑓 (𝑋𝑡 , 𝑌𝑡 ). (B) A feedback network in which processes 𝑌 and 𝑍 influence
             future state 𝑍, 𝑍𝑡+1 = 𝑓 (𝑌𝑡 , 𝑍𝑡 ). . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Figure 4.2: (A) A Reichardt detector circuit. In this circuit, the results of the multiplica-
             tions from each pathway are subtracted to generate the response. The circuit’s
             outcome for PD is +1, ND is -1, and for stationary patterns is 0. (B) Schematic
             examples of three types of input patterns received by the two sensory neurons
             at two consecutive time steps. Grey squares show presence of the stimuli in
             those neurons. The sensory pattern shown here for PD is 10 at time 𝑡 and 01
             at time 𝑡 + 1, which we write as: 10 → 01. Patterns 11 → 01 and 00 → 10
             also represent PD. Similarly, pattern 01 → 10 is shown as an example of ND
             but patterns 11 → 10 and 01 → 11 are also instances of ND. . . . . . . . . . . 59
Figure 4.3: (A) Schematic of 5 sound sources at different angles with respect to a listener
             (top view) and Jeffress model of sound localization. (B) Schematic examples
             of 5 time sequences of input patterns received by the two sensory neurons
             (receptors of two ears) at three consecutive time steps. Black squares show
             presence of the stimuli in those neurons. . . . . . . . . . . . . . . . . . . . . . 60
Figure 4.4: Frequency distribution of all, as well as essential, gates in evolved Markov
             Brains that perform the motion detection or sound localization task perfectly.
             (A) All gates. (B) Essential gates. . . . . . . . . . . . . . . . . . . . . . . . . . 62
Figure 4.5: Transfer entropy measures, exact measures and misestimates by transfer en-
             tropy, on essential gates of perfect circuits for motion detection, and sound
             localization task. Columns show mean values and 95% confidence interval
             of misestimates and exact measures (A) per Brain, and (B) per gate. . . . . . . . 64
Figure 4.6: (A) Transfer entropy measures from neural recordings of a Markov Brain
             evolved for sound localization. (B) Influence map (also receptive field) of
             neurons derived from a combination of the logic gates connections and the
             Boolean logic functions for the same evolved Markov Brain, shown in (C).
             (C) The logic circuit of the same evolved Markov Brain; neurons 𝑁0 and 𝑁1
             are sensory neurons, and neurons 𝑁11 − 𝑁15 are actuator (or decision) neurons.           67
                                                    xii


Figure 4.7: Transfer entropy performance in detecting relations among neurons of evolved
            (A) motion detection circuits, (B) sound localization circuits. Presented
            values are averaged across best performing Brains along with 95% confidence
            intervals. Receiver operating characteristic (ROC) curve representing TE
            performance with different thresholds to detect neurons relations in evolved
            (C) motion detection, (D) sound localization circuits. . . . . . . . . . . . . . . 69
Figure 5.1: A schematic of the auditory oddball paradigm in which an oddball tone is
            placed within a rhythmic sequence of tones, i.e., standard tones. Standard
            tones are shown as grey blocks and the oddball tone is shown as a red block.
            Oddball tone duration may be longer or shorter than the standard tones. . . . . . 77
Figure 5.2: (A) Psychometric curves generated from averaged responses of 50 evolved
            Brains for every inter-onset-interval, standard tone. Oddball durations on the
            𝑥-axis are normalised by standard tone to lie in the range (-1, 1). (B) Relative
            JND values and their 95% confidence interval as a function of inter-onset-
            interval, standard tone. Dashed line shows the average value of relative JNDs.
            (C) Constant errors, the difference between PSE and standard tone, and their
            95% confidence interval as a function of inter-onset-interval, standard tone.
            Dashed line shows CE=0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Figure 5.3: Duration distortion factors (DDF) and their 95% confidence interval as a
            function of the onset of the oddball for all IOI, standard tones. Negative onset
            values represent early oddballs and positive values of onset represent late
            oddballs. A DDF greater than 1 shows an overestimation of the duration of
            the oddball and DDF less than unity shows an underestimation of the duration
            of the oddball. The dashed line indicates DDF=1 and the dotted line shows
            DDF for on-time oddball tone. . . . . . . . . . . . . . . . . . . . . . . . . . . 81
                                                xiii


Figure 5.4: State-to-state transition diagram of a Markov Brain for IOI=10, and standard
            tone=5, with oddball tones of duration 5, 6 shown in (A) and 4 shown in (B).
            Before the stimulus starts, all neurons in the Brain are quiescent so the initial
            state of the Brain is 0. The stimulus presented to the Brain is a sequence of
            ones (representing the tone) followed by a sequence of zeros (denoting the
            intermediate silence). The stimulus at each time step is shown as the label of
            the transition arrow in the directed graph. The input sequence is shown for the
            standard and oddball sequences at the bottom of the state-to-state diagrams.
            (A) State-to-state transition diagram of a Markov Brain when exposed to a
            standard tone of length 5, as well as a longer oddball tone of length 6. This
            Brain judges an oddball tone of duration 6 by following the same sequence
            of states as the original loop, because the transition from state 485 to 1862
            occurs irrespective of the sensory input value, 0 or 1. This Brain correctly
            issues the judgement “longer” from state 3911, indicated by the red triangle at
            the end of the time interval (see Supplementary Movie 1 and Supplementary
            Movie 2 for standard tone and longer oddball tone, respectively). (B) The
            state-to-state transition diagram of the same Brain when presented with a
            shorter oddball tone of length 4. The decision state is marked with a down-
            pointing blue triangle. Once the Brain is entrained to the rhythm of the
            stimulus, the shorter oddball throws the Brain out of this loop. The exit from
            the loop transitions this Brain into a different path. After four ones the Brain
            transitions to state 359 (instead of continuing to 485), and then continues
            along a path where it correctly judges the stimulus to be “shorter” in state
            2884 (see also Supplementary Movie 3). . . . . . . . . . . . . . . . . . . . . . 83
Figure 5.5: The distribution of loop sizes of 50 evolved Brain for each inter-onset-interval
            (IOI). The size of the markers is proportional to the number of Brains (out of
            50) that evolve a particular loop length in each IOI. The dashed line shows
            the identity function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Figure 5.6: (A) The mutual information between perception, i.e., the decision state of the
            Brain, and 1) the oddball tone ending time step (shown in black), 2) the oddball
            tone duration (shown in red), 3) the oddball tone onset (shown in blue), and
            their 95% confidence intervals. (B) Sequence of inputs for a standard tone, an
            on-time longer oddball tone that is correctly judged as longer, and a shorter
            late oddball tone that is misjudged as longer. Sequence of inputs for a standard
            tone, an on-time shorter oddball tone that is correctly judged as shorter, and
            a longer early oddball tone that is misjudged as shorter. Sequences of Brain
            states along with input sequences for on-time longer oddballs and shorter
            late oddballs.(C) The fraction of misperceived out-of-time oddball tones that
            resulted from having the same perception in on-time and out-of-time stimuli
            with the same oddball end points (left data point), compared to the null
            hypothesis; likelihood that Brains misjudgements were to be issued from any
            one of states from set of “shorter-judging” or “longer-judging” states (middle
            and right data point, respectively). . . . . . . . . . . . . . . . . . . . . . . . . 87
                                                 xiv


Figure 5.7: (A) Distribution of similarity depth of experiences (sequences of states) of
             on-time and early/late oddball tones in trials in which onset does not change
             the perception of the tone in Markov Brains. Similarity depth one implies
             that the experiences are identical throughout the tone perception. (B) The
             distribution of the difference between the total similarity and similarity depth
             in each trial. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Figure 5.8: (A) A simple Markov Brain with 12 neurons and two logic gates at two
             consecutive time steps 𝑡 and 𝑡 + 1. (B) Gate 1 of (A) with 3 input neurons
             and 2 output neurons. (C) Underlying probabilistic logic table of gate 1. (D)
             Markov Network Brains are encoded using sequences of numbers (bytes) that
             serve as agent’s genome. This example shows two genes that specify the logic
             gates shown in (A), so that, for example, the byte value ’194’ that specifies
             the number of inputs 𝑁in to gate 1 translates to ’3’ (the number of inputs for
             that gate). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Figure 5.9: (A) A schematic of auditory oddball paradigm in which an oddball tone is
             placed within a rhythmic sequence of tones, i.e., standard tones. Standard
             tones are shown as grey blocks and the oddball tone is shown as a red block.
             (B) The oddball auditory paradigm, which is converted to a sequence of
             binary values, shown as sensed by the input neuron of a Markov Brain. When
             a stimulus is present, a sequence of ‘1’s (shown by black blocks) is supplied
             to the sensory neuron while during silence, a sequence of ‘0’ is fed to the
             sensory neuron. Each block shows one time step of the sequence experienced
             by the Brain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Figure 5.10: (A) Mean fitness across all 50 lineages and 95% confidence interval as a
             function of generation shown every 20 generations. (B) Mean fitness (and
             95% intervals) of best agents picked from each of the 50 populations after
             2000 generations as a function of inter-onset-interval, standard tone. . . . . . . 103
Figure 5.11: State-to-state transition diagram of a Markov Brain for IOI=10, standard
             tone=5, oddball tones=4 and 6, and onset of oddball tones can be 2 time steps
             early and 2 time step late. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Figure 5.12: (A) The distribution of loop sizes of 50 evolved brain for each inter-onset-
             interval (IOI). The size of the markers is proportional to the number of
             Brains (out of 50) that evolve a particular loop lengths in each IOI. (B) The
             distribution of number of distinct states in loops visited by Markov Brains in
             a sequence of rhythmic standard tones, as a function of IOI. The dashed line
             shows the identity function line. . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Figure 5.13: (A) Mean fitness across all 50 lineages and 95% confidence interval color-
             coded at different evolutionary times as a function of inter-onset-interval,
             standard tone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
                                                  xv


Figure 5.14: Constant errors and their 95% confidence interval for 50 best performing
             Brains as a function of inter-onset-interval, standard tone at different evolu-
             tionary times. Dashed line shows zero constant error. . . . . . . . . . . . . . . 109
Figure 5.15: The distribution of number of distinct states used to encode rhythm and
             standard tone duration, i.e., the number of distinct states in each loop, as a
             function of inter-onset-interval at different evolutionary times. The size of
             the circle is proportional to the likelihood at that loop size.The dashed line
             shows the identity function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Figure 5.16: Absolute constant errors (CE) shown in grey as a function of 𝛿IOI , as well as
             the binned data and the fitted softplus curve. . . . . . . . . . . . . . . . . . . . 112
Figure 5.17: Constant errors and their 95% confidence interval for 50 best performing
             Brains as a function of inter-onset-interval, standard tone at different evolu-
             tionary times. Dashed line shows zero constant error. . . . . . . . . . . . . . . 115
Figure 5.18: The distribution of number of distinct states used to encode rhythm and
             standard tone duration, i.e., the number of distinct states in each loop, as a
             function of inter-onset-interval at different evolutionary times. The dashed
             line shows the identity function. . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Figure 5.19: Absolute constant errors (CE) shown in grey as a function of 𝛿IOI , as well as
             the binned data and the fitted softplus curve. . . . . . . . . . . . . . . . . . . . 117
Figure 5.20: Constant errors and their 95% confidence interval for 50 best performing
             Brains as a function of inter-onset-interval, standard tone at different evolu-
             tionary times. There are some missing data points in these plots which is due
             to the fact that in those trials the performances of all 50 Brains are 100%, as
             a result, PSE would be exactly equal to the standard tone and the slope of the
             psychometric function would be infinity. Dashed line shows zero constant error. 119
Figure 5.21: The distribution of number of distinct states used to encode rhythm and
             standard tone duration, i.e., the number of distinct states in each loop, as a
             function of inter-onset-interval at different evolutionary times. The dashed
             line shows the identity function. . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Figure 5.22: Absolute constant errors (CE) shown in grey as a function of 𝛿IOI , as well as
             the binned data and the fitted softplus curve. . . . . . . . . . . . . . . . . . . . 121
                                                   xvi


Figure 5.23: Constant errors and their 95% confidence interval for 50 best performing
             Brains as a function of inter-onset-interval, standard tone at different evolu-
             tionary times. There are some missing data points in these plots which is due
             to the fact that in those trials the performances of all 50 Brains are 100%, as
             a result, PSE would be exactly equal to the standard tone and the slope of the
             psychometric function would be infinity. Dashed line shows zero constant error. 122
Figure 5.24: The distribution of number of distinct states used to encode rhythm and
             standard tone duration, i.e., the number of distinct states in each loop, as a
             function of inter-onset-interval at different evolutionary times. The dashed
             line shows the identity function. . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Figure 5.25: Absolute constant errors (CE) shown in grey as a function of 𝛿IOI , as well as
             the binned data and the fitted softplus curve. . . . . . . . . . . . . . . . . . . . 124
Figure 6.1: The images in the dataset are 28×28 pixels. (A) The entropy content (in
             bits) of MNIST dataset images per pixel, 𝐻 (𝑋). (B) The information shared
             between each pixel and the class of the image, 𝐼 (𝐶 : 𝑋). (C) The probability
             distributions of class variable 𝐶 given the pixel in the center is 0 or 1. . . . . . . 134
Figure 6.2: The performance of ANNs trained on masked images. Maskings were based
             on (A) the entropy content of sub-images in the dataset, and (B) the informa-
             tion shared between 𝐶 and the sub-images. . . . . . . . . . . . . . . . . . . . . 135
                                                   xvii


                                             CHAPTER 1
                                           INTRODUCTION
1.1    In search of building blocks of intelligence and cognition in light of
       artificial life
    Scientists have long been studying animal behavior and brain function in search of components
contributing to intelligent behavior, and how cognitive processes enable such behaviors that are
essential to organism survival. These studies have taken a wide variety of approaches ranging from
studying behaviors of animals in the wild to trained animals in the lab, and utilizing tools such as
fMRI (Functional Magnetic Resonance Imaging) to genetic engineering in order to modify neural
structure, and to building computational models to unravel mysteries of intelligence.
    General intelligence has long been the holy grail of AI (Artificial Intelligence) and scientists
have always been fascinated by the question: “can machines think?” [196]. But after decades
of work and in spite of an exponential increase in computational power throughout this period,
we still do not have a definitive answer. Researchers in the field of AI have constantly been
speculating about possible routes that should be taken in order to advance toward or perhaps
achieve general intelligence, yet controversies remain. It seems essential to me that our approach
toward understanding intelligence, and perhaps building cognitive machines, must be through
creating its building blocks first. I believe a bottom-up approach can take us closer to intelligence,
by building up simpler and more fundamental cognitive widgets first and then attempt to join them
together. As such, the main theme of this thesis is to build and study some of the simple yet
fundamental neural circuits for cognition using neuroevolution. This approach enables us to study
the components of cognition from an evolutionary standpoint where, I investigate the selective
pressures and fitness landscape structures, as well as how they impact the evolved brains and
their evolutionary history. This approach also allows us to analyze these cognitive components at
different levels by investigating their behavioral characteristics, circuitry structure, and algorithms
                                                    1


and computations.
1.1.1   Why use computational evolution?
Ever since Charles Darwin published On the Origin of Species, his evolutionary theory has become
the foundation of modern biology. In the Descent of Man he writes that the “mental faculties”,
similar to any other trait, vary in populations and are heritable, and as a result are subject to
natural and sexual selection [40] (also see [18]). Thus, it seems inevitable to study intelligence
and its building blocks in the light of evolution. Studying intelligence through computational
evolutionary biology has been one of several active fields of research to shed light on intelligence
alongside evolutionary psychology, evolutionary neuroscience, evolutionary behavioral ecology,
etc. The advantage of using evolutionary methods in training artificial neural networks has started
to attract more attention and is emphasized especially in recent years (see for example [208]).
Scientists are slowly beginning to take advantage of neuroevolution because they are realizing
that in order to build a simulated version of an intelligent organism it is only reasonable to
follow the natural process by which intelligence has emerged in the first place, i.e., evolution.
It should come as no surprise that we cannot reverse-engineer extremely complex biological
brains, nor can we design machines with the same degree of complexity and performance. For
example, Nguyen et al. used evolutionary computation to build images with random patterns
that fooled CNNs (Convolutional Neural Networks) to classify them as actual images with high
confidence [133]. It is noteworthy that image recognition is one of the leading areas in AI and
scientists has been more successful in image processing compared to other areas such as natural
language processing, social intelligence, or knowledge representation. This example and many
similar findings [131, 80, 46, 178] underscores how far away a biological visual cortex is from our
sophisticated designed image processing machines. This is perhaps an indication that we need to
approach the problem from a higher level, for example by designing the substrate or components
rather than designing the entire apparatus. Using evolutionary approaches enables us to avoid
engineering the networks and let the evolutionary process take its course to build both the structure
                                                 2


and function of the network [49, 48, 51]. Furthermore, from an evolutionary perspective it may
be more important to discover what selective pressures and environmental conditions might have
resulted in the evolution of a particular intelligent behavior rather than understanding the behavior
or the network. In other words, in an evolutionary process all we need is to build the right fitness
landscape that leads to the evolution of the desired behavior. Ultimately, the problem of building
general intelligence can be reduced to building the set of fitness landscapes within which we can
evolve “thinking machines.” Needless to say, building such fitness landscapes and evolving the
agents within a proper substrate is still a very difficult problem and perhaps might as well be equally
difficult as designing thinking machines from scratch.
     Computational evolution has become of utmost interest to many scientists especially due to
the rise of modern computers and the unprecedented increase in computational power. Compu-
tational evolution, and computational methods as a whole, are essential components of studying
intelligence. Computational methods allow us to run “experiments” in silico, and easily change ex-
perimental parameters and explore conditions that have not been (or could not be) tested empirically.
The computational models use different levels of abstraction to build the processing component,
i.e. the brain, ranging from very detailed simulations of individual neurons, their networks, and
their interactions with their environment such as neocortical column modeling [31] (NEURON
platform) or Project Blue Brain [112], to less intricate models that partly capture the behavior of
biological neurons such as common ANNs (Artificial Neural Networks) which are more efficient
in computation and can achieve high performance in particular tasks such as pattern recognition.
1.1.2    Why Markov Brains?
Markov Brain Networks are a class of evolvable artificial brains in which populations of agents
embedding digital brains undergo Darwinian evolution, through natural selection of inherited
variations that increase an individual’s ability to compete, survive, and reproduce. These digital
brains have the Markov property, i.e., the future state of the network is influenced only by its
present state. This property inspired the name Markov Network Brains, or Markov Brains for short.
                                                     3


More specifically, Markov Brains are neural networks in which neurons are binary variables that
are connected via probabilistic or deterministic logic gates that represent synaptic excitatory or
inhibitory connections. The connectivity and structure of the network, and the functionality of
logic gates are determined by an evolutionary process. The aforementioned properties of Markov
Brains makes them significantly distinct from other common artificial brain models such as ANNs.
Some of these key differences are 1) evolvability of the network structure, 2) high variety in types
of logic gates, 3) possibility of analysis of computations and algorithms of the evolved networks.
In the following, I briefly describe these differences and their benefits and drawbacks and argue
why using Markov Brains, of all artificial brain models, makes them suitable for the purpose of this
thesis.
   1. Evolvability of the network structure.
       Evolvability of the network structure is one of the key features that distinguishes Markov
       Brains from common ANNs. It is noteworthy that there are variations of ANNs that use
       evolution to train the network [205] and models such as NEAT (Neuroevolution of Augment-
       ing Topologies) that enable the network structure to change during training [176]. However,
       researchers rarely use an evolutionary process or GA (Genetic Algorithm) to train ANNs
       and the evolutionary process is unnecessarily costly for ANNs because 1) ANNs are usually
       fully-connected networks that use real-valued numbers and as a result, require a lot of com-
       putation for each individual in the population, and 2) when training ANNs the structure of
       the network is almost always fixed, therefore, using a population of identical networks that
       are only different in their weights implies a lot of redundancy and is not computationally
       reasonable.
       On the other hand, Markov Brains are inherently sparse networks, which makes the compu-
       tations required for the agents much cheaper, especially at the beginning of evolution. As the
       evolutionary process proceeds, the size of the networks grows, and their structure shapes to
       fit into the task, which is contrary to the conventional approach in training ANNs or CNNs
                                                    4


   where researchers hand design the entire structure of the network while only the weights are
   subject to training. As mentioned before, one of the advantages of a top-down approach is
   that we do not inject our own biases into the engineering design of the network. Another
   advantage of this approach is that evolution provides us with a variety of network structures
   and functions (for example by evolving several populations) that perform the task (for ex-
   ample, see [34, 184, 185]) which then enables us to study the similarities and differences
   of a population of networks. It also has been shown that using an evolutionary approach in
   Markov Brains results in more sparse networks as opposed to ANNs that are fully connected
   networks [70]. This sparsity in connections has been shown to enable us to better detect and
   follow information in Markov Brains compared to ANNs [114]. For example, Marstaller et
   al. introduced an information-theoretic measure of “representation” and showed that Markov
   Brains evolved to perform an active categorical perception task have higher values of repre-
   sentation compared to ANNs evolved to perform the same task [116]. This is not to say that
   Markov Brains can evolve to have internal representation of the environment while ANNs
   cannot (note that any given Markov Brain can be recreated by a network of perceptrons that
   has the exact same computations and functions). Rather, the main differences in structure,
   connections, and functionality makes detecting and storing representations easier [70, 116].
2. High variety in types of logic gates.
   Markov Brains are networks of binary variables (neurons) that are connected via logic gates.
   These logic gates can take any number of inputs and based on their logic computation (logic
   table that is also subject to evolution) return a number of outputs. For example, a logic gate
   that takes two inputs and returns one output can have 16 different Boolean logic functions and
   as the number of inputs to a logic gate increase, the number of possible functions increases
   exponentially. This flexibility in functionality of logic gates in Markov Brains makes them
   more suitable especially for the purposes of this thesis. For example, it is well-known that
   single-layer perceptrons cannot perform an XOR (exclusive OR) operation [43] since the XOR
   operation is linearly inseparable. Thus, it is required to do the XOR operation in multiple
                                                5


   layers and with induction, as the non-linearity in the operation increases the required number
   of layers to perform it increases. On the contrary, Markov Brains handle such non-linearity
   in the operations in a more efficient way and as a result, they can evolve to be more sparse
   with higher information density.
   Here I should mention that a recent method called Xnor-net has employed binary operations
   in convolution and filtering components of CNNs, and achieved state-of-art performance
   on the ImageNet dataset [158]. I should also acknowledge that an exponentially increased
   number of functions introduces an exponentially larger search space for optimization, but
   note that a set of smaller logic gates can always replace a larger set, and a smooth fitness
   landscape in which partially-optimized functions are rewarded is guaranteed to result in the
   optimum solution.
   The more significant advantage of logic gates that connect neurons in Markov Brains is
   that they can mimic a more complex wiring in biological brains, with high density in
   synaptic or dendritic connections. For example, it has been shown that the non-linearity of
   dendritic connections makes them operate as computational subunits that take place before
   the summation at the synapse, which further facilitates pattern recognition in pyramidal
   neurons [150, 151]. Furthermore, Hawkins et al. show that a neuron with several thousand
   synapses segregated on active dendrites is capable of classifying several independent patterns
   and they can perform this task with large amounts of noise and variation introduced in those
   patterns [66]. Obviously, I am not suggesting that the logic tables of Markov Brains is an
   equivalent to more complex layered computations in dendritic and synaptic connections, but
   the more complex and non-linear computations of these logic tables and the accessibility of
   exponentially more complex functions is certainly in this respect, a closer model of biological
   neurons’ connections compared to ANNs.
3. Possibility of analysis of computations and algorithms of the evolved networks.
   Understanding the mechanisms and algorithms at work in evolved networks is crucially
                                              6


important for two main reasons. First, it seems necessary to understand the apparatus if we
would want to correct its errors, prevent unexpected behavior, and improve its performance
in the future. The second reason, which is more central to my thesis, is that we are attempting
to recreate (evolve) biological-like brains in a machine with the purpose of discovering their
structure and functionality, and then use this knowledge to better understand biological brains.
This is also central to the entire field of Artificial Life, where the ultimate goal is to simulate
living things in silico in order to discover out-of-reach mysteries, and to gain insight that
helps us move forward in this journey.
As discussed earlier, while deep neural networks have been shown to be a powerful tool in
AI, it is usually very difficult to understand the algorithms and computations behind their
performance [206, 102, 80]. In other words, fpr the most part we have no clue as to how
these huge networks that consist of components that perform sophisticated computations
perform the desired task. It also seems impossible to translate their computations and
algorithms to biological nervous systems and as a result, they cannot advance the task of
understanding how an organism performs this specific task. In Markov Brains, on the other
hand, there are methods that can reveal the mechanisms or algorithms that the agent utilizes
in order to perform a particular task. Obviously it is not always very easy or straight-
forward to discover these computations, and it has been shown before that the evolved
Markov Brain networks can be “epistemologically opaque” [116]. Yet, there are techniques
that have been proposed and implemented that can unravel much about a Markov Brain’s
underlying mechanisms. For example, in chapters 2 and 4 I present an analysis of the types
of computational components and their frequency distributions that are used in visual motion
detection and sound localization tasks (also see [184, 182]). I also used knockout assays
to measure how critical these computational components are in these evolved networks. In
chapter 4, I performed transfer entropy measurements that can show the flow of information
between neurons of the network. Furthermore, in chapter 5 I propose and use a technique
based on the analysis of state-space transitions in Markov Brains when performing an event
                                              7


       duration judgment task (also see [185]). While the use of such techniques is still in its infancy,
       their ability to demonstrate important characteristics of the network and their algorithms was
       shown to be promising and points to their capacity to be enhanced in the future.
     All said, the Markov Brains platform is a prominent substrate to study the evolution of intelligent
behavior especially for the purpose of projects studied in this thesis. Furthermore, the Markov brains
platform is recognized as one the recent specialized techniques in the neuroevolution community.
For example, in a review on Evolutionary Algorithms, the authors categorize Markov brains as
an innovation in the field of neuroevolution [173]. They write: “[Markov Brains] are showing
some early promise especially in unsupervised learning” and attribute the success of the Markov
brains to “being a more flexible substrate than ANNs, they could also lead to a more general
understanding of the role recurrence plays in learning.” The Markov Brains platform was used in
numerous studies before and has been shown to be a powerful tool for the study of evolution of
intelligence, such as evolution of predator-prey interactions [139, 140, 141, 137], active categorical
perception [116, 142], image classification [34, 142], the evolution of neural plasticity [169, 170],
the evolution of cognitive representations [45, 116, 89, 90, 91], the evolution of decision making
strategies [97], and the dynamic interplay between ecology and brain structure [30, 141].
1.1.3    Cognitive Widgets: Fundamental Neural Circuits
The nervous system is undoubtedly the most complex organ/system in animals. For example,
the human brain (which is a part of the central nervous system) consists of around 100 billion
neurons (with about 20 billion in the neocortex alone) that differ in their anatomy, physiology, and
functionality, with approximately 100 trillion connections. Our knowledge of the brain is still in
its infancy, with numerous open questions and unknowns, given that the study of nervous systems,
i.e., neuroscience, only dates back to Santiago Ramon y Cajal’s seminal work in the 1890s. In
fact, we still do not have a complete understanding of even much simpler central nervous systems
like, for instance, an insect’s brain, with only a few hundred thousand neurons. While there has
been substantial research in neuroscience and its related fields, we have only come to understand
                                                   8


neural circuits and their functions for significantly simpler tasks. In particular, as we learn more
about these systems, the more we realize the necessity of engaging experts from other disciplines
and employing more specialized techniques for specific problems. Here, I focus on the following
neural circuits and attempt to answer questions regarding their structure, functionality, and their
evolutionary origins:
   1. Visual motion detection.
      Visual motion detection is one of the fundamental components of visual perception and
      the computations take place at a low level (close to sensory neurons) in nervous system.
      Perceiving moving objects in the environment is crucial to an animal from an evolutionary
      point of view since it can be critical for survival; for example, detecting predators, prey,
      or falling objects [143]. One of the standard motion detection models was proposed by
      Werner Reichardt and Bernhard Hassenstein in the 1950s, based on a delay-and-compare
      scheme [65]. In addition to the Reichardt detector, researchers have proposed other types
      of motion detection models, such as edge-based models [113] and spatial-frequency-based
      models [6]. However, most computational motion detection models are based on the delay-
      and-compare scheme [143]. While motion detection in mammals and in particular humans
      is more complicated in structure and function, it is expected to have significant similarities
      to the basic Reichardt detector circuitry [22], and thus the Reichardt detector “module” is a
      key component of all motion detection circuits.
      In chapter 2 of this thesis, I study visual motion detection circuits and the underlying neuronal
      architectures. In particular, I study the distribution of different types of logic gates used to
      perform motion detection, the size of the network (the number of neurons contributing to
      the computation), and the presence of redundant logic gates, and their total complexity (i.e.,
      number of logic gates). Furthermore, I investigate the evolutionary significance in complexity
      variation between circuits by seeding the population with a handwritten Reichardt detector
      circuit as the ancestor. I then ask whether an increase or decrease in their circuit complexity
                                                    9


   is observed even though the performance of these circuits could not improve. If we observe
   a decrease in circuit complexity, it would suggest that the hand-written Reichardt detector
   could be further optimized, and therefore, we may be able to find simpler neuronal circuits
   in biological neuronal circuits. On the contrary, if we observe an increase in the complexity
   of the circuits, it would imply that other factors such as historical contingency or mutational
   robustness may be important factors in the evolution of visual motion detection circuits.
2. Intraspecific collision avoidance strategy based on apparent motion cues.
   The visual system is a significant perceptual component of an animal’s cognitive system
   and provides it with information about its environment, for example when foraging for food,
   detecting predators or prey, and when searching for potential mates. Motion detection is one
   of the primary dimensions of visual systems [20] and plays a key role in decision making for
   most animals. In chapter 3 of this thesis, I study a specific type of behavior in Drosophila
   melanogaster (the common fruit fly), which is proposed as a collision-avoidance strategy
   based on visual motion cues. More precisely, I investigate the selective (i.e., evolutionary)
   pressures that might have given rise to this behavior.
   Fruit flies show an interesting behavior when perceiving two different types of optical flow
   in their retina, i.e., back-to-front and front-to-back motions. In a study published in 2009
   by Branson et al., researchers investigated the walking trajectories of groups of fruit flies
   in a planar covered arena (so that they could only walk, not fly) using high-throughput
   recorded data to study the flies’ behavior [25]. One of the results of their analysis showed that
   female fruit flies stop walking when they see another fly moving from back-to-front in their
   visual field (an optical flow referred to as “regressive motion”) whereas they keep walking
   when they perceive conspecifics’ motion from front-to-back in their visual field (referred to
   as “progressive motion”). Later, in a study published in 2012, Zabala et al. [207] further
   studied this behavior and tested a hypothesis that suggested that flies stop walking when
   perceiving regressive motion, and coined the term “regressive motion saliency”. They used
                                               10


   a controllable fly-sized robot that interacted with a real fly in a planar arena. They used a
   robot instead of an actual fly in order to exclude other sensory cues such as image expansion
   (“looming,” see [163]) and pheromones. Their results provided rigorous support for the
   regressive motion saliency hypothesis.
   Subsequently, Chalupka et al. showed that a moving object (for example, another fly) that
   produces regressive motion in a fly’s retina will reach the intersection point first whereas the
   fly that reaches the intersection first always perceives progressive motion on its retina [33].
   They suggested a hypothesis called “generalized regressive motion” that suggests this behav-
   ior is a strategy to avoid collisions similar to the rules that ship captains use when moving
   on intersecting paths (see, e.g., [110]). However, it is not evident a priori which selective
   pressures or environmental circumstances could give rise to this behavior. For example,
   it is unclear whether collision avoidance alone could be a significant enough evolutionary
   factor for this behavior. In chapter 3, I test whether collision avoidance can be a sufficient
   selective pressure for the evolution of this behavior. I also investigate the environmental
   conditions, such as the varying costs and benefits involved, in the evolution of the described
   behavior. I also explore how the interplay (and trade-offs) between the necessity to move and
   the avoidance of collisions can result in the evolution of regressive motion saliency in digital
   flies.
3. Sound localization.
   Sound localization is another one of the fundamental cognitive neural circuits that has been
   widely studied [130, 149]. Sound localization mechanisms in mammalian auditory systems
   function based on various cues such as interaural time difference, interaural level difference,
   etc. [128]. Interaural time difference (which is the main cue behind the sound localization
   mechanism) is the difference between the times at which sound reaches the two ears. One
   of the most prominent sound localization models was proposed by Jeffress [79], in which
   sound reaches the right ear and left ear at two possibly different times. These stimuli are then
                                                11


   processed in a sequence of delay components and reach an array of detector neurons. Each
   detector fires only if the two signals from different pathways, the left ear pathway and the
   right ear pathway, arrive at that neuron simultaneously.
   I used sound localization circuits as a benchmark to study how well transfer entropy analy-
   sis [165] can capture the information flow in neural circuits. Markov Brains have been shown
   to be a suitable platform to study the information-theoretic correlates of fitness and network
   structure in neural networks [45, 8, 164, 114, 85]. The Markov Brains platform enables us to
   analyze structure, function, and circuitry of hundreds of evolved neural circuits. As a result, I
   can perform statistical analysis on these evolved circuits (as opposed to studying only a single
   evolutionary outcome), for example, investigate the frequency of different types of relations,
   and further assess how crucial different operators are for each evolved task, by performing
   knockout experiments in order to measure an operator’s contribution to the task.
4. Time perception.
   Time perception refers to the subjective experience of time that can be measured by someone’s
   own perception of the duration of events. Time perception is a key component in our ability to
   deduce causation, to predict, infer, and forecast. As a result, time perception plays a key role in
   the survival of an organism by predicting and deciphering events in the world [134, 160]. The
   event duration perception is not objective, rather, we perceive temporal signals subjectively,
   and our perception is influenced by various factors such as attention [193, 37, 32, 108, 188].
   A central hypothesis in time perception posits that the more attention devoted to the temporal
   characteristics of an event, the longer it is perceived [193, 37, 32, 108, 188]. There are
   several competing models of time perception. In models such as Scalar Expectancy Theory
   (SET) [54], it is assumed that event duration perception is performed with computations
   similar to that of an internal clock [53, 54, 191]. Models like SET also assume that in
   such an internal clock, the amount of attention allocated to the stimulus is adjusted based
   on the amount of attention, and that the attention is uniformly distributed in time. On the
                                                12


       contrary, in models such as Dynamic Attending Theory (DAT) [81, 82, 101] the attention is
       not distributed uniformly in time, rather, the temporal structure of the stimulus may increase
       or decrease levels of attention in time. In particular, rhythmic stimuli entrain the brain and
       lead to periodic peaks and troughs of attention.
       Interval timing models such as DAT and SET and their computational counterparts usually
       take a top-down approach, meaning they are designed based on a set of rules so that they
       can describe behavioral/psychophysical data in duration perception [53, 81, 82, 44, 117]. In
       chapter 5 of this thesis, I take a bottom-up approach where I evolve a population of artificial
       brains consisting of lower-level components. In particular, I use Darwinian evolution to
       create artificial digital brains that are able to perform duration judgments in auditory oddball
       paradigms, similar to experiments performed by Fromboluti and McAuley [121]. I then
       study the evolved brains as though they are participants in a psychophysical experiment.
       For example, I investigate psychometric parameters of the evolved brains, such as their
       discrimination thresholds. I also can test these brains when exposed to stimuli patterns
       that they have not experienced during evolution such as arrhythmic stimuli. Furthermore, I
       can investigate the algorithms and computations involved in duration judgment, and analyze
       how these algorithms allocate attention to different parts of the stimuli. This analysis can
       demonstrate the similarities and differences of the evolved duration perception mechanisms
       and the underlying mechanisms of SET and DAT models.
    In this thesis, I studied only a few cognitive circuits, while there are many other well-studied
visual and auditory neural circuits as well as cognitive components. In the future, it is imperative
that we also investigate other modes of sensation such as olfaction and touch.
1.2     Outline
    In the following chapters, I address various questions concerning neuronal circuits and present
my findings. In Chapter 2 I show how the evolution of motion detection circuits in Markov brains
can lead to a diversity of circuits with a variety of structures in gate compositions and with different
                                                     13


levels of complexity. I also show that the complexity variation in evolved brains circuitry is due to
selection for mutational robustness. These results suggest that different species may evolve different
circuits for similar neuronal functions. In Chapter 3 I present the evolution of collision avoidance
in digital flies, test the “generalized regressive motion” hypothesis, and discuss the environmental
conditions and selective pressures that could give rise to this behavior. In Chapter 4 I investigate
whether transfer entropy measurements can infer the information flow in two different neuronal
circuits: visual motion detection and sound localization. In Chapter 5 I present the evolution of
Markov brains that solve the event duration judgment task, and how the analysis of the underlying
algorithms performed by digital brains can challenge existing models of time perception. In Chapter
6 I present a summary of my findings in completed projects and then I propose possible directions
for future research. My proposal is the evolution of Markov Brains that perform image classification
via saccadic eye movements.
     In the remainder of this section, I briefly explain the findings from my finished projects that are
presented in more detail in Chapters 2-5.
1.2.1    Visual motion detection
A central goal of evolutionary biology is to explain the origins and distribution of diversity across
life. Beyond species or genetic diversity, we also observe diversity in the circuits (genetic or
otherwise) underlying complex functional traits. However, while the theory behind the origins and
maintenance of genetic and species diversity has been studied for decades, theory concerning the
origin of diverse functional circuits is still in its infancy. It is not known how many different circuit
structures can implement any given function, which evolutionary factors lead to different circuits,
and whether the evolution of a particular circuit was due to adaptive or non-adaptive processes.
Here, I use digital experimental evolution to study the diversity of neural circuits that encode
motion detection in digital (artificial) brains. I find that evolution leads to an enormous diversity
of potential neural architectures encoding motion detection circuits, even for circuits encoding the
exact same function. Evolved circuits vary in both redundancy and complexity (as previously found
                                                      14


in genetic circuits) suggesting that similar evolutionary principles underlie circuit formation using
any substrate. I also show that a simple (designed) motion detection circuit that is optimally-adapted
gains in complexity when evolved further, and that selection for mutational robustness led this gain
in complexity.
1.2.2    Collision avoidance mechanisms in Drosophila melanogaster
Flies that walk in a covered planar arena on straight paths avoid colliding with each other, but which
of the two flies stops is not random. High-throughput video observations, coupled with dedicated
experiments with controlled robot flies have revealed that flies utilize the type of optic flow on
their retina as a determinant of who should stop, a strategy also used by ship captains to determine
which of two ships on a collision course should throw engines in reverse. I use digital evolution to
test whether this strategy evolves when collision avoidance is the sole selective pressure. I find that
the strategy does indeed evolve in a narrow range of cost/benefit ratios, for experiments in which
the “regressive motion” cue is error free. I speculate that these stringent conditions may not be
sufficient to evolve the strategy in real flies, pointing perhaps to auxiliary costs and benefits not
modeled in our study.
1.2.3    Information flow in evolved in silico motion detection and sound localization circuits
How cognitive neural systems process information is largely unknown, in part because of how
difficult it is to accurately follow the flow of information from sensors via neurons to actuators.
Measuring the flow of information is different from measuring correlations between firing neurons,
for which several measures are available, foremost among them the Shannon information, which
is an undirected measure. Several information-theoretic notions of “directed information” have
been used to successfully detect the flow of information in some systems, in particular in the
neuroscience community. However, recent work has shown that directed information measures
such as transfer entropy can sometimes inadequately estimate information flow, or even fail to
identify manifest directed influences, especially if neurons contribute in a cryptographic manner
                                                   15


to influence the effector neuron. Because it is unclear how often such cryptic influences emerge
in cognitive systems, the usefulness of transfer entropy measures to reconstruct information flow
is unknown. Here, I test how often cryptographic logic emerges in an evolutionary process that
generates artificial neural circuits for two fundamental cognitive tasks (motion detection and sound
localization). Besides counting the frequency of problematic logic gates, I also test whether transfer
entropy applied to an activity time-series recorded from behaving digital brains can infer information
flow, compared to a ground-truth model of direct influence constructed from connectivity and circuit
logic. These findings suggest that transfer entropy will sometimes fail to infer directed information
when it exists, and sometimes suggest a causal connection when there is none. However, the extent
of incorrect inference strongly depends on the cognitive task considered. These results emphasize
the importance of understanding the fundamental logic processes that contribute to information
flow in cognitive processing, and quantifying their relevance in any given nervous system.
1.2.4    Evolution of event duration perception and implications on attentional entrainment
While cognitive theory has advanced several candidate frameworks to explain attentional entrain-
ment, the neural basis for the temporal allocation of attention is unknown. Here I present a new
model of attentional entrainment guided by empirical evidence obtained using a cohort of 50 ar-
tificial brains. These brains were evolved in silico to perform a duration judgment task similar to
one where human subjects perform duration judgments in auditory oddball paradigms. I found that
the artificial brains display psychometric characteristics remarkably similar to those of human lis-
teners, and exhibit similar patterns of distortions of perception when presented with out-of-rhythm
oddballs. A detailed analysis of mechanisms behind the duration distortion suggests that attention
peaks at the end of the tone, which is inconsistent with previous attentional entrainment models.
Instead, the suggested model of entrainment emphasizes increased attention to those aspects of the
stimulus that the brain expects to be highly informative.
                                                  16


                                             CHAPTER 2
    EVOLUTION LEADS TO A DIVERSITY OF MOTION-DETECTION NEURONAL
                                              CIRCUITS
One of the most astonishing aspects of life is the overwhelming amount of diversity that has
existed throughout life’s history. Ever since Charles Darwin published On the Origin of Species,
evolutionary biologists have tried to understand the processes that lead to biological diversity [41].
On the micro scale, the question of how genetic diversity is maintained within a population has been
of interest to population geneticists [89, 105, 180, 12] for decades; work on this topic still continues
to this day [58]. In a similar fashion, ecologists have long been interested in the ecological and
evolutionary processes that lead to the origins [156, 135] and maintenance [162, 35] of species
diversity. The rise of cheap sequencing technologies in recent years has led to the recognition of
another characteristic of biological diversity, molecular diversity [187], or diversity in the sense
that multiple genotypes can lead to the same phenotype [57]. In other words, evolution can lead to
a diversity of genetic circuits across species [194].
    The evolutionary principles that lead to molecular diversity in genetic systems has been well-
explored. The relationship between genotype and phenotype must be many-to-one to allow for
the existence of neutral evolutionary trajectories between genotypes. Computational studies of
metabolic networks, gene regulatory networks, and RNA-structure networks [reviewed in [200]] all
show evidence of neutral paths that conserve phenotypes between different genotypes. Many-to-one
genotype-phenotype mappings are even present in artificial digital evolution systems [e.g., [100,
50, 99]] and evolutionary simulations of digital logic circuits [157]. Empirical studies of bio-
logical systems suggest the existence of multiple genotypes encoding similar phenotypes, either
through genetic analysis [195, 181], comparative genomics [39], or experimental evolution [106,
73]. However, the evolutionary reasons why populations evolve one genotype instead of another
genotype, or which evolutionary processes lead to the evolution of different genotypes, are largely
unexplored in biological systems due to the difficulty of deciphering every possible evolutionary
                                                  17


trajectory and process, and the waiting time required for many of these evolutionary events to occur
[but see [106]]. This difficulty presents a prime opportunity for artificial life and digital evolution
studies to perform “digital genetics" and test hypotheses for why some populations, but not others,
evolve certain genotypic characteristics [1].
    Genetic circuits are not the only biological network shaped by evolution. Neuronal circuits are
also shaped by selective pressures, and much work has been devoted to understand those. Much of
the literature, however, has focused on whether evolution optimizes the wiring patterns of a brain,
or the efficiency of the circuitry [see, e.g., the discussion in chapter 7 of [175]]. For example, it
is clear that the wiring pattern of the neuronal circuitry of the roundworm Caenorhabditis elegans
is not optimal [7]. At the same time, there appear to be certain network motifs that are strongly
favored in the worm brain [154], suggesting that evolution has a hand in optimizing computational
efficiency. However, very little is known about the wiring diversity underlying circuits with the
same function. According to the principles of evolvability and robustness discussed above, such
diversity could be key for the adaptability of brains. In fact, both modeling [152] and empirical [56]
studies have shown that neuronal circuits can vary in their internal parameters but lead to the same
functional output [111]. And while many of these studies examine variation within one species [56],
similar results have also been found between species, suggesting evolutionary mechanisms can also
cause these differences [171]. This outcome is not surprising, as evolution and natural selection
is expected to primarily act on the function, not the circuit encoding said function [171]. These
results motivate the question as to how and why evolution leads to neuronal circuits with different
characteristics for the same function.
    Here we use digital evolution to study the evolution of neuronal circuits for visual motion
detection. Perception of moving objects in the environment is of utmost significance from an
evolutionary standpoint since it can be critical to survival of animals (including humans); detecting
predators, prey, or falling objects can pose a live or die question [143]. In the 1950s, Werner
Reichardt along with Bernhard Hassenstein proposed a simple computational model [now known
as the Reichardt detector], that is based on a delay-and-compare scheme [65]. The main idea behind
                                                   18


this model is that a moving object stimulates two adjacent receptors (or regions) in the retina at
two different time points. In Fig. 2.1(A), an object (a star) is moving from left to right stimulating
two adjacent receptors n1 and n2, at time points 𝑡 and 𝑡 + Δ𝑡. In the neural circuit illustrated in
Fig. 2.1(A), which is a portion of the entire Reichardt detector circuit, 𝜏 functions as a temporal
filter that delays the received stimulus from receptor n1. This delayed signal will then be multiplied
(in the × neuron) with the stimulus received in n2 at 𝑡 + Δ𝑡. This multiplication result, therefore,
detects motion from left to right. However, this half-circuit only detects motion in one direction. In
the full Reichardt detector circuit shown in Fig. 2.1(B), the outcome of the multiplication from two
similar computations, but in opposite directions, are subtracted. Thus, the result will be a positive
value for left to right motion (also called preferred direction, PD), and negative for right to left
motion, (termed the null direction, ND).
Figure 2.1: (A) A half Reichardt detector circuit. An object (star) moving from left to right
stimulating two adjacent receptors, n1 and n2, at time points 𝑡 and 𝑡 + Δ𝑡. (B) A full Reichardt
detector circuit. In full Reichardt detector circuits, the results of the multiplications from each half
circuit are subtracted.
     Beyond the Reichardt detector, other types of motion detection models were also proposed, e.g.
edge-based models [113] and spatial-frequency-based models [6]. However, most computational
motion detection models are based on the delay-and-compare scheme [143]. For example, the
Barlow-Levick (BL) motion detection model [13] is similar to the Reichardt model in that it also
                                                   19


employs asymmetric temporal filtering of signals that are then fed to a non-linearity component, but
they differ in the location of the filter and type of non-linearity component. While motion detection
in mammals and in particular humans is expected to be far more complex, there are significant
similarities to the basic Reichardt detector logic [22], and thus the Reichardt detector “module" of
motion detection is likely a key component of all motion detection circuits.
    Using digital experimental evolution methods, we found that motion detection circuits can be
encoded by a wide diversity of neuronal architectures. Evolved brains differ in the logic gates used
to perform motion detection, in the wiring between these logic gates, in the presence of redundant
logic gates, and in their total complexity (i.e., number of logic gates). We explored the evolutionary
significance in complexity variation between brains by evolving brains using a handwritten optimal
motion detection circuit as the ancestor. These brains also increased in complexity although no
improvement in the performance of their circuit could occur. Instead, these brains evolved greater
complexity due to selection for mutational robustness. These results suggest that different species
may evolve different circuits for similar neuronal functions.
2.1    Methods
    In this study, we use an agent-based model to study evolution of computational visual motion
detection circuits. In this model, agents embody neural networks known as “Markov brains"
(MB) [69]. Markov brains have three different types of neurons that help the agent interact with the
outside world: 1) sensory neurons, that receive the information from the environment, 2) hidden
neurons that assimilate the agent’s processing unit, and 3) decision (“motor") neurons that function
as the actuators of the agent. In other words, sensory neurons are written to by the surrounding
environment, hidden neurons process the received information, and the decision neurons specify
the actions of the agent in its environment.
    Markov brains are evolvable networks of neurons in which the neurons are connected via
probabilistic/deterministic logic gates. In the experimental setup used in this study, a logic neuron
                                                     20


is a binary variable whose state is either 0 or 1 (it is quiescent or it fires1). The states of the neurons
are updated in a Markov fashion, i.e., the probability distribution of states of the neurons at time step
𝑡 + 1 depends only on the states of neurons at time step 𝑡 as shown in Fig. 2.2(A). That figure shows
a Markov brain with 11 neurons and two hidden Markov gates (HMG) at two consecutive time
steps 𝑡 and 𝑡 + 1. Hidden Markov gates determine how the states of the neurons at time step t+1 are
updated given the states of the neurons at time 𝑡. For example in Fig. 2.2(B), gate 1 takes the states
of neurons 0, 2, and 6 as inputs and writes updated states to output neurons 6 and 7. Each hidden
Markov gate has a probabilistic logic table that specifies the probability of every possible output
given the states of the input (Fig. 2.2(C)). That figure shows the probability table of gate 1 with
8 rows for all possible input states, and 4 columns for each possible output states (note that there
are 23 = 8 possible input states for 3 binary inputs, and similarly, 22 = 4 for outputs). Each entry
in the table represents the probability of a specific output, given a particular input. For instance,
𝑝 53 represents the probability of getting output states h1, 1i, with decimal representation 3, given
the input states h1, 0, 1i, with decimal representation 5. As a result, the sum of the probabilities of
each row should be equal to 1. In this work, we constrain hidden Markov gates to be deterministic,
therefore, the output states will always be the same given a particular input (probabilities in the
table are either 0 or 1 and only one entry in each row of the table can be equal to 1). Markov brains
can evolve to perform a variety of tasks such as active categorical perception [115], swarming
in predator-prey interactions [139], collision avoidance strategies using optical flow classification
in fruit flies [186], and decision making strategies in humans [97]. In the evolutionary process,
the connections of the networks and the underlying logic of the connected gates change (evolve),
and therefore, the agents adapt to their environment. More specifically, the number of gates, how
each gate is connected to its inputs/outputs neurons, and the logic table of the gates are subject
to evolution. However, the total number of neurons, the number of each type of neurons (i.e.,
sensory neurons, hidden neurons, and decision neurons), does not change during evolution. In our
experimental setup for instance, we use MBs with 16 neurons in which two neurons (neurons 1
    1 These  logic neurons are thought to represent the state of groups of biological neurons.
                                                    21


Figure 2.2: (A) A Markov brain with 11 neurons and 2 gates shown at two time steps 𝑡 and 𝑡 + 1.
The states of neurons at time t and the logic operations of gates determine the states of neurons at
time 𝑡 + 1. (B) One of the gates of the MB whose inputs are neurons 0, 2, and 6 and its outputs are
neurons 6 and 7. (C) Probabilistic logic table of gate 1.
and 2) are designated as sensory neurons, and two neurons (neurons 15 and 16) are assigned as
decision neurons, while the remaining 12 neurons are hidden neurons. In order to evolve MBs, we
 Figure 2.3: A Markov brain is encoded in a sequence of bytes that serves as the agent’s genome.
apply a Genetic Algorithm (GA) to a population of MBs in which each MB is encoded in a genome
as shown in Fig. 2.3. The genome of each MB is a sequence of numbers in the range [0,255] (a
sequence of bytes) that encodes hidden Markov gates (HMGs), their connections, and their logic.
The arbitrary pair h42, 213i is chosen as the start codon for each gate. The next two bytes following
                                                  22


the start codon encode the number of inputs and the number of outputs of the HMG, respectively. In
our experimental setup, we constrained MBs to always have 2 inputs and 1 output, therefore, these
two bytes are ignored in transcription. The subsequent (downstream) loci in the genome encode
which neurons are connected to this HMG as input, which neuron is connected to the output, and
finally the logic table of the HMG.
    In our experimental setup, we initialized the populations with 100 genomes with 5,000 random
bytes. We sprinkled those random bytes with four start codons in each genome to speed up initial
evolution. Thus, all genomes in the initial population have at least four random HMGs. As
mentioned before, all HMGs in our setup are deterministic and have 2 inputs and 1 output. As a
result, HMGs can only have 16 possible logic tables. We ran 100 replicates of this experiment for
10,000 generations with mutations, roulette wheel selection, and 5% elitism. The GA configuration
is presented in more detail in Table 5.2.
Table 2.1: Genetic Algorithm configuration. We evolved 100 populations of 100 MBs for 10,000
generations with point mutations, deletions, and insertion. We used roulette wheel selection, with
5% elitism, and with no cross-over or immigration.
                                        Population size        100
                                          Generations         2000
                                     Initial genome length    5,000
                                      Initial start codons      4
                                      Point mutation rate     0.5%
                                       Gene deletion rate      2%
                                     Gene duplication rate     5%
                                             Elitism           5%
                                            Crossover         None
                                          Immigration         None
    The fitness function is designed in order to evolve MBs that function as a visual motion detection
circuit. In doing so, two sets of stimuli are presented to the agent in two consecutive time steps and
the agent classifies the input as either: motion in preferred direction (PD), stationary, or motion
in null direction (ND). Neurons 1 and 2 (the sensory neurons) represent two adjacent receptors
separated by a fixed distance that can sense the presence or the absence of a visual stimulus.
The binary value of the sensory neuron becomes 1 when a stimulus is present, and it becomes
                                                     23


(or remains) 0 otherwise (see Fig. 2.4). Thus, there are 16 possible sensory patterns that can be
presented to the agent (2 binary neurons at 2 time steps). Among these 16 input patterns, 3 input
patterns are PD, 3 are ND, and the other 10 are stationary patterns. Agents classify the sensory
pattern with 2 decision neurons, neurons 15 and 16. We assigned the sum of the values of the
decision neurons to represent the category of the sensory pattern: when both decision neurons
fire (sum=2), the sensory pattern is classified as PD, when only one of the decision neurons fires
(sum=1), the sensory pattern is classified as stationary, and when neither fire the sensory pattern is
classified as ND (sum=0). We chose this encoding for three classes of input pattern to facilitate the
evolution of motion detection circuits. In preliminary experiments, we tried three different methods
of encoding input pattern classes and found this one to evolve the fastest. In those preliminary
experiments, we tried the following alternative encodings: assigning one neuron to each class (i.e.
three decision neurons), assigning the decimal value of the pair of decision neurons to each class,
i.e., 00 → ND, 01 → stationary, 10 → PD, and ignore 11, and finally assigning the sum of the
values of decision neurons to each class. For the last two encoding methods, we tried all possible
permutations of encodings and the one we chose consistently leads to the best results.
Figure 2.4: Schematic examples of three types of input patterns received by the two sensory neurons
at two consecutive time steps. Grey squares show presence of the stimuli in those neurons. (A)
Preferred direction (PD). (B) Stationary stimulus. (C) Null direction (ND).
     All agents of the population are evaluated in all 16 possible sensory patterns and gain a reward
for correct classification (no reward or penalty for incorrect classifications). The reward values for
correct classifications of each class is inversely proportional to their frequency: the reward for PD
and ND patterns are 10, and the reward value for correct classification of stationary patterns are
                                                   24


3. However, in the results presented in the next section, all fitness values are normalized to take a
maximum value of 100.
2.2     Results
    After evolving 100 populations for 10,000 generations, we isolated one of the genotypes with
the highest score from each population and analyzed its ability to perform the same function as a
motion detection circuit. Seventy-five of the one hundred brains evolved a perfect motion detection
circuit (correct classification of all 16 patterns); we used those brains for the rest of our analysis.
A preliminary analysis of our evolved brains suggested that evolution led to a wide diversity of
neuronal circuit architectures. Amongst our population of 75 brains, we found both relatively
simple neuronal circuits (Fig. 2.5(A)) and more complex neuronal circuits (Fig. 2.5(B)), suggesting
that not only does evolution lead to a large number of different motion detectors, but they also all
vary in complexity (defined here as the number of gates composing a circuit).
               (a)
               (b)
Figure 2.5: Markov brains evolve alternative circuits to encode a motion detection circuit (duplicated
logic gates with same inputs and outputs are omitted). (A) Example simple evolved motion detection
circuit. (B) Example complex evolved motion detection circuit. Gate symbols are US Standard.
    To gain a better understanding of the diversity of neuronal circuits evolved in this study, we
performed gate-knockout assays on all 75 brains. We sequentially eliminated each logic gate and
                                                   25


re-measured the mutant brain’s fitness, thus allowing us to estimate which gates were essential to
the motion detection function (if mutant fitness decreased) and which gates were redundant to the
motion detection function (if mutant fitness was equal to the ancestral fitness). There was a wide
distribution in the number of essential logic gates, ranging from two logic gates to ten logic gates,
with a mean of 4.82 gates (Fig. 2.6(A)). This result supports the idea that there is a wide diversity
of possible motion detection circuits available to evolution. We also measured the number of
redundant logic gates and found our evolved brains possessed an even greater number of gates that
had no apparent contribution to the circuit’s function (Fig. 2.6(B)), suggesting that either a large
portion of the complexity of these motion detection circuits evolved neutrally, or that selection for
redundancy and mutational robustness is involved.
                       (a)
                       (b)
Figure 2.6: Evolved motion detection circuits vary greatly in complexity. (A) Histogram of the
number of essential gates (i.e., gates that resulted in a fitness loss when removed) for each evolved
motion detection circuit. (B) Histogram of the number of redundant gates (i.e., gates that resulted
in no fitness loss when removed) for each evolved circuit.
    We also examined the types of logic gates that were either essential or redundant to each brain by
                                                   26


recording the average number each gate was found within each evolved brain. We found surprising
similarities in the distribution of the average presence of each logic gate between both essential
gates (Fig. 2.7(A)) and redundant gates (Fig. 2.7(B)). The six most-abundant logic gates in both the
essential and the redundant gate distribution were NOR, OR-NOT, AND-NOT, NOT, COPY, and
EQU. These results suggest either that evolved motion detection circuits may incorporate whichever
gates are most easily-evolved (in the sense that they interact with other gates without fitness trade-
offs) or that they may have evolved the same redundant gates as their essential gates in order to
encode robustness against mutations.
                (a)
               (b)
Figure 2.7: Distribution of specific gates used in evolved motion detectors. (A) Average number
of essential logic gates of each type of logic gate per evolved brain. Error bars represent 95%
confidence intervals. (B) Average number of redundant logic gates of each type of logic gate per
evolved brain. Error bars represent 95% confidence intervals.
    Multiple pieces of evidence suggest that the complexity of our evolved brains did not evolve
solely to perform the motion-detection function. Our evolved brains are more complex than required
to encode a motion detection circuit (Fig. 2.6). The large abundance of redundant gates suggests
that either these brains are neutrally evolving increased complexity or are evolving mutational
robustness due to high mutation rates. The similarities in the distribution of both essential and
redundant logic gates suggests either that certain gates arise due to their intrinsic abundance in
                                                  27


the fitness landscape, or because they can compensate for mutations to otherwise essential gates.
Therefore, to test for the reason behind our evolved brains’ complexity, we hand-wrote a simple
Reichardt detector with optimal individual fitness (Fig. 2.8(A)), evolved 100 populations under
the same protocol as before, and repeated our knockout analysis. If the evolution of complexity
was either non-adaptive or due to selection for increased redundancy and robustness, we would
expect these simple brains to increase in complexity upon further evolution. However, if the motion
detector circuit’s evolved complexity is due to difficult-to-break historical contingency, we would
expect little change in the brains evolved from hand-written Reichardt detectors.
                     (a)
                     (b)
                     (c)
Figure 2.8: Evolution of a simple Reichardt detector leads to greater complexity. (A) Diagram of
a hand-written Markov Brain encoding a simple Reichardt detector (B) Distribution of the number
of essential gates for brains evolved from a hand-written ancestor. (C) Mutational sensitivity of
evolved motion detectors.
    The results from the knockout analysis demonstrated that the brains evolved from a hand-written
                                                 28


Reichardt detector increased in complexity when evolved further (Fig. 2.8(B)), suggesting that the
increased complexity seen in Fig. 6 was not due to historical contingency, but to other evolutionary
factors. To test if these evolved brains were shaped by selection for mutational robustness, we
measured the mutational sensitivity of each brain by calculating the average fitness loss from
removing one logic gate and multiplying this loss by the total number of gates in each brain. Those
evolved brains were less mutationally-sensitive (or more mutationally-robust) than their hand-
written ancestor (Fig. 2.8(C)), suggesting that the additional gates evolved in order to increase the
brain’s robustness to mutations. However, we should also note that some brains did evolve a greater
mutational sensitivity, suggesting that either robustness was evolved beyond single-step mutations
or that there is some role for non-adaptive evolutionary processes in driving circuit architecture.
2.3     Discussion
    We tested if a computational model could evolve a wide diversity of neuronal architectures,
and studied evolutionary trends in the evolution of these neuronal architectures. We found that
selection for motion detection does lead to a wide diversity of neuronal circuits even though each
has the same overall function. Most brains are more complex than the standard model for motion
detection: the Reichardt detector. Each brain uses many different logic-gate components, although
some gates are more common than others. A large portion of the evolved complexity in these brains
results from the evolution of redundant gates. We also showed that even hand-written Reichardt
detectors increase in complexity when evolved further, suggesting that the large complexity is
due to either non-adaptive evolution or selection for functional redundancy. Measurements of the
evolved brains’ mutational sensitivity suggested they had indeed evolved mutational robustness,
illustrating one additional selective pressure beyond basic functionality on the neuronal architecture
of motion-detection circuits.
    We undertook this study to see if some of the trends detected in the evolution of genetic circuits
occurred in the evolution of Markov brains [194]. As found in many other functional systems,
including those based on biochemistry [200] and those based on various digital substrates [157, 50,
                                                  29


30], there is a wide variety of diverse neuronal architectures that can encode a motion-detection
circuit that is logically equivalent to that of a Reichardt detector. Our results are in accordance with
previous results that showed neuronal circuits with the same functional output could vary between
species [171]. These results suggest that a diversity of neuronal architectures may exist for species
across life. Our results also suggest that any system with interacting individual components that,
when combined, lead to a functioning circuit may possess a diversity of circuits that provide the
same function.
    While it is perhaps not surprising that our evolved digital brains are different from the default
Reichardt detector encoding, we did not expect them to be much more complex. Thus, it is worth
discussing how some of our experimental design decisions could have influenced these differences.
One likely difference between our evolved brains and real brains is the lack of any fitness cost for
larger brains in our model. If each neuron or logic gate was associated with a fitness cost, then one
would intuitively expect the evolved brains to be simpler than what we found them to be. On the
other hand, neuro-anatomical evidence has suggested that wiring length and connection cost do not
appear to be minimized in brains [see also [68]].
    Another difference between digital and biological brains is that we only selected on one trait
here. The evolution of neuronal circuits is likely constrained by pleiotropic interactions with other
functional circuits, as with genetic systems [174]. Finally, compared to biological systems, Markov
brains evolved under of a very high mutation rate, something that is known to alter the evolution
of genetic architecture towards mutational robustness [203]. It is likely that Markov brains would
have evolved less-complex circuits with a decreased mutation rate, although the magnitude of this
effect is not known.
    We envision the results we presented here as a first step in establishing Markov brains as a
model system to study the potential neuronal architectures evolved by Darwinian natural selection.
Some of the limitations discussed above present fruitful avenues for future work that may lead to
further insights into the evolutionary potential of biological brains. Although we did not attempt a
more-precise classification of our evolved circuits beyond their complexity and their specific logic
                                                     30


gates, we see this as a possible endeavor. If the addition of further selection pressures results in the
evolution of simpler brains than those evolved here, this task should be achievable. Such studies
should lead to a more predictable theory of the diversity of neuronal circuits.
                                                   31


                                            CHAPTER 3
      FLIES AS SHIP CAPTAINS? DIGITAL EVOLUTION UNRAVELS SELECTIVE
                   PRESSURES TO AVOID COLLISION IN DROSOPHILA
3.1     Introduction
    How animals make decisions has always been an interesting, yet controversial, question to
scientists [125] and philosophers alike. Animals obtain various types of sensory information from
the environment and then process these information streams so as to take actions that benefit them
in survival and reproduction. The visual system plays an important role in providing animals
information about their environment, for example when foraging for food, detecting predators or
prey, and when searching for potential mates. One of the primary components of visual information
is motion detection. Motion is a fundamental perceptual dimension of visual systems [20] and is
a key component in decision making in most animals. Here, we study a very particular type of
motion detection and concomitant behavior (collision avoidance) in Drosophila melanogaster (the
common fruit fly), and attempt to unravel the selective (i.e., evolutionary) pressures that might have
given rise to this behavior.
    D. melanogaster shows a striking difference in behavior when exposed to two different types
of optical flow. [25] recorded the interaction of groups of fruit flies in a planar covered arena (so
that they could only walk, not fly) and used computer vision algorithms to analyze the walking
trajectories in order to study fly behavior. Their analysis revealed that female fruit flies stop
walking when they perceive another fly’s motion from back-to-front in their visual field (an optical
flow referred to as “regressive motion”) whereas they keep walking when perceiving conspecifics
moving from front-to-back in their visual field (referred to as “progressive motion,” see Figure 3.1).
[207] further investigated this behavior and tested the “regressive motion saliency” hypothesis,
suggesting that flies stop walking when perceiving regressive motion. They used a programmable
fly-sized robot interacting with a real fly to exclude other sensory cues such as image expansion
                                                  32


Figure 3.1: An illustration of regressive (back-to-front, left) and progressive (front-to-back, right)
optic flows in a fly’s retina.
(“looming,” see [163]) and pheromones. Their results provide rigorous support for the regressive
motion saliency hypothesis.
    Subsequently, [33] coined the term “generalized regressive motion” for optic flows in which
images move clockwise on the left eye and conversely, counterclockwise on the right eye (see
Figure 3.1). They presented a geometric analysis for two flies moving on straight, intersecting
trajectories with constant velocities and showed that the fly that reaches the intersection first always
perceives progressive motion on its retina, whereas the one that reaches the intersection later
perceives regressive motion at all times before the other fly reaches the intersection. They went on
to suggest that this behavior is a strategy to avoid collisions during locomotion similar to the rules
that ship captains use when moving on intersecting paths (see, e.g., [110]).
    As intriguing as this hypothesis may seem, it is not clear a priori which selective pressures or
environmental circumstances could give rise to this behavior. For example, it is unclear whether
collision avoidance provides a significant enough fitness benefit. As a consequence, it is possible
that the behavior has its origin in a completely different cognitive constraint that is fundamentally
unrelated to collision avoidance, or to the rules that ship captains use to navigate the seas. While
such questions are difficult to answer using traditional behavioral biology methods, Artificial Life
offers unique opportunities to test these hypotheses directly.
    In this study, we tested whether collision avoidance can be a sufficient selective pressure for
                                                   33


the described behavior to evolve. We also investigated the environmental conditions under which
this behavior could have evolved, in terms of the varying costs and benefits involved. By using an
agent-based computational model (described in more detail below), we studied how the interplay
(and trade-offs) between the necessity to move and the avoidance of collisions can result in the
evolution of regressive motion saliency in digital flies.
    Digital evolution is currently the only technique that can study hypotheses concerning the selec-
tive pressures necessary (or even sufficient) for the emergence of animal behaviors, as experimental
evolution with animal lines of thousands of generations is impractical. In digital evolution, we
can study the interplay between multiple factors such as selective pressures, environmental condi-
tions, population size and structure, etc. For example, Olson et al. ([140]) used digital evolution
to show that predator confusion is a sufficient condition to evolve swarming behavior, but they
also found that collective vigilance can give rise to gregarious foraging behavior in group-living
organisms [137]. In principle, any one hypothesis favoring the emergence of behavior can be tested
in isolation, or in conjunction [137].
3.2     Methods
3.2.1   Markov Networks
We use an agent-based model to simulate the interaction of walking flies with moving objects (here,
potentially conspecifics) in a two-dimensional virtual world. Agents have sensors to perceive their
surrounding world (details below) and have actuators that enable them to move in the environment.
Agent brains in our experiment have altogether twelve sensors, three internal processing nodes,
and one output node (the actuator). The brain controlling the agent is a “Markov network brain”
(MNB), which is a probabilistic controller that makes decisions based on sensory inputs and internal
nodes [45]. Each node in the network (i.e., sensors, internal nodes, and actuators) can be thought
of as a digital (binary) neuron that either fires (value=1), or is quiescent (value=0). Nodes of
the network are connected via Hidden Markov Gates (HMGs) that function as probabilistic logic
gates. Each HMG is specified by its inputs, outputs, and a state transition table that specifies the
                                                   34


Figure 3.2: Probabilistic logic gates in Markov network brains with three inputs and two outputs.
One of the outputs writes into one of the inputs of this gate, so its output is “hidden.” Because
after firing all Markov neurons automatically return to the quiescent state, values can only be kept
in memory by actively maintaining them. Probability table shows the probability of each output
given input values.
probability of each output state based on input states (Figure 3.2). For example, in the transition
table of Figure 3.2 (a three-input, two-output gate), the probability 𝑝 73 controls the likelihood that
the output state is 3 (the decimal equivalent of the binary pattern 11, that is, both output neurons
fire) given that the input happened to be state 7 (the decimal translation of 111, i.e., all inputs are
active). MNBs can consist of any number of HMGs with any possible connection arrangement,
given certain constrains (see for example [45]).
Figure 3.3: An illustration of a portion of genome containing two genes that encode two HMGs.
The first two loci represent start codon (red blocks), followed by two loci that determine the number
of inputs and outputs respectively (green blocks). The next four loci specify which nodes are
inputs of this gate (blue blocks) and the following four specify output nodes (yellow blocks). The
remaining loci encode the probabilities of HMG’s logic table (cyan blocks).
    The number of gates, their connections, and how they work is subject to evolution and changes
across individuals and through generations. For this purpose, the agent’s brains are encoded in a
                                                  35


genome, which is an ordered sequence of integers, each in the range [0,255], i.e., one byte. Each
integer (or byte) is a locus in the genome and specific sequences of loci construct genes, where
each gene codes for one HMG. The “start codon” for a gene (i.e., the sequence that determines the
beginning of the gene) in our encoding is the pair (42,213) (these numbers are arbitrary). Each gene
encodes exactly one HMG, for example as shown in Figure 3.3. The gene specifies the number of
inputs/outputs in each HMG, which nodes it reads from and writes to (the connectivity) and the
probability table that determines the gates’ function. As shown in Figure 3.3, the first two bytes
are the start codon, followed by one byte that specifies the number of inputs and one byte for the
number of outputs. The bytes are modulated so as to encode the number of inputs and outputs
unambiguously. For example, the bytes encoding the number of inputs is an integer in [0,255]
whereas a HMG can take a maximum of four inputs, thus we use a mapping function that generates
a number ∈ [1,4] from the value of this byte. The next four bytes specify the inputs of the HMG,
followed by another four bytes specifying where it writes to. The remaining bytes of the gene
are mapped to construct the probabilistic logic gate table. MNBs have been used extensively in
the last five years to study the evolution of navigation [45, 84], the evolution of active categorical
perception [116, 9], the evolution of swarming behavior as noted earlier, as well as how visual
cortices [34] and hierarchical groups [71] form. In this work, we force the gates to be deterministic
rather than probabilistic (all values in the logic table are 0 or 1), which turns our HMGs into
classical logic gates.
3.2.2   Experimental Configurations
We construct an initial population of 100 agents (digital flies), each with a genome initialized
with 5,000 random integers containing four start codons (to jump-start evolution). Agents (and
by proxy the genomes that determine them) are scored based on how they perform in their living
environment. The population of genomes is updated via a standard Genetic Algorithm (GA)
for 50,000 generations, where the next generation of genomes is constructed via roulette wheel
selection combined with mutations (detailed GA specifications are listed in Table 3.1). To control
                                                  36


for the effects of reproduction and similar effects, there is no crossover or immigration in our GA
implementation.
     Each digital fly is put in a virtual world for 25,000 time steps, during which time its fitness
score is evaluated. During each time step in the simulation, the agent perceives its surrounding
environment, processes the information with its MNB, and makes movement decisions according
to the MNB outputs. The sensory system of a digital fly is designed such that it can see surrounding
objects within a limited distance of 250 units, in a 280◦ pixelated retina shown in Figure 3.4. The
state of each sensor node is 0 (inactive) when it does not sense anything within the radius, and
turns to 1 (active) if an object is projected at that position in the retina. Agents in this experiment
have one actuator node that enables them to move ahead or stop, for active (firing) and non-active
(quiescent) states respectively.
                     Table 3.1: Configurations for GA and Environmental setup
                               GA Parameters               Environment Parameters
                      Population size           100       Vision range     250
                      Generations               50,000    Field of vision  280◦
                      Point mutation rate       0.5%      Collision range  60
                      Gene deletion rate        2%        Agent velocity   15
                      Gene duplication rate 5%            Event time steps 250
                      Initial genome length 5,000         No. of events    100
                      Initial start codons      4         Moving reward    0.004
                      Crossover                 None      Collis. penalty  1,2,3,5,10
                      Immigration               None      Replicates       20
     In our experiment, the digital flies exist in an environment where they should move to gain fitness,
representing the fact that organisms should forage for resources, mates, and avoiding predators.
Thus, the fitness function is set so that agents are rewarded for moving ahead at each update of the
world, and are penalized for colliding with objects. The amount of fitness they gain for moving (the
benefit) is characteristic of the environment, and we change it in different treatments. The penalty
for collisions represents the importance of collision avoidance for their survival and reproduction,
and we vary this cost also. Each digital fly sees 100 moving objects (one at a time) during its
lifetime, and we say that it experiences 100 “events.” The penalty-reward ratio (PR) determines
                                                       37


Figure 3.4: The digital fly and its visual field in the model. Flies have a 12 pixel retina that is
able to sense surrounding objects in 280◦ within a limited distance (250 units). The red circle is
an external object that can be detected by the agent within its vision field. Activated sensors are
shown in red, while inactive sensors are blue. In (A) the object activates two sensors, in (B) the
object is detected in one sensor, and in (C) the object is outside the range.
the amount of penalty of collision divided by the reward for moving during the entirety of an event.
So for example, PR=1 means the agent loses all the rewards it gained by walking during the whole
event if it collides with the object in that event:
                                       Õ
                            fitness =        (reward − 𝑃𝑅 × collision) ,                          (3.1)
                                      events
where reward ∈ [0, 1] reflects how many time steps the agent moved during the event. Our
experiments are constructed such that all objects that produce regressive motion in the digital retina
will collide with the fly if it keeps moving. The reason for biasing our experiments in this manner
is explained in the following section.
3.2.3   Collision Probability in Events with Regressive Optic Flow
As mentioned earlier, Chalupka et al. ([33]) showed that for two flies moving on straight, intersecting
trajectories with constant velocities, the fly that reaches the intersection first always perceives
progressive motion on its retina while the counterpart that reaches the intersection later perceives
regressive motion at all times before the first fly reaches the intersection. However, this does not
imply that all objects that produce a regressive motion on a fly’s retina will necessarily collide with
                                                    38


                 Figure 3.5: An illustration of a moving fly at the onset of the event.
it. In this section we present a mathematical analysis to discover how often objects that produce
regressive motion in the fly’s retina will eventually collide with the fly if it continues walking.
    Suppose a fly moves on a straight line with constant velocity 𝑽fly and an object is also moving
on a straight line with constant velocity 𝑽obj (Figure 3.5(A)). The fly is able to perceive objects
within distance 𝑅v𝑖𝑠 , its vision range (Figure 3.5(A)). The object is assumed to be a point in the
plane and the distance between this point and the center of the visual field of the fly is defined to
be the distance between them. We define “the onset of the event” as the first time the object is
detected by the fly. At the onset of the event, the object is at the distance 𝑅vis of the fly at relative
azimuthal angle 𝛼 ∈ [0, 𝜋2 ] (Figure 3.5(A)). We assume that the object can be at any relative position
𝑹vis = (𝑅vis , 𝛼)1 with equal probabilities (the probability distribution of 𝛼 is uniform around the
fly). The velocity of the object can be represented as 𝑽obj = (𝑉obj , 𝜃) where 𝜃 ∈ [ −𝜋     𝜋
                                                                                        2 , 2 ] (note that
𝑽obj is constant). We also assume that the velocity of the object can point in all directions with
equal probabilities (the probability distribution of 𝜃 is uniform). The relative velocity of the object
with respect to the fly is 𝑽rel = 𝑽obj − 𝑽fly (Figure 3.5). Since both 𝑽obj and 𝑽fly are constant, 𝑽rel
is also a constant vector.
    1 Here and below, we represent vectors either in boldface or by the parameters that determine
them within a planar polar coordinate system. Thus the vector 𝑹 is represented by (| 𝑹|, 𝜙), where
𝑅𝑥 = 𝑅 cos 𝜙 and 𝑅 𝑦 = 𝑅 sin 𝜙.
                                                   39


3.2.3.1     Proposition 1.
A moving object produces regressive motion on a fly’s retina if:
                                                       𝑉fly
                                   𝜃 > −𝛼 + arcsin(         cos 𝛼) .                                  (3.2)
                                                      𝑉obj
3.2.3.2     Proof.
In order for the object to produce regressive motion on the retina, the relative velocity should
be pointed above the center point O. The relative velocity direction 𝛾 can be found awith 𝑽rel =
(𝑉rel , 𝛾), as
                                             !                              !
                                      𝑉rel 𝑦              𝑉obj sin 𝜃 − 𝑉fly
                          𝛾 = arctan           = arctan                       .                       (3.3)
                                      𝑉rel𝑥                  𝑉obj cos 𝜃
The angle 𝛾 should be greater than the central angle (Figure 3.5(B)), that is, 𝛾 > −𝛼. Replacing 𝛾
and simplifying, we obtain:
                                                                   𝑉fly
                              𝜃 > −𝛼 + arcsin(𝜈 cos 𝛼),       𝜈=         .                            (3.4)
                                                                   𝑉obj
For smaller values of 𝜃, the object produces progressive optic flow. We thus define 𝜃 min =
−𝛼 + arcsin(𝜈 cos(𝛼)) as the minimum angle 𝜃 that produces regressive motion on the retina.
3.2.3.3     Definition 1.
The object remains “observable” to the fly after the onset of the event if its relative velocity is
directed toward the inside of the fly’s vision field (to the left of the tangent line 𝛿1 in Figure 3.5(B)).
3.2.3.4     Proposition 2.
The object remains observable to the fly if:
                                                  𝑉fly
                                   𝜃 < arccos(−         sin 𝛼) − 𝛼 .                                  (3.5)
                                                  𝑉obj
                                                    40


3.2.3.5     Proof.
According to the definition the sufficient condition for observability is that 𝛾 should be less than
the tangent line 𝛿1 angle: 𝛾 < −𝛼 + 𝜋2 . Replacing 𝛾 and simplifying we obtain
                                    𝜃 < arccos(−𝜈 sin 𝛼) − 𝛼 .                                        (3.6)
For greater values of 𝜃, the object will be out of vision range of the fly. Thus the maximum value
that 𝜃 can take on is:
                                  𝜃 max = arccos(−𝜈 sin 𝛼) − 𝛼 .                                      (3.7)
In order for the object to produce regressive motion on fly’s retina and also remain observable to
the fly, relative velocity should be within the arc 𝜓 (Figure 3.5(B)).
3.2.3.6     Definition 2.
The object collides with the fly if its distance with the fly is less than “collision range” 𝑅coll
(Figure 3.5(B)).
3.2.3.7     Proposition 3.
An object that creates regressive optic flow on the fly’s retina and remains observable will collide
with it if:
                                                                𝑅
                        𝜃 < 𝜙 + arcsin(𝜈 cos 𝜙),    𝜙 = arcsin( coll ) − 𝛼 .                          (3.8)
                                                                𝑅vis
3.2.3.8     Proof.
The relative velocity of such object is within arc 𝜓. This object will collide with the fly if its relative
velocity is within the arc spanned by the angle 𝛽, i.e. lower than tangent line to collision circle
(Figure 3.5(B)). This condition holds true if:
                                                           𝑅
                                 𝛾 < 𝛽 − 𝛼,    𝛽 = arcsin( coll ) .                                   (3.9)
                                                           𝑅vis
                                                   41


         𝑅
Let 𝜌 = 𝑅coll and 𝜙 = 𝛽 − 𝛼. Replacing 𝛾 and rearranging gives:
           vis
                                       𝜃 < 𝜙 + arcsin(𝜈 cos 𝜙) .                                    (3.10)
For greater values of 𝜃, the object produces regressive motion on the fly’s retina but does not collide
with it. So the threshold collision angle is given by:
                                     𝜃 col = 𝜙 + arcsin(𝜈 cos 𝜙) .                                  (3.11)
As mentioned, we assume that the probability distribution of the direction of the object velocity, 𝜃
is uniform.
3.2.3.9   Definition 3.
For an object at initial position 𝛼, the probability Πcoll is the range of velocity directions 𝜃 such that
the object collides with the fly divided by the range of directions with which it creates regressive
optic flow on fly’s retina (see Figure 3.5(B)):
                                                      𝜃 col − 𝜃 min
                                   Πcoll (𝛼, 𝜈, 𝜌) =                .                               (3.12)
                                                      𝜃 max − 𝜃 min
Integrating this function over the range of possible initial relative positions, the probability that an
event results in a collision given that the object produces regressive motion on an fly’s retina can
be found as:
                                                𝛼∫max
                                Πcoll (𝜈, 𝜌) =       Πcoll (𝛼, 𝜈, 𝜌)𝑑𝛼 ,                            (3.13)
                                               𝛼min
where 𝛼min is either 0 or the minimum value of 𝛼 for which there exists a 𝜃 with which the object
can produce a regressive motion on fly’s retina, and 𝛼min is either 90 or maximum value of 𝛼 for
which there exists a 𝜃 with which the object remains observable to the fly.
    We calculated the integral (3.13) numerically and show the results in Figure 3.6 for different
values of fly-object velocity ratios 𝜈 and different collision range-vision range ratios 𝜌. As can
be seen from Figure 3.6, for 𝑅vis =60 mm [207] and 𝑅coll =15 mm (our assumption), the collision
                                                     42


Figure 3.6: Probability of collision Πcoll (𝜈, 𝜌) with an object that creates regressive motion on the
retina as a function of the ratio of vision radius to collision radius 𝜌, for different fly-object velocity
ratios 𝜈.
probability is around 0.2-0.3. This implies that if encounters are created randomly, regressive
motion on the retina is not predictive of collision, and as a consequence it is unreasonable to expect
that digital evolution will produce collision avoidance in response, as only 1 in 5 to 1 in 3 regressive
motions actually lead to collisions. This was borne out in experiments, and we thus decided to bias
the events in such a manner that all events that leave a regressive motion signature in the retina
will lead to collision. Note that this is not necessarily an unrealistic assumption, as we have not
analyzed a distribution of realistic “events” (such as is available in the data set of [25]). It could
very well be that the way real flies approach each other differs from the uniform distributions that
went into the mathematical analysis presented here.
3.3    Results
    We conducted experiments with five different fitness functions representing different environ-
ments. Environments differ in the amount of fitness individuals gain when moving and in the
penalty incurred by a collision. Evolved agents use various strategies to avoid collisions and maxi-
mize the travelled distance, but one of the most successful strategies they use is indeed to categorize
visual cues into regressive and progressive optic flows. We find that agents categorize these visual
cues only in some regions of the retina: the regions in which collisions take place more frequently.
                                                    43


They then use this information to cast a movement decision: they keep moving when seeing an
object creating progressive optic flow on their retina, and stop when the object creates regressive
optic flow on their retina. However, they do not stop for the entire duration of the event, i.e., the
whole time they perceive regressive optic flow. Rather they stop during only a portion of the event,
which helps the agent to avoid a collision with the object while maximizing their walking duration
and hence gaining higher fitness.
    The strategy of using regressive motion as a cue for collision [33], similar to the observed behav-
ior in fruit flies [207] evolves in our experimental setup under some environmental circumstances
(discussed below). We refer to this strategy as regressive-collision-cue (RCC) and we define it in
our experimental setup as follows:
1) The moving object produces regressive motion on the agent’s retina during an event and the
agent stops at least for some time during that event, or
2) The moving object produces progressive motion on the agent’s retina during an event and the
agent does not stop during that event. The number of events (out of 100) in which the agent uses
this strategy is termed the “RCC value.”
    We now discuss the results of an experiment in which the RCC strategy has evolved. We take
the most successful agent at the end of that experiment and analyze its behavior. This agent evolved
in an environment with penalty-reward ratio of 2, meaning the penalty of each collision equals
twice the maximum reward the agent can gain in 2 events. Figure 3.7 shows whether the agent
stopped during an event, stop probability (blue triangles), as a function of the angular velocity of
the image on the agent’s retina for 100 events. In that figure, the angular velocity of the image on
agent’s retina is negative for regressive optic flow and positive for progressive events. Simulation
units are converted to plotted values (in deg/s and mm/s) by equalizing dimensionless values 𝜈,
and 𝜌 in simulation and actual values: 𝑅vis =60 mm [207], 𝑉fly =20 mm/s [207], 𝑅coll =15 mm (our
assumption). We can see from the figure that out of all 100 events, the agent did not stop during one
event with regressive motion while for two progressive events, it stopped. In the remaining events
the agent accurately uses the RCC strategy (resulting in an RCC value=97). The average velocity of
                                                   44


Figure 3.7: The stop probability of the evolved agent vs. the angular velocity of the image on
its retina for 100 events. Positive values of angular velocity show progressive motion events and
negative angular velocities stand for regressive motion events. The average velocity of the agent is
also shown during each event.
the agent during each event is also shown (solid orange circles), which reflects the number of time
steps the agent moves during that event (and thus indirectly how often it stops). For progressive
motions, the stop probability is zero (the agent continues to move during the event) and thus the
velocity of the agent is maximal during that event. For regressive optic flow (negative angular
velocities), the average velocity during each event is less than maximum and for extreme angular
velocities, as it only needs to stop for shorter durations to avoid collisions.
     In order to quantitatively analyze how using regressive motion as a collision cue benefits
agents to gain more fitness, we traced this particular agent’s evolutionary line of descent (LOD)
by following its lineage backwards for 50,000 generations mutation by mutation until we reached
the random agent that we used to seed the initial population (see [104] for more details on how to
construct evolutionary lines of descent for digital organisms). Figure 3.8 shows the fitness and the
RCC value vs. generation for this agent’s LOD. It is evident from these results that evolving this
strategy benefits agents in gaining fitness compared to the rest of the population in this environment
as high peaks of fitness occur at high RCC values and conversely, the fitness drops as the RCC value
decreases. Nevertheless, this strategy does not evolve all the time. Figure 3.9 shows the fitness and
RCC for all 20 replicates in the environment with penalty-reward ratio of 2. We can see that the
                                                   45


Figure 3.8: Fitness and regressive-collision-cue (RCC) value on the line of descent for an agent
that evolved RCC as a strategy to avoid collisions. Only the first 20,000 generations are shown, for
every 500 generations.
mean fitness of all 20 replicates is around 20% less than the fitness of the agent that evolved the
RCC strategy. The mean RCC value for all 20 replicates is also ≈ 20% less than that of an agent
that evolved the RCC strategy.
     The difficulty to evolve the RCC strategy is not limited to the number of runs in which this
behavior evolved out of all replicates in some environment (we also tried running the experiment
for longer evolutionary times but the results do not change significantly). Environmental conditions
also play a key role in the evolution of this behavior. Figure 3.10 shows the RCC value distribution
for 20 replicates in five different environments. In order to calculate the RCC value in each
replicate, we took the average of the RCC value in the last 1,000 generations on the line of descent
to compensate for fluctuations. We observe that the RCC strategy only evolves in a narrow range
of penalty-reward ratio, namely for PR=2 and PR=3. According to Figure 3.10, higher values
of penalty on the one hand discourage the agents from walking in the environment (they simply
choose to remain stationary), and therefore prevent them from exploring the fitness landscape.
Lower values for the penalty, on the other hand, result in indifference to collisions and thus, the
optimal strategy (probably the local optimum) in these environments is to keep walking and ignore
all collisions. For lower values of the penalty, the RCC value is ≈ 55%, which means they evolve
                                                  46


Figure 3.9: Mean values of fitness and regressive-collision-cue (RCC) over all 20 replicates vs.
evolutionary time in the line of descent in the environment with penalty-reward ratio of 2. Standard
error lines are shown with shaded areas around mean values. Only the first 20,000 generations are
shown, for every 500 generations.
to stop in obvious cases that end up in collision (if they keep moving, the RCC value should be 50).
3.4     Discussion
    We used an agent-based model of flies equipped with MNBs that evolve via a GA to study
the selective pressures and environmental conditions that can lead to the evolution of collision
avoidance strategies based on visual information. We specifically tested cognitive models that
invoke “regressive motion saliency” and “regressive motion as a cue for collision” to understand
how flies avoid colliding with each other in two-dimensional walks. We showed that it is possible
to configure the experiment in such a manner that “regressive-collision-cue” (RCC) evolves as a
strategy to avoid collisions. However, the conditions under which the RCC strategy evolved in our
experiments are limited: the strategy only evolved in a narrow range of environmental conditions
and even in those environments, it does not evolve all the time. In addition, we showed that from
general principles, only a small percentage of events in which an agent perceives regressive optical
flow eventually leads to a collision, so that RCC as a sole strategy is expected to have a large false
positive rate, leading to unnecessary stops.
    As discussed in the Methods section, our experimental implementation is biased in such a way
                                                   47


that all regressive motion events lead to a collision if the agent does not stop during that event. If the
moving object’s velocity direction is distributed uniformly randomly in all directions, the probability
that a regressive event ends up in a collision is rather low (≈ 20% in our implementations). Because
the false positive rate of using regressive optical flow as the only predictor of collisions is liable to
thwart the evolution of an RCC strategy, we biased our setup in such a way that the false-positive
rate is zero, a bias that does not significantly influence the outcome of our experiments. Consider
an environment in which only a percentage of events with regressive motion end up in collision.
This is similar to an environment with a lower penalty for collisions (as long as the strategy evolves
at all) since the agent’s fitness is scored at the end of its lifetime (all 100 events) not during each
event.
    However, there is a difference between a lower percentage of collisions in regressive events and
lower penalty for collisions, namely a lower probability of collision in regressive motion events
is equivalent to a higher amount of noise in the cue that the agent takes from the environment,
compared to the case of lower penalties for collision. In other words, if 100% of all regressive
motion events lead to collisions, the agent associates regressive motion events with collisions with
certainty. Thus, implementing the experiments with 100% collisions in regressive motion events
is tantamount to eliminating the noise in sensory information, which generally aids evolution.
Compensating for noise in sensory information could also be achieved if we scored agents in every
single event, and informed them about their performance in that event (feedback learning). We did
not use feedback learning here, but plan to do so in future experiments.
    We conclude that the evolution of “regressive motion saliency” is unlikely to have happened
only due to collision avoidance as the selective pressure. It is important to remember that walking
is not the most frequent activity in fruit flies. Further, flies do not usually live in high density
colonies and therefore do not find themselves on collision courses very often. It may be the case
that components of this strategy (namely categorizing the optic flow as regressive or progressive)
have evolved under different selective pressures entirely unrelated to the present test situation,
and was further evolved to enhance collision avoidance with conspecifics while moving (a type
                                                    48


Figure 3.10: RCC value distribution in environments with different penalty-reward ratios. Each
box-plot shows the RCC value averaged over the last 1000 generations on the line of descent for 20
replicates.
of exaptation). For example, detecting predators is a strong selective pressure in the evolution of
visual motion detection, including the categorization of that cue so as to take appropriate actions.
It may be interesting to study the behavior of flies interacting with animals or objects that are not
perceived as conspecifics.
                                                 49


                                                 CHAPTER 4
       CAN TRANSFER ENTROPY INFER INFORMATION FLOW IN NEURONAL
                          CIRCUITS FOR COGNITIVE PROCESSING?
4.1    Introduction
    When searching for common foundations of cortical computation, more and more emphasis is
being placed on information-theoretic descriptions of cognitive processing [148, 161, 3, 136, 201].
One of the core tasks in the analysis of cognitive processing is to follow the flow of information
within the nervous system, by finding cause-effect components. Indeed, understanding causal
relationships is considered to be fundamental to all natural sciences [27]. However, inferring causal
relationships and separating them from mere correlations is difficult, and the subject of ongoing
research [60, 145, 146, 179, 10]. The concept of Granger causality is an established statistical
measure that aims to determine directed (causal) functional interactions among components or
processes of a system. Schreiber [165] described Granger causality in terms of information theory
by introducing the concept of transfer entropy (TE). The main idea is that if a process 𝑋 is
influencing process 𝑌 , then an observer can predict the future state of 𝑌 more accurately given the
                                           (𝑘)         (ℓ)
history of both 𝑋 and 𝑌 (written as 𝑋𝑡          and 𝑌𝑡 , where 𝑘 and ℓ determine how many states from
the past of 𝑋 and 𝑌 are taken into account) compared to only knowing the history of 𝑌 . According
to Schreiber, the transfer entropy TE 𝑋→𝑌 quantifies the flow of information from process 𝑋 to 𝑌 :
                                         (𝑘)     (ℓ)               (ℓ)                (ℓ)  (𝑘)
            TE 𝑋→𝑌 = 𝐼 (𝑌𝑡+1 : 𝑋𝑡            | 𝑌𝑡 ) = 𝐻 (𝑌𝑡+1 | 𝑌𝑡 ) − 𝐻 (𝑌𝑡+1 |𝑌𝑡 , 𝑋𝑡        )=
                                                                                 (𝑘) (ℓ)
                            Õ Õ Õ
                                                        (𝑘)   (ℓ)      𝑝(𝑦 𝑡+1 |𝑥𝑡 , 𝑦 𝑡 )
                        =                    𝑝(𝑦 𝑡+1 , 𝑥𝑡 , 𝑦 𝑡 ) log                      .         (4.1)
                                                                                    (ℓ)
                           𝑦 𝑡+1 (𝑘) (ℓ)                                 𝑝(𝑦 𝑡+1 |𝑦 𝑡 )
                                  𝑥𝑡 𝑦 𝑡
                   (𝑘)        (ℓ)
Here as before, 𝑋𝑡     and 𝑌𝑡     refer to the history of the processes 𝑋 and 𝑌 , while 𝑌𝑡+1 refers to the
                                                (𝑘)   (ℓ)
variable at 𝑡 + 1 only. Further, 𝑝(𝑦 𝑡+1 , 𝑥𝑡 , 𝑦 𝑡 ) is the joint probability of 𝑌𝑡+1 and the histories
  (𝑘)       (ℓ)                   (𝑘)   (ℓ)                   (ℓ)
𝑋𝑡    and 𝑌𝑡 , while 𝑝(𝑦 𝑡+1 |𝑥𝑡 , 𝑦 𝑡 ) and 𝑝(𝑦 𝑡+1 |𝑦 𝑡 ) are conditional probabilities.
    The transfer entropy (4.1) is a conditional mutual entropy, and quantifies what the process 𝑌
                                                        50


at time 𝑡 + 1 knows about the process 𝑋 up to time 𝑡, given the history of 𝑌 up to time 𝑡 (see [23]
for a thorough introduction to the subject). Specifically, TE 𝑋→𝑌 measures “how much uncertainty
about the future course of 𝑌 can be reduced by the past of 𝑋, given 𝑌 ’s own past.” Transfer entropy
reduces to Granger causality for so-called “auto-regressive processes” [14] (which encompasses
most biological dynamics), and has become one of the most widely used directed information
measures, especially in neuroscience (see [199, 202, 201, 23] and references cited therein).
     While transfer entropy is sometimes used to infer causal influences between susbsystems, it
is important to point out that inferring causal relationships is different from inferring information
flow [107]. In complex systems (for example, in computations that a brain performs to choose
the correct action given a particular sensory experience) events in the sensory past can causally
influence decisions significantly distant in time, and to capture such influences using the transfer
entropy concept requires a careful analysis in which not only the history lengths 𝑘 and ℓ used in
Equation (4.1) must be optimized, but false influences due to linear mixing of signals (which can
mimic causal influences) must also be corrected for [199, 23]. In some sense, inferring information
flow is a much simpler task than finding all causal influences, as we need only to identify (and
quantify) the sources of information transferred to a particular variable. More precisely, for this
application the pairwise transfer entropy is used to find candidate sources (in the immediate past)
that account for the entropy of a particular neuron.
                     A                                  B
                                X
                                           Z                                  Z
                                Y                                  Y
Figure 4.1: (A) A network where processes 𝑋 and 𝑌 influence future state of 𝑍, 𝑍𝑡+1 = 𝑓 (𝑋𝑡 , 𝑌𝑡 ).
(B) A feedback network in which processes 𝑌 and 𝑍 influence future state 𝑍, 𝑍𝑡+1 = 𝑓 (𝑌𝑡 , 𝑍𝑡 ).
     Using transfer entropy to search for and detect directed information was shown to lead to
inaccurate assessments in simple case studies [76, 77]. For instance, James et al. [76] presented
two examples in which TE underestimates the flow of information from inputs to output in one
                                                  51


example, and overestimates it in the other. In the first example, they define a simple system with
three binary variables 𝑋, 𝑌 , and 𝑍 where 𝑍𝑡+1 = 𝑋𝑡 ⊕ 𝑌𝑡 (⊕ is the exclusive OR logic operation) and
variables 𝑋 and 𝑌 take states 0 or 1 with equal probabilities, i.e., 𝑃(𝑋 = 0) = 𝑃(𝑋 = 1) = 𝑃(𝑌 =
0) = 𝑃(𝑌 = 1) = 0.5 (this 2-to-1 relation is schematically shown in Figure 4.1A). In this network,
TE 𝑋→𝑍 = TE 𝑋→𝑍 = 0 whereas the entropy of the process 𝑍, 𝐻 (𝑍) = 1 bit, and variables 𝑋 and 𝑌
certainly influence the future state of 𝑍. In this example, the entropy of 𝑍 can be reduced by 1 bit
but the TE does not attribute this entropy to either variables 𝑋 or 𝑌 and as a consequence the TE
underestimates the flow of information from 𝑋 and 𝑌 to 𝑍. In another example, they define a system
with two binary variables 𝑌 and 𝑍, where 𝑍𝑡+1 = 𝑌𝑡 ⊕ 𝑍𝑡 and similar to the previous example,
𝑃(𝑌 = 0) = 𝑃(𝑌 = 1) = 𝑃(𝑍 = 0) = 𝑃(𝑍 = 1) = 0.5 (this feedback loop relation is schematically
shown in Figure 4.1B). In this scenario, TE𝑌 →𝑍 = 1 bit, which implies that the entire 1 bit of
entropy in 𝑍 is coming from process 𝑌 . However, this is not correct since both 𝑌 and 𝑍 are equally
contributing to determine the future state of 𝑍. In this example, TE overestimates the information
flow from process 𝑌 to 𝑍. It is also noteworthy that in this example the processed information
(defined as 𝐼 (𝑍𝑡 : 𝑍𝑡+1 )) vanishes, which again does not correctly detect the other source, 𝑍𝑡 ,
from which the information is coming. As acknowledged by the authors in [76], expecting that
the entropy of the output 𝐻 (𝑍𝑡+1 ) is given simply by the sum of the transfer entropy from each of
the inputs independently is a naive interpretation of information flow. Indeed, this is generally not
the case, even if the two sources are uncorrelated. Consider for example the first system described
above in which 𝑍𝑡+1 = 𝑓 (𝑋𝑡 , 𝑌𝑡 ). Suppose 𝑓 is a deterministic function of 𝑋𝑡 and 𝑌𝑡 , in which case
the conditional entropy 𝐻 (𝑍𝑡+1 |𝑋𝑡 , 𝑌𝑡 ) = 0. Then, the entropy 𝐻 (𝑍𝑡+1 ) decomposes into the sum
of an unconditional and a conditional transfer entropy
                                𝐻 (𝑍𝑡+1 ) = TE𝑌 →𝑍 + TE 𝑋→𝑍 |𝑌𝑡 ,                                (4.2)
where the conditional transfer entropy is defined as (see [23], section 4.2.3)
                                TE𝑌 →𝑍 |𝑋𝑡 = 𝐼 (𝑌𝑡 : 𝑍𝑡+1 |𝑍𝑡 , 𝑋𝑡 ) .                           (4.3)
                                                   52


Using this definition, it is easy to show that
                            TE𝑌 →𝑍 = TE𝑌 →𝑍 |𝑋𝑡 + 𝐼 (𝑋𝑡 : 𝑌𝑡 : 𝑍𝑡+1 |𝑍𝑡 ) ,                        (4.4)
and Equation (4.2) can be rewritten in terms of transfer entropies only, or else conditional transfer
entropies only, as
𝐻 (𝑍𝑡+1 ) = TE𝑌 →𝑍 |𝑋𝑡 + TE 𝑋→𝑍 |𝑌𝑡 + 𝐼 (𝑋𝑡 :𝑌𝑡 : 𝑍𝑡+1 |𝑍𝑡 ) = TE𝑌 →𝑍 + TE 𝑋→𝑍 − 𝐼 (𝑋𝑡 :𝑌𝑡 : 𝑍𝑡+1 |𝑍𝑡 ) .(4.5)
In light of Equation (4.5), it then becomes clear that the naive sum of the transfer entropies TE 𝑋→𝑍
and TE𝑌 →𝑍 (or naive sum of conditional transfer entropies) must fail to account for the entropy of
𝑍 whenever the term 𝐼 (𝑋𝑡 :𝑌𝑡 : 𝑍𝑡+1 |𝑍𝑡 ) is non-zero, and therefore will fail to fully and accurately
quantify information transferred from sources 𝑋 and 𝑌 . Therefore, the error in information flow
estimate when using transfer entropy is simply given by the absolute value of 𝐼 (𝑋𝑡 : 𝑌𝑡 : 𝑍𝑡+1 |𝑍𝑡 )
(same when using conditional transfer entropies).
     Now consider the second example system with a feedback loop in which 𝑍𝑡+1 = 𝑓 (𝑌𝑡 , 𝑍𝑡 ), and
again suppose 𝑓 is a deterministic function which implies 𝐻 (𝑍𝑡+1 |𝑌𝑡 , 𝑍𝑡 ) = 0. In this case, there is
a similar information decomposition that now involves a shared entropy 𝐼 (𝑌𝑡 : 𝑍𝑡 : 𝑍𝑡+1 )
                                𝐼 (𝑌𝑡 : 𝑍𝑡+1 ) = TE𝑌 →𝑍 + 𝐼 (𝑌𝑡 : 𝑍𝑡 : 𝑍𝑡+1 ) .                    (4.6)
Here, the entropy 𝐻 (𝑍𝑡+1 ) can be written in terms of transfer entropy and processed information
(recall that 𝐻 (𝑍𝑡+1 |𝑍𝑡 , 𝑌𝑡 ) = 0)
                                    𝐻 (𝑍𝑡+1 ) = TE𝑌 →𝑍 + 𝐼 (𝑍𝑡 : 𝑍𝑡+1 ) .                          (4.7)
While Equation (4.7) shows that the sum of transfer entropy TE𝑌 →𝑍 and processed information
𝐼 (𝑍𝑡 : 𝑍𝑡+1 ) account for all the entropy 𝑍𝑡+1 , these two terms do not always individually identify
the sources of information flow correctly. For instance, we have seen that in the second example
(where 𝑍𝑡+1 = 𝑌𝑡 ⊕ 𝑍𝑡 ) the processed information 𝐼 (𝑍𝑡 : 𝑍𝑡+1 ) vanishes even though variable 𝑍𝑡 most
definitely influences the state of variable 𝑍𝑡+1 . As discussed earlier, all the information transferred
                                                      53


to 𝑍𝑡+1 in that case is attributed to variable 𝑌𝑡 . Note that the processed information can be written
as
                            𝐼 (𝑍𝑡 : 𝑍𝑡+1 ) = 𝐼 (𝑍𝑡 : 𝑍𝑡+1 |𝑌𝑡 ) + 𝐼 (𝑍𝑡+1 : 𝑍𝑡 :𝑌𝑡 )                        (4.8)
where 𝐼 (𝑍𝑡 : 𝑍𝑡+1 |𝑌𝑡 ) = 1 and 𝐼 (𝑍𝑡+1 : 𝑍𝑡 :𝑌𝑡 ) = −1.
    Note that for the most general case where function 𝑓 can be non-deterministic and the network
with or without feedback loop, the full entropy decomposition can be written as
  𝐻 (𝑍𝑡+1 ) = TE𝑌 →𝑍 |𝑋𝑡 + TE 𝑋→𝑍 |𝑌𝑡 + 𝐼 (𝑋𝑡 :𝑌𝑡 : 𝑍𝑡+1 |𝑍𝑡 ) + 𝐼 (𝑍𝑡+1 : 𝑍𝑡 ) + 𝐻 (𝑍𝑡+1 |𝑋𝑡 , 𝑌𝑡 , 𝑍𝑡 ) . (4.9)
There is also another key factor in the examples described above that results in misestimating
information flow when using transfer entropy. In both examples, the input to output relation is
implemented by an XOR function. For instance, in the first example (𝑍𝑡+1 = 𝑋𝑡 ⊕ 𝑌𝑡 ), the transfer
entropy TE 𝑋→𝑍 considers 𝑋 in isolation and independent of variable 𝑌 . We should make it
clear that it is not the formulation of TE that is at the origin of mis-attributing the sources of the
transferred information. Rather, by definition Shannon’s mutual information, 𝐼 (𝑋 : 𝑌 ) = 𝐻 (𝑋) +
𝐻 (𝑌 ) − 𝐻 (𝑋, 𝑌 ) is dyadic, and cannot capture polyadic correlations where more than one variable
influences another. Consider for example a similar but time-independent process between binary
variables 𝑋, 𝑌 , and 𝑍 where 𝑍 = 𝑋 ⊕𝑌 . As is well-known, the mutual information between 𝑋 and 𝑍,
and also between 𝑌 and 𝑍 vanishes: 𝐼 (𝑋 : 𝑍) = 𝐼 (𝑌 : 𝑍) = 0 (this corresponds to the one-time pad,
or Vernam cipher [168], a common method of encryption that takes advantage of the fact that 𝐼 (𝑋 :
𝑌 : 𝑍) = −1). Thus, while the TE formulation aims to capture a directed dependency of information,
Shannon information measures the undirected (correlational) dependency of two variables only. As
a consequence, problems with TE measurements in detecting directed dependencies are unavoidable
when using Shannon information, and do not stem from the formulation of transfer entropy [165]
or similar measures such causation entropy [179] to capture causal relations. Note that methods
such as partial information decomposition have been proposed to take into account the synergistic
influence of a set of variables on the others [204]. However, such higher-order calculations are
                                                        54


more costly (possibly exponentially so) and require significantly more data in order to perform
accurate measurements.
    Given the observed error in measuring information flow using TE due to logic gates that encrypt,
we now set out to examine how well TE measurements capture information flow when the function
is implemented with Boolean functions other than XOR. In particular, we examine every first-order
Markov process 𝑍𝑡+1 = 𝑓 (𝑋𝑡 , 𝑌𝑡 ) where function 𝑓 is implemented by all 16 possible 2-to-1 binary
relations (Figure 4.1A) and quantify the error in information transfer estimate for each of them.
Similar to previous examples, the state of variable 𝑍 is independent of its past, and inputs 𝑋 and 𝑌
take states 0 and 1 with equal probabilities, i.e., 𝑃(𝑋 = 0) = 𝑃(𝑋 = 1) = 𝑃(𝑌 = 0) = 𝑃(𝑌 = 1) =
0.5.
    Table 4.1 shows the results of transfer entropy measurements for all possible 2-to-1 logic
gates and the error that would occur if TE measures are used to quantify the information flow
from inputs to outputs. This error is the sum of misestimations in information flow quantified by
pairwise transfer entropies TE 𝑋→𝑍 and TE𝑌 →𝑍 . As we discussed before, for the XOR relation the
transfer entropies TE 𝑋→𝑍 = TE𝑌 →𝑍 = 0, and 𝐻 (𝑍𝑡+1 ) = 1 which means that TE misestimates
the information flow from inputs 𝑋 and 𝑌 by 1 bit (the XNOR is exactly the same). We find
that in all other polyadic relations where both 𝑋 and 𝑌 influence the future state of 𝑍, TE 𝑋→𝑍
and TE𝑌 →𝑍 capture part of the information flow from inputs to outputs, but TE 𝑋→𝑍 + TE𝑌 →𝑍 is
less than the entropy of the output 𝑍 by 0.19 bits (TE 𝑋→𝑍 + TE𝑌 →𝑍 = 0.62, 𝐻 (𝑍) = 0.81). In
the remaining six relations where only one of the inputs or neither of them influences the output,
the transfer entropies correctly capture the information flow. The difference between the sum of
transfer entropies, TE 𝑋→𝑍 + TE𝑌 →𝑍 , and the entropy of the output 𝐻 (𝑍) in XOR and XNOR
relations, stems from the fact that 𝐼 (𝑋 :𝑌 : 𝑍) = −1, the tell-tale sign of encryption. Furthermore,
while other polyadic gates do not implement perfect encryption, they still encrypt partially as
𝐼 (𝑋 :𝑌 : 𝑍) = −0.19, which we call obfuscation. It is this obfuscation that is at the heart of the TE
error shown in Table 4.1.
    We repeated similar calculations for the case of a feedback loop network where 𝑍𝑡+1 = 𝑓 (𝑌𝑡 , 𝑍𝑡 )
                                                   55


(Figure 4.1B) and function 𝑓 can be any one of the 16 logic relations shown in Table 4.1. These
simple calculations show that in 16 relations including XOR and XNOR, the sum of the transfer
entropies, TE𝑌 →𝑍 + 𝐼 (𝑍𝑡+1 : 𝑍𝑡 ) (the formulation for transfer entropy of a variable to itself reduces
to processed information 𝐼 (𝑍𝑡+1 : 𝑍𝑡 )) is equal to the entropy of the output 𝑍𝑡+1 as was shown in
Equation (4.7). However, in XOR and XNOR relations transfer entropy incorrectly attributes all the
information to one of the input variables and no influence is attributed to the other. Furthermore, in
the polyadic relations other than XOR and XNOR, the transfer entropies TE𝑌 →𝑍 and 𝐼 (𝑍𝑡+1 : 𝑍𝑡 )
differ in value while variables 𝑋 and 𝑌 equally influence the state of the output 𝑍, which is why the
TE error in these relations is 0.19 bits.
Table 4.1: Transfer entropies and information in all possible 2-to-1 binary logic gates with or without
feedback. The logic of the gate is determined by the value 𝑍𝑡+1 (second column) as a function of the
input 𝑋𝑡 𝑌𝑡 =(00,01,10,11). 𝐻 (𝑍𝑡+1 ) is the Shannon entropy of the output assuming equal probability
inputs, 𝑇 𝐸 𝑋→𝑍 is the transfer entropy from 𝑋 to 𝑍. In 2-to-1 gates without feedback, transfer
entropies TE 𝑋→𝑍 and TE𝑌 →𝑍 reduce to 𝐼 (𝑋𝑡 : 𝑍𝑡+1 ), and 𝐼 (𝑌𝑡 : 𝑍𝑡+1 ), respectively. Similarly,
transfer entropy of a process to itself is simply 𝐼 (𝑍𝑡 : 𝑍𝑡+1 ) which is the information processed by
𝑍.
                                         2-to-1 network, 𝑍 = 𝑓 (𝑋, 𝑌 ) 2-to-1 feedback loop, 𝑍 = 𝑓 (𝑌 , 𝑍)
     gate          𝑍𝑡+1     𝐻 (𝑍𝑡+1 )   TE 𝑋→𝑍 TE𝑌 →𝑍 TE error TE𝑌 →𝑍 𝐼 (𝑍𝑡 : 𝑍𝑡+1 ) TE error
    ZERO         (0,0,0,0)     0.0          0.0       0.0         0.0        0.0          0.0           0.0
     AND         (0,0,0,1)    0.81         0.31      0.31        0.19        0.5         0.31          0.19
  AND-NOT        (0,0,1,0)    0.81         0.31      0.31        0.19        0.5         0.31          0.19
  AND-NOT        (0,1,0,0)    0.81         0.31      0.31        0.19        0.5         0.31          0.19
     NOR         (1,0,0,0)    0.81         0.31      0.31        0.19        0.5         0.31          0.19
    COPY         (0,0,1,1)     1.0          1.0       0.0         0.0        1.0          0.0           0.0
    COPY         (0,1,0,1)     1.0          0.0       1.0         0.0        0.0          1.0           0.0
     XOR         (0,1,1,0)     1.0          0.0       0.0         1.0        1.0          0.0           1.0
    XNOR         (1,0,0,1)     1.0          0.0       0.0         1.0        1.0          0.0           1.0
     NOT         (1,0,1,0)     1.0          0.0       1.0         0.0        0.0          1.0           0.0
     NOT         (1,1,0,0)     1.0          1.0       0.0         0.0        1.0          0.0           0.0
      OR         (0,1,1,1)    0.81         0.31      0.31        0.19        0.5         0.31          0.19
   OR-NOT        (1,0,1,1)    0.81         0.31      0.31        0.19        0.5         0.31          0.19
   OR-NOT        (1,1,0,1)    0.81         0.31      0.31        0.19        0.5         0.31          0.19
    NAND         (1,1,1,0)    0.81         0.31      0.31        0.19        0.5         0.31          0.19
     ONE         (1,1,1,1)     0.0          0.0       0.0         0.0        0.0          0.0           0.0
    Given that pairwise TE measurements (not taking into account higher-order conditional transfer
                                                   56


entropies) only fail to correctly identify the sources of information flow in cryptographic gates and
demonstrate partial errors in quantifying information flow in polyadic relations, we now set out to
determine how often these relations appear in networks that implement basic cognitive tasks, and
how much error is introduced when measuring information flow using transfer entropy. If the total
error in transfer entropy measurements of information flow in cognitive networks is significant,
an analysis of pairwise directed information among neural components (neurons, voxels, cortical
columns, etc.) using this concept is bound to be problematical. If, however, these errors are
reasonably low within biological control structures because cryptographic logic is rarely used, then
treatments using the TE concept can largely be trusted.
    To answer this question, we use a new tool in computational cognitive neuroscience, namely
computational models of cognitive processing that can explain task-performance in terms of plau-
sible dynamic components [93]. In particular, we use Darwinian evolution to evolve artificial
digital brains (also known as Markov Brains or MBs [69]) that can receive sensory stimuli from
the environment, process this information, and take actions in response. (In the following we refer
to digital brains as “Brains”, while biological brains remain “brains”.). We evolve Markov Brains
that perform two different cognitive tasks whose circuitry is thoroughly studied in neuroscience:
visual motion detection [21], as well as sound localization [130, 149]. Markov Brains have been
shown to be a powerful platform that can unravel the information-theoretic correlates of fitness and
network structure in neural networks [45, 8, 164, 114, 85]. This computational platform enables us
to analyze structure, function, and circuitry of hundreds of evolved digital Brains. As a result, we
can obtain statistics on the frequency of different types of relations in evolved circuits (as opposed
to studying only a single evolutionary outcome), and further assess how crucial different operators
are for each evolved task, by performing knockout experiments in order to measure an operator’s
contribution to the task. In particular, we first investigate the composition of different types of logic
gates in networks evolved for the two cognitive tasks, and then theoretically estimate how accurate
transfer entropy measures could be when applied to quantify the pairwise information flow from
one neuron to another in such simple cognitive networks. We then use transfer entropy measures as
                                                    57


a statistic to identify information flow between neurons of evolved circuits using the time series of
neural recordings obtained from behaving Brains engaged in their task, and evaluate how successful
transfer entropy is in detecting this flow. While artificial evolution of control structures (“artificial
Brains”) is not a substitute for the analysis of information flow in biological brains, this investigation
should provide some insights on how accurate (or inaccurate) transfer entropy measures could be.
4.2     Materials and Methods
4.2.1    Markov Brains
Markov Brains (MB) are evolvable networks of binary neurons (they take value 0 for a quiescent
neuron, or 1 for a firing neuron) in which neurons are connected via probabilistic or deterministic
logic gates (in this work, we constrain MBs to only use 2-to-1 deterministic logic gates). The states
of the neurons are updated in a first order Markov process, i.e., the probability distribution of states
of the neurons at time step 𝑡 + 1 depends only on the states of neurons at time step 𝑡. This does
not imply that Markov Brains are memoryless, because the state of one neuron can be stored by
repeatedly writing into its own (or another) neuron’s state variable [45, 114, 69]. The connectivity
and the underlying logic of the MB’s neuronal network is encoded in a genome. Thus, we can
evolve populations of MBs using a Genetic Algorithm (GA) [127] to perform a variety of cognitive
tasks (for a more detailed description of Markov Brain function and implementation see [69]). In
the following sections, we describe two fitness functions designed to evolve motion detection and
sound localization circuits in MBs.
4.2.2    Motion Detection
The first fitness function is designed in order to evolve MBs that function as a visual motion detection
circuit. Reichardt and Hassenstein proposed a circuit model of motion detection that is based on
a delay-and-compare scheme [65]. The main idea behind this model is that a moving object is
sensed by two adjacent receptors on the retina, at two different time points. Figure 4.2 shows the
schematic of a Reichardt detector in which the 𝜏 components delay the stimulus and × components
                                                     58


multiply the signals, i.e., fires if the signal from the receptor and delay component arrive at the
same time. The result of the multiplication units for two different directions is then subtracted so
that high values denote motion in one direction (the “preferred direction”, PD), low values denote
the opposite direction (null direction, ND), and intermediate values encode a stationary stimulus.
                 A                     B
Figure 4.2: (A) A Reichardt detector circuit. In this circuit, the results of the multiplications from
each pathway are subtracted to generate the response. The circuit’s outcome for PD is +1, ND is -1,
and for stationary patterns is 0. (B) Schematic examples of three types of input patterns received by
the two sensory neurons at two consecutive time steps. Grey squares show presence of the stimuli
in those neurons. The sensory pattern shown here for PD is 10 at time 𝑡 and 01 at time 𝑡 + 1,
which we write as: 10 → 01. Patterns 11 → 01 and 00 → 10 also represent PD. Similarly, pattern
01 → 10 is shown as an example of ND but patterns 11 → 10 and 01 → 11 are also instances of
ND.
    The experimental setup for the evolution of motion detection circuits is similar to the setup
previously used in [184]. In that setup, two sets of inputs are presented to a MB at two consecutive
times and the Brain classifies the input as preferred direction (PD), stationary, or null direction
(ND). After the first set of inputs, i.e., at time 𝑡 in Figure 4.2B, a Markov Brain is updated once,
and after the second set of inputs (at 𝑡 + 1) it is updated two times, which simulates two operations
performed after delaying one of the inputs, namely multiplication and subtraction. The value
of the sensory neuron becomes 1 when a stimulus is present, and it becomes 0 otherwise (see
Figure 4.2B). Thus, 16 possible sensory patterns can be presented to the MB to classify, among
which 3 input patterns are PD, 3 are ND, and the other 10 are stationary patterns. Two neurons
are assigned as output neurons of the motion detection circuit. The sum of binary values of these
neurons represents the output of the motion detection circuit, 0: ND, 1: stationary stimulus , 2:
PD, while in the Reichardt detector circuit shown in Figure 4.2A, the output corresponding to ND
is -1, stationary is 0, and PD is +1.
                                                     59


4.2.3   Sound Localization
The second fitness function is designed to evolve MBs that function as a sound localization circuit.
Sound localization mechanisms in mammalian auditory systems function based on several cues
such as interaural time difference, interaural level difference, etc. [128]. Interaural time difference
(which is the main cue behind the sound localization mechanism) is the difference between the
times at which sound reaches the two ears. Figure 4.3A shows a simple schematic of a sound
localization model proposed by Jeffress [79] in which sound reaches the right ear and left ear at
two possibly different times. These stimuli are then delayed in an array of delay components and
travel to an array of detector neurons (marked with different colors in Figure 4.3A). Each detector
only fires if the two signals from different pathways, the left ear pathway (shown at the bottom) and
the right ear pathway (shown at top), reach that neuron simultaneously.
                                                                      N0
                  A                                           B       N1
                                                                                    N11
                                                                      N0            N12
                                                                      N1
                             Left       Right                         N0
                                                                                    N13
                                                                      N1
                                               Right ear              N0
                                                                                    N14
                                                                      N1
                                                                      N0
                   Left ear                                                         N15
                                                                      N1
Figure 4.3: (A) Schematic of 5 sound sources at different angles with respect to a listener (top
view) and Jeffress model of sound localization. (B) Schematic examples of 5 time sequences of
input patterns received by the two sensory neurons (receptors of two ears) at three consecutive time
steps. Black squares show presence of the stimuli in those neurons.
    In our experimental setup, two sequences of stimuli are presented to two different sensory
neurons (neurons 𝑁0 and 𝑁1 ) that represent the receptors in the two ears. The stimulus in two
sequences are lagged or advanced with respect to one another (as shown in Figure 4.3B). The agent
receives these sequences and should identify 5 different angles from where that sound is coming.
The binary value of the sensory neuron becomes 1 when a stimulus is present, shown as black
blocks in Figure 4.3B, and it becomes 0 otherwise, shown as white blocks in Figure 4.3B. Markov
Brains are updated once after each time step in the experiment. Similar to the schema shown
                                                  60


in Figure 4.3A, Markov Brains have five designated output neurons (𝑁11 -𝑁15 ) and each neuron
corresponds to one of the sound sources placed at a specific angle. Colors of detector neurons
(𝑁11 -𝑁15 ) in Figure 4.3B match the angle of each sound source in Figure 4.3A.
4.3     Results
    For the motion detection (MD) and sound localization (SL) tasks, we evolved 100 populations
each for 10,000 generations, allowing all possible 2-to-1 (deterministic) logic gates as primitives.
At the end of each evolutionary run, we isolated one of the genotypes with the highest score from
each population to generate a representative circuit.
4.3.1   Gate Composition of Evolved Circuits
Out of 100 populations evolved in motion detection task, 98 led to circuits that perform motion
detection with perfect fitness. The number of gates in evolved Brains varies tremendously, with a
minimum of four and maximum of 17 (mean=7.92, SD=2.48). The frequency distribution of types
of logic gates per each individual Brain is shown for these 98 perfect circuits in Figure 4.4A (in
this figure, AND-NOT is an asymmetric AND operation where one of the variables is negated,
for example 𝑋 0 · 𝑌 . Similarly, OR-NOT is an asymmetric OR operation, e.g., 𝑋 + 𝑌 0). To gain a
better understanding of the distribution of logic gates and how they compose the evolved motion
detection circuits, we performed gate-knockout assays on all 98 Brains. We sequentially eliminated
each logic gate, (along with all the input and output connections of that gate) and re-measured
the mutant Brain’s fitness, thus allowing us to estimate which gates were essential to the motion
detection function (if there is a drop in mutant Brain’s fitness) and which gates were redundant to the
motion detection function (if a mutant Brain’s fitness remains perfect). The frequency distribution
of each type of logic gate per individual Brain for essential gates is shown for the 98 perfect Brains
in Figure 4.4B.
    For the sound localization task, 71 evolution experiments out of 100 resulted in Markov Brains
with perfect fitness. The minimum number of gates was six, with a maximum of 15 (mean=9.14,
                                                   61


SD=1.77). Figure 4.4A shows the frequency distribution of types of logic gates per Brain for these
71 perfect Brains. We also performed a knockout analysis on all 71 evolved sound localization
Brains. The frequency distribution of each type of logic gate per individual Brain for essential
gates is shown for the 71 perfect Brains in Figure 4.4B. These results demonstrate that the gate
type compositions and circuit structures in evolved Brains for motion detection (MD) and sound
localization (SL) tasks are significantly different. The total number of logic gates (ignoring
duplicates) in the SL task (9.14 gates per Brain, SD=1.77) is greater than the total number of gates
in the MD task (7.92 gates per Brain, SD=2.48). Moreover, the number of essential gates in SL
(7.13 gates per Brain, SD=1.24) is also greater than the number of essential gates in MD (5.23 gates
per Brain, SD=1.31).
           A
           B
Figure 4.4: Frequency distribution of all, as well as essential, gates in evolved Markov Brains that
perform the motion detection or sound localization task perfectly. (A) All gates. (B) Essential
gates.
                                                  62


4.3.2   Transfer Entropy Misestimates Caused by Encryption or Polyadicity
As discussed earlier, transfer entropy measures may misestimate the information flow from input to
output and may fail to correctly identify the source of information. Table 4.1 gave a detailed analysis
of transfer entropy measurements and their misestimates that are rooted either in the polyadic or
encrypting nature of the gate, for all possible 2-to-1 logic gates. Given the gate distributions of
evolved circuits for motion detection and sound localization tasks along with the misestimate values
calculated in Table 4.1, we can estimate the error that would occur when using transfer entropy to
quantify the pairwise information flow from source neurons (i.e., input neurons to gates) to receiver
neurons (i.e., output neurons of gates). We can similarly estimate what fraction of the information
flow from inputs to outputs would be correctly quantified by the transfer entropy in the evolved
circuits. Recall that in the results presented in Table 4.1, calculations were performed assuming
that the input bits take values 0 or 1 with equal probability 0.5. Of course, we cannot generally
assume this for the input bits of every logic gate in an evolved network. As a consequence, this
analysis only approximates the information flow misestimates of the full network.
    In our analysis, we only evaluated the contribution of gates deemed essential via the knockout
test. For these essential gates, we summed the pairwise information flow misestimates as well as
the correct information flow attributions in each evolved Brain. The mean values of calculated
misestimates of information flow as well as correct measurements with their 95% confidence
intervals for 98 evolved circuits that perform motion detection task, and for 71 evolved sound
localization Brains are shown in Figure 4.5A. In Figure 4.5B, we normalized misestimates and
correct measurements by dividing by the number of essential gates in each Brain, and averaged
them across Brains. It is worth noting that the calculated information flow misestimates shown in
these plots only reflect the misestimates that originated from the polyadicity or encrypting nature
of the gates, since they are only based on the network structure and the gate composition of each
Brain as well as the analytical results presented in Table 4.1, and do not take into account the
errors that could occur as a result of factors such as sampling errors in the dataset or structural
complexities in the network, such as recurrent or transitive relations [11, 179, 10]. Along the same
                                                  63


line of reasoning, calculated values of correct measurements represent correct information flows
that could be measured by transfer entropy in the absence of the aforementioned sources of errors.
    These results further reveal that the circuit structures and gate type compositions in the two
tasks are significantly different, and that this structural difference leads to different outcomes when
transfer entropy measures are used to detect pairwise information flows. Transfer entropy can
potentially capture 3.31 bits (SE = 0.10) of information flow correctly in evolved motion detection
circuits (0.64 bits per gate, SE = 0.014), and 3.95 (averaged across 71 Brains, SE = 0.14) bits
in evolved sound localization circuits (0.55 bits per gate, SE = 0.014). However, the information
flow misestimates when using transfer entropy in evolved sound localization circuits is 2.39 bits
(averaged across 71 Brains, SE = 0.12) which is significantly higher than the misestimates in
evolved motion detection circuits, which is 1.33 bit (averaged across 98 Brains, SE = 0.085). The
information flow misestimate in evolved motion detection circuits is 0.25 bits per gate (SE = 0.014)
whereas it is 0.34 bits (SE = 0.016) per gate in evolved sound localization circuits. These findings
show that the accuracy of transfer entropy measurements for detecting information flow in digital
neural networks can vary significantly from one task to another.
         A                                              B
Figure 4.5: Transfer entropy measures, exact measures and misestimates by transfer entropy, on
essential gates of perfect circuits for motion detection, and sound localization task. Columns show
mean values and 95% confidence interval of misestimates and exact measures (A) per Brain, and
(B) per gate.
                                                     64


4.3.3    Transfer Entropy Measurements from Recordings of Evolved Brains
In the previous section we estimated errors in information flow attribution using the error that each
particular logic gate in Table 4.1 entails, and then calculated the total error using the gate type
distribution for each cognitive task. However as mentioned earlier, this approach only gives a crude
estimate of flow because in the evolved cognitive circuits the neurons (and therefore the logic gates)
are not independent, and their input is not in general maximum entropy.
    Here we use a different approach to assess transfer entropy measurement accuracy in identifying
inter-neuronal relations of evolved Markov Brains: we record the neural activities of an evolved
Brain when performing a particular cognitive task, similar to the neural recording (“brain mapping”)
performed on behaving animals. We collect the recordings in all possible trials for each cognitive
task and create a dataset for each evolved Brain for that cognitive task. More precisely, for Brains
that evolved to perform the motion detection task we record neural firing patterns in 16 different
trials. At the beginning of each trial, the Brain is in a state in which all neurons are quiescent. Then,
the Brain is updated three times, so we record the Brain’s neural activity in 4 consecutive time steps
(including the initial state). As a result, the recordings dataset of a Brain that performs motion
detection consists of 64 snapshots of the Brain, i.e., the binary state of each neuron. Similarly,
a Brain that performs sound localization is recorded during five different trials, and during each
trial the Brain is recorded in four consecutive time steps. This results in a recording dataset of
size 20 for each evolved Brain. Note that these evolved Brains are deterministic, thus, if a Brain
is recorded in the same trial multiple times, its behavior and neural activities remain exactly the
same and therefore, recording a Brain once in each trial is sufficient. We then use these recordings
to measure transfer entropy for every pair of neurons TE 𝑁𝑖 →𝑁 𝑗 in the network. These transfer
entropy measures can be used as a statistic to test whether a neuron 𝑁𝑖 causally influences another
neuron 𝑁 𝑗 . Figure 4.6A shows the result of TE calculations performed on neural recording for a
Markov Brain evolved in the sound localization task.
    To test the accuracy of the TE prediction, we construct an influence map for each neuron of the
Markov Brain that shows which other neurons are influenced by a particular neuron. Such a mapping
                                                    65


also determines the receptive field of each neuron, which specifies which other neurons influence a
particular neuron. Markov Brains evolve complex networks in which multiple logic gates can write
to the same neuron and as a result, it is not straightforward to deduce input-output relations among
neurons. Indeed, it was previously argued that even armed with complete knowledge of a given
system, finding the causal relation among the components of the system may be a very difficult
task [145, 144, 63].
     To create our “ground truth” model of direct influence relations, we take into consideration two
different components of a Brain’s network. First, we take into account the input neurons of a gate
and its output neuron, while we also take into consideration the type of the logic gate. For example,
in the case of a ’ZERO’ gate where the output is always 0 we do not interpret this connection
to reflect information flow (as there is no entropy in the output). Second, we analytically extract
the binary state of each neuron as a Boolean function of all other neurons using a logic table of
the entire Brain (logic table of size 216 , for 16 neurons). This helps us rule out neurons that are
connected as inputs to a logic gate while not actually contributing to the output neuron of that gate.
Note that this procedure is specifically helpful in cases where more than one logic gate writes into
a neuron (when more than one gate writes into a neuron the ultimate result is the bitwise OR of all
incoming signals since if either one of them is a non-zero signal it would make the neuron fire, i.e.,
its state becomes 1). Figure 4.6B shows an example of “ground truth” influence map of neurons
for a Brain evolved for sound localization. Each row of this plot shows the influence map of the
corresponding neuron and each column represents the receptive field of that neuron. Note that
in this plot values are binary, i.e., they are either 0 or 1 which specifies whether a source neuron
influences a destination neuron, whereas TE measurements vary in the range [0, 1] bits. Keep in
mind that this influence map is only an estimate of information flow gathered from gate logic and
connectivity shown in Figure 4.6C.
                                                   66


      A                                          B
      C                                                          11
                         1                               XNOR    12     XNOR
                                                         NOR     13
                                           OR      3
                               OR     6
                         0
                                               5         AND     14
                                            8            XOR     15
Figure 4.6: (A) Transfer entropy measures from neural recordings of a Markov Brain evolved for
sound localization. (B) Influence map (also receptive field) of neurons derived from a combination
of the logic gates connections and the Boolean logic functions for the same evolved Markov Brain,
shown in (C). (C) The logic circuit of the same evolved Markov Brain; neurons 𝑁0 and 𝑁1 are
sensory neurons, and neurons 𝑁11 − 𝑁15 are actuator (or decision) neurons.
    In order to compare TE measurements with influence maps, we first assume that any non-zero
value of the TE 𝑁𝑖 →𝑁 𝑗 implies that there is some flow of information from neuron 𝑁𝑖 to 𝑁 𝑗 .
We then evaluate how well TE measurements detect the information flow among neurons based
on this assumption. In particular, for each evolved Brain we count 1) the number of existing
pairwise information flows between neurons that is correctly detected by TE (hit), 2) the number
of relations that are present in the influence map but were not detected by TE (miss), and 3) the
number of existing pairwise information flow between the neurons detected by TE measurements
that according to the influence map were incorrectly detected (false-alarm). Figures 4.7A and B
show the performance results of TE measurements in detecting information flow in Brains evolved
in motion detection and sound localization, respectively (averaged across best performing Brains
and 95% confidence interval). We observe that the number of false-alarms in motion detection
                                                  67


(mean = 19.0, SE = 0.86) is greater the number of hits (mean = 6.8, SE = 0.20). Similarly, in sound
localization the number of false-alarms (mean = 45.1, SE = 1.63) is also greater than the number of
hits (mean = 10.1, SE = 0.31), but significantly more so. This again underscores that the accuracy
of transfer entropy measures strongly depends on the characteristics of the task that is being solved.
    In the results shown in Figure 4.7 we assumed that any value of transfer entropy greater
than 0 implies information flow. This assumption can be relaxed such that only transfer entropy
values that are greater than a particular threshold imply information flow. We calculated TE
measurement performance for a variety of threshold values in the range [0, 1]. The results are
presented as receiver operating characteristic (ROC) curves that show hit rates as a function
of false-alarm rates as well as their 95% confidence intervals in Figs. 4.7C and D for motion
detection and sound localization, respectively [109]. In these plots, the dashed line shows a fitted
ROC curve assuming a Gaussian distribution for the 𝑝(TE| information flow is present) and
                                                                                                𝜇 −𝜇
𝑝(TE| information flow is not present). The resulting ROC function is 𝑓 (𝑥) = 12 𝑒𝑟 𝑓 𝑐( √1 2 +
                                                                                                  2𝜎2
𝜎1       −1
𝜎2 𝑒𝑟 𝑓 𝑐 (2𝑥)),  where 𝑒𝑟 𝑓 𝑐 is the “error function” complement and 𝑒𝑟 𝑓 𝑐−1 is the inverse of the
error function complement.
    In the ROC plots, the datapoint with the highest hit rate (right-most data point) is the normalized
result shown in Figure 4.7A, B, that is, the analysis with a vanishing threshold. Note also that
the data in Figure 4.7 represent hit rates against false-alarm rate for thresholds spanning the entire
range [0,1], implying that hit rates cannot be increased any further unless we assume there is an
information flow between every pair of neurons (hit rate=false-alarm rate=1). The false-alarm rates
in the ROC curves are actually fairly low in spite of the significant number of false alarms we see in
Figure 4.7A, B. This is due to the fact that the number of existing pairwise information flows in a
Brain network is much smaller than the number of non-existing flows between any pair of neurons
(the influence map matrices are sparse). Thus, when dividing the number of false-alarms by the
total number of non-existing information flows, the false-alarm rate is low.
                                                  68


             A                                       B
              C                                      D
Figure 4.7: Transfer entropy performance in detecting relations among neurons of evolved (A)
motion detection circuits, (B) sound localization circuits. Presented values are averaged across best
performing Brains along with 95% confidence intervals. Receiver operating characteristic (ROC)
curve representing TE performance with different thresholds to detect neurons relations in evolved
(C) motion detection, (D) sound localization circuits.
4.4    Discussion
    We used an agent-based evolutionary platform to create digital Brains so as to quantitatively
evaluate the accuracy of transfer entropy measurements as a proxy for measuring information flow.
To this end, we measured the frequency and significance of cryptographic and polyadic 2-to-1
logic gates in evolved digital Brains that perform two fundamental and well-studied cognitive
tasks: visual motion detection and sound localization. We evolved 100 populations for each of
the cognitive tasks and analyzed the Brain with the highest fitness at the end of each run. Markov
Brains evolved a variety of neural architectures that vary in number of neurons and the number of
logic gates, as well as the type of logic gates to perform each of the cognitive tasks. In fact, both
modeling [152] and empirical [56] studies have shown that a wide variety of internal parameters
in neural circuits can result in the same functionality [111]. Thus, it would be informative and
                                                  69


perhaps necessary to examine a variety of circuits that perform the same cognitive task [184].
    An analysis of the evolved Brains suggests that selecting for different cognitive tasks leads
to significantly different gate-type distributions. Using the error estimate for each particular gate
due to encryption or polyadicity, we used the gate-type distributions for each cognitive task to
estimate the total error in information flow stemming from using transfer entropy as a statistic. The
transfer entropy misestimate was 1.33 bits (SE = 0.08) per Brain on average for Brains evolved for
motion detection, whereas in evolved Brains performing sound localization the misestimate was
significantly higher: 2.39 bits (SE = 0.12) per Brain on average. More importantly, the inherent
differences between the two tasks result in different levels of accuracy when using transfer entropy
measures to identify information flow between neurons. It is important to note that in calculating
these misestimates, we only accounted for the misestimates that result from TE measurements
in polyadic or cryptographic gates. However, we commonly face several other challenges when
applying the transfer entropy concept to components of nervous systems (neurons, voxels, etc.).
These challenges range from intrinsic noise in neurons to inaccessibility of recording data for larger
populations of neurons, which we discuss in more detail later.
    We also tested how well transfer entropy can identify the existence of information flow between
any pair of neurons using the statistics of neural recordings at two subsequent time points only.
Because a perfect model for the “ground truth” of information flow is difficult (if not impossible) to
establish, we use an approximate ground truth that uses the connectivity of the network, along with
information from the (simplified) logic function to provide a comparison. We find that TE captures
many of the connections established by the ground truth model, with a true positive rate (hit rate)
of 73.1% for motion detection and 78.7% for sound localization (assuming any non-zero value
of transfer entropy implies information flow). The TE measurements miss some relations from
the established ground truth while also providing demonstrably false positives, with a false-alarm
rate of 7.7% in motion detection and 18.5% for sound localization. For example, some of the
information flow estimates in Figure 4.6 manifestly reverse the actual information flow, suggesting
a backwards flow that is causally impossible. Such erroneous backwards influence is possible,
                                                  70


for example, when the signal has a periodicity that creates accidental correlations with significant
frequency. Besides these false positives, the false negatives (missed inferences) are due to the use
of information-hiding (cryptographic or obfuscating) relations, as discussed earlier.
    It is noteworthy that in the transfer entropy measurements we performed, we benefited from
multiple factors that are commonly great challenges in TE analysis of biological neural recordings.
First, our TE measurement results were obtained using error-free recordings of noise-free neurons,
while biological neurons are intrinsically noisy. We were also able to use the recordings from
every neuron in the network, which presumably results in more accurate estimates. In contrast, in
biological networks we only have the capacity to record from a finite number of neurons which, in
turn, constrains our understanding of how information flows in the network.
    Furthermore, by focusing only on information flow from one time step to the next we can
evade the complex issues posed by estimating causal influence, which requires finding optimal
time delays in transfer entropies. For example, while a signal may influence a neuron’s firing three
time steps after it was perceived by a sensory neuron, it must be possible to follow this influence
step-by-step in a first-order Markov process, as causal signals must be relayed physically (no action-
at-a-distance). As a consequence, when using transfer entropy to detect and follow information
flow, we can restrict ourselves to history lengths of 1 (𝑘 = ℓ = 1), which significantly simplifies
the analysis [107]. Furthermore, complications arising from discretizing continuous signals [199]
do not arise, nor is there a choice in embedding the signal as all our neurons have discrete states.
In principle, extending the history lengths (from 𝑘 = ℓ = 1 to higher) may be used to reduce false
positives in entropy estimates (even for a first-order Markov process), for the simple reason that
the higher dimensionality of state space reduces accidental correlations, given a finite sample set.
However, such an increase in dimensionality has a drawback: it makes the detection of true positives
more difficult (it increases the rate of false negatives) unless the dataset size is also increased. In
many dynamical systems such an increase in data size is not an issue, but it may be very difficult (if
not impossible) for smaller systems such as the simple cognitive circuits that we evolve. For those,
the number of different “sensory experiences” is extremely limited, and increasing the dataset size
                                                  71


does not solve the problem because it would simply repeat the same data. In other words, unlike for
large probabilistic systems where generating longer time series will almost invariably exhaustively
sample the probability space, this is not the case for motion detection and sound localization. For
such “small” systems, increasing the history lengths reduces false positives, but increases false
negatives at the same time.
    Finally, in order to precisely calculate transfer entropy from Equation (4.1), the summation
should be performed over all possible states of variables 𝑋𝑡 , 𝑌𝑡 , 𝑌𝑡+1 . Using only a subset of those
states when calculating the entropy estimate may result in false positives, as well as false negatives.
This is another common source of inaccuracy in TE measurements of neural recordings. Here we
were able to generate neural recording data for all possible sensory input patterns and included
them in our dataset, yet we still observe the described shortcomings in our results. This brings up
another important point to notice, namely, even if we introduce every possible sensory pattern to
the network, we do not necessarily observe every possible neural firing pattern in the network, and
as a result, we do not necessarily sample the entire set of variable states (𝑌𝑡+1 , 𝑌𝑡 , 𝑋𝑡 ).
4.5    Conclusions
    Our results imply that using pairwise transfer entropy has its limitations in accurately estimating
the information flow, and its accuracy may depend on the type of network or cognitive task it is
applied to, as well as the type of data that is used to construct the measure. Higher-order conditional
transfer entropies or more sophisticated measures such as partial information decomposition [204]
may be able to alleviate those errors, at the expense of significant computational investments. We
also find that simple networks that respond to a low-dimensional set of stimuli (such as the two
example tasks investigated here) lead to problems in inferring information flow simply because
transfer entropy estimates will be prone to sampling errors.
    These findings highlight the importance of understanding the frequency and types of funda-
mental processes and relations in biological nervous systems. For example, one approach would be
to examine transfer entropy in known systems, especially in simple biological neural networks in
                                                     72


order to shed light on the strengths and deficiencies of current methods. Performing an information
flow analysis on brains in vivo will remain a daunting task for the foreseeable future, but advances
in the evolution of digital cognitive systems may allow us a glimpse of the circuits in biological
brains, and perhaps guide the development of other measures of information flow.
                                                  73


                                              CHAPTER 5
  MECHANISM OF DURATION PERCEPTION IN ARTIFICIAL BRAINS SUGGESTS
                      NEW MODEL OF ATTENTIONAL ENTRAINMENT
5.1     Introduction
    Our ability to deduce causation, to predict, infer, and forecast, are all linked to our perception of
time. This activity of the brain refers to an inductive process that integrates information about the
past and present to calculate the most likely future event [29]. Without a doubt, this ability is key to
an organism equipped with such a brain to survive and prosper, by predicting and deciphering events
in the world [134, 160]—as well as the actions of other such organisms. A typical experimental
procedure in the study of time perception is comparative duration judgement, in which subjects
are asked to compare and judge the duration of events. Generally, duration judgements display the
scalar property, which implies that the probability distribution of judgements is scale invariant [53].
However, we do not perceive time objectively. Rather, the experience of temporal signals is highly
subjective, and is influenced by non-temporal perception, attention, as well as memory [26, 118].
An example of non-temporal perception is the saliency of a stimulus (how it stands out over a
background), which may affect how it is perceived.
    Attention is another variable that can shape time perception [193, 37, 32, 108, 188]. Because our
cognitive bandwidth is limited, we cannot pay attention to all sources of information equally [132].
Rather, a sophisticated mechanism selects which stimuli are attended to, and how much attention
is allocated to them. A central hypothesis is that the more attention is devoted to the duration of an
event, the longer it is perceived to last [193, 37, 32, 108, 188]. Proposed models of time perception
such as Scalar Expectancy Theory (SET) [54] that support this hypothesis usually assume that
duration perception is performed with some sort of internal clock [53, 54, 191]. In that model,
the onset of an event triggers a switch that starts measuring the accumulation of pulses generated
by a pacemaker, and triggers the stop switch at the end of the event. The effective rate of pulse
                                                    74


accumulation, in turn, is modulated by the attention given to the stimulus.
    In SET, the amount of attention allocated to the stimulus is uniformly distributed in time.
By contrast, in models such as Dynamic Attending Theory (DAT) [81, 82, 101] the temporal
structure of the signal within which the stimulus is embedded may increase or decrease levels
of attention in time. In particular, rhythmic backgrounds can entrain the brain so that it expects
stimuli to occur periodically, and leads to peaks and troughs of attention. Consequently, models of
attentional entrainment based on DAT posit that attentional rhythms that are internal to the cognitive
architecture are synchronised by external rhythms, so that the external stimuli can then lead to an
enhanced processing of events that occur precisely when they are expected to occur [120, 122, 123].
Previous studies have provided support for DAT and related entrainment models, for example by
showing that events that occur at rhythmically expected time points can be discriminated more
easily than those that occur unexpectedly [101, 122, 83, 124, 129]. In a recent study, McAuley
and Fromboluti provided additional support for DAT and related entrainment models by studying
the role of attentional entrainment on event duration perception [121]. In that work, they used an
auditory oddball paradigm in which a deviant tone (oddball) is embedded within a sequence of
otherwise identical rhythmic tones (standard tones). Their results demonstrated that manipulations
of oddball tone onset can lead to distortions in oddball tone duration perception. In particular, they
observed a systematic underestimation of the duration of oddball tones that came early with respect
to the rhythm of the sequence, and an overestimation of oddball duration in trials where oddballs
arrived late with respect the rhythm of the sequence.
    Interval timing models such as DAT and SET and their computational counterparts usually take a
top-down approach by engineering networks of high-level computational components that describe
behavioural/psychophysical data in duration perception [53, 81, 82, 44, 117] (see also references
in [5, 64, 61]). Some studies have employed more elaborate models that consist of neuron-scale
components [28, 86]. Here, we take a bottom-up approach where evolution leads to a population
of diverse computational networks (artificial brains) consisting of lower-level components. These
brains may differ in their components and possibly in their behaviours (higher level computations).
                                                  75


These modern computational methods have opened a new path towards understanding perception:
the recreation, in silico, of neural circuitry that implements behaviour similar to human performance.
While this capacity is still in its infancy and therefore can only emulate humans on fairly simple
tasks (such as attentional entrainment), the usefulness of this tool for a future “experimental robotic
psychology” [19, 2] is evident.
     In this study, we use Darwinian evolution to create artificial digital brains, (also known as
Markov Brains [69], see Methods), that are able to perform duration judgements in auditory
oddball paradigms1. Markov Brains are networks of variables with discrete states that undergo
transitions evoked by sensory, behavioural, or internal states, and capable of stochastic decisions.
As such, they are abstract representations of micro-circuit cortical models [62], except that their
dynamics is not programmed.
     We run 50 replicates of the evolutionary experiment (i.e., 50 different populations) and from
each pick the best-performing Brain. These evolved Brains display behavioural characteristics that
are similar to human subjects: for example, their discrimination thresholds satisfy Weber’s law. In
fact, these 50 Brains can be thought of as participants in a cognitive experiment. We then test these
Brains against auditory oddball paradigms that they have never experienced before, in which the
oddball tone may come early or late with respect to the rhythm of the sequence (similar to the first
series of experiments in Ref. [121]). The evolved Brains show distortions in perception of early/late
tones similar to what was reported in human subjects [121]. We then analyse the algorithms and
computations involved in duration judgement in order to discover how these algorithms result in
systematic distortions in perception of early/late oddballs.
     Our findings demonstrate that the computations involved in duration judgements and distortions
is quite different from existing time perception theories such as scalar expectancy theory (SET)
or dynamic attending theory (DAT), and suggest a new theory of perception in which attention to
uncertain parts of the stimuli plays the central role, whereas predictable parts require less attention
(i.e., less processing) because they are expected [78]. This is consistent with recent findings that
    1 Here and below, to avoid confusion we use “Brain” with a capital B to denote artificial brains,
while biological brains remain just “brains”.
                                                      76


predictability of stimuli results in more rapid recognition [119]. We close with speculations that
suggest a broader view in which all cognitive processing can be understood in terms of context-
dependent prediction algorithms that pay attention only to those parts of the signal that are predicted
to have the highest uncertainty, and are therefore likely to be informative.
5.2    Results
    We evolve Markov Brains that are capable of duration judgements of an oddball tone placed in a
rhythmic sequence of identical tones (standard tones) with a variety of standard tone durations and
inter-onset-intervals (IOI) (Fig. 5.1 shows a schematic of the auditory oddball paradigm). We ran
50 replicates of the evolution experiment for 2,000 generations and from each population picked
the Brain with the highest performance at the end of each run. The best performing Brains in all
50 populations gain 98.0% fitness on average (see Fig. 5.10).
          standard tones                         oddball
                                                                                          time
          Inter-onset-interval
Figure 5.1: A schematic of the auditory oddball paradigm in which an oddball tone is placed within
a rhythmic sequence of tones, i.e., standard tones. Standard tones are shown as grey blocks and
the oddball tone is shown as a red block. Oddball tone duration may be longer or shorter than the
standard tones.
5.2.1   Discrimination thresholds of evolved Markov Brains comply with Weber’s law
We used average responses of the evolved Brains to generate psychometric curves as follows.
For each (IOI, standard tone) we averaged the decision responses of 50 evolved Brains. Using
these averaged responses, we generated psychometric curves corresponding to each standard tone
as prescribed by [109] and calculated the point of subjective equality (PSE) and just noticeable
difference (JND). The PSE measures the duration for which Markov Brains respond longer (or
                                                 77


shorter) 50% of the time which, in essence, marks the duration of the oddball that is perceived
to be equal to the standard tone. The JND measures the sensitivity of the discrimination, or
discrimination threshold, for a standard tone. In other words, the JND represents the slope of
the psychometric curve, where steeper slopes show higher discrimination sensitivity or lower
discrimination threshold. The PSE reflects the accuracy of the perception while the JND indicates
its precision. The values of PSE, JND, and their standard deviations are presented for all inter-
onset-intervals and standard tones in Table 5.1.
Table 5.1: This table contains point of subjective equality (PSE), just noticeable difference (JND),
and their standard deviations (SD), as well as relative JNDs, and constant error (CE) of on-time
oddballs for all inter-onset-intervals, standard tones. Responses are averaged across all 50 Brains
to generate psychometric curves.
             IOI, std tone   PSE    PSE SD      JND     JND SD    relative JND      CE
                (10, 5)      4.89     0.073     0.335    0.050        0.067       -0.109
                (11, 5)      5.09     0.068     0.292    0.039        0.058        0.092
                (12, 6)      5.92     0.083     0.458    0.051        0.076       -0.077
                (13, 6)      6.27     0.072     0.339    0.050        0.057        0.265
                (14, 7)      6.80     0.080     0.404    0.050        0.058       -0.204
                (15, 7)      7.05     0.072     0.416    0.041        0.059        0.049
                (16, 8)      7.76     0.067     0.380    0.037        0.047       -0.242
                (17, 8)      8.05     0.064     0.402    0.038        0.050        0.051
                (18, 9)      8.58     0.072     0.372    0.049        0.041       -0.417
                (19, 9)      9.01     0.081     0.469    0.048        0.052        0.012
               (20, 10)      9.76     0.078     0.403    0.052        0.040       -0.240
               (21, 10)     10.45     0.093     0.442    0.045        0.044        0.448
               (22, 11)     11.19     0.109     0.655    0.085        0.060        0.192
               (23, 11)     11.99     0.116     0.756    0.075        0.069        0.993
               (24, 12)     13.04     0.119     0.829    0.071        0.069        1.036
               (25, 12)     13.82     0.128     0.900    0.079        0.075        1.819
    According to Weber’s law [47], the discrimination threshold (e.g., the JND) varies in proportion
to the standard stimulus; therefore, the values of relative JND, defined as stdJND
                                                                                tone , should remain
constant. Getty showed that empirical results of duration perception in the range of 80 msec
to 2 seconds is explained very well with Weber’s law [52]. Fig. 5.2A shows the psychometric
curves generated from the averaged responses of all 50 Brains for every (IOI, standard tone). In
this figure, durations are normalised by standard tone. Psychometric curves for different standard
                                                  78


                           A
                           B
                           C
Figure 5.2: (A) Psychometric curves generated from averaged responses of 50 evolved Brains
for every inter-onset-interval, standard tone. Oddball durations on the 𝑥-axis are normalised by
standard tone to lie in the range (-1, 1). (B) Relative JND values and their 95% confidence interval
as a function of inter-onset-interval, standard tone. Dashed line shows the average value of relative
JNDs. (C) Constant errors, the difference between PSE and standard tone, and their 95% confidence
interval as a function of inter-onset-interval, standard tone. Dashed line shows CE=0.
tones overlap, which shows that relative JNDs in all these trials are similar and confirms that they
are in accordance with Weber’s law. Fig. 5.2B shows relative JNDs as a function of standard
tones. All relative JNDs are in the range between 0.04 and 0.07 with mean=0.06 and standard
                                                   79


deviation=0.01, similar to the values found in [52]. The difference between PSE and the standard
tone, also known as the constant error (CE), shows the deviation of perceived duration of tone from
its actual duration. The values of CE are shown for every (IOI, standard tone) in Fig. 5.2C and we
observe that for longer IOIs, CE values start to deviate slightly from zero. This deviation in PSE
values for longer tone durations is also observed in human subjects [52]. However, this deviation
of CEs for longer tones in Markov Brains was different from human subjects in that CE values in
human subjects start decreasing for longer durations (they are negative) whereas in Markov Brains
CE values increase (they are positive). This difference can be attributed to the fact that in the
experiments described in Ref. [52] subjects do not receive any feedback about their performance
duration judgements whereas Darwinian evolution provides feedback implicitly via selection. The
mechanisms behind the distortion in duration perception in longer IOIs are explained in more detail
in Additional Experiments and Analysis.
5.2.2   Evolved Brains show systematic duration perception distortion patterns similar to
        human subjects
In the next step, we tested evolved Markov Brains with stimuli that they had never experienced
during evolution, namely oddballs that arrive early or late with respect to the rhythm of the sequence
of tones (termed “test trials”). In trials used during evolution (“training trials”), oddballs always
occurred in sync with the rhythmic tone (on-time oddballs). These test trials included all possible
oddball durations but also all possible oddball onsets, meaning oddballs were delayed or advanced
as many time steps as possible as long as they did not interfere with the following or preceding tone.
Then, we used the average response of 50 Brains to generate psychometric curves for early/late
oddballs, and to calculate PSE values.
     We used PSE values to calculate the duration distortion factor (DDF), defined as the ratio
of the point of objective equality (the standard tone) and the point of subjective equality (PSE).
Fig. 5.3 shows the DDF as a function of the onset of the oddball for all IOIs. In this plot, negative
onset values stand for early oddballs and positive values of onset represent late oddballs. A DDF
                                                  80


                                      1.75       IOI=10, std tone=5                    1.75       IOI=11, std tone=5                    1.75       IOI=12, std tone=6              1.75       IOI=13, std tone=6
                                      1.50                                             1.50                                             1.50                                       1.50
                                      1.25                                             1.25                                             1.25                                       1.25
                                      1.00                                             1.00                                             1.00                                       1.00
                                      0.75                                             0.75                                             0.75                                       0.75
                                      0.50                                             0.50                                             0.50                                       0.50
                                             3       2       1 0   1       2   3              3       2       1 0   1       2   3              4       2       0       2       4          4       2       0       2       4
                                      1.75       IOI=14, std tone=7                    1.75       IOI=15, std tone=7                    1.75       IOI=16, std tone=8              1.75       IOI=17, std tone=8
                                      1.50                                             1.50                                             1.50                                       1.50
         duration distortion factor
                                      1.25                                             1.25                                             1.25                                       1.25
                                      1.00                                             1.00                                             1.00                                       1.00
                                      0.75                                             0.75                                             0.75                                       0.75
                                      0.50                                             0.50                                             0.50                                       0.50
                                             4           2     0       2           4          4           2     0       2           4              4       2   0   2       4                  4       2   0   2       4
                                      1.75       IOI=18, std tone=9                    1.75       IOI=19, std tone=9                    1.75       IOI=20, std tone=10             1.75       IOI=21, std tone=10
                                      1.50                                             1.50                                             1.50                                       1.50
                                      1.25                                             1.25                                             1.25                                       1.25
                                      1.00                                             1.00                                             1.00                                       1.00
                                      0.75                                             0.75                                             0.75                                       0.75
                                      0.50                                             0.50                                             0.50                                       0.50
                                             6           3     0       3           6          6           3     0       3           6          6       3       0       3       6          6       3       0       3       6
                                      1.75       IOI=22, std tone=11                   1.75       IOI=23, std tone=11                   1.75       IOI=24, std tone=12             1.75       IOI=25, std tone=12
                                      1.50                                             1.50                                             1.50                                       1.50
                                      1.25                                             1.25                                             1.25                                       1.25
                                      1.00                                             1.00                                             1.00                                       1.00
                                      0.75                                             0.75                                             0.75                                       0.75
                                      0.50                                             0.50                                             0.50                                       0.50
                                                 6       3     0       3       6                  6       3     0       3       6              8       4       0       4       8          8       4       0       4       8
                                                                                                                        oddball onset
Figure 5.3: Duration distortion factors (DDF) and their 95% confidence interval as a function of
the onset of the oddball for all IOI, standard tones. Negative onset values represent early oddballs
and positive values of onset represent late oddballs. A DDF greater than 1 shows an overestimation
of the duration of the oddball and DDF less than unity shows an underestimation of the duration of
the oddball. The dashed line indicates DDF=1 and the dotted line shows DDF for on-time oddball
tone.
greater than one shows an overestimation of the duration of the oddball whereas a value less than
unity reflects an underestimation of the duration of the oddball. Just as was observed with human
subjects [121], the late oddballs are perceived as longer and the early oddballs are perceived as
shorter compared to the standard tone. In addition, the more delayed (early) the oddball tone, the
more its duration is overestimated (underestimated) compared to the standard tone, which is again
consistent with results presented in experiment 2 of Ref. [121].
                                                                                                                                81


5.2.3    Algorithmic analysis of duration judgement task in Markov Brains
The logic circuits of evolved Markov Brains are complicated and defy analysis in terms of causal
logic. As observed before, these networks turn out to be “epistemologically opaque” [116], in the
sense that their evolved logic does not easily fit into the common logical narratives we are familiar
with. Rather than focus on the Boolean logic of Markov Brains, we here focus on their state
space [55, 166]. In particular, we investigate the state transitions and how these transitions unfold
in time, in order to discover the computations that are at the basis of the observed behaviour [16].
5.2.3.1    Temporal information about stimuli is encoded in sequences of Markov Brain states
Evolved Brains display periodic neural activation patterns in response to rhythmic auditory signals
(this is, by definition, entrainment). These periodic neural firing patterns translate to loops in state
transition diagrams (see Methods for more details on state transitions in Markov Brains). In each
trial, the first few tones an evolved Brain listens to typically shift the Brain’s activation pattern
towards a region in state space that is associated with this rhythm. More precisely, the opening
tones transition the Brain to a sequence of states that form a loop in the state-to-state diagram, and
the Brain remains in that loop as long as the stimulus is repeated. Fig. 5.4A shows an instance of a
Markov Brain state transition diagram when listening to rhythmic tones with IOI=10 and standard
tone=5 in the absence of an oddball. The state of the Brain is calculated from equation (5.2).
Supplementary Movie 1 shows the state-to-state transitions as the Brain listens to a sequence
of standard tones. This sequence of Brain states encodes the contextual information about the
stimuli, that is, the sequence forms an internal representation of the rhythm and the standard tone.
More importantly, this sequence produces an expectation of future inputs that enables the Brain
to compare the input it has sensed with future inputs. In particular, when the Brain receives the
oddball, it usually transitions out of this loop to follow a different trajectory in state space (see for
example Fig. 5.4B) to judge the oddball duration, which is a comparison mechanism between the
standard tone (what is expected) and the oddball. Fig. 5.5 shows that in most of the trials (77.6%
of the trials) Brains evolve loops of the same size as the period of the rhythmic tones (the IOI), but
                                                   82


                        A             0
                                            1
                                               1510
                                                    1
                                                        2919
                                                               1
                                                                   997
                                                                          1
                                                                               2882
                                                                                       1
                                                                                             485
                                                                                                    0 1
                                                                                                            1862
                                                                 1                                               0
                                                                   2914                                      867
                                                                                                                 0
                                                                 1                  L
                                                                            1              0            0
                                                                   961            3911           359         839
                            standard     1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
                            longer       1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 0 0 0 0
                        B
                                1         1         1        1           1             0 1
                           0        1510      2919     997        2882          485             1862          0
                                                                                        0 1          0
                                                      1                                                    357
                                                       2914             0          967           867
                                                                                           1                   0
                                                      1                                              0
                                                                      L                                    2886
                                                               1              0             0
                                                       961          3911            359          839
                                                                                                         0
                                                                  0                                               1
                                                                                                           865        3014
                                                                     833
                                                                                                         1         0 1
                                                                         0               1                 3046
                                                                    2918
                                                                                   S                     1
                                                                         0
                                                                     869          2884         3044        2912
                                                                            0             1            1
                            standard     1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
                            shorter      1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
Figure 5.4: State-to-state transition diagram of a Markov Brain for IOI=10, and standard tone=5,
with oddball tones of duration 5, 6 shown in (A) and 4 shown in (B). Before the stimulus starts, all
neurons in the Brain are quiescent so the initial state of the Brain is 0. The stimulus presented to
the Brain is a sequence of ones (representing the tone) followed by a sequence of zeros (denoting
the intermediate silence). The stimulus at each time step is shown as the label of the transition
arrow in the directed graph. The input sequence is shown for the standard and oddball sequences at
the bottom of the state-to-state diagrams. (A) State-to-state transition diagram of a Markov Brain
when exposed to a standard tone of length 5, as well as a longer oddball tone of length 6. This
Brain judges an oddball tone of duration 6 by following the same sequence of states as the original
loop, because the transition from state 485 to 1862 occurs irrespective of the sensory input value,
0 or 1. This Brain correctly issues the judgement “longer” from state 3911, indicated by the red
triangle at the end of the time interval (see Supplementary Movie 1 and Supplementary Movie 2 for
standard tone and longer oddball tone, respectively). (B) The state-to-state transition diagram of
the same Brain when presented with a shorter oddball tone of length 4. The decision state is marked
with a down-pointing blue triangle. Once the Brain is entrained to the rhythm of the stimulus, the
shorter oddball throws the Brain out of this loop. The exit from the loop transitions this Brain into
a different path. After four ones the Brain transitions to state 359 (instead of continuing to 485),
and then continues along a path where it correctly judges the stimulus to be “shorter” in state 2884
(see also Supplementary Movie 3).
                                                                  83


some Brains have loops that are multiples of the IOI. In this figure, the size of the each marker is
proportional to the number of Brains that evolve a particular loop length in each IOI. Also, further
analysis shows that in 93.6% of trials, evolved Brains transition out of these loops at the exact time
point where there is a mismatch in oddball and standard tone.
                                    100
                                        80
                          loop length
                                        60
                                        40
                                        20
                                             10    15            20      25
                                                  inter-onset-interval
Figure 5.5: The distribution of loop sizes of 50 evolved Brain for each inter-onset-interval (IOI).
The size of the markers is proportional to the number of Brains (out of 50) that evolve a particular
loop length in each IOI. The dashed line shows the identity function.
5.2.4   Algorithmic analysis of distortions in duration judgements: Experience and perception
        during misjudgements of early/late oddballs
The similarity of behavioural characteristics in the perception of event duration between Markov
Brains and human subjects appears to imply a fundamental similarity between the underlying
computations and algorithms. In the following, we present brief definitions of concepts such
as attention, experience, and perception in terms of state transitions in deterministic finite-state
machines that are later used in our analysis (in Methods we present more formal definitions of these
concepts and the reasoning behind them).
    1) Attention to a stimulus: When a Brain is in state 𝑆𝑡 and transitions to state 𝑆𝑡+1 regardless
of the stimulus (zero or one), we say the Brain does not pay attention to the input stimulus. More
generally, a Brain pays less attention to an input stimulus or a sequence of stimuli if that input does
not affect the state of the Brain later in the future state, 𝑆𝑡+𝑘 .
                                                        84


     2) Perception of a trial: The state of the Brain at the end of the oddball tone interval (when it
issues the longer/shorter decision) is the Brain’s perception of the tone sequence.
     3) Experience of the stimuli: The temporal sequence of Brain states when exposed to a sequence
of input stimuli constitutes the Brain’s experience.
     We first hypothesised that early or late oddball tones drive the Brain into states that they had
never visited before (as these Brains had never previously experienced early or late oddball tones)
and that these new states are responsible for misjudgements of early or late oddballs. When
exposed to late or early oddballs, Brains visited on average 22.26 (SE=4.33) new states across 50
evolved Brains, approximately 32% of the number of states they visited during trials with on-time
oddballs, which is 69.80 states on average (SE=5.07). We then tested how often these new states
are decision states for the misjudgements of out-of-time oddball tones. Our tests show that in
such misjudgements, the Brain state at decision time point is almost never a new state that has not
appeared before (it happened in one test trial for one Brain out of 56,250 different test trials in all
50 Brains).
     Given that during misjudgements of out-of-rhythm oddballs the decision state is a state that
had previously occurred during evolution, we test whether there is any connection between Brain
states during these misjudgements and Brain states in training trials. In other words, we investigate
how the experience during a misjudgement relates to experiences the Brain had in its evolutionary
history. In the next two sections, we address these questions by separately focusing on perception
and experience of Markov Brains during misjudgements of out-of-rhythm oddball tones.
5.2.4.1    The onset of the tone does not alter a Brain’s perception of the tone
Our null hypothesis is that the perception of an out-of-rhythm oddball tone may be any one of
the states that the Brain has traversed in training trials with equal probability. In any of these
Brain states, the decision neuron will be either quiescent or firing, so we call the set of states
with quiescent decision neuron “shorter-judging states” denoted as 𝑆Sh , and the set of states with
firing decision neuron “longer-judging states” denoted as 𝑆Lo . Thus, the probability that a Brain at
                                                  85


decision time is in any of the shorter-judging states, for example, is calculated by
                                                                1
                                    Prob(𝑆decision ∈ 𝑆 𝑆ℎ ) =        ,                           (5.1)
                                                              |𝑆Sh |
where |𝑆Sh | is the cardinality of the set of shorter-judging states, and similarly, Prob(𝑆decision ∈
𝑆 𝐿𝑜 ) = |𝑆 1 | .
           Lo
    We develop our alternative hypothesis that captures possible associations between experience
and perception during misjudgement of out-of-rhythm oddballs and experiences and perceptions
they had in training trials. In order to discover such possible associations, for any given mis-
judgement of early or late oddball we limit our search domain to training trials with the same
inter-onset-interval and standard tone as the misjudgement trial. In the next step, we search for
correlations between the perception and various oddball tone properties such as its 1) onset (time
step at which the oddball begins, 𝑇init ), 2) duration (Δ𝑇), and 3) ending time point (time point
at which the oddball ends, 𝑇 𝑓 𝑖𝑛 ). To this end, we calculated the information shared between the
perception and oddball tone properties (see Methods for a detailed explanation of information com-
putation procedures). Fig. 5.6A shows the information shared between the perception (decision
state 𝑆decision ) of the Brains and 1) oddball ending time (shown in grey), 2) oddball onset (shown
in blue), and 3) oddball duration for each inter-onset-interval and standard tone. These results
show that the oddball ending time point is a better predictor of the perception than the oddball tone
onset or its duration. Note also that the information shared between the perception and the oddball
ending time point remains consistent across all IOI and standard tones, whereas shared information
between perception and oddball duration, and perception and onset decrease monotonically as
IOI and standard tones increase. Building on these results, we propose the following alternative
hypothesis: during misjudgement of an early or late oddball, a Brain goes through a state sequence
that is reminiscent of experiences it had during trials with the same IOI and standard tone, and with
on-time oddballs that end at the same time point as the early or late oddball (an example scenario
is shown Fig. 5.6B).
    In order to test this alternative hypothesis, we perform another test to measure how often
perception in misjudgement of early or late oddballs is identical to perceptions in similar training
                                                  86


   A                                                        B
                                                               standard     1 1 1 1 1 0 0 0 0 0            standard       1 1 1 1 1 0 0 0 0 0
                                                                                                           shorter        1 1 1 1 0 0 0 0 0 0 S
                                                              longer        1 1 1 1 1 1 0 0 0 0   L
                                                                                                           early      1 1 1 1 1 1 0 0 0 0 0 0 S
                                                              late          0 0 1 1 1 1 0 0 0 0   L
                                                                                              1 1 1 1 1 1 0 0 0 0
                                                                              1         1         1        0           0         0
                                                                        961        2914      997       ...       839        359      3911
                                                                                                                                           L
                                                                                              0 0 1 1 1 1 0 0 0 0
                                                                               0        0         1         0           0        0
                                                                        833       2918      997        ...        839       359      3911  L
                               C            ***                    ***
Figure 5.6: (A) The mutual information between perception, i.e., the decision state of the Brain,
and 1) the oddball tone ending time step (shown in black), 2) the oddball tone duration (shown in
red), 3) the oddball tone onset (shown in blue), and their 95% confidence intervals. (B) Sequence
of inputs for a standard tone, an on-time longer oddball tone that is correctly judged as longer, and
a shorter late oddball tone that is misjudged as longer. Sequence of inputs for a standard tone, an
on-time shorter oddball tone that is correctly judged as shorter, and a longer early oddball tone that
is misjudged as shorter. Sequences of Brain states along with input sequences for on-time longer
oddballs and shorter late oddballs.(C) The fraction of misperceived out-of-time oddball tones that
resulted from having the same perception in on-time and out-of-time stimuli with the same oddball
end points (left data point), compared to the null hypothesis; likelihood that Brains misjudgements
were to be issued from any one of states from set of “shorter-judging” or “longer-judging” states
(middle and right data point, respectively).
trials. Consider for example a trial with IOI=10, standard tone=5 with a late oddball tone (onset=2)
that is shorter than the standard tone (duration=4) as shown in Fig. 5.6B. When a Brain misjudges
this oddball as “longer” (with 𝑆decision = 3911 as shown in Fig. 5.6B), we search for instances
in the set of training trials (with on-time oddball) with IOI=10 and standard tone=5, where that
Brain issued a correct “longer” decision for an oddball that ended at the same time point as the
late shorter oddball (as shown in Fig. 5.6B). The same analysis can be performed for misjudgement
of early oddballs that are longer than the standard tone (Fig. 5.6B). We count the number of such
instances for each Brain and divide the result by the total number of its misjudgements of out-
of-rhythm oddball tones. Fig. 5.6C (left data point) shows the result of this analysis for all 50
                                                  87


Brains. This result shows that in the vast majority of the cases (with median of 69.5% of the
cases), the misjudged out-of-rhythm oddball and on-time oddballs that end at the same time point
are perceived the same. In other words, the misjudgement is due to Brains paying less attention
to the onset of the tone, meaning the onset of the oddball does not affect the ultimate state from
which the decision is issued. The middle and left data points show the probabilities calculated
from equation 5.1 described in the null hypothesis, that measure how likely it is for a Brain to,
by chance, end up in any of the “shorter-judging” or “longer-judging” state at decision time. Our
statistical analysis shows that having the same decision state in out-of-rhythm oddball and on-time
oddballs (with constraints explained above) are significantly more likely than being in “shorter-
judging” (median=0.695 vs. median=0.069, Mann Whitney 𝑈 = 2494.0, 𝑛 = 50, 𝑝 = 5.03 × 10−18
one-tailed) or “longer-judging” state at decision time (median=0.695 vs. median=0.023, Mann
Whitney 𝑈 = 2500.0, 𝑛 = 50, 𝑝 = 3.51 × 10−18 one-tailed), therefore, we reject the null hypothesis
in favour of the alternative hypothesis.
    Based on these findings, we conclude that during misjudgements of early or late oddball tones,
Markov Brains pay more attention to the end point of the oddball and less attention to the oddball
duration, or it onset. This is presumably because during evolution tones are always rhythmic and
Brains that entrain to the rhythm expect the oddball to be on-time. As a result, Brains pay more
attention to when the oddball ends which is a more informative component of the stimuli than its
onset which, during evolution, had no variation and hence, no uncertainty.
5.2.4.2    Experience of early or late oddball is similar to adapting entrainment to phase change
Here we investigate the entire sequence of Brain states (Brain experiences of the stimuli) for those
instances we found in previous section, in which the perception of the Brain in misjudgement of
early/late oddballs was the same as perception of shorter/longer on-time tones with the same end
point as an out-of-time oddball. In order to compare two experiences, we use two different measures
(experience comparison is a form of representational similarity analysis, see for example [96, 95]).
First, we find the longest common sub-sequence that includes the decision state. In other words, we
                                                  88


start from the decision state in on-time and out-of-time sequences (note that the decision state is the
same in both sequences), trace back the transitions in sequences and count the number of states that
are identical in both sequences until the first mismatch occurs. The length of the identical portion
of the two sequences is then normalised by the total length of one sequence (recall that the length
of both sequences are the same) to lie in the range (0,1], we term this normalized length of the
identical portion of experiences the similarity depth, since it measures how deeply the on-time and
out-of-time oddball experiences are identical. We note that because the perception of the tone is
the same in these trials, the similarity depth must be greater than zero. Second, we use the Jaccard
index, that measures the overall similarity of sequences by comparing states at same positions in
the two sequences.
    Fig. 5.7A shows the distributions of similarity depth and total similarity of experiences.
Fig. 5.7B shows the distribution of the difference between the similarity depth and total simi-
larity. The difference between the two measures is zero in 91.5% of the cases which implies that
the experiences are almost always entirely different up to the point where they become identical.
We observe a wide variety in these similarity measures which shows that Brains do not traverse
the exact same trajectory they did during an on-time trial; rather the early or late oddball initially
throws the Brain out of this trajectory but later the Brain returns to states it experienced during an
on-time oddball with the same end point. In other words, the onset of the out-of-time oddball is
noticed, however, since the Brains are entrained to the rhythm and expect the oddball to be on-time
their computations of duration relies more on their expectation than the actual start point of the
oddball. This mechanism is reminiscent of adapting to phase changes in entrainment to rhythmic
stimuli.
5.3     Discussion
    This study was aimed at elucidating the neural (mechanistic) underpinnings of perception, by
evolving digital Brains that perform duration judgements of tones that were presented in a rhythmic
sequence, and that were later subjected to out-of-rhythm oddball tones to quantify distortions in
                                                   89


           A                                                B
Figure 5.7: (A) Distribution of similarity depth of experiences (sequences of states) of on-time
and early/late oddball tones in trials in which onset does not change the perception of the tone in
Markov Brains. Similarity depth one implies that the experiences are identical throughout the tone
perception. (B) The distribution of the difference between the total similarity and similarity depth
in each trial.
duration judgement that occur as a response to the onset manipulation. We found that evolved
Markov Brains display a capacity to discriminate tone length that is remarkably similar to people’s
ability to distinguish changes (quantified by Weber’s Law) to the extent that the observed relative
JND of Markov Brains was in the same range (6-10%) as in some of the experiments in [52, 121].
Furthermore, evolved Markov Brains exhibit a systematic distortion in perceived event duration
of out-of-rhythm oddball tones that is also similar to what was observed in a human subjects
study previously conducted by one of the authors. But while the conclusion of [121] was that the
experiments supported the dynamic attending theory (DAT) of attentional entrainment (which, we
recall, posits that entrainment creates peaks of attention that coincide with the start of each tone)
we here find instead that Markov Brains pay attention to the end of the signal, and pay less attention
to the onset.
    From the point of view of Bayesian inference [92], a model of cognition that focuses attention on
those parts of the signal that carry most of the uncertainty (the end of the stimulus) makes eminent
sense. After all, Brains that have experienced only on-time stimuli should take the rhythmic nature
of stimuli for granted: there is no need to pay attention to predictable stimuli. In fact, this view of
cognition is fully consistent with the Hierarchical Temporal Memory (HTM) model of neo-cortical
computation [67], which is based on the idea that brains are prediction machines. This model
of attention differs from common models of visual processing and attention such as visual and
                                                  90


auditory saliency [74, 87], because in those models only the contrast of the stimulus with the
background is considered for saliency, not the value of the information it contains. The model is
consistent, however, with neurophysical models in which temporal anticipation improves perception
but does not affect the spontaneous firing rate [78, 119], which is associated with attention in visual
processing [192].
    The present work suggests a model of cognition where the stimulus not only entrains the
cognitive apparatus, but conditions the brain to expect only a small subset of possible future states.
From this point of view, any temporal history of stimuli leads to predictions that, for the most time,
will come to pass unless the environment has changed in a way that necessitates further attention.
In particular, our findings suggest that both DAT and SET are incomplete models of time perception
where DAT unduly emphasises attention peaks at the beginning of each tone in the sequence, while
SET uses the onset and the end of the tone to start and stop a clock, contrary to our (admittedly
digital) evidence.
    The results presented here open up a number of different questions and avenues for future
exploration. Can the theory of dynamical entrainment we present here be meaningfully tested in
human experiments, by focusing on those predictions that distinguish it from established theories
such as SET and DAT? Does this theory also explain observations in different sensory modalities
such as vision? A program in which empirical studies using human subjects coupled with so-
phisticated digital experimentation might provide an answer, and open up avenues for a detailed
mechanistic understanding of the complexities of perception. Ultimately, this opens up the possi-
bility of explaining phenomenological concepts such as attention, perception, and memory in terms
of state-space dynamics of cortical networks.
5.4     Methods
    The use of mathematical and computational methods for the study of behaviour is growing,
especially due to the unprecedented increase in our computational power [94]. Computational
methods in particular enable us to perform a large number of “experiments” in silico, with param-
                                                  91


eters varying in a wide range, in a reasonably short time. Such experiments allow us to explore
parameter space more broadly and to make predictions about conditions that have not been tested
before and, more importantly, are currently beyond the reach of our empirical power. Naturally, for
such computational experiments to have any explanatory power, they must be validated thoroughly
with behavioural data.
    In this work, we use an agent-based model in which agents are controlled by artificial neural
networks (ANNs) that differ in many important aspects from the more common ANN method.
Because the logic of these networks is determined by logic gates with the Markov property we refer
to these neural networks as Markov Brains [45]. Below, we describe the structure, function, and
encoding of Markov Brains, but see [69] for a full description of their properties and how they are
implemented. Markov Brains have been shown to be well-suited for modelling different types of
behaviour observed in nature, from simple elements of cognition such as motion detection [184]
and active categorical perception [116, 186], to swarming in predator-prey interactions [140],
foraging [138], and decision-making strategies in humans [98].
5.4.1   Markov Brains
Markov Brains are networks of variables connected via probabilistic or deterministic logic gates
with the Markov property. While we often term these variables “neurons”, the state of the variable
is more akin to a binary firing rate, that is, each neuron is a binary random variable (i.e., a bit) that
may take two values: 0 for quiescent and 1 for firing. Fig. 5.8A shows a schematic of a simple
Brain consisting of 12 neurons (labeled as 0-11) at two subsequent time points 𝑡 and 𝑡 + 1. The
state of neurons in this example are updated via two logic gates. Fig. 5.8B shows a gate that takes
inputs from neurons 0, 2, and 6 and writes the output into neurons 6 and 7. This logic gate produces
output states of neurons 6 and 7 at time 𝑡 + 1 given input states at time 𝑡. Each gate is defined
by a probabilistic logic table in which the probability of each output pattern for a given input is
specified. For example, in the probability table shown in Fig. 5.8C, 𝑝 52 specifies the probability of
obtaining output state (𝑁6 , 𝑁7 ) = (1, 0) (a state with decimal representation ‘2’) given input states
                                                    92


(𝑁0 , 𝑁2 , 𝑁6 ) = (1, 0, 1) (decimal translation ‘5’), that is,
                               𝑝 52 = 𝑃(𝑁0 , 𝑁2 , 𝑁6 = 1, 0, 1 → 𝑁6 , 𝑁7 = 1, 0).
Since this gate takes 3 inputs, 23 possible inputs can occur, which are shown in eight rows. Similarly,
this probabilistic table has four columns, one for each of the 22 possible outputs. The sum of the
                                                  Í
probabilities in each row must equal 1: 𝑗 𝑝𝑖 𝑗 = 1. When using deterministic logic gates (such
as in this study), all the conditional probabilities 𝑝𝑖 𝑗 are zeros or ones. In general, Markov Brains
can contain an arbitrary number of gates, with any possible connection patterns, and arbitrary
probability values in logic tables [69]. As is clear from this example, we do not implement the
update of the Brain state using probabilities that are conditional on the environmental state 𝐸®𝑡 ;
rather, we update the joint state ( 𝐸®𝑡 , 𝑆®𝑡 ).
            A                                               B
            C                                               D
Figure 5.8: (A) A simple Markov Brain with 12 neurons and two logic gates at two consecutive
time steps 𝑡 and 𝑡 + 1. (B) Gate 1 of (A) with 3 input neurons and 2 output neurons. (C) Underlying
probabilistic logic table of gate 1. (D) Markov Network Brains are encoded using sequences of
numbers (bytes) that serve as agent’s genome. This example shows two genes that specify the logic
gates shown in (A), so that, for example, the byte value ’194’ that specifies the number of inputs
𝑁in to gate 1 translates to ’3’ (the number of inputs for that gate).
    In Markov Brains, a subset of the neurons is designated as sensory neurons that receive inputs
from the environment. Similarly, another subset of neurons serves as actuator neurons (or decision
neurons) that enable agents to take actions in their environment. In principle, an optimal Brain is
designed in such a manner that a particular sequence of inputs (a time series of environmental states
®𝑡 = 𝜎
Σ     ® 1, 𝜎          ® 𝑡 ) leads to a Brain state 𝑆®𝑡 that triggers the optimal response in that environment.
           ® 2 , ..., 𝜎
                                                           93


Rather than using an optimisation procedure that maximises an agent’s performance over the
probabilities 𝑃( 𝑆®𝑡 → 𝑆®𝑡+1 | 𝐸®𝑡 ), we use an evolutionary process in which a Brain’s entire network
is encoded in a genome [208] and optimisation occurs through the evolution of a population of
such genomes using a “Genetic Algorithm” (GA, see for example [127]). In particular, each gene
specifies a gate’s connectivity and its underlying logic as shown in Fig. 5.8D. This evolutionary
approach is explained in more detail in the following section.
5.4.2   Evolution of Markov Brains
Markov Brains can evolve to perform a variety of tasks representing different types of behaviours
observed in nature. Selecting for any desirable task leads to the evolution of network connections
and logic-gate properties that enable the agents to succeed in their environment. Each genome is
a sequence of numbers ranging between 0-255 (bytes) that represent a set of genes that encode
the logic and connectivity of the network. The arbitrary pair of bytes h42, 213i represents the
"start codon" for each gate (Fig. 5.8D), while the downstream loci instruct the compiler how to
construct the network, by encoding how many inputs and outputs define each logic gate, where the
inputs come from (that is, which neuron or neurons), and where it writes to. In this manner, by
"expressing" each gene, the network is fully determined via the connections between neurons and
the logic those connections entail. Once a Brain is constructed, it is implanted in an agent whose
performance is evaluated in an artificial environment that selects for the task. Those agents that
perform best are rewarded with a differential fitness advantage. As these genomes are subject to
mutation, heritability, and selection, they evolve in a purely Darwinian fashion (albeit asexually).
The Genetic Algorithm specification details are shown in Table 5.2.
    The population of Markov Brains evolves to judge the duration of an oddball tone (“longer”
or “shorter”) in multiple trials with different IOIs and oddball durations. The full set of all (IOI,
standard tone), possible oddball tone durations, and the total number of trials for each pair of
(IOI, standard tone) used in the evolution is shown in table 5.3. All told, there are 1,472 possible
trials. However, agents are only evaluated on a subset of trials in every generation. This sampling
                                                     94


Table 5.2: Genetic Algorithm configuration. We evolved 50 populations of Markov Brains for 2,000
generations with point mutations, deletions, and insertions. We used roulette wheel selection, with
5% elitism, and with no cross-over or immigration.
                                        Population size       100
                                          Generations        2000
                                    Initial genome length    5,000
                                      Point mutation rate    0.5%
                                      Gene deletion rate      2%
                                    Gene duplication rate     5%
                                            Elitism           5%
increases the evolution efficiency [24], and helps to avoid overfitting and enhances generalisation
of learning [209]. In each generation, we randomly pick 22 trials from each (IOI, standard tone)
pair (each row in Table 5.3) to form the evaluation subset: 11 trials with a longer oddball, and 11
trials with a shorter oddball, so as to prevent biasing Brains toward one response or the other. All
agents of the population are then evaluated in that same subset of trials, which is 352 trials.
5.4.3   Experimental Setup
The Brains we evolve can have up to 16 neurons, of which one serves as the sensory neuron,
and one delivers the decision (the “actuator” neuron). The remaining 14 neurons can be used
for computation and signal transduction, but how many of them are actually used is determined
by evolution. The population of Markov Brains evolves to judge the duration of a deviant tone
(oddball) within a rhythmic sequence of otherwise identical tones, similar to experiments in [121]
(see Fig. 5.9). In each trial, agents listen to a sequence of nine tones with a constant inter-onset-
interval (IOI). An oddball is embedded within this sequence that is either shorter or longer in
duration compared to the other eight tones (standard tones). Markov Brains sense the stimulus
in one of their neurons (here, neuron 0, see Fig. 5.9). Agents must decide whether the oddball
stimulus is longer or shorter than the standard tones. The agent is rewarded for correct duration
judgements and does not gain any reward or incur a penalty for incorrect judgements. One neuron
(neuron 15) in the Markov Brain is designated for delivering the decision (“longer” or “shorter”).
                                                   95


   A
             standard tones                             oddball
                                                                                                  time
             Inter-onset-interval
   B
       111110000011111000001111100000111110000011111000001111111000111110000011111000001111100000
             standard tones                              oddball
             Inter-onset-interval
Figure 5.9: (A) A schematic of auditory oddball paradigm in which an oddball tone is placed within
a rhythmic sequence of tones, i.e., standard tones. Standard tones are shown as grey blocks and
the oddball tone is shown as a red block. (B) The oddball auditory paradigm, which is converted
to a sequence of binary values, shown as sensed by the input neuron of a Markov Brain. When a
stimulus is present, a sequence of ‘1’s (shown by black blocks) is supplied to the sensory neuron
while during silence, a sequence of ‘0’ is fed to the sensory neuron. Each block shows one time
step of the sequence experienced by the Brain.
    For the purpose of fitness evaluation, agents are evaluated in several trials with different inter-
onset-intervals (IOIs), different standard tones, a wide range of oddball durations, and with oddballs
placed in different positions in the sequence. Standard tones range from 5 time steps to 12 time
steps. The IOI is approximately twice the standard tone, and ranges from 10 to 25. Oddball
durations can take any value from the shortest possible duration (1 time step) all the way to IOI
minus 1 to avoid interfering with the next tone. During evolution, agents are not evaluated with
oddball tones with the same duration as the standard tone since it is not shorter or longer than the
standard tone. Oddballs can occur in either 5th, 6th, 7th, or 8th position, exactly as in the protocol
of [121]. Our standard tones would be comparable in duration to those used in [121] if a digital
time step is represented by a physical signal with about 70msec duration.
                                                       96


     The set of all IOIs, standard tones, possible oddball-tone durations, and the total number of trials
for each pair of (IOI, tone) is given in Table 5.3. All agents of the population are then evaluated
in that same subset of trials, half of which with a longer oddball and the other half with shorter
oddball, to avoid creating a bias in the agents’ judgements. This subset of randomly picked trials
consists of 512 trials (out of a total 2,852 trials): 22 trials for each (inter-onset-interval, standard
tone) (see Table 5.3).
5.4.4   Discrete time in Markov Brains
The logic of Markov Brains is implemented by probabilistic or deterministic logic gates that
update the Brain states from time 𝑡 to time 𝑡 + 1, which implies that time is discretised not only
for Brain updates, but for the environment as well. Whether or not the brain perceives time
discretely or continuously is a hotly debated topic [197], but for common visual tasks such as
motion perception [198] discrete sampling of visual scenes can be assumed. For Markov Brains,
the discreteness of time is a computational necessity. Because no other states (besides the neurons
at time 𝑡) influence a Brain’s state at time 𝑡 + 1, the gates possess the Markov property (hence the
name of the networks). Note that even though the Markov property is usually referred to as the
“memoryless” property of stochastic systems, this does not imply that Markov Brains cannot have
memory. Rather, memory can be explicitly implemented by gates whose outputs are written into
the inputs of other gates, or even the same gates, i.e., to itself [45, 116].
5.4.5   Markov Brains as finite state machines
Because the Brains we evolve are deterministic, they effectively represent a deterministic finite-
state automaton (DFA). There is considerable literature covering the mathematics of DFAs (see, for
example [72]), but very little is applicable to the automata we evolve here. For example, realistic
evolved automata are unlikely to have absorbing states, their stationary distributions are irrelevant,
and they may be both cyclic and acyclic.
                                                    97


Table 5.3: Complete set of all inter-onset-intervals, standard tones, and oddball durations used
for the evolution of duration judgement. Oddballs can occur in either of the 5th, 6th, 7th, or 8th
position in the rhythmic sequence. Also, oddball durations are always either shorter or longer than
the standard tone. The total number of trials for each pair hioi, tonei is four times the IOI minus 2
(excluding oddball duration=standard tone, oddball duration=IOI), because the oddball can appear
in four different positions within the rhythmic sequence.
     (Inter-onset-interval, Stan-   Oddball tone durations            Total number     Number     of
     dard tone)                                                       of possible tri- evaluation
                                                                      als              trials
     (10, 5)                        {1, 2, 3, 4} , {6, 7, 8, 9}       32               22
     (11, 5)                        {1, 2, · · · , 4} , {6, · · · ,   36               22
                                    10}
     (12, 6)                        {1, 2, · · · , 5} , {7, · · · ,   40               22
                                    11}
     (13, 6)                        {1, 2, · · · , 5} , {7, · · · ,   44               22
                                    12}
     (14, 7)                        {1, 2, · · · , 6} , {8, · · · ,   48               22
                                    13}
     (15, 7)                        {1, 2, · · · , 6} , {8, · · · ,   52               22
                                    14}
     (16, 8)                        {1, 2, · · · , 7} , {9, · · · ,   56               22
                                    15}
     (17, 8)                        {1, 2, · · · , 7} , {9, · · · ,   60               22
                                    16}
     (18, 9)                        {1, 2, · · · , 8} , {10, · · · ,  64               22
                                    17}
     (19, 9)                        {1, 2, · · · , 8} , {10, · · · ,  68               22
                                    18}
     (20, 10)                       {1, 2, · · · , 9} , {11, · · · ,  72               22
                                    19}
     (21, 10)                       {1, 2, · · · , 9} , {11, · · · ,  76               22
                                    20}
     (22, 11)                       {1, 2, · · · , 10} , {12, · · · , 80               22
                                    21}
     (23, 11)                       {1, 2, · · · , 10} , {12, · · · , 84               22
                                    22}
     (24, 12)                       {1, 2, · · · , 11} , {13, · · · , 88               22
                                    23}
     (25, 12)                       {1, 2, · · · , 11} , {13, · · · , 92               22
                                    24}
                                                      98


    We define the state of a Markov Brain as the vector of states of all neurons except the sensory
ones [166, 62, 159]: 𝑆®𝑡 = (𝑁 𝑝 , 𝑁 𝑝+1 , ..., 𝑁𝑛−1 ), where 𝑁𝑖 is the state of the 𝑖 𝑡ℎ neuron, 𝑝 is
the number of sensory (or peripheral) neurons, (𝑁0 , 𝑁1 , ..., 𝑁 𝑝−1 ) is the state vector of sensory
neurons, and 𝑛 is the total number of neurons. We abbreviate the Brain-state using the decimal
translation of the state vector as:
                                                       𝑛−1
                                                       Õ
                                               𝑆𝑡 =         𝑁𝑖 (𝑡) × 2𝑖 .                                     (5.2)
                                                       𝑖=𝑝
The Brain state can be thought of as a snapshot of the entire Brain that contains information about
the activity (firing rate) of all neurons at that particular point in time. Markov Brains go through
discrete states as the agent it controls behaves, reminiscent of what has been observed in monkeys
performing a localisation task [166]. In our experimental setup, Markov Brains have 16 neurons
in total, so 𝑛 = 15. One of the neurons senses the stimulus, i.e. 𝑝 = 1, so equation [5.2] can be
                  Í15
written as 𝑆𝑡 = 𝑖=1     𝑁𝑖 (𝑡) × 2𝑖 which means the Brain can be in at most 215 = 32, 768 different
states. We also denote the sensory input at time 𝑡 as 𝜎            ® 𝑡 , and define the sequence of sensory inputs
                      ® 0 : 𝑡1 ) = ( 𝜎
from time 𝑡0 to 𝑡 1 , Σ(𝑡            ® 𝑡0 , 𝜎
                                            ® 𝑡0 +1 , ..., 𝜎
                                                           ® 𝑡1 ).
    The initial Brain state is always 0 since all neurons are quiescent at the outset. State-to-state
transitions of an evolved Brain can be represented (or explained) as a mapping of the state of the
Brain and the sensory input to the future state of the Brain. Formally, the set of all transitions of the
Brain over all visited states in trials (states that Brains have taken on in those trials) can be viewed
as a function T that takes the current state of the Brain 𝑆𝑡 as well as the sensory input 𝜎             ® 𝑡 (in our
experimental setup it is just one bit) as the input, and returns the future state of the Brain as the
output, 𝑆𝑡+1 :
                              T : 𝑆𝑡 , 𝜎® 𝑡 ↦→ 𝑆𝑡+1 ,         or    𝑆𝑡+1 = T (𝑆𝑡 , 𝜎® 𝑡 ),                    (5.3)
We restrict the domain of variable 𝑆𝑡 to those Brain states that actually occur during training (i.e.,
evolution) or test trials (early/late oddball tones). This function can be illustrated as a directed
graph in which Brain states are represented by nodes (labelled by the decimal translation of the
Brain state, see Eq. [5.2]) and edges represent transitions that are labelled with the stimulus that
drives those transitions, 𝜎® (see [69] for a more detailed exposition of state-to-state diagrams).
                                                            99


5.4.6        Attention, experience, and perception in Markov Brains
We describe Markov Brains in terms of functions that take (𝑆𝑡 , 𝜎             ® 𝑡 ) as the input and return 𝑆𝑡+1 as
the output.
      Definition 1. If the Brain transitions from a particular state 𝑆𝑡 to the same state 𝑆𝑡+1 for all
possible values of 𝜎           ® 𝑡 we say: the Brain does not pay attention to sensory input 𝜎  ® 𝑡 in state 𝑆𝑡 .
      Note that it is possible that the Brain does not pay attention to parts of the sensory input 𝜎                 ®𝑡
when the transition from 𝑆𝑡 to 𝑆𝑡+1 occurs independently of specific components of vector 𝜎                   ® 𝑡 . We
emphasise that when the Brain does not pay attention to a sensory input in one transition, it does not
imply that the stimulus is not sensed. Rather, it implies that even though sensed, the value does not
affect the Brain’s computation when in state 𝑆𝑡 . It is crucial here that this definition of attention to
a stimulus depends not only on the stimulus itself but also on the context in which it is sensed—this
context is represented by the state 𝑆𝑡 the Brain has reached. Because the Brain has reached the state
𝑆𝑡 as a consequence of the temporal sequence of states traversed, this context is in fact historical.
Also, note that the Brain state encompasses the actuator neuron (decision neuron), therefore, “not
paying attention” is reflected in an agent’s behaviour as well as the Brain’s computations on sensory
information. In a sense, the definition implies that an event that the Brain does not pay attention to
should not alter its experience of the world, a concept that we will now define.
      Definition 2. We define the Brain’s experience of the environment (which is sensed as a
sequence of sensory inputs Σ(0           ® : 𝑡)) as the sequence of Brain states it traverses, i.e., as 𝜒(0® : 𝑡) =
( 𝑆®0 , 𝑆®1 , 𝑆®2 , ..., 𝑆®𝑡 ).
      This definition implies that the experiences of different individual Brains can be different
when encountering the exact same sensory sequence, hence, experience is subjective [190, 189].
Furthermore, an agent may have experiences in which it does not take any actions on its environment
(does not make any physical changes to itself or the world). Thus, dreaming or thinking are
instances of such experiences in humans [172, 190, 189]. However, if the agent takes any actions
in its environment, those actions become part of the experience by definition. For example, in our
experimental setup Brains can only “take an action” in one particular time step of the trials. As a
                                                            100


result, a sequence of states that excludes that time step is still an experience, but does not involve
any actions from the agent. It is also crucial to understand that the experience of the environment
that is represented within Brain states is not just a naive projection of the world on the Brain, but
rather contains integrated information about the relevant aspects of the environment (cues), while
ignoring unimportant details (noise). In a very real sense, a Brain separates signal from noise;
information from entropy [177].
     In general, two different input sequences Σ        ® 1 (0 : 𝑡) and Σ ® 2 (0 : 𝑡) will result in the Brain having
two different experiences 𝜒®1 (0 : 𝑡) and 𝜒®2 (0 : 𝑡), but not necessarily. If experiences 𝜒®1 (0 : 𝑡) and
𝜒®2 (0 : 𝑡) are exactly the same, it means that (according to Definitions 1 and 2) the Brain does not
pay attention to inputs during those transitions in which Σ          ® 1 and Σ ® 2 are different. While in Definition
1 we only considered the Brain’s transition at one time step, we can also look at the sequence of
future Brain states, to discover how sensory inputs affect the Brain’s computations and transitions
multiple time steps after the input is sensed. Now, consider two input sequences Σ                      ® 1 (0 : 𝑡) and
® 2 (0 : 𝑡) that differ in time steps (0 : 𝑡 0), where 𝑡 0 < 𝑡. Also, suppose Σ
Σ                                                                                      ® 1 (0 : 𝑡) and Σ
                                                                                                       ® 2 (0 : 𝑡) result
in two different experiences 𝜒®1 (0 : 𝑡) and 𝜒®2 (0 : 𝑡). The effect of sub-sequence Σ(0              ® : 𝑡 0) can be
gauged by how different experiences 𝜒®1 (0 : 𝑡) and 𝜒®2 (0 : 𝑡) are as a result. For example, if two
input sequences Σ   ® 1 (0 : 𝑡 0) and Σ
                                      ® 2 (0 : 𝑡 0) (during time interval 0 : 𝑡 0 where they are different) throw
the Brain into two different regions in state space and therefore give rise to completely different
experiences, then those inputs disturb experiences substantially. If, by contrast, Σ                   ® 1 (0 : 𝑡 0) and
® 2 (0 : 𝑡 0) only result in different experiences temporarily (for example, during 0 : 𝑡 0) while 𝜒®1 and
Σ
𝜒®2 become similar or identical later, then the differences in inputs is less disruptive to the Brain’s
experience. In particular, if the experiences have identical states at decision time 𝑡 𝑑 (assuming
that 𝑡 𝑑 ∈ [0 : 𝑡]), the differences in sensory inputs impact experiences 𝜒®1 and 𝜒®2 even less. We
emphasise that the Brain state at the point of decision is key, because at this time point in the trial,
the state of the Brain specifies the Brain’s judgement, and more importantly, represents the path
traversed in state space to reach this state. Consequently, we use the Brain state at decision time to
define what it means to “perceive” a sensory input sequence.
                                                           101


    Definition 3. If a Brain encounters two different input sequences Σ                 ® 1 (0 : 𝑡) and Σ
                                                                                                        ® 2 (0 : 𝑡), yet
ends up in the same state 𝑆𝑡 at decision time 𝑡 in both cases, we say that the Brain had the same
perception of sensory sequences Σ    ® 1 (0 : 𝑡) and Σ  ® 2 (0 : 𝑡).
    By this definition, “having the same perception” is a superset of “having the exact same
experience” when encountering two different sensory sequences. As discussed earlier, if the Brain
has the exact same experience when exposed to two different input sequences, it clearly does not
pay attention to the sub-sequence of the inputs that is not common between the two input sequences.
In the same vein, how similar the experiences are for two different input sequences correlates with
how little the Brain pays attention to those parts of input sequences that are not the same. This
correlation captures the idea that there are different levels of “not paying attention” to a phenomenon
in the environment. At the same time, it becomes clear that events that evoke the same perception
(and thus similar experiences) must overlap in the significant parts of the sensory input. In this
manner, the state of the Brain—being specific to the path in state space that leads to it—can encode
“involuntary memory”, in the same way as Marcel Proust’s memories of the past [153] are triggered
by the taste of a Madeleine dipped in Linden tea.
5.4.7   Information shared between perception and the oddball tone
Here we describe the procedures used to calculate the information shared between perception, (the
Brain state at decision time-step), and the different oddball tone properties such as its duration,
onset, and ending time-step. Markov Brains are tested against oddball tones varying in durations
as well as different onsets with respect to the rhythm of the sequence. For each individual Brain
we create an ensemble of trials with the same inter-onset-interval and standard tone, in which
oddball tones differ in duration, onset, or both. We can calculate the information shared between
the perception of each individual Brain and oddball properties for a given inter-onset-interval and
standard tone using the standard Shannon information [38]
                                              Õ                            𝑝(𝑠 𝑑 , 𝑡 𝑜𝑏 )
                          𝐼 (𝑆 𝑑 : 𝑇𝑜𝑏 ) =            𝑝(𝑠 𝑑 , 𝑡 𝑜𝑏 ) log(                   ,                      (5.4)
                                            𝑠 𝑑 ,𝑡 𝑜𝑏
                                                                          𝑝(𝑠 𝑑 ) 𝑝(𝑡 𝑜𝑏 )
                                                        102


where 𝑆 𝑑 denotes the Brain state at decision time (which we defined as perception) and 𝑇𝑜𝑏
denotes oddball properties, for example the oddball duration. The shared information between
the perception and the oddball properties (duration, onset, and ending time-step) captures the
correlation between the perception of the Brain and each of the oddball properties. It is noteworthy
that perception occurs after the oddball tone has arrived and terminated. Thus, the information
Eq. (5.4) measures how well each of the oddball tone properties can predict how the Brain perceives
the tone.
5.5    Additional Experiments and Analysis
        A                                          B
Figure 5.10: (A) Mean fitness across all 50 lineages and 95% confidence interval as a function
of generation shown every 20 generations. (B) Mean fitness (and 95% intervals) of best agents
picked from each of the 50 populations after 2000 generations as a function of inter-onset-interval,
standard tone.
                                               103


                                                                             967
                                                                   0 1
                                                                                 1
                1        1      1         1         1         1        0           0
            0       1510   2819     997       2882     485      1862         867       1862
                                                              0
                                        1                                  0         0
                                                          0
                                    2914                                     839        357
                                            1                          0
                                                        L                            0
                                                   1          0                  1
                                              961     3911         359                 2886
                                                           0                         0      1
                                                       833       0 1         487
                                                                                       865      3014
                                                1           0                        1
                                                                         1                   0 1
                                                       2918                            3046
                                                                 S
                                                            0                        1
                                                       869      2884         3044      2912
                                                              0          1         1
                 standard  1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
                 longer    1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 0 0 0 0
                 shorter   1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
                 late      1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 1 0 0 0 0 0
                 early     1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
Figure 5.11: State-to-state transition diagram of a Markov Brain for IOI=10, standard tone=5,
oddball tones=4 and 6, and onset of oddball tones can be 2 time steps early and 2 time step late.
                                                   104


5.5.1   Fitness landscape structure and historical contingencies result in Markov Brains using
        smaller regions of state space in trials with longer IOIs
In the main text we described that the judgement accuracy deteriorates as the IOI (and therefore
tone lengths) increases. More specifically, even though the relative JND values remain in the
same range for different IOI and standard tones (see Fig. 2B in the main text), PSE values start
to deviate from the standard tone leading to higher values of “constant errors” (CE) that is, the
difference between PSE and POE (see Fig. 2C in the main text). Here, we show that 1) deviations
of PSEs in longer IOIs result from the fitness landscape structure and historical contingencies (see
for example [59, 17]), and 2) the mechanistic basis of these deviations is associated with the size
of the state-space Markov Brains use to encode stimuli characteristics.
    As discussed before, Markov Brains display periodic firing patterns in response to rhythmic
stimuli. These periodic patterns result in the formation of loops in their state transitions. This
is the dominant mechanism by which Brains evolve to entrain to rhythmic stimuli, and encode
temporal characteristics of the stimuli (i.e., rhythm and standard tone’s duration). The distribution
of the period of these periodic firing patterns, that is, the lengths of the loops in state transition
diagrams is shown here again in Fig. 5.12A. Since the first four standard tones are provided so
that Brains entrain to the rhythm, we measured the period of state transitions after the first four
intervals, without an oddball tone. We also measured the number of distinct states each Brain visits
during these periodic state transitions. Fig. 5.12B shows the distribution of number of distinct
states in traversing loops during entrainment for 50 evolved Brains for each IOI. Note that these
data represent number of distinct states in multiple loops, therefore, it is possible for a Brain to visit
more states than the IOI. Note also that in traversing the loop once (in one period of the sequence) it
is possible to visit some Brain states more than once. For example, the sequence: 6,3,1,1,6,3,1,1,...
has a period of 4, but only three distinct states are visited. These results indicate that the number
of distinct states visited by evolved Brains, i.e., the size of the state space used to encode temporal
information, starts to plateau for longer IOIs.
    The duration judgement task in trials with longer IOIs and standard tones is inherently more
                                                   105


         A                                             B
Figure 5.12: (A) The distribution of loop sizes of 50 evolved brain for each inter-onset-interval
(IOI). The size of the markers is proportional to the number of Brains (out of 50) that evolve a
particular loop lengths in each IOI. (B) The distribution of number of distinct states in loops visited
by Markov Brains in a sequence of rhythmic standard tones, as a function of IOI. The dashed line
shows the identity function line.
difficult (see Fig. 5.10B) for two reasons. First, longer rhythms and durations require more memory
and computations to encode temporal information, and second, the number of possible oddball
tones (in range [1, 𝐼𝑂𝐼 − 1]) is greater in longer IOIs compared to the number of possible oddball
tones in shorter IOIs. As a result, Markov Brains need to use progressively larger regions of their
state-space to encode the temporal information and moreover, they need more evoltionary time to
learn a larger number of patterns; however, state-space size does not grow linearly with IOI but
rather begins to plateau (Fig. 5.12B) which, in turn, leads to less accurate performance in duration
judgements in trials with longer rhythms and a systematic increase in PSE and CE values. This
plateau in utilisation of state-space occurs not because of limitations in Markov Brains capacity
but due to historical contingencies in the evolution. More specifically, the fitness landscape is
structured in such a way that Markov Brains evolve to perform the duration judgement task for
shorter IOIs earlier during the evolutionary course. As a consequence, algorithms that emerge
later in evolution that perform the task in longer IOIs are built upon those algorithms evolved
earlier. In order to provide further support for the claims we made here, we conducted a series of
additional experiments. In the following sections we present results for the evolution of Markov
Brains performing duration judgement for various experimental setups that differ slightly from the
original experimental setup used in the main text.
                                                   106


5.5.1.1      Longer evolutionary time does not resolve systematic behavioural distortions in
             longer rhythms/standard tones
In the first set of additional experiment, we continued running the experiments presented in main text
(which were run for 2,000 generations) for longer evolutionary time, namely 10,000 generations.
Fig. 5.13 shows the fitness values of the best performing agents averaged across 50 runs as a
function of IOI and colour-coded at different evolutionary times. We observe that the average
fitness values increase in all IOI and standard tones with evolution, however, we still observe the
same pattern that the performance drops as IOI increases. Fig. 5.14 shows CE values as a function
of (IOI, standard tone) at different evolutionary time points. These results show that constant errors
in longer IOIs decrease with evolutionary time, however, this decrease slows down considerably
and more importantly, a similar trend in CE values vs. (IOI, standard tone) is observed in all
generations.
                 1.00                                                                                                                                                                         generation 100
                                                                                                                                                                                              generation 200
                                                                                                                                                                                              generation 500
                 0.95                                                                                                                                                                         generation 1000
                                                                                                                                                                                              generation 2000
                 0.90                                                                                                                                                                         generation 3000
       fitness                                                                                                                                                                                generation 5000
                                                                                                                                                                                              generation 10000
                 0.85
                 0.80
                        (10, 5    (11, 5    (12, 6    (13, 6    (14, 7    (15, 7    (16, 8    (17, 8    (18, 9    (19, 9    (20, 1     (21, 1     (22, 1     (23, 1     (24, 1     (25, 1
                              )         )         )         )         )         )         )         )         )         )         0)         0)         1)         1)         2)         2)
                                                                          inter-onset-interval, standard tone
Figure 5.13: (A) Mean fitness across all 50 lineages and 95% confidence interval color-coded at
different evolutionary times as a function of inter-onset-interval, standard tone.
   Fig. 5.15 shows the number of distinct states used to encode temporal information corresponding
to each IOI at different evolutionary time points. After 100 generations, the distributions of state-
space size in shorter rhythms (IOIs 10-14) peak at the IOI (the identity function shown with dashed
line) but as the IOI increases the peak of the distribution start to deviate from the identity line
and begin to spread more widely. As evolution progresses, the distribution of distinct states in a
larger number of IOIs peaks at the identity function but in all the plots shown in Fig. 5.15 (after
different number of generations), the distributions that deviate from the identity line correspond to
                                                                                                             107


the longest IOIs. For example, after 2,000 generations the distributions for IOIs 23-25 are further
from the identity line, and after 10,000 elapsed generations this occurs for IOIs 24, and 25. Recall
that we observed a similar pattern in CE values, where at the beginning of evolution CEs for shorter
IOIs are around 0 but begin to deviate from 0 for longer IOIs, and as populations evolve further CEs
for larger and larger number of IOIs approach 0. Note that the size of the state-space corresponding
to each rhythm is indicative of how accurately the representation of that rhythm is encoded in the
Brain. And clearly, in longer IOIs Markov Brains do not use as accurate an encoding and therefore,
their performance drops for longer IOIs and CE values start to increase systematically.
    Here we investigate in more depth the correlation between CE values and the size of state-space
used by Markov Brains to encode temporal information. As discussed before, the optimum number
of distinct states used to encode stimuli characteristics is the length of the rhythm, i.e., IOI. When
the number of distinct states used to encode the rhythm length is smaller than IOI, it means that
different time points during that interval have the same representation in the Brain because the
Brain must visit some state(s) more than once (at different time points). For example, consider a
Brain that is entrained to a rhythm and is traversing a loop in state-space. An oddball tone results
in the Brain exiting that loop (we showed such an example in the main text). In this case, if the
exit from the loop occurs from a repeated state in that loop, the Brain’s experiences of oddballs that
end at different time points would be exactly the same. Alternatively, when the number of distinct
states visited when traversing the loops is greater than IOI, it means that the period of that loop
is not IOI but a multiple of the IOI. This may also result in less accurate performance in duration
judgement task, for example in the judgement of oddballs with the same duration that occur in
different positions (recall that oddball tones can occur at 5𝑡ℎ , 6𝑡ℎ , 7𝑡ℎ , or 8𝑡ℎ position).
    In Fig. 5.12B, we observed that the distribution of number of distinct states in loops peaks at
IOI for shorter IOIs at the outset of evolution and increasingly more distributions move towards the
IOI and accumulate around IOI. Let 𝐷® IOI = (𝑑IOI 1 , 𝑑 2 , 𝑑 3 , . . . , 𝑑 𝑁 ), where 𝑑 𝑖 represents the
                                                       IOI IOI             IOI           IOI
number of distinct states the 𝑖 𝑡ℎ Brain uses in its loops for a particular IOI, and 𝑁 = 50 since we
have 50 evolved Brains. Thus, each distribution in Fig 5.15 can be represented by a vector 𝐷.     ® We
                                                  108


                                                                                                                                                                                              constant error
                                                                                                                                                              5
                                                                                                                                                              4
                                                                                                                                                              3
                                                                                                                                                              2
                                                                                                                                                              1
                                                                                                                                                              0
                                                                                                                                                              1                               5
                                                                                                                                                                                              4
                                                                                                                                                                                              3
                                                                                                                                                                                              2
                                                                                                                                                                                              1
                                                                                                                                                                                              0
                                                                                                                                                                                              1                                  5
                                                                                                                                                                                                                                 4
                                                                                                                                                                                                                                 3
                                                                                                                                                                                                                                 2
                                                                                                                                                                                                                                 1
                                                                                                                                                                                                                                 0
                                                                                                                                                                                                                                 1                             5
                                                                                                                                                                                                                                                               4
                                                                                                                                                                                                                                                               3
                                                                                                                                                                                                                                                               2
                                                                                                                                                                                                                                                               1
                                                                                                                                                                                                                                                               0
                                                                                                                                                                                                                                                               1
                                                                                                                                                  (10, 5                          (10, 5                             (10, 5                         (10, 5
                                                                                                                                                          )                               )                                  )                             )
                                                                                                                                                  (11, 5                          (11, 5                             (11, 5                         (11, 5
                                                                                                                                                          )                               )                                  )                             )
                                                                                                                                                  (12, 6                          (12, 6                             (12, 6                         (12, 6
                                                                                                                                                          )                               )                                  )                             )
                                                                                                                                                  (13, 6                          (13, 6                             (13, 6                         (13, 6
                                                                                                                                                          )                               )                                  )                             )
      Figure 5.14: Constant errors and their 95% confidence interval for 50 best performing Brains as a
                                                                                                                                                  (14, 7                          (14, 7                             (14, 7                         (14, 7
                                                                                                                                                          )                               )                                  )                             )
                                                                                                                                                  (15, 7                          (15, 7                             (15, 7                         (15, 7
                                                                                                                                                          )                               )                                  )                             )
                                                                                                                                                  (16, 8                          (16, 8                             (16, 8                         (16, 8
                                                                                                                                                          )                               )                                  )                             )
                                                                                                                                                  (17, 8                          (17, 8                             (17, 8                         (17, 8
                                                                                                                                                          )                               )                                  )                             )
                                                                                                                                                  (18, 9                          (18, 9                             (18, 9                         (18, 9
                                                                                                                                                          )                               )                                  )                             )
                                                                                                                                                  (19, 9                          (19, 9                             (19, 9                         (19, 9
                                                                                                                                                          )   generation=5000             )       generation=2000            )   generation=500            )   generation=100
                                                                                                                                                 (20, 1                          (20, 1                             (20, 1                         (20, 1
                                                                                                                                                        0                               0                                  0                             0
                                                                                                                                                 (21, 1 )                        (21, 1 )                           (21, 1 )                       (21, 1 )
                                                                                                                                                        0                               0                                  0                             0
                                                                                                                                                 (22, 1 )                        (22, 1 )                           (22, 1 )                       (22, 1 )
                                                                                                                                                        1                               1                                  1                             1
                                                                                                                                                 (23, 1 )                        (23, 1 )                           (23, 1 )                       (23, 1 )
                                                                                                                                                        1                               1                                  1                             1
                                                                                                                                                 (24, 1 )                        (24, 1 )                           (24, 1 )                       (24, 1 )
                                                                                                                                                        2                               2                                  2                             2
                                                                                                                                                 (25, 1 )                        (25, 1 )                           (25, 1 )                       (25, 1 )
                                                                                                                                                        2)                              2)                                 2)                            2)
109
                                                                                                                                                              5
                                                                                                                                                              4
                                                                                                                                                              3
                                                                                                                                                              2
                                                                                                                                                              1
                                                                                                                                                              0
                                                                                                                                                              1                               5
                                                                                                                                                                                              4
                                                                                                                                                                                              3
                                                                                                                                                                                              2
                                                                                                                                                                                              1
                                                                                                                                                                                              0
                                                                                                                                                                                              1                                  5
                                                                                                                                                                                                                                 4
                                                                                                                                                                                                                                 3
                                                                                                                                                                                                                                 2
                                                                                                                                                                                                                                 1
                                                                                                                                                                                                                                 0
                                                                                                                                                                                                                                 1                             5
                                                                                                                                                                                                                                                               4
                                                                                                                                                                                                                                                               3
                                                                                                                                                                                                                                                               2
                                                                                                                                                                                                                                                               1
                                                                                                                                                                                                                                                               0
                                                                                                                                                                                                                                                               1
                                                                                                                                                  (10, 5                          (10, 5                             (10, 5                         (10, 5
                                                                                                                                                          )                               )                                  )                             )
                                                                                                                                                  (11, 5                          (11, 5                             (11, 5                         (11, 5
                                                                                                                                                          )                               )                                  )                             )
                                                                                                                                                  (12, 6                          (12, 6                             (12, 6                         (12, 6
                                                                                                                                                          )                               )                                  )                             )
                                                                                                                                                  (13, 6                          (13, 6                             (13, 6                         (13, 6
                                                                                                                                                          )                               )                                  )                             )
      function of inter-onset-interval, standard tone at different evolutionary times. Dashed line shows
                                                                                                                                                  (14, 7                          (14, 7                             (14, 7                         (14, 7
                                                                                                                                                          )                               )                                  )                             )
                                                                                                                                                  (15, 7                          (15, 7                             (15, 7                         (15, 7
                                                                                                                                                          )                               )                                  )                             )
                                                                                                           inter-onset-interval, standard tone
                                                                                                                                                  (16, 8                          (16, 8                             (16, 8                         (16, 8
                                                                                                                                                          )                               )                                  )                             )
                                                                                                                                                  (17, 8                          (17, 8                             (17, 8                         (17, 8
                                                                                                                                                          )                               )                                  )                             )
                                                                                                                                                  (18, 9                          (18, 9                             (18, 9                         (18, 9
                                                                                                                                                          )                               )                                  )                             )
                                                                                                                                                  (19, 9                          (19, 9                             (19, 9                         (19, 9
                                                                                                                                                          )                               )       generation=3000            )   generation=1000           )   generation=200
                                                                                                                                                 (20, 1                          (20, 1                             (20, 1                         (20, 1
                                                                                                                                                              generation=10000
                                                                                                                                                        0)                              0)                                 0)                            0)
                                                                                                                                                 (21, 1                          (21, 1                             (21, 1                         (21, 1
                                                                                                                                                        0                               0                                  0                             0
                                                                                                                                                 (22, 1 )                        (22, 1 )                           (22, 1 )                       (22, 1 )
                                                                                                                                                        1                               1                                  1                             1
                                                                                                                                                 (23, 1 )                        (23, 1 )                           (23, 1 )                       (23, 1 )
                                                                                                                                                        1                               1                                  1                             1
                                                                                                                                                 (24, 1 )                        (24, 1 )                           (24, 1 )                       (24, 1 )
                                                                                                                                                        2                               2                                  2                             2
                                                                                                                                                 (25, 1 )                        (25, 1 )                           (25, 1 )                       (25, 1 )
                                                                                                                                                        2)                              2)                                 2)                            2)
      zero constant error.


                                                          generation=100                            generation=200
                                                40                                      40
                                                30                                      30
                                                20                                      20
                                                10                                      10
                                                 0                                       0
                                                     10   15          20         25          10    15          20     25
                                                          generation=500                           generation=1000
                                                40                                      40
           number of distinct states in loops
                                                30                                      30
                                                20                                      20
                                                10                                      10
                                                 0                                       0
                                                     10   15          20         25          10    15          20     25
                                                          generation=2000                          generation=3000
                                                40                                      40
                                                30                                      30
                                                20                                      20
                                                10                                      10
                                                 0                                       0
                                                     10   15          20         25          10    15          20     25
                                                          generation=5000                          generation=10000
                                                40                                      40
                                                30                                      30
                                                20                                      20
                                                10                                      10
                                                 0                                       0
                                                     10   15          20         25          10    15          20     25
                                                                            inter-onset-interval
Figure 5.15: The distribution of number of distinct states used to encode rhythm and standard tone
duration, i.e., the number of distinct states in each loop, as a function of inter-onset-interval at
different evolutionary times. The size of the circle is proportional to the likelihood at that loop
size.The dashed line shows the identity function.
                                                                                  110


now calculate the distance of each distribution to the IOI by:
                                                                          !1
                                                         Õ                 𝑝
                           𝛿IOI = k𝑑𝑖 − IOIk 0 = lim         (𝑑𝑖 − IOI) 𝑝    ,                    (5.5)
                                                   𝑝→0
                                                          𝑖
                                                  1 − IOI, 𝑑 2 − IOI, . . . , 𝑑 𝑁 − IOI). In fact, 𝛿
in which k k 0 denotes the ℓ0 -norm of vector (𝑑IOI           IOI              IOI                   IOI
simply reflects how many of the 50 Brains do not use exactly IOI distinct states in their loops. We
calculated 𝛿IOI for each IOI and at different points in evolutionary time. We then normalised these
𝛿IOI by the maximum 𝛿IOI value. Fig 5.16 shows absolute CE values as a function of normalised
𝛿IOI . Each data point shown in grey represents 𝛿IOI calculated in a distribution at a specific
evolutionary time and a particular IOI in Fig. 5.15).
     We used a non-linear regression analysis [15] to find the correlation between the CE and 𝛿IOI .
Since a large number of data points fall around CE=0 and in the lower range of 𝛿IOI (which
is not surprising since most trials result in CEs that are not significantly different from 0), we
applied binning with constant bin size to this data. Mean values of binned data and their standard
deviations as well as the fitted function are also shown in Fig. 5.16. We tested three different kernel
functions for regression analysis: 1) quadratic function, 2) ramp function, 3) softplus function
( 𝑓 (𝑥) = 𝑙𝑜𝑔(1 + 𝑒 𝑥 ), which is a differentiable approximation of ramp function). Table 5.4 shows
the regression analysis results for three different kernel functions. We compare these three models
using Bayesian information criterion (BIC) [155]. These results show that the softplus function
describes the pattern in the data better than quadratic and ramp function. This pattern can be
interpreted as: there is no significant change in CE values for a range of small 𝛿IOI s, however, by
further increasing 𝛿IOI , at some threshold CEs start to increase linearly with 𝛿IOI .
5.5.1.2    Training Markov Brains equally in all IOIs and standard tones has a minor effect
           on behavioural deviations in longer rhythms
In this experimental setup, we used the same set of inter-onset-intervals, standard tones, and oddball
tones as used in original experimental setup. The only difference is that the number of evaluations
for each (IOI, standard tone) is not constant anymore (in the original setup we evaluate Brains in 22
                                                   111


                                              4
                         | constant error |
                                              3
                                              2
                                              1
                                              0
                                                     0.3   0.4     0.5   0.6     0.7    0.8    0.9   1.0
                                                                          δIOI
Figure 5.16: Absolute constant errors (CE) shown in grey as a function of 𝛿IOI , as well as the
binned data and the fitted softplus curve.
Table 5.4: Non-linear regression analysis used to explain the correlation between the constant
errors (CE) and 𝛿IOI which is a function of the distinct number of states used in encoding stimuli.
Residuals sum of squares (RSS), and the Bayesian information criterion. A BIC difference > 10
provide very strong support for one model over the other [155].
   function       RSS                             BIC            ΔBIC            with         ΔBIC         with   ΔBIC with soft-
                                                                 quadratic                    ramp                plus
   quadratic      6.49                            -48.29         0                            -                   -
   ramp           2.41                            -83.02         34.73                        0                   -
   softplus       1.9                             -91.39         43.10                        8.37                0
trials for each IOI, standard tone) but in this modified setup it increases with IOI linearly. Table 5.5
shows the number of evaluation trials as well as IOI, standard tone, and total number of trials for
each (IOI, standard tone). Note that we tried to keep the total number of evaluations in this setup,
368 (37.1% of all possible trials), as close as possible to that of the original setup 352 (35.5% of
all possible trials). Note also that the number of evaluations in each (IOI, standard tone) is chosen
proportionate to the number of oddball tones in that (IOI, standard tone).
   Fig. 5.17 shows CE values for this experimental setup as a function of (IOI, standard tone)
at different evolutionary time points in the experiments. It is evident that the same trend in CE
values that was observed in the original setup can be seen in these experiments too. In particular,
after 2,000 generations CEs for (IOI, standard tone)={(23, 11), (24,12), (25,12)} are significantly
                                                                         112


Table 5.5: Complete set of all inter-onset-intervals, standard tones, and oddball durations used for
evolution of duration judgement task. Oddballs can occur in either of 5th, 6th, 7th, or 8th position
in the rhythmic sequence. Also, oddball durations are always either shorter or longer than the
standard tone.
    Inter-onset-    Standard       Oddball tone duration             Total number     number of eval-
    interval        tone dura-                                       of possible tri- uation trials
                    tion                                             als
    10              5              {1, 2, 3, 4} , {6, 7, 8, 9}       32               8
    11              5              {1, 2, · · · , 4} , {6, · · · ,   36               10
                                   10}
    12              6              {1, 2, · · · , 5} , {7, · · · ,   40               12
                                   11}
    13              6              {1, 2, · · · , 5} , {7, · · · ,   44               14
                                   12}
    14              7              {1, 2, · · · , 6} , {8, · · · ,   48               16
                                   13}
    15              7              {1, 2, · · · , 6} , {8, · · · ,   52               18
                                   14}
    16              8              {1, 2, · · · , 7} , {9, · · · ,   56               20
                                   15}
    17              8              {1, 2, · · · , 7} , {9, · · · ,   60               22
                                   16}
    18              9              {1, 2, · · · , 8} , {10, · · · ,  64               24
                                   17}
    19              9              {1, 2, · · · , 8} , {10, · · · ,  68               26
                                   18}
    20              10             {1, 2, · · · , 9} , {11, · · · ,  72               28
                                   19}
    21              10             {1, 2, · · · , 9} , {11, · · · ,  76               30
                                   20}
    22              11             {1, 2, · · · , 10} , {12, · · · , 80               32
                                   21}
    23              11             {1, 2, · · · , 10} , {12, · · · , 84               34
                                   22}
    24              12             {1, 2, · · · , 11} , {13, · · · , 88               36
                                   23}
    25              12             {1, 2, · · · , 11} , {13, · · · , 92               38
                                   24}
                                                    113


Table 5.6: Non-linear regression analysis used to explain the correlation between the constant
errors (CE) and 𝛿IOI which is a function of the distinct number of states used in encoding stimuli.
Residuals sum of squares (RSS), and the Bayesian information criterion.
    function        RSS         BIC         ΔBIC           with   ΔBIC      with   ΔBIC with soft-
                                            quadratic             ramp             plus
    quadratic       5.36        -66.42      0                     -                -
    ramp            1.59        -113.77     47.35                 0                -
    softplus        1.4         -118.65     52.23                 4.88             0
different from 0 and similarly, after 10,000 generations the CE for (25, 12) is significantly different
from 0. Fig. 5.18 shows state-space sizes as a function of IOI at different evolutionary time points.
Similar to trends observed in the original setup, state-space sizes plateau as IOIs increase and again,
their distributions are slightly closer to the identity function (dashed line) but not significantly so.
Thus, we conclude that having the same training set size for all IOIs has little to do with distorted
behaviours in longer rhythms. Fig. 5.19 shows the binned CE values as a function of 𝛿IOI as well
as the fitted softplus function. We performed the non-linear regression analysis described before
for this experiment and the results are presented in Table 5.6. Similar to previous experiment, the
softplus function describes the pattern in CE values and 𝛿IOI better than the other two models.
5.5.1.3    Constant errors in longest rhythms are greater than zero regardless of trial size
In order to show that the deviations of PSE (from the point of objective equality, i.e., standard
tone) in longer IOI, and standard tones is not specific to a particular value of IOI or standard tone,
we used two experimental setups where one has a smaller set of (IOI, standard tone) with shorter
IOIs and standard tone durations, and one that has a larger set of (IOI, standard tone) with longer
rhythms, standard tones. The first training set is similar to the original experimental setup but we
excluded trials with the following inter-onset-intervals and standard tones from the original setup:
{(23, 11), (24, 12), (25, 12)}. Similar to the original setup, oddball tones can vary from 1 to IOI-1.
In this experimental setup, there are 728 possible trials and all agents are evaluated in 20 trials from
each IOI and standard tone (10 with longer and 10 with shorter oddball tones) which is 35.7% of
all possible trials (in the original setup evaluation trials set was 35.5% of all possible trials).
                                                   114


                                                                                                                                                                                                                                  constant error
                                                                                                                                                              1   0   1   2   3   4                               1   0   1   2   3   4                              1   0   1   2   3   4                             1   0   1   2   3   4
                                                                                                                                                  (10, 5                                              (10, 5                                             (10, 5                                             (10, 5
                                                                                                                                                          )                                                   )                                                  )                                                 )
                                                                                                                                                  (11, 5                                              (11, 5                                             (11, 5                                             (11, 5
                                                                                                                                                          )                                                   )                                                  )                                                 )
                                                                                                                                                  (12, 6                                              (12, 6                                             (12, 6                                             (12, 6
                                                                                                                                                          )                                                   )                                                  )                                                 )
                                                                                                                                                  (13, 6                                              (13, 6                                             (13, 6                                             (13, 6
                                                                                                                                                          )                                                   )                                                  )                                                 )
      Figure 5.17: Constant errors and their 95% confidence interval for 50 best performing Brains as a
                                                                                                                                                  (14, 7                                              (14, 7                                             (14, 7                                             (14, 7
                                                                                                                                                          )                                                   )                                                  )                                                 )
                                                                                                                                                  (15, 7                                              (15, 7                                             (15, 7                                             (15, 7
                                                                                                                                                          )                                                   )                                                  )                                                 )
                                                                                                                                                  (16, 8                                              (16, 8                                             (16, 8                                             (16, 8
                                                                                                                                                          )                                                   )                                                  )                                                 )
                                                                                                                                                  (17, 8                                              (17, 8                                             (17, 8                                             (17, 8
                                                                                                                                                          )                                                   )                                                  )                                                 )
                                                                                                                                                  (18, 9                                              (18, 9                                             (18, 9                                             (18, 9
                                                                                                                                                          )                                                   )                                                  )                                                 )
                                                                                                                                                  (19, 9                                              (19, 9                                             (19, 9                                             (19, 9
                                                                                                                                                          )                       generation=5000             )                       generation=2000            )                       generation=500            )                       generation=100
                                                                                                                                                 (20, 1                                              (20, 1                                             (20, 1                                             (20, 1
                                                                                                                                                        0                                                   0                                                  0                                                 0
                                                                                                                                                 (21, 1 )                                            (21, 1 )                                           (21, 1 )                                           (21, 1 )
                                                                                                                                                        0                                                   0                                                  0                                                 0
                                                                                                                                                 (22, 1 )                                            (22, 1 )                                           (22, 1 )                                           (22, 1 )
                                                                                                                                                        1                                                   1                                                  1                                                 1
                                                                                                                                                 (23, 1 )                                            (23, 1 )                                           (23, 1 )                                           (23, 1 )
                                                                                                                                                        1                                                   1                                                  1                                                 1
                                                                                                                                                 (24, 1 )                                            (24, 1 )                                           (24, 1 )                                           (24, 1 )
                                                                                                                                                        2                                                   2                                                  2                                                 2
                                                                                                                                                 (25, 1 )                                            (25, 1 )                                           (25, 1 )                                           (25, 1 )
                                                                                                                                                        2)                                                  2)                                                 2)                                                2)
115
                                                                                                                                                              1   0   1   2   3   4                               1   0   1   2   3   4                              1   0   1   2   3   4                             1   0   1   2   3   4
                                                                                                                                                  (10, 5                                              (10, 5                                             (10, 5                                             (10, 5
                                                                                                                                                          )                                                   )                                                  )                                                 )
                                                                                                                                                  (11, 5                                              (11, 5                                             (11, 5                                             (11, 5
                                                                                                                                                          )                                                   )                                                  )                                                 )
                                                                                                                                                  (12, 6                                              (12, 6                                             (12, 6                                             (12, 6
                                                                                                                                                          )                                                   )                                                  )                                                 )
                                                                                                                                                  (13, 6                                              (13, 6                                             (13, 6                                             (13, 6
                                                                                                                                                          )                                                   )                                                  )                                                 )
      function of inter-onset-interval, standard tone at different evolutionary times. Dashed line shows
                                                                                                                                                  (14, 7                                              (14, 7                                             (14, 7                                             (14, 7
                                                                                                                                                          )                                                   )                                                  )                                                 )
                                                                                                                                                  (15, 7                                              (15, 7                                             (15, 7                                             (15, 7
                                                                                                                                                          )                                                   )                                                  )                                                 )
                                                                                                           inter-onset-interval, standard tone
                                                                                                                                                  (16, 8                                              (16, 8                                             (16, 8                                             (16, 8
                                                                                                                                                          )                                                   )                                                  )                                                 )
                                                                                                                                                  (17, 8                                              (17, 8                                             (17, 8                                             (17, 8
                                                                                                                                                          )                                                   )                                                  )                                                 )
                                                                                                                                                  (18, 9                                              (18, 9                                             (18, 9                                             (18, 9
                                                                                                                                                          )                                                   )                                                  )                                                 )
                                                                                                                                                  (19, 9                                              (19, 9                                             (19, 9                                             (19, 9
                                                                                                                                                          )                                                   )                       generation=3000            )                       generation=1000           )                       generation=200
                                                                                                                                                 (20, 1                                              (20, 1                                             (20, 1                                             (20, 1
                                                                                                                                                                                  generation=10000
                                                                                                                                                        0)                                                  0)                                                 0)                                                0)
                                                                                                                                                 (21, 1                                              (21, 1                                             (21, 1                                             (21, 1
                                                                                                                                                        0                                                   0                                                  0                                                 0
                                                                                                                                                 (22, 1 )                                            (22, 1 )                                           (22, 1 )                                           (22, 1 )
                                                                                                                                                        1                                                   1                                                  1                                                 1
                                                                                                                                                 (23, 1 )                                            (23, 1 )                                           (23, 1 )                                           (23, 1 )
                                                                                                                                                        1                                                   1                                                  1                                                 1
                                                                                                                                                 (24, 1 )                                            (24, 1 )                                           (24, 1 )                                           (24, 1 )
                                                                                                                                                        2                                                   2                                                  2                                                 2
                                                                                                                                                 (25, 1 )                                            (25, 1 )                                           (25, 1 )                                           (25, 1 )
                                                                                                                                                        2)                                                  2)                                                 2)                                                2)
      zero constant error.


                                                          generation=100                            generation=200
                                                40                                      40
                                                30                                      30
                                                20                                      20
                                                10                                      10
                                                 0                                       0
                                                     10   15          20         25          10    15          20     25
                                                          generation=500                           generation=1000
                                                40                                      40
           number of distinct states in loops
                                                30                                      30
                                                20                                      20
                                                10                                      10
                                                 0                                       0
                                                     10   15          20         25          10    15          20     25
                                                          generation=2000                          generation=3000
                                                40                                      40
                                                30                                      30
                                                20                                      20
                                                10                                      10
                                                 0                                       0
                                                     10   15          20         25          10    15          20     25
                                                          generation=5000                          generation=10000
                                                40                                      40
                                                30                                      30
                                                20                                      20
                                                10                                      10
                                                 0                                       0
                                                     10   15          20         25          10    15          20     25
                                                                            inter-onset-interval
Figure 5.18: The distribution of number of distinct states used to encode rhythm and standard tone
duration, i.e., the number of distinct states in each loop, as a function of inter-onset-interval at
different evolutionary times. The dashed line shows the identity function.
   Fig. 5.20 shows mean constant errors as a function of standard tones at different evolutionary
times for this experimental setup. The increase in CEs is again observed for longer IOIs and
noticeably, after 2000 generations in trials with (IOI, standard tone)={(10, 5), (11,5)}, all 50 Brains
perform the duration judgement task perfectly (100% performance for all oddball tones in those
rhythms) and we observe Brains perform the duration judgement task perfectly in more IOIs, and
standard tone in later generations, for example after 10,000 generations Brains perform perfectly
in (IOI, standard tone)={(10, 5), (11, 5), (12, 6), (14,7)}. Cognitive scientists and psychophysicists
                                                                                  116


                                              3.5
                                              3.0
                         | constant error |
                                              2.5
                                              2.0
                                              1.5
                                              1.0
                                              0.5
                                              0.0
                                                    0.2   0.4     0.6   0.8   1.0
                                                                 δIOI
Figure 5.19: Absolute constant errors (CE) shown in grey as a function of 𝛿IOI , as well as the
binned data and the fitted softplus curve.
are not in general interested in “trivial” experiments in which all the subjects answer 100% of
questions correctly; therefore, we did not design our experimental setup such that Brains evolve
to achieve 100% fitness either. Fig. 5.21 shows state-space size distributions as a function of IOI
for different evolutionary time points. It is again evident that the state-space sizes start to plateau
for longer IOIs but of course, not as drastically as in the original setup. The CE values, as well as
binned means and their standard deviations, are shown as a function of 𝛿IOI are shown in Fig. 5.22.
In Fig. 5.22, the blue dashed line shows the fitted softplus function. The results of the non-linear
regression analysis are shown in Table 5.7. We again observe that the softplus function describes
the pattern in CE values and 𝛿IOI better than the other two functions.
   The second experimental setup has all the trials from the original and we also added the
following inter-onset-intervals and standard tones: {(26, 13), (27, 13), (28, 14), (29, 14)}. In this
experimental setup, there are 1400 possible trials and all agents are evaluated in 24 trials from
each IOI and standard tone (12 with longer and 12 with shorter oddball tones) which is 34.3% of
all possible trials to maintain the same ratio of evaluation trials to all possible trials. Fig. 5.23
shows mean constant errors as a function of standard tones at different evolutionary times for this
experimental setup. These results show a similar pattern in CE values and more importantly, we
observe that the CEs for the inter-onset-interval and standard tones {(23, 11), (24, 12), (25, 12)}
are not significantly different from 0 whereas in the original experiment, CEs were significantly
                                                                117


Table 5.7: Non-linear regression analysis used to explain the correlation between the constant
errors (CE) and 𝛿IOI which is a function of the distinct number of states used in encoding stimuli.
Residuals sum of squares (RSS), and the Bayesian information criterion.
    function       RSS          BIC         ΔBIC         with    ΔBIC        with    ΔBIC with soft-
                                            quadratic            ramp                plus
    quadratic      2.91         -76.33      0                    -                   -
    ramp           1.48         -99.97      23.64                0                   -
    softplus       0.98         -114.51     38.18                14.54               0
Table 5.8: Non-linear regression analysis used to explain the correlation between the constant
errors (CE) and 𝛿IOI which is a function of the distinct number of states used in encoding stimuli.
Residuals sum of squares (RSS), and the Bayesian information criterion.
    function       RSS          BIC         ΔBIC         with    ΔBIC        with    ΔBIC with soft-
                                            quadratic            ramp                plus
    quadratic      7.03         -53.20      0                    -                   -
    ramp           1.03         -126.34     73.14                0                   -
    softplus       0.89         -131.92     78.72                5.58                0
different from 0 in the same trials, i.e., {(23, 11), (24, 12), (25, 12)}. Fig. 5.24 shows state-space
size distributions as a function of inter-onset-intervals for different evolutionary time points. We
again observe that the state-space sizes start to plateau for longer IOIs but of course, but not as
drastically as in the original setup. We performed the non-linear regression analysis on these data
as well and the results are shown in Table 5.8. As observed in previous results, the softplus function
describes the pattern in CE values and 𝛿IOI better than the other two models. The CE values, the
binned data mean and standard deviations, and the fitted softplus function is shown in Fig. 5.25.
     These results reaffirm that the entrainment and duration judgement task become much more
difficult for longer (IOIs, standard tone) and with greater set of trials, and that furthermore, Markov
Brains do have the capacity to use greater regions of the state-space and perform more accurately in
longer IOIs. However, the historical contingencies in such fitness landscapes lead to less accurate
strategies in duration judgements in longer IOIs which results from using smaller regions in state-
space.
                                                  118


                                                                                                                                                                                                                              constant error
      Figure 5.20: Constant errors and their 95% confidence interval for 50 best performing Brains as
                                                                                                                                                             1   0   1   2   3   4                            1   0   1   2   3   4                         1   0   1   2   3   4                           1   0   1   2   3   4
                                                                                                                                                    (10, 5                                           (10, 5                                          (10, 5                                        (10, 5
                                                                                                                                                            )                                                )                                              )                                              )
                                                                                                                                                    (11, 5                                           (11, 5                                          (11, 5                                        (11, 5
                                                                                                                                                            )                                                )                                              )                                              )
                                                                                                                                                    (12, 6                                           (12, 6                                          (12, 6                                        (12, 6
                                                                                                                                                            )                                                )                                              )                                              )
                                                                                                                                                    (13, 6                                           (13, 6                                          (13, 6                                        (13, 6
                                                                                                                                                            )                                                )                                              )                                              )
                                                                                                                                                    (14, 7                                           (14, 7                                          (14, 7                                        (14, 7
                                                                                                                                                            )                                                )                                              )                                              )
                                                                                                                                                    (15, 7                                           (15, 7                                          (15, 7                                        (15, 7
                                                                                                                                                            )                                                )                                              )                                              )
                                                                                                                                                    (16, 8                                           (16, 8                                          (16, 8                                        (16, 8
                                                                                                                                                            )                                                )                                              )                                              )
      a function of inter-onset-interval, standard tone at different evolutionary times. There are some
                                                                                                                                                    (17, 8                                           (17, 8                                          (17, 8                                        (17, 8
                                                                                                                                                            )                    generation=5000             )                    generation=2000           )                   generation=500             )                    generation=100
                                                                                                                                                    (18, 9                                           (18, 9                                          (18, 9                                        (18, 9
                                                                                                                                                            )                                                )                                              )                                              )
                                                                                                                                                    (19, 9                                           (19, 9                                          (19, 9                                        (19, 9
                                                                                                                                                            )                                                )                                              )                                             )
                                                                                                                                                   (20, 1                                           (20, 1                                          (20, 1                                        (20, 1
                                                                                                                                                         0)                                               0)                                               0)                                            0)
                                                                                                                                                   (21, 1                                           (21, 1                                          (21, 1                                        (21, 1
                                                                                                                                                         0)                                               0)                                               0)                                            0)
                                                                                                                                                   (22, 1                                           (22, 1                                          (22, 1                                        (22, 1
                                                                                                                                                         1)                                               1)                                               1)                                            1)
119
                                                                                                                                                             1   0   1   2   3   4                            1   0   1   2   3   4                         1   0   1   2   3   4                           1   0   1   2   3   4
                                                                                                                                                    (10, 5                                           (10, 5                                          (10, 5                                        (10, 5
                                                                                                                                                            )                                                )                                              )                                              )
      missing data points in these plots which is due to the fact that in those trials the performances of
                                                                                                                                                    (11, 5                                           (11, 5                                          (11, 5                                        (11, 5
                                                                                                                                                            )                                                )                                              )                                              )
                                                                                                                                                    (12, 6                                           (12, 6                                          (12, 6                                        (12, 6
                                                                                                                                                            )                                                )                                              )                                              )
                                                                                                                                                    (13, 6                                           (13, 6                                          (13, 6                                        (13, 6
                                                                                                                                                            )                                                )                                              )                                              )
                                                                                                                                                    (14, 7                                           (14, 7                                          (14, 7                                        (14, 7
                                                                                                                                                            )                                                )                                              )                                              )
                                                                                                             inter-onset-interval, standard tone
                                                                                                                                                    (15, 7                                           (15, 7                                          (15, 7                                        (15, 7
                                                                                                                                                            )                                                )                                              )                                              )
                                                                                                                                                    (16, 8                                           (16, 8                                          (16, 8                                        (16, 8
                                                                                                                                                            )                                                )                                              )                                              )
                                                                                                                                                    (17, 8                                           (17, 8                                          (17, 8                                        (17, 8
                                                                                                                                                            )                                                )                    generation=3000           )                   generation=1000            )                    generation=200
                                                                                                                                                    (18, 9                                           (18, 9                                          (18, 9                                        (18, 9
                                                                                                                                                            )                                                )                                              )                                              )
                                                                                                                                                                                 generation=10000
                                                                                                                                                    (19, 9                                           (19, 9                                          (19, 9                                        (19, 9
      all 50 Brains are 100%, as a result, PSE would be exactly equal to the standard tone and the slope
                                                                                                                                                            )                                                )                                              )                                             )
                                                                                                                                                   (20, 1                                           (20, 1                                          (20, 1                                        (20, 1
                                                                                                                                                         0)                                               0)                                               0)                                            0)
                                                                                                                                                   (21, 1                                           (21, 1                                          (21, 1                                        (21, 1
                                                                                                                                                         0)                                               0)                                               0)                                            0)
                                                                                                                                                   (22, 1                                           (22, 1                                          (22, 1                                        (22, 1
                                                                                                                                                         1)                                               1)                                               1)                                            1)
      of the psychometric function would be infinity. Dashed line shows zero constant error.


                                                50             generation=100                     50               generation=200
                                                40                                                40
                                                30                                                30
                                                20                                                20
                                                10                                                10
                                                 0                                                 0
                                                     10   13         16          19        22          10    13         16          19   22
                                                50             generation=500                     50              generation=1000
           number of distinct states in loops
                                                40                                                40
                                                30                                                30
                                                20                                                20
                                                10                                                10
                                                 0                                                 0
                                                     10   13         16          19        22          10    13         16          19   22
                                                50             generation=2000                    50              generation=3000
                                                40                                                40
                                                30                                                30
                                                20                                                20
                                                10                                                10
                                                 0                                                 0
                                                     10   13         16          19        22          10    13         16          19   22
                                                50             generation=5000                    50              generation=10000
                                                40                                                40
                                                30                                                30
                                                20                                                20
                                                10                                                10
                                                 0                                                 0
                                                     10   13         16          19        22          10    13         16          19   22
                                                                                      inter-onset-interval
Figure 5.21: The distribution of number of distinct states used to encode rhythm and standard tone
duration, i.e., the number of distinct states in each loop, as a function of inter-onset-interval at
different evolutionary times. The dashed line shows the identity function.
                                                                                            120


                                            3.0
                                            2.5
                       | constant error |
                                            2.0
                                            1.5
                                            1.0
                                            0.5
                                            0.0
                                                  0.2   0.3   0.4   0.5    0.6    0.7   0.8   0.9   1.0
                                                                           δIOI
Figure 5.22: Absolute constant errors (CE) shown in grey as a function of 𝛿IOI , as well as the
binned data and the fitted softplus curve.
                                                                          121


                                                                                                                                                                                                                                    constant error
      Figure 5.23: Constant errors and their 95% confidence interval for 50 best performing Brains as
                                                                                                                                                              1   0   1   2   3   4   5                             1   0   1   2   3   4   5                            1   0   1   2   3   4   5                            1   0   1   2   3   4   5
                                                                                                                                                    (10, 5                                                (10, 5                                               (10, 5                                               (10, 5
                                                                                                                                                    (11, 5)                                               (11, 5)                                              (11, 5)                                              (11, 5)
                                                                                                                                                    (12, 6)                                               (12, 6)                                              (12, 6)                                              (12, 6)
                                                                                                                                                    (13, 6)                                               (13, 6)                                              (13, 6)                                              (13, 6)
                                                                                                                                                    (14, 7)                                               (14, 7)                                              (14, 7)                                              (14, 7)
                                                                                                                                                    (15, 7)                                               (15, 7)                                              (15, 7)                                              (15, 7)
                                                                                                                                                    (16, 8)                                               (16, 8)                                              (16, 8)                                              (16, 8)
                                                                                                                                                    (17, 8)                                               (17, 8)                                              (17, 8)                                              (17, 8)
                                                                                                                                                    (18, 9)                                               (18, 9)                                              (18, 9)                                              (18, 9)
                                                                                                                                                    (19, )                                                (19, )                                               (19, )                                               (19, )
                                                                                                                                                   (20, 19)                                              (20, 19)                                             (20, 19)                                             (20, 19)
      a function of inter-onset-interval, standard tone at different evolutionary times. There are some
                                                                                                                                                   (21, 10)                                              (21, 10)                                             (21, 10)                                             (21, 10)
                                                                                                                                                   (22, 10)                           generation=5000    (22, 10)                           generation=2000   (22, 10)                           generation=500    (22, 10)                           generation=100
                                                                                                                                                   (23, 11)                                              (23, 11)                                             (23, 11)                                             (23, 11)
                                                                                                                                                   (24, 11)                                              (24, 11)                                             (24, 11)                                             (24, 11)
                                                                                                                                                   (25, 12)                                              (25, 12)                                             (25, 12)                                             (25, 12)
                                                                                                                                                   (26, 12)                                              (26, 12)                                             (26, 12)                                             (26, 12)
                                                                                                                                                   (27, 13)                                              (27, 13)                                             (27, 13)                                             (27, 13)
                                                                                                                                                   (28, 13)                                              (28, 13)                                             (28, 13)                                             (28, 13)
                                                                                                                                                   (29, 14)                                              (29, 14)                                             (29, 14)                                             (29, 14)
122                                                                                                                                                      4)                                                    4)                                                   4)                                                   4)
                                                                                                                                                              1   0   1   2   3   4   5                             1   0   1   2   3   4   5                            1   0   1   2   3   4   5                            1   0   1   2   3   4   5
                                                                                                                                                    (10, 5                                                (10, 5                                               (10, 5                                               (10, 5
                                                                                                                                                    (11, 5)                                               (11, 5)                                              (11, 5)                                              (11, 5)
      missing data points in these plots which is due to the fact that in those trials the performances of
                                                                                                                                                    (12, 6)                                               (12, 6)                                              (12, 6)                                              (12, 6)
                                                                                                                                                    (13, 6)                                               (13, 6)                                              (13, 6)                                              (13, 6)
                                                                                                                                                    (14, 7)                                               (14, 7)                                              (14, 7)                                              (14, 7)
                                                                                                                                                    (15, 7)                                               (15, 7)                                              (15, 7)                                              (15, 7)
                                                                                                                                                    (16, 8)                                               (16, 8)                                              (16, 8)                                              (16, 8)
                                                                                                             inter-onset-interval, standard tone
                                                                                                                                                    (17, 8)                                               (17, 8)                                              (17, 8)                                              (17, 8)
                                                                                                                                                    (18, 9)                                               (18, 9)                                              (18, 9)                                              (18, 9)
                                                                                                                                                    (19, )                                                (19, )                                               (19, )                                               (19, )
                                                                                                                                                   (20, 19)                                              (20, 19)                                             (20, 19)                                             (20, 19)
                                                                                                                                                   (21, 10)                                              (21, 10)                                             (21, 10)                                             (21, 10)
                                                                                                                                                   (22, 10)                           generation=10000   (22, 10)                           generation=3000   (22, 10)                           generation=1000   (22, 10)                           generation=200
                                                                                                                                                   (23, 11)                                              (23, 11)                                             (23, 11)                                             (23, 11)
                                                                                                                                                   (24, 11)                                              (24, 11)                                             (24, 11)                                             (24, 11)
      all 50 Brains are 100%, as a result, PSE would be exactly equal to the standard tone and the slope
                                                                                                                                                   (25, 12)                                              (25, 12)                                             (25, 12)                                             (25, 12)
                                                                                                                                                   (26, 12)                                              (26, 12)                                             (26, 12)                                             (26, 12)
                                                                                                                                                   (27, 13)                                              (27, 13)                                             (27, 13)                                             (27, 13)
                                                                                                                                                   (28, 13)                                              (28, 13)                                             (28, 13)                                             (28, 13)
                                                                                                                                                   (29, 14)                                              (29, 14)                                             (29, 14)                                             (29, 14)
                                                                                                                                                         4)                                                    4)                                                   4)                                                   4)
      of the psychometric function would be infinity. Dashed line shows zero constant error.


                                                50             generation=100                 50              generation=200
                                                40                                            40
                                                30                                            30
                                                20                                            20
                                                10                                            10
                                                 0                                             0
                                                     10   15          20         25     30         10    15          20         25   30
                                                50             generation=500                 50              generation=1000
           number of distinct states in loops
                                                40                                            40
                                                30                                            30
                                                20                                            20
                                                10                                            10
                                                 0                                             0
                                                     10   15          20         25     30         10    15          20         25   30
                                                50             generation=2000                50              generation=3000
                                                40                                            40
                                                30                                            30
                                                20                                            20
                                                10                                            10
                                                 0                                             0
                                                     10   15          20         25     30         10    15          20         25   30
                                                50             generation=5000                50          generation=10000
                                                40                                            40
                                                30                                            30
                                                20                                            20
                                                10                                            10
                                                 0                                             0
                                                     10   15          20         25     30         10    15          20         25   30
                                                                                  inter-onset-interval
Figure 5.24: The distribution of number of distinct states used to encode rhythm and standard tone
duration, i.e., the number of distinct states in each loop, as a function of inter-onset-interval at
different evolutionary times. The dashed line shows the identity function.
                                                                                        123


                                            5
                                            4
                       | constant error |
                                            3
                                            2
                                            1
                                            0
                                                0.3   0.4   0.5   0.6     0.7   0.8   0.9   1.0
                                                                   δIOI
Figure 5.25: Absolute constant errors (CE) shown in grey as a function of 𝛿IOI , as well as the
binned data and the fitted softplus curve.
                                                                  124


                                             CHAPTER 6
                                            CONCLUSION
In this thesis, I used neuroevolution to study the evolution of some of the most fundamental
neural circuits such as 1) visual motion detection, 2) intraspecific collision avoidance using visual
motion cues, 3) sound localization, and 4) event duration perception in rhythmic auditory stimuli.
In particular, I used the Markov Brains platform that uses in silico Darwinian evolution, via a
genetic algorithm (GA), to train neural networks that consist of binary neurons and are connected
via logic gates. As explained in depth earlier, the circuit network, structure, and computation
are all subject to evolution, which is an attempt to simulate how these neural circuits evolved in
nature in the first place. This bottom-up approach is in contrast with more common methods
used in computational neuroscience and artificial intelligence where researchers design rule-based
systems, network structure, and its components. The evolutionary process and specific properties
of the Markov Brains platform makes it a more plausible model of neural circuits in many respects.
    The Markov Brains platform provides the possibility to explore the structure, complexity, and
functionality of evolved neural circuits. For example, in chapters 2 and 4 I used a gate-knockout
analysis to investigate the type of logic gates that are essential in evolved motion detection and
sound localization circuits and I demonstrated the distribution of different types of logic gates
that contribute to these neural circuits. In addition to analyzing the network structure and its
components, it is also possible to test and analyze evolved agents in environments that are completely
different from environments in which they evolved. This approach is particularly useful to isolate
environmental factors that could play a role in the evolved behavior. In chapter 3 for example, I
used a behavioral analysis in which the environmental factor under investigation was the apparent
motion of the moving object (robot) in an agent’s vision, namely regressive or progressive motion.
Similarly in chapter 5, I evolved brains that can judge the duration of an auditory stimulus in a
rhythmic sequence and then tested these evolved brains when exposed to out-of-rhythm oddball
tones. Last but not least, the algorithms and computations of Markov Brains can be described in
                                                  125


terms of their state-space transitions. In chapter 5, for the first time I implemented a new technique
that records a Markov Brain’s neural activity as a sequence of transitions from one discrete state
to another. In this type of analysis, a Markov Brain is represented as a finite state machine (FSM)
which allows us to explore its state-space and analyze the brain’s trajectories in the state-space when
experiencing different stimuli in the environment, in order to discover algorithms and mechanisms
behind its behavior.
    In summary, I was able to utilize this powerful approach to address different questions and
hypotheses regarding the fundamental neural circuits, the so called “widgets of intelligence”. In
what follows I briefly recapitulate some of these findings and also discuss the lessons I learned along
the way in each project, and how they helped me make improvements in designing and conducting
future research projects.
6.1     Visual Motion Detection
    In chapter 2 I studied visual motion detection and found that evolution leads to a wide diversity
of neuronal circuits even though each has the same function. I also observed that most circuits are
more complex than one of the standard motion detection circuit models, the Reichardt detector,
and showed that this increase in complexity is due to redundancy in the evolved circuits’ structure.
Measurements of mutational sensitivity showed that the evolved circuits were subject to additional
selective pressures other than the basic functionality. But perhaps the most significant discovery
in this project was that the wide diversity I observed in the evolution of Markov Brains performing
motion detection was in accordance with patterns previously shown in the evolution of genetic
circuits [194], functional systems based on biochemistry [200], as well as modeling and empirical
studies of neuronal circuits with fixed wiring structure [152, 56]. This observation was the first
stepping stone in establishing Markov Brains as a model system for the study of neural circuits
evolved by Darwinian natural selection.
    This study was also insightful for me in terms of experimental design decisions and research
conduct. One of the examples of such design decisions concerned how to read the output neurons
                                                  126


for motion detection circuits. Initially, I tried a few common implementations: 1) assign three
output neurons to each class, 2) assign two input neurons as outputs and read it as a two-bit binary
value with possible outputs 00, 01, 10, 11. In the course of running the experiments I found that
the aforementioned options have low evolvability especially because one of the classes (stationary
object) is more common than the others (preferred direction and null direction). So I came up
with a solution in which I assigned two different output patterns (01 and 10) to stationary objects.
The reasoning behind this decision was that I considered the sum of the output values as the firing
rate of the output neuron. Indeed, in the biological motion detection circuits of fruit flies, the
motion state is encoded in terms of a neuron’s firing rate. The other design decision was how
to evaluate Markov Brains when seeing the visual input patterns. There were 16 possible input
patterns and they fall into 3 different categories. Furthermore, their frequency distributions are not
uniform; 10 of those patterns correspond to stationary objects, 3 of them are preferred motion and
3 are null direction. I tried two different approaches. First, I evaluated Markov brains with a fixed
number of input patterns (for example 20) in which the probability distribution of different classes
are uniform. Obviously, in this approach there are a lot of repetitions in evaluations of preferred
direction (PD) and null direction (ND) classes. So I came up with a second solution in which I
eliminated all the repetitions in evaluations, meaning I evaluate each agent with all possible 16
input patterns once, but I assigned different reward values to patterns that are more abundant. In
other words, I constructed a non-uniform fitness function based on the non-uniform frequency of
the three output classes. These two different approaches led to two different evolutionary outcomes
but their differences were not significant for the results presented in [184]. These were all valuable
lessons that shaped the experimental design in sound localization and time perception projects.
6.2    Intraspecific Collision-Avoidance Strategy based on Apparent Motion
       Cues
    In chapter 3, I studied the intra-specific collision avoidance strategy based on apparent motion
cues that was observed in Drosophila melanogaster [207, 33]. High-throughput data along with
                                                  127


mathematical analysis provided evidence for a strategy in which the apparent back-to-front motion
(regressive motion) in a fly’s retina is a cue to avoid collisions. I investigated possible selective
pressures and environmental conditions for the evolution of this strategy. I showed that even though
it is possible to evolve collision avoidance behavior in Markov Brains that uses regressive motion
as the cue, it is highly unlikely that collision avoidance was the selective pressure behind the
evolution of the observed behavior. The results of my evolutionary experiments clearly showed
that the described behavior only evolves in a narrow range of experimental setups. Furthermore,
I performed a mathematical analysis in which I calculated the probability of collisions in cases
that generate an apparent regressive motion in a fly’s retina. This analysis showed that in the
experimental setup used in [207] only 20% of such events end up in collisions.
     As discussed before, I managed to evolve Markov Brains that show a behavior similar to those
observed in fruit flies. But it is worth mentioning that I tried a few different experimental setups and
the explained behavior did not evolve in the beginning. First, I used a setup in which a group of flies
were positioned in a two-dimensional arena and gained rewards for walking and incurred penalty
for colliding with other flies (the fitness function described here is same as the one described in
chapter 3). I also tested agents with and without the ability to make turns to avoid collisions. The
observed behavior did not evolve in any of these setups. In particular, in the setup where agents had
the ability to turn, agents evolved to circle around in a small space individually to avoid collisions
while walking constantly (which was not surprising in hindsight). In a different setup, I put two
flies in an arena without the ability to turn. The desired behavior did not evolve in this setup
either. The optimal strategy that evolved here was that agents stopped once they sensed another fly
regardless of the direction of the apparent motion. As a result, I recreated an experimental setup
very similar to that used by [207] with a moving object that created a progressive or regressive
motion in the agent’s visual field. This setup also made me perform an analysis to calculate what
percentage of the events that create a regressive motion in the retina result in collision. I believe
the most valuable lesson learned for me in this project was to analyze and benefit from negative
results, and also to start with simpler building blocks and make sure they work before proceeding
                                                  128


to a more complicated task. The latter lesson, i.e., building simpler components of a bigger system,
was in fact the incentive behind the visual motion detection project.
6.3     Information Flow in Motion Detection and Sound Localization Circuits
     In chapter 4, I studied whether transfer entropy (TE) measurement can accurately infer the
flow of information in neural circuits, in particular, in motion detection and sound localization
circuits. I addressed the question using different approaches. First, I calculated the accuracy of
TE measurements in different types of logic gates and used their frequencies in neural circuits.
Then, I used a different method in which I used TE as a proxy to infer information flow. In this
approach, non-zero values of TE are equivalent to the existence of a causal relation between two
neurons. I then generated the receiver operating characteristic (ROC) curves for each circuit. I also
created the receptive fields and influence maps of each network using the connections and the logic
used in the network in order to have a “ground-truth” model for information flow in the networks.
These various approaches and analysis methods showed that the accuracy of TE measurements can
be very sensitive to the type of circuit (the task it is performing), its connectivity structure, and
its size. Furthermore, I showed that creating a ground-truth model can be a very hard task even
if all the information about the network and its functionality is accessible. Finally, I showed that
even in the absence of empirical limitations, inferring causation and information flow can be very
challenging. For example, in our analysis we used neural recordings in the absence of any noise
and we were able to record from every neuron in a neural circuit. Furthermore, we had access to the
recordings of the brain for all possible sensory patterns. This just reminds us again that causality
and identifying causal relations in a system is a very hard problem, as acknowledged before (see for
example [145]). This problem becomes even harder when the subject matter is the nervous system,
which arguably is the most complex system known to us. It also emphasized the fact that perhaps
one of the missing components in the study of the brain is information theory, and probably one
of the future breakthroughs in the field would follow the discovery of a new information-theoretic
method that addresses causality.
                                                  129


6.4    Event Duration Perception in Rhythmic Auditory Stimuli
    In chapter 5, I studied attentional entrainment as a model of event duration perception in
rhythmic auditory stimuli. In particular, I tested two competing models of time perception in
relation to attention, Scalar Expectancy Theory (SET) and Dynamic Attending Theory (DAT). In
this project, I evolved Markov Brains that are able to perform duration judgment task in a rhythmic
sequence of tones. These evolved brains can be considered as participants in a psychophysical
experiment. We also performed psychometric tests that showed that these evolved brains have the
same perceptual characteristics as human subjects [183]. For example, the discrimination threshold
of evolved Markov Brains complies with Weber’s law and furthermore, their point of subjective
equality reveals similar trends to that of human subjects. I then tested the evolved brains against
out-of-rhythm tones that they have not experienced during evolution. The psychometric results of
these tests showed duration misperceptions that are similar to those experienced by human subjects.
    In this project, I used a new method to analyze the computations and algorithms that Markov
Brains used. I used the state-space transitions of Markov Brains that can reveal their computations,
as well as how these brains pay attention to parts of the sensory input and do not pay attention
to other parts. The results of this analysis showed that unlike what SET posits, the attention
distributions in Markov Brains are not uniform in time at all. Furthermore, I observed that the
attention distributions during the trials were also not in accordance with DAT, which predicts that
attention peaks at the beginning of the rhythmic tones. Rather, evolved Markov Brains paid less
attention to the beginning of the tones and their attention peaks coincided with the end point of
the tone. These results suggest a new model of dynamic attending or attentional entrainment,
where attention reaches its highest point when the stimulus is potentially the most informative, and
attention drops when the stimulus is predictable. This new model can also be generalized to other
modes of sensation such as visual attention. The generalized model would suggest that –similar
to auditory attention in which attention peaks at specific time points– visual attention would be
focused on those parts of the visual field that is predicted to be more informative. It would be
interesting to design experiments that can test this new model of attention with human subjects, with
                                                  130


both auditory and visual sensory patterns. Aside from the proposed model of attention, the more
important conclusion is the fact that we can test existing models of cognition using computational
evolutionary methods and then we are able to suggest modification and even come up with new
models. These new models make predictions that can be tested in biological brains.
    To wrap up, I would like to suggest a possible future project that is in line with the research
I did in this thesis, and in particular is inspired by the idea discussed before, namely that visual
attention is focused on the most informative parts of an image. For this work, I propose to evolve
Markov Brains that perform an image classification task via visual saccades that are driven by the
information content (Shannon information [167]) of the image rather than the image saliency. The
proposed project seems promising to me based on my experience in the field of computational
cognitive science and using Markov Brains as the platform.
6.5    Information-Driven Image Classification via Saccadic Eye Movements
    Here, I propose a project as a possible direction to pursue in the future, which involves
“information-driven visual attention in image classification”. This proposal is in part inspired
by the work done by Olson et al. [142]. They conducted a series of experiments to evolve
Markov brains that performed active image recognition of hand-written numerals in the MNIST
dataset [103]. Unlike most widely used image recognition methods in which the classifier networks
view the entire image and do not actively change the temporal or spatial structure of the data they
receive, Olson et al. evolved classifiers that could view only a subset of the pixels in the image (a
3×3 sub-image) and could navigate which part of the image to view. As a result, the agents viewed a
temporal sequence of sub-images rather than seeing the entire image at once. They were evolved to
perform the image classification task by navigating and scanning the sub-images in a finite number
of time steps. They ran 30 replicates of the evolutionary experiment, namely 30 populations, for
nearly 250k generations and used only 1000 images from the MNIST training dataset (the original
dataset consists of 60,000 images). They presented the results of their most successful run in which
the agent with the highest performance only achieved 76% accuracy on testing dataset. The accu-
                                                  131


racy they achieved in their experiments is significantly lower than that of other machine learning
methods such as K-nearest neighbors [88], support vector machines (SVM) [42], ANNs [126], and
CNNs [36] which can be attributed to multiple factors I discuss here.
    1. According to the data presented in their own paper [142], using a smaller portion of the dataset
       can result in lower accuracy. They presented results of training a decision tree model [147]
       on the smaller dataset, which only achieved 88.5% accuracy.
    2. One of the configuration decisions they made in setting up the system was to put the agent
       in a random position in the image, which made the task much more difficult than it needed
       to be. The evidence for this speculation is the fact that in the early stages of the evolution
       the agent with the highest performance evolves to find the center point at the top edge of the
       image and uses it as a reference point in order to start its navigation through the image.
    3. Another configuration factor that may have prevented the final performance from reaching
       higher levels is that the saccadic movements were limited to translating from a given position
       in the image to its neighboring points whereas biological saccadic eye movements enable
       transitioning from a point in the visual field to any other point.
     Here, I propose modifications to the experimental setup for the evolution of image classification
task via saccadic eye movements that addresses some of the issues discussed in [142]. I also
suggest a different approach that attempts to improve performance results in the MNIST dataset.
As mentioned before, the idea behind this approach relies mainly on navigating the visual attention
or saccades based on the information content of the images. This captures how, for example, humans
navigate their attention via saccades through salient regions of an image to recognize faces [75].
     First, I investigate the entropy content of each pixel of the images in the MNIST dataset. The
entropy for each pixel is calculated as:
                                            Õ
                                  𝐻 (𝑋) = −     −𝑝(𝑥𝑖 ) log( 𝑝(𝑥𝑖 )),                             (6.1)
                                              𝑖
                                                   132


where 𝑥𝑖 are the possible states a pixel can take on and the summation is over images of the dataset.
Figure 6.1(A) shows the entropy content of pixels (in bits) for the images in the MNIST dataset. I
used the same dataset used in [142], in which images were converted from greyscale to black and
white; therefore, the possible states of a pixel are either 0 or 1.
     Similarly, I can explore the entropy of variable 𝐶, which denotes the class to which an image
belongs. In the MNIST dataset, variable 𝐶 can take on values 0-9, i.e., 𝑐𝑖 =0-9). Each class in
the MNIST dataset has an equal number of images in the dataset, meaning there are 6000 images
for each digit (there are 60,000 image in the MNIST dataset). As a consequence, the probability
distribution of 𝑐𝑖 is uniform, thus the entropy 𝐻 (𝐶) = log(10). Now we can link the entropy 𝐻 (𝐶)
to the entropy content of images and their pixels. For example, we can calculate how much entropy
is reduced by looking at the value of a particular pixel. So we first calculate the conditional entropy
𝐻 (𝐶 |𝑋) that shows the uncertainty in class variable 𝐶 given the value of a particular pixel X:
                                           Õ
                            𝐻 (𝐶 |𝑋) = −        𝑝(𝑐𝑖 , 𝑥 𝑗 ) log( 𝑝(𝑐𝑖 |𝑥 𝑗 )),                    (6.2)
                                           𝑖, 𝑗
Then, we can calculate how much the entropy of 𝐶 can be reduced given the state of a particular
pixel 𝑋:
                                   𝐼 (𝐶 : 𝑋) = 𝐻 (𝐶) − 𝐻 (𝐶 |𝑋),                                   (6.3)
or the information shared between class variable 𝐶 and a particular pixel 𝑋. Figure 6.1(B) shows
the information 𝐼 (𝐶 : 𝑋) (in bits) shared between image classes and each pixel. For a uniform
probability distribution the entropy is always maximal (the uncertainty is the highest) but as the
entropy decreases, for example by subtracting a conditional entropy, the probability distribution
becomes non-uniform (see [4]). In other words, 𝐼 (𝐶 : 𝑋) values in figure 6.1(B) show how the
probability distribution of 𝐶 can be distorted from a uniform distribution given the value of a single
pixel. Figure 6.1(C) shows two different probability distributions of class variable 𝐶 given the
value of the pixel in the center (the pixel with the highest information in figure 6.1(B)) is either
0 or 1. Similarly, the entropy of 𝐶 can be reduced further given the values of other pixels in the
image, namely viewing one pixel of the image at a time. Furthermore, these results suggest that it
                                                   133


would be more efficient to first scan the pixels that reduce entropy the most, rather than scanning
pixels in a random order. So we can design an experiment such that the agents saccade through the
sequence of pixels with the highest information contents. Furthermore, in the proposed design the
agents will see a group of adjacent pixels, for example a 2×2 sub-image at every time step instead
of seeing one pixel at a time.
 (A)      H(X)
          <latexit sha1_base64="nPn90FllffRBST4PH8z/JlcZqp4=">AAAB63icbVBNSwMxEJ31s9avqkcvoUWoCGW3HvRY9NJjBfsB7VKyabYNTbJLkhWW0r/gRUERr/4hb/03ZtsetPXBwOO9GWbmBTFn2rjuzNnY3Nre2c3t5fcPDo+OCyenLR0litAmiXikOgHWlDNJm4YZTjuxolgEnLaD8X3mt5+o0iySjyaNqS/wULKQEWwyqV7uXPYLJbfizoHWibckpVqxd/U6q6WNfuG7N4hIIqg0hGOtu54bG3+ClWGE02m+l2gaYzLGQ9q1VGJBtT+Z3zpFF1YZoDBStqRBc/X3xAQLrVMR2E6BzUivepn4n9dNTHjrT5iME0MlWSwKE45MhLLH0YApSgxPLcFEMXsrIiOsMDE2nrwNwVt9eZ20qhXvulJ9sGncwQI5OIcilMGDG6hBHRrQBAIjeIY3eHeE8+J8OJ+L1g1nOXMGf+B8/QAp5JCi</latexit>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             (B) I(C : X)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 <latexit sha1_base64="Ct3sqOq3A50+/3e+kafZIGCUkeY=">AAAB7XicbVDLSgNBEOyNrxhfUY9ehgQhIoTdeFA8BXPRWwTzgGQJs5PZZMzszjIzKyxL/sGDHhTx6v94y984eRw0saChqOqmu8uLOFPatidWZm19Y3Mru53b2d3bP8gfHjWViCWhDSK4kG0PK8pZSBuaaU7bkaQ48DhteaPa1G89UamYCB90ElE3wIOQ+YxgbaTmXal23T7r5Yt22Z4BrRJnQYrVQvf8ZVJN6r38d7cvSBzQUBOOleo4dqTdFEvNCKfjXDdWNMJkhAe0Y2iIA6rcdHbtGJ0apY98IU2FGs3U3xMpDpRKAs90BlgP1bI3Ff/zOrH2r9yUhVGsaUjmi/yYIy3Q9HXUZ5ISzRNDMJHM3IrIEEtMtAkoZ0Jwll9eJc1K2bkoV+5NGjcwRxZOoAAlcOASqnALdWgAgUd4hjd4t4T1an1Yn/PWjLWYOYY/sL5+ADTjkTQ=</latexit>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                (C)
Figure 6.1: The images in the dataset are 28×28 pixels. (A) The entropy content (in bits) of MNIST
dataset images per pixel, 𝐻 (𝑋). (B) The information shared between each pixel and the class of
the image, 𝐼 (𝐶 : 𝑋). (C) The probability distributions of class variable 𝐶 given the pixel in the
center is 0 or 1.
6.6     Experimental Setup
     I evolve Markov Brains that see parts of an image through a 2×2 or 3×3 window at each
time step and classify the image at the end of the sequence of saccades. The experimental setup
proposed here is very similar to the experiments by [142] except that the saccade positions are
predetermined by the results generated based on information content of the images in the dataset.
More specifically, the agents will sense the sub-images at each position specified in the based on
the information content and deliver their classification decision at the end.
6.6.1    Proof of concept
I also investigated whether the proposed approach can potentially improve the image classification
performance for the MNIST dataset. To this end, I trained ANNs (multi-layered perceptrons, MLP)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            134


that perform image classification on the MNIST dataset in which images that are fed to the network
are partially masked. Masking of the images was performed based on two different criteria: 1)
I masked images based on the entropy content of the pixels/sub-images across the dataset (see
figure 6.1(A)), meaning the network only sees the sub-images with the high entropy contents, and
2) images are masked based on the entropy reduction in 𝐶 given a sub-image (see figure 6.1(B)),
where the network sees sub-images with high 𝐼 (𝐶 : 𝑋). Figure 6.2(A) and (B) shows the accuracy
of the trained ANNs on masked images based on entropy of sub-images and the information shared
between class 𝐶 and sub-images, respectively. In these plots, the 𝑥-axis shows the thresholds by
which the sub-images were masked, for example, the values of 0.2 on 𝑥-axis in figure 6.2(A) shows
runs in which all the sub-images with 𝐻 (𝑋) less than 0.2 were masked in the dataset. The 𝑦-axis
on the left shows the accuracy of the network on the testing dataset and the 𝑦-axis on the right
shows what percentage of the image was visible to the network. We observe that the network can
still achieve an 80% accuracy on the testing dataset when only around 20% of the image is visible
to the network when masking images based on entropy of sub-images, and they can achieve around
80% accuracy when only 15% of the image is visible based on 𝐼 (𝐶 : 𝑋). These results show that
this approach has the potential to significantly improve experimental design and consequently the
final results.
       (A)                                          (B)
Figure 6.2: The performance of ANNs trained on masked images. Maskings were based on (A)
the entropy content of sub-images in the dataset, and (B) the information shared between 𝐶 and the
sub-images.
                                                 135


BIBLIOGRAPHY
     136


                                       BIBLIOGRAPHY
[1]  Adami, C. Digital genetics: unravelling the genetic basis of evolution. Nature Reviews
     Genetics 7 (2006), 109.
[2]  Adami, C. What do robots dream of? Science 314 (2006), 1093–1094.
[3]  Adami, C. The use of information theory in evolutionary biology. Annals of the New York
     Academy of Sciences 1256, 1 (2012), 49–65.
[4]  Adami, C. What is information? Philosophical Transactions of the Royal Society A:
     Mathematical, Physical and Engineering Sciences 374, 2063 (2016), 20150230.
[5]  Addyman, C., French, R. M., and Thomas, E. Computational models of interval timing.
     Current Opinion in Behavioral Sciences 8 (2016), 140–146.
[6]  Adelson, E. H., and Bergen, J. R. Spatiotemporal energy models for the perception of motion.
     J Opt Soc Am A 2 (1985), 284–299.
[7]  Ahn, Y. Y., Jeong, H., and Kim, B. J. Wiring cost in the organization of a biological neuronal
     network. Physica A 367 (2006), 531–537.
[8]  Albantakis, L., Hintze, A., Koch, C., Adami, C., and Tononi, G. Evolution of integrated
     causal structures in animats exposed to environments of increasing complexity. PLoS Comput
     Biol 10 (2014), e1003966.
[9]  Albantakis, L., Hintze, A., Koch, C., Adami, C., and Tononi, G. Evolution of integrated
     causal structures in animats exposed to environments of increasing complexity. PLoS Comput
     Biol 10 (2014), e1003966.
[10] Albantakis, L., Marshall, W., Hoel, E., and Tononi, G. What caused what? a quantitative
     account of actual causation using dynamical causal networks. Entropy 21, 5 (2019), 459.
[11] Ay, N., and Polani, D. Information flows in causal networks. Advances in complex systems
     11, 01 (2008), 17–41.
[12] Ayala, F. J., and Campbell, C. A. Frequency-dependent selection. Ann Rev Ecol System 5
     (1974), 115–138.
[13] Barlow, H., and Levick, W. R. The mechanism of directionally selective units in rabbit’s
     retina. J Physiol 178 (1965), 477–504.
[14] Barnett, L., Barrett, A. B., and Seth, A. K. Granger causality and transfer entropy are
     equivalent for Gaussian variables. Phys Rev Lett 103 (2009), 238701.
[15] Bates, D. M., and Watts, D. G. Nonlinear Regression Analysis and its Applications, vol. 2.
     Wiley New York, 1988.
                                               137


[16] Beer, R. The dynamics of active categorical perception in an evolved model agent. Adaptive
     Behavior 11 (2003), 209–243.
[17] Blount, Z. D., Borland, C. Z., and Lenski, R. E. Historical contingency and the evolution of
     a key innovation in an experimental population of Escherichia coli. Proc Natl Acad Sci U S
     A 105 (2008), 7899–7906.
[18] Bolhuis, J. J., Brown, G. R., Richardson, R. C., and Laland, K. N. Darwin in mind: New
     opportunities for evolutionary psychology. PLoS biology 9, 7 (2011), e1001109.
[19] Bongard, J., Zykov, V., and Lipson, H. Resilient machines through continuous self-modeling.
     Science 314 (2006), 1118–1121.
[20] Borst, A., and Egelhaaf, M. Principles of visual motion detection. Trends Neurosci 12
     (1989), 297–306.
[21] Borst, A., and Egelhaaf, M. Principles of visual motion detection. Trends in neurosciences
     12, 8 (1989), 297–306.
[22] Borst, A., and Helmstaedter, M. Common circuit design in fly and mammalian motion
     vision. Nat Neurosci 18 (2015), 1067.
[23] Bossomaier, T., Barnett, L., Harré, M., and Lizier, J. T. An Introduction to Transfer Entropy.
     Springer International, Cham, Switzerland, 2015.
[24] Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings
     of COMPSTAT’2010. Springer, 2010, pp. 177–186.
[25] Branson, K., Robie, A. A., Bender, J., Perona, P., and Dickinson, M. H. High-throughput
     ethomics in large groups of drosophila. Nature Methods 6 (2009), 451–457.
[26] Buhusi, C. V., and Meck, W. H. What makes us tick? Functional and neural mechanisms of
     interval timing. Nat Rev Neurosci 6 (2005), 755.
[27] Bunge, M. A. Causality: The place of the causal principle in modern science. Harvard
     University Press, Cambridge, MA, 1959.
[28] Buonomano, D. V., and Mauk, M. D. Neural network model of the cerebellum: temporal
     discrimination and the timing of motor responses. Neural Computation 6 (1994), 38–55.
[29] Buzsáki, G. Rhythms of the Brain. Oxford University Press, New York, NY, 2006.
[30] C G, N., LaBar, T., Hintze, A., and Adami, C. Origin of life in a digital microcosm. Phil
     Trans Roy Soc A 375 (2017), 20160350.
[31] Carnevale, N. T., and Hines, M. L. The NEURON book. Cambridge University Press, 2006.
[32] Casini, L., and Macar, F. Effects of attention manipulation on judgments of duration and of
     intensity in the visual modality. Mem Cognit 25 (1997), 812–8.
                                               138


[33] Chalupka, K., Dickinson, M., and Perona, P. Generalized regressive motion: A visual cue
     to collision. arXiv preprint arXiv:1510.07573 (2015).
[34] Chapman, S., Knoester, D., Hintze, A., and Adami, C. Evolution of an artificial visual cortex
     for image recognition. In Advances in Artificial Life, ECAL 12 (2013), pp. 1067–1074.
[35] Chesson, P. Mechanisms of maintenance of species diversity. Ann Rev Ecol System 31
     (2000), 343–366.
[36] Ciregan, D., Meier, U., and Schmidhuber, J. Multi-column deep neural networks for image
     classification. In 2012 IEEE conference on computer vision and pattern recognition (2012),
     IEEE, pp. 3642–3649.
[37] Coull, J. T., Vidal, F., Nazarian, B., and Macar, F. Functional anatomy of the attentional
     modulation of time estimation. Science 303 (2004), 1506–1508.
[38] Cover, T. M., and Thomas, J. A. Elements of Information Theory. John Wiley, New York,
     NY, 1991.
[39] Cross, F. R., Buchler, N. E., and Skotheim, J. M. Evolution of networks and sequences in
     eukaryotic cell cycle control. Phil Trans Roy Soc B 366 (2011), 3532–3544.
[40] Darwin, C. The Descent of Man, and Selection in Relation to Sex. John Murray, London,
     1871.
[41] Darwin, C. On the Origin of Species By Means of Natural Selection. Murray, London, 1959.
[42] Decoste, D., and Schölkopf, B. Training invariant support vector machines. Machine
     learning 46, 1-3 (2002), 161–190.
[43] Duda, R. O., Hart, P. E., et al. Pattern classification and scene analysis, vol. 3. Wiley New
     York, 1973.
[44] Durstewitz, D. Self-organizing neural integrator predicts interval times through climbing
     activity. Journal of Neuroscience 23 (2003), 5342–5353.
[45] Edlund, J. A., Chaumont, N., Hintze, A., Koch, C., Tononi, G., and Adami, C. Integrated
     information increases with fitness in the evolution of animats. PLoS Comput Biol 7 (2011),
     e1002236.
[46] Engstrom, L., Tran, B., Tsipras, D., Schmidt, L., and Madry, A. A rotation and a translation
     suffice: Fooling cnns with simple transformations.
[47] Fechner, G. T. Elemente der Psychophysik, vol. 2. Breitkopf und Härtel, Leipzig, 1860.
[48] Floreano, D., Dürr, P., and Mattiussi, C. Neuroevolution: from architectures to learning.
     Evolutionary intelligence 1, 1 (2008), 47–62.
[49] Fogel, D. B., Fogel, L. J., and Porto, V. Evolving neural networks. Biological cybernetics
     63, 6 (1990), 487–493.
                                                139


[50] Fortuna, M. A., Zaman, L., Ofria, C., and Wagner, A. The genotype-phenotype map of an
     evolving digital organism. PLoS Comput Biol 13 (2017), e1005414.
[51] Gauci, J., and Stanley, K. O. Autonomous evolution of topographic regularities in artificial
     neural networks. Neural computation 22, 7 (2010), 1860–1898.
[52] Getty, D. J. Discrimination of short temporal intervals: A comparison of two models.
     Attention, Perception, & Psychophysics 18 (1975), 1–8.
[53] Gibbon, J. Scalar expectancy theory and Weber’s law in animal timing. Psychol. Rev. 84
     (1977), 279–325.
[54] Gibbon, J., Church, R. M., and Meck, W. H. Scalar timing in memory. Ann NY Acad Sci
     423 (1984), 52.
[55] Gilbert, C. D., and Sigman, M. Brain states: Top-down influences in sensory processing.
     Neuron 54 (2007), 677–96.
[56] Goaillard, J.-M., Taylor, A. L., Schulz, D. J., and Marder, E. Functional consequences of
     animal-to-animal variation in circuit parameters. Nat Neurosci 12 (2009), 1424.
[57] González-González, A., Hug, S. M., Rodríguez-Verdugo, A., Patel, J. S., and Gaut, B. S.
     Adaptive mutations in RNA polymerase and the transcriptional terminator Rho have similar
     effects on escherichia coli gene expression. Mol Biol Evol 34 (2017), 2839–2855.
[58] Good, B. H., McDonald, M. J., Barrick, J. E., Lenski, R. E., and Desai, M. M. The dynamics
     of molecular evolution over 60,000 generations. Nature 551 (2017), 45.
[59] Gould, S. J. Wonderful Life: the Burgess Shale and the Nature of History. WW Norton &
     Company, 1990.
[60] Granger, C. W. Investigating causal relations by econometric models and cross-spectral
     methods. Econometrica: Journal of the Econometric Society (1969), 424–438.
[61] Grondin, S. From physical time to the first and second moments of psychological time.
     Psychological Bulletin 127 (2001), 22.
[62] Habenschuss, S., Jonke, Z., and Maass, W. Stochastic computations in cortical microcircuit
     models. PLoS Comput Biol 9 (2013), e1003311.
[63] Halpern, J. Y. Actual causality. MiT Press, 2016.
[64] Hass, J., and Durstewitz, D. Neurocomputational models of time perception. In Neurobiology
     of Interval Timing, H. Merchant and V. de Lafuente, Eds. Springer, New York and Heidelberg,
     2014, pp. 49–71.
[65] Hassenstein, B., and Reichardt, W. Systemtheoretische Analyse der Zeit-, Reihenfolgen-
     und Vorzeichenauswertung bei der Bewegungsperzeption des Rüsselkäfers Chlorophanus.
     Z Naturforsch B 11 (1956), 513–524.
                                              140


[66] Hawkins, J., and Ahmad, S. Why neurons have thousands of synapses, a theory of sequence
     memory in neocortex. Frontiers in neural circuits 10 (2016), 23.
[67] Hawkins, J., and Blakeslee, S. On Intelligence. Henry Holt and Co., New York, NY, 2004.
[68] Hilgetag, C. C., and Kaiser, M. Clustered organization of cortical connectivity. Neuroinfor-
     matics 2 (2004), 353–60.
[69] Hintze, A., Edlund, J. A., Olson, R. S., Knoester, D. B., Schossau, J., Albantakis, L.,
     Tehrani-Saleh, A., Kvam, P., Sheneman, L., Goldsby, H., et al. Markov brains: A technical
     introduction. arXiv:1709.05601 (2017).
[70] Hintze, A., Kirkpatrick, D., and Adami, C. The structure of evolved representations across
     different substrates for artificial intelligence. arXiv preprint arXiv:1804.01660 (2018).
[71] Hintze, A., and Miromeni, M. Evolution of autonomous hierarchy formation and mainte-
     nance. In ALIFE 14: The Fourteenth Conference on the Synthesis and Simulation of Living
     Systems (2014), pp. 366–367.
[72] Hopcroft, J. E., and Ullman, J. D. Introduction to Automata Theory, Languages, and
     Computation. Addison-Wesley Longman, Boston, MA, 1979.
[73] Hope, E. A., Amorosi, C. J., Miller, A. W., Dang, K., Heil, C. S., and Dunham, M. J.
     Experimental evolution reveals favored adaptive routes to cell aggregation in yeast. Genetics
     206 (2017), 1153–1167.
[74] Itti, L., and Koch, C. Computational modelling of visual attention. Nat Rev Neurosci 2
     (2001), 194.
[75] Itti, L., and Koch, C. Computational modelling of visual attention. Nature reviews neuro-
     science 2, 3 (2001), 194–203.
[76] James, R. G., Barnett, N., and Crutchfield, J. P. Information flows? A critique of transfer
     entropies. Phys. Rev. Lett. 116 (2016), 238701.
[77] Janzing, D., Balduzzi, D., Grosse-Wentrup, M., Schölkopf, B., et al. Quantifying causal
     influences. The Annals of Statistics 41, 5 (2013), 2324–2358.
[78] Jaramillo, S., and Zador, A. M. The auditory cortex mediates the perceptual effects of
     acoustic temporal expectation. Nat Neurosci 14 (2011), 246–51.
[79] Jeffress, L. A. A place theory of sound localization. Journal of comparative and physiological
     psychology 41, 1 (1948), 35.
[80] Jo, J., and Bengio, Y. Measuring the tendency of cnns to learn surface statistical regularities.
     arXiv preprint arXiv:1711.11561 (2017).
[81] Jones, M. R. Time, our lost dimension: Toward a new theory of perception, attention, and
     memory. Psychol Rev 83 (1976), 323–55.
                                                  141


[82] Jones, M. R., and Boltz, M. Dynamic attending and responses to time. Psychol Rev 96
     (1989), 459.
[83] Jones, M. R., Moynihan, H., MacKenzie, N., and Puente, J. Temporal aspects of stimulus-
     driven attending in dynamic arrays. Psychol Sci 13 (2002), 313–9.
[84] Joshi, N. J., Tononi, G., and C., K. The minimal complexity of adapting agents increases
     with fitness. PLoS Comput Biol 9 (2013), e1003111.
[85] Juel, B. E., Comolatti, R., Tononi, G., and Albantakis, L. When is an action caused
     from within? quantifying the causal chain leading to actions in simulated agents.
     arXiv:1904.02995, 2019.
[86] Karmarkar, U. R., and Buonomano, D. V. Timing in the absence of clocks: encoding time
     in neural network states. Neuron 53 (2007), 427–438.
[87] Kayser, C., Petkov, C. I., Lippert, M., and Logothetis, N. K. Mechanisms for allocating
     auditory attention: an auditory saliency map. Current Biology 15 (2005), 1943–1947.
[88] Keysers, D., Deselaers, T., Gollan, C., and Ney, H. Deformation models for image recog-
     nition. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 8 (2007),
     1422–1435.
[89] Kimura, M., and Crow, J. F. The number of alleles that can be maintained in a finite
     population. Genetics 49 (1964), 725–738.
[90] Kirkpatrick, D., and Hintze, A. Augmenting neuro-evolutionary adaptation with repre-
     sentations does not incur a speed accuracy trade-off. In Proceedings of the Genetic and
     Evolutionary Computation Conference Companion (2019), pp. 177–178.
[91] Kirkpatrick, D., and Hintze, A. The role of ambient noise in the evolution of robust mental
     representations in cognitive systems. In Artificial Life Conference Proceedings (2019), MIT
     Press, pp. 432–439.
[92] Knill, D. C., and Richards, W. Perception as Bayesian Inference. Cambridge University
     Press, Cambridge, Mass., 1996.
[93] Kriegeskorte, N., and Douglas, P. K. Cognitive computational neuroscience. Nat Neurosci
     21, 9 (Sep 2018), 1148–1160.
[94] Kriegeskorte, N., and Douglas, P. K. Cognitive computational neuroscience. Nat Neurosci
     21 (2018), 1148.
[95] Kriegeskorte, N., and Kievit, R. A. Representational geometry: integrating cognition,
     computation, and the brain. Trends in Cognitive Sciences 17 (2013), 401–412.
[96] Kriegeskorte, N., Mur, M., and Bandettini, P. A. Representational similarity analysis-
     connecting the branches of systems neuroscience. Frontiers in Systems Neuroscience 2
     (2008), 4.
                                              142


[97] Kvam, P., Cesario, J., Schossau, J., Eisthen, H., and Hintze, A. Computational evolution of
      decision-making strategies. In Proceedings 37th Annual Meeting of the Cognitive Science
      Society (Austin, TX, 2015), Noelle, D. C. et al., Ed., Cognitive Science Society.
[98] Kvam, P., Cesario, J., Schossau, J., Eisthen, H., and Hintze, A. Computational evolution
      of decision-making strategies. In Proc. 37th Annual Conf. of the Cognitive Science Society,
      D. Noelle, R. Dale, A. Warlaumont, J. Yoshimi, T. Matlock, C. Jennings, and P. P. Maglio,
      Eds. Cognitive Science Society, Austin, TX, 2015, pp. 1225–1230.
[99] LaBar, T., and Adami, C. Evolution of drift robustness in small populations. Nature Comm
      8 (2017), 1012.
[100] LaBar, T., Hintze, A., and Adami, C. Evolvability tradeoffs in emergent digital replicators.
      Artificial Life 22 (2016), 483–498.
[101] Large, E. W., and Jones, M. R. The dynamics of attending: How people track time-varying
      events. Psychol. Rev. 106 (1999), 119 – 159.
[102] LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature 521, 7553 (2015), 436–444.
[103] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to
      document recognition. Proceedings of the IEEE 86, 11 (1998), 2278–2324.
[104] Lenski, R. E., Ofria, C., Pennock, R. T., and Adami, C. The evolutionary origin of complex
      features. Nature 423, 6936 (2003), 139–144.
[105] Lewontin, R. C., and Hubby, J. L. A molecular approach to the study of genic heterozygosity
      in natural populations. II. Amount of variation and degree of heterozygosity in natural
      populations of Drosophila pseudoobscura . Genetics 54 (1966), 595–609.
[106] Lind, P. A., Farr, A. D., and Rainey, P. B. Experimental evolution reveals hidden diversity
      in evolutionary pathways. Elife 4 (2015), e07074.
[107] Lizier, J. T., and Prokopenko, M. Differentiating information transfer and causal effect. The
      European Physical Journal B 73, 4 (2010), 605–615.
[108] Macar, F., Grondin, S., and Casini, L. Controlled attention sharing influences time estimation.
      Mem Cognit 22 (1994), 673–86.
[109] Macmillan, N. A., and Creelman, C. D. Detection theory: A user’s guide. Psychology press,
      2004.
[110] Maloney, E. S. Chapman Piloting, Seamanship and Small Boat Handling. Hearst Marine
      Books, 1989.
[111] Marder, E. Variability, compensation, and modulation in neurons and circuits. Proc Natl
      Acad Sci U S A 108, Suppl 3 (2011), 15542–15548.
[112] Markram, H. The blue brain project. Nature Reviews Neuroscience 7, 2 (2006), 153–160.
                                                143


[113] Marr, D., and Ullman, S. Directional selectivity and its use in early visual processing. Proc
      R Soc Lond B 211 (1981), 151–180.
[114] Marstaller, L., Hintze, A., and Adami, C. The evolution of representation in simple cognitive
      networks. Neural computation 25, 8 (2013), 2079–2107.
[115] Marstaller, L., Hintze, A., and Adami, C. The evolution of representation in simple cognitive
      networks. Neural Comput 25 (2013), 2079–2107.
[116] Marstaller, L., Hintze, A., and Adami, C. The evolution of representation in simple cognitive
      networks. Neural Comput 25 (2013), 2079–2107.
[117] Matell, M. S., and Meck, W. H. Cortico-striatal circuits and interval timing: coincidence
      detection of oscillatory processes. Cognitive Brain Research 21 (2004), 139–170.
[118] Matthews, W. J., and Meck, W. H. Temporal cognition: Connecting subjective time to
      perception, attention, and memory. Psych Bull 142 (2016), 865.
[119] Mazzucato, L., La Camera, G., and Fontanini, A. Expectation-induced modulation of
      metastable activity underlies faster coding of sensory stimuli. Nat Neurosci 22 (2019),
      787–796.
[120] McAuley, J. D. Perception of time as phase: Toward an adaptive-oscillator model of rhythmic
      pattern processing. PhD thesis, Indiana University, Indianapolis, IN, 1995.
[121] McAuley, J. D., and Fromboluti, E. K. Attentional entrainment and perceived event duration.
      Philos Trans R Soc Lond B Biol Sci 369 (2014), 20130401.
[122] McAuley, J. D., and Jones, M. R. Modeling effects of rhythmic context on perceived duration:
      A comparison of interval and entrainment approaches to short-interval timing. J Exp Psychol
      Hum Percept Perform 29 (2003), 1102–25.
[123] McAuley, J. D., Jones, M. R., Holub, S., Johnston, H. M., and Miller, N. S. The time of our
      lives: Life span development of timing and event tracking. J Exp Psychol Gen 135 (2006),
      348–67.
[124] McAuley, J. D., and Kidd, G. R. Effect of deviations from temporal expectations on tempo
      discrimination of isochronous tone sequences. J Exp Psychol Hum Percept Perform 24
      (1998), 1786–800.
[125] McFarland, D. J. Decision making in animals. Nature 269 (1977), 15–21.
[126] Meier, U., Ciresan, D. C., Gambardella, L. M., and Schmidhuber, J. Better digit recognition
      with a committee of simple neural nets. In 2011 International Conference on Document
      Analysis and Recognition (2011), IEEE, pp. 1250–1254.
[127] Michalewicz, Z. Genetic Algorithms + Data Strucures = Evolution Programs. Springer
      Verlag, New York, 1996.
                                               144


[128] Middlebrooks, J. C., and Green, D. M. Sound localization by human listeners. Annual
      review of psychology 42, 1 (1991), 135–159.
[129] Miller, J. E., Carlson, L. A., and McAuley, J. D. When what you hear influences when you
      see: Listening to an auditory rhythm influences the temporal allocation of visual attention.
      Psychol Sci 24 (2013), 11–8.
[130] Moore, B. C. An introduction to the psychology of hearing. Brill, 2012.
[131] Moosavi-Dezfooli, S.-M., Fawzi, A., and Frossard, P. Deepfool: a simple and accurate
      method to fool deep neural networks. In Proceedings of the IEEE conference on computer
      vision and pattern recognition (2016), pp. 2574–2582.
[132] Moray, N. Where is capacity limited? a survey and a model. Acta Psychologica 27 (1967),
      84–92.
[133] Nguyen, A., Yosinski, J., and Clune, J. Deep neural networks are easily fooled: High
      confidence predictions for unrecognizable images. In Proceedings of the IEEE conference
      on computer vision and pattern recognition (2015), pp. 427–436.
[134] Nobre, A. C., Correa, A., and Coull, J. T. The hazards of time. Curr Opin Neurobiol 17
      (2007), 465–70.
[135] Nosil, P. Ecological Speciation. Oxford University Press, Oxford (UK), 2012.
[136] Oizumi, M., Albantakis, L., and Tononi, G. From the phenomenology to the mechanisms of
      consciousness: Integrated information theory 3.0. PLoS Comput Biol 10 (2014), e1003588.
[137] Olson, R. S., Haley, P. B., Dyer, F. C., and Adami, C. Exploring the evolution of a trade-off
      between vigilance and foraging in group-living organisms. Royal Society open science 2, 9
      (2015), 150135.
[138] Olson, R. S., Haley, P. B., Dyer, F. C., and Adami, C. Exploring the evolution of a trade-off
      between vigilance and foraging in group-living organisms. R Soc Open Sci 2 (2015), 150135.
[139] Olson, R. S., Hintze, A., Dyer, F. C., Knoester, D. B., and Adami, C. Predator confusion is
      sufficient to evolve swarming behaviour. J R Soc Interface 10 (2013), 20130305.
[140] Olson, R. S., Hintze, A., Dyer, F. C., Knoester, D. B., and Adami, C. Predator confusion is
      sufficient to evolve swarming behaviour. Journal of The Royal Society Interface 10 (2013),
      20130305.
[141] Olson, R. S., Knoester, D. B., and Adami, C. Critical interplay between density-dependent
      predation and evolution of the selfish herd. In Proceedings of the 15th annual conference on
      Genetic and evolutionary computation (2013), pp. 247–254.
[142] Olson, R. S., Moore, J. H., and Adami, C. Evolution of active categorical image classification
      via saccadic eye movement. In International Conference on Parallel Problem Solving from
      Nature (2016), Springer, pp. 581–590.
                                                 145


[143] Palmer, S. E. Vision science: Photons to Phenomenology. MIT Press, Cambridge, MA,
      1999.
[144] Paul, L. A., Hall, N., and Hall, E. J. Causation: A user’s guide. Oxford University Press,
      2013.
[145] Pearl, J. Causality: models, reasoning and inference, vol. 29. Springer, 2000.
[146] Pearl, J. Causality. Cambridge University Press, 2009.
[147] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel,
      M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. Scikit-learn: Machine learning in python.
      the Journal of machine Learning research 12 (2011), 2825–2830.
[148] Phillips, W. A., and Singer, W. In search of common foundations for cortical computation.
      Behavioral and Brain Sciences 20, 4 (1997), 657–683.
[149] Pickles, J. An introduction to the physiology of hearing. Brill, 2013.
[150] Poirazi, P., Brannon, T., and Mel, B. W. Pyramidal neuron as two-layer neural network.
      Neuron 37, 6 (2003), 989–999.
[151] Polsky, A., Mel, B. W., and Schiller, J. Computational subunits in thin dendrites of pyramidal
      cells. Nature neuroscience 7, 6 (2004), 621–627.
[152] Prinz, A. A., Bucher, D., and Marder, E. Similar network activity from disparate circuit
      parameters. Nat Neurosci 7 (2004), 1345.
[153] Proust, M. A la Recherche du Temp Perdu. Gallimard, Nouvelle Revue Française, Paris,
      1919-1927.
[154] Qian, J., Hintze, A., and Adami, C. Colored motifs reveal computational building blocks in
      the C. elegans brain. PLoS ONE 6 (2011), e17013.
[155] Raftery, A. E. Bayesian model selection in social research. Sociological Methodology 25
      (1995), 111–164.
[156] Rainey, P. B., Buckling, A., Kassen, R., and Travisano, M. The emergence and maintenance
      of diversity: insights from experimental bacterial populations. Trends Ecol Evol 15 (2000),
      243–247.
[157] Raman, K., and Wagner, A. The evolvability of programmable hardware. J R Soc Interface
      8 (2010), 269–281.
[158] Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A. Xnor-net: Imagenet classification
      using binary convolutional neural networks. In European conference on computer vision
      (2016), Springer, pp. 525–542.
[159] Rich, E. L., and Wallis, J. D. Decoding subjective decisions from orbitofrontal cortex. Nat
      Neurosci 19 (2016), 973–980.
                                                146


[160] Richelle, M., Lejeune, H., Defays, D., Greenwood, P., Macar, F., and Mantanus, H. Time in
      Animal Behaviour. Pergamon Press, New York, NY, 2013.
[161] Rivoire, O., and Leibler, S. The value of information for populations in varying environments.
      Journal of Statistical Physics 142, 6 (2011), 1124–1166.
[162] Rosenzweig, M. L. Species Diversity in Space and Time. Cambridge University Press,
      Cambridge (UK), 1995.
[163] Schiff, W., Caviness, J. A., and Gibson, J. J. Persistent fear responses in rhesus monkeys to
      the optical stimulus of “looming". Science 136 (1962), 982–3.
[164] Schossau, J., Adami, C., and Hintze, A. Information-theoretic neuro-correlates boost evolu-
      tion of cognitive systems. Entropy 18, 1 (2015), 6.
[165] Schreiber, T. Measuring information transfer. Physical Review Letters 85, 2 (2000), 461.
[166] Seidemann, E., Meilĳson, I., Abeles, M., Bergman, H., and Vaadia, E. Simultaneously
      recorded single units in the frontal cortex go through sequences of discrete and stable states
      in monkeys performing a delayed localization task. J Neurosci 16 (1996), 752–68.
[167] Shannon, C. E. A mathematical theory of communication. The Bell system technical journal
      27, 3 (1948), 379–423.
[168] Shannon, C. E. Communication theory of secrecy systems. Bell system technical journal
      28, 4 (1949), 656–715.
[169] Sheneman, L., and Hintze, A. Evolving autonomous learning in cognitive networks. Scientific
      reports 7, 1 (2017), 1–11.
[170] Sheneman, L., Schossau, J., and Hintze, A. The evolution of neuroplasticity and the effect
      on integrated information. Entropy 21, 5 (2019), 524.
[171] Shomrat, T., Graindorge, N., Bellanger, C., Fiorito, G., Loewenstein, Y., and Hochner, B.
      Alternative sites of synaptic plasticity in two homologous “fan-out fan-in" learning and
      memory networks. Curr Biol 21 (2011), 1773–1782.
[172] Siclari, F., Baird, B., Perogamvros, L., Bernardi, G., LaRocque, J. J., Riedner, B., Boly, M.,
      Postle, B. R., and Tononi, G. The neural correlates of dreaming. Nature Neuroscience 20
      (2017), 872.
[173] Sloss, A. N., and Gustafson, S. 2019 evolutionary algorithms review. Genetic Programming
      Theory and Practice XVII (2020), 307.
[174] Sorrells, T. R., Booth, L. N., Tuch, B. B., and Johnson, A. D. Intersecting transcription
      networks constrain gene regulatory evolution. Nature 523 (2015), 361.
[175] Sporns, O. Networks of the Brain. MIT Press, Cambridge, MA, 2011.
[176] Stanley, K. O., and Miikkulainen, R. Evolving neural networks through augmenting topolo-
      gies. Evolutionary computation 10, 2 (2002), 99–127.
                                                 147


[177] Strong, S. P., Koberle, R., de Ruyter van Steveninck, R. R., and Bialek, W. Entropy and
      information in neural spike trains. Phys Rev Lett 80 (1998), 197.
[178] Su, J., Vargas, D. V., and Sakurai, K. One pixel attack for fooling deep neural networks.
      IEEE Transactions on Evolutionary Computation 23, 5 (2019), 828–841.
[179] Sun, J., and Bollt, E. M. Causation entropy identifies indirect influences, dominance of
      neighbors and anticipatory couplings. Physica D: Nonlinear Phenomena 267 (2014), 49–
      57.
[180] Sved, J. A., Reed, T. E., and Bodmer, W. F. The number of balanced polymorphisms that
      can be maintained in a natural population. Genetics 55 (1967), 469–481.
[181] Taylor, M. B., Phan, J., Lee, J. T., McCadden, M., and Ehrenreich, I. M. Diverse genetic
      architectures lead to the same cryptic phenotype in a yeast cross. Nature Comm 7 (2016),
      11669.
[182] Tehrani-Saleh, A., and Adami, C. Can transfer entropy infer information flow in neuronal
      circuits for cognitive processing? Entropy 22, 4 (2020), 385.
[183] Tehrani-Saleh, A., and Adami, C. Psychophysical tests reveal that evolved artificial brains
      perceive time like humans. In ALIFE 2021: The 2021 Conference on Artificial Life (2021),
      MIT Press.
[184] Tehrani-Saleh, A., LaBar, T., and Adami, C. Evolution leads to a diversity of motion-
      detection neuronal circuits. In Proceedings of Artificial Life 16 (Cambridge, MA, 2018),
      T. Ikegami, N. Virgo, O. Witkowski, M. Oka, R. Suzuki, and H. Iizuka, Eds., MIT Press,
      pp. 625–632.
[185] Tehrani-Saleh, A., McAuley, J. D., and Adami, C. Mechanism of perceived duration in
      artificial brains suggests new model of attentional entrainment. bioRxiv (2019), 870535.
[186] Tehrani-Saleh, A., Olson, R., and Adami, C. Flies as ship captains? Digital evolution
      unravels selective pressures to avoid collision in Drosophila. In Proc. Artificial Life 15
      (Cambridge, MA, 2016), C. G. et al., Ed., MIT Press.
[187] Tenaillon, O., Rodríguez-Verdugo, A., Gaut, R. L., McDonald, P., Bennett, A. F., Long,
      A. D., and Gaut, B. S. The molecular diversity of adaptive convergence. Science 335 (2012),
      457–461.
[188] Thomas, E., and Weaver, W. Cognitive processing and time perception. Atten. Oercept.
      Psychophys. 17 (1975), 363–367.
[189] Tononi, G., Boly, M., Massimini, M., and Koch, C. Integrated information theory: From
      consciousness to its physical substrate. Nature Reviews Neuroscience 17, 7 (2016), 450.
[190] Tononi, G., and Koch, C. Consciousness: here, there and everywhere? Phil. Trans. R. Soc.
      B 370 (2015), 20140167.
                                               148


[191] Treisman, M. Temporal discrimination and the indifference interval. implications for a model
      of the “internal clock". Psychol Monogr 77 (1963), 1–31.
[192] Treue, S., and Martínez Trujillo, J. C. Feature-based attention influences motion processing
      gain in macaque visual cortex. Nature 399 (1999), 575–9.
[193] Tse, P. U., Intriligator, J., Rivest, J., and Cavanagh, P. Attention and the subjective expansion
      of time. Percept Psychophys 66 (2004), 1171–89.
[194] Tsong, A. E., Miller, M. G., Raisner, R. M., and Johnson, A. D. Evolution of a combinatorial
      transcriptional circuit: a case study in yeasts. Cell 115 (2003), 389–399.
[195] Tsong, A. E., Tuch, B. B., Li, H., and Johnson, A. D. Evolution of alternative transcriptional
      circuits with identical logic. Nature 443 (2006), 415.
[196] Turing, A. M. Computing machinery and intelligence. Mind 59, 236 (1950), 433–460.
[197] VanRullen, R., and Koch, C. Is perception discrete or continuous? Trends in Cognitive
      Sciences 7 (2003), 207–213.
[198] VanRullen, R., Reddy, L., and Koch, C. Attention-driven discrete sampling of motion
      perception. Proc. Natl. Acad. Sci. U.S.A. 102 (2005), 5291–5296.
[199] Vicente, R., Wibral, M., Lindner, M., and Pipa, G. Transfer entropy—a model-free measure
      of effective connectivity for the neurosciences. Journal of computational Neuroscience 30,
      1 (2011), 45–67.
[200] Wagner, A. The Origins of Evolutionary Innovations: A Theory of Transformative Change
      in Living Systems. Oxford University Press, Oxford (UK), 2011.
[201] Wibral, M., Lizier, J. T., and Priesemann, V. Bits from brains for biologically inspired
      computing. Frontiers in Robotics and AI 2 (2015), 5.
[202] Wibral, M., Vicente, R., and Lindner, M. Transfer entropy in neuroscience. In Directed
      Information Measures in Neuroscience. Springer, 2014, pp. 3–36.
[203] Wilke, C. O., Wang, J. L., Ofria, C., Lenski, R. E., and Adami, C. Evolution of digital
      organisms at high mutation rates leads to survival of the flattest. Nature 412 (2001), 331.
[204] Williams, P. L., and Beer, R. D. Nonnegative decomposition of multivariate information.
      arXiv preprint arXiv:1004.2515 (2010).
[205] Yao, X., and Liu, Y. A new evolutionary system for evolving artificial neural networks. IEEE
      transactions on neural networks 8, 3 (1997), 694–713.
[206] Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. How transferable are features in deep
      neural networks? arXiv preprint arXiv:1411.1792 (2014).
[207] Zabala, F., Polidoro, P., Robie, A., Branson, K., Perona, P., and Dickinson, M. H. A
      simple strategy for detecting moving objects during locomotion revealed by animal-robot
      interactions. Current Biology 22 (2012), 1344–1350.
                                                     149


[208] Zador, A. M. A critique of pure learning and what artificial neural networks can learn from
      animal brains. Nature communications 10 (2019), 3770.
[209] Zhang, C., Liao, Q., Rakhlin, A., Sridharan, K., Miranda, B., Golowich, N., and Pog-
      gio, T. Theory of Deep Learning IIb: Generalization properties of SGD. Tech. Rep.
      arXiv:1801.02254, Center for Brains, Minds, and Machines, 2018.
                                              150