L
.3...-
!“15!,

ﬁr. J...

i.» n: ..
ntuhﬁhl;

)ﬂﬁnunﬁx. ,

,x 28%
3:33. 21...... .
“Cutiii‘Iu‘!
83
.ﬁ u. ..
szxuvﬂaﬁ. 22.7.
shﬁnu‘hn t)...
I . 1‘ Il..9-zth.:v!

.9.
5.5.2:: LII
..I.§:‘I.. s!!!“
.5.::.‘1s!l|ia.\§r.\| ‘1
32:3»..sz .1... {lay-E 1....
7:11}. xvii! 5
{ .s.3.{l|!.w¢
3(le .91...‘

Hr... ‘..i..!...r n
I 3.. :..:.\u¥$~..
iv. insi§l
i, A ‘
.Ir‘i’.
3.!

i...) ,
.duot (‘ﬁhmtihﬂm
.1); i3.

7 .1:
1.112?!“ 9!:
.t .32., 11.59;}.

x9 (xiiiiix

3", a.

(X. .111. 1.1;:
1.!!!5.i\.v09 X
I. '15..)1‘.‘

z). »..|

 

 

sf EBRARY
Michigan State
U: .iversity

 

 

 

This is to certify that the
dissertation entitled

CORTEX-INSPIRED
GOAL-DIRECTED RECURRENT NETWORKS FOR
DEVELOPMENTAL VISUAL ATTENTION AND
RECOGNITION WITH COMPLEX BACKGROUNDS

presented by

MATTHEW LUCIW

has been accepted towards fulﬁllment
of the requirements for the

PhD. degree in Computer Science

 

 

/
ale/“t Wﬂj/

 

/ Major Profeﬁsor’s Signature
ﬂ/f/q /%/ 20/9

Date

 

MSU is an Affirmative Action/Equal Opportunity Employer

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5/08 KProj/AmPreyClRC/Dateoue.indd

 

'CORTEx-INSPIRED
GOAL-DIRECTED RECURRENT NETWORKS FOR

DEVELOPMENTAL VISUAL ATTENTION AND
RECOGNITION WITH COMPLEX BACKGROUNDS

By

Matthew Luciw

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree Of

 

DOCTOR OF PHILOSOPHY
Computer Science

2010

ABSTRACT

CORTEx-INSPIRED
GOAL-DIRECTED RECURRENT NETWORKS FOR

DEVELOPMENTAL VISUAL ATTENTION AND
RECOGNITION WITH COMPLEX BACKGROUNDS

By
Matthew Luciw

It is unknown how the brain self-organizes its internal wiring without a holistically-
aware central controller. How does the brain develop internal object representations
for a massive number of objects? How do such representations enable tightly in-
tertwined attention and recognition in the presence of complex backgrounds? Most
vision systems have not included top-down connectivity or treated bottom-up and
top-down separately. Yet almost all excitatory pathways in the visual cortex are
bidirectional; evidence suggests the top-down connections are not fundamentally
different from bottom-up connections. This dissertation presents and analyzes a
hierarchical self-organizing type of network with adaptive excitatory bottom-up and
top-down connections. This two-way network takes advantage of grounding — both
the sensory end (visual patches) and motor end (action control) are input ports.
Internally, local neural learning uses only the co—ﬁring between the pre—synaptic and
post-synaptic activities. Such a representation automatically boosts action-relevant
components in the sensory inputs (e.g., foreground vs. background) by increasing the
chance of only action-related feature detectors to win in competition. After learn-
ing, the bidirectional networks showed topographic semantic grouping and modular
connectivity. It is shown how and why such modular networks can take advantage of
recurrent excitation for recognition. In Where-What Network-3, top-down connec-
tions enabled type-based and location—based top-down attention and synchronization

of neurons over multiple levels to bind features into holistic representations.

COPYRIGHT BY
MATTHEW LUCIW
2010

DEDICATION

To my parents.

iv

ACKNOWLEDGEMENTS

I truly have to thank my advisor, Professor Juyang (John) Weng. John has been
a great advisor for me, and I’ve learned so much from him. He’s been an endless
source of inspiration and motivation. I’d like to thank professors Charles Ofria,
Kelly Mix, and George Stockman for serving on my committee and providing crit-
ical comments on this work. A special thanks goes to George Stockman, for being
tremendously supportive throughout my time in graduate school and for teaching
me about Computer Vision, and to Professor Wayne Dyksen, who has been a great
help and an inspiring person to work with as a teaching assistant. I’ve been inﬂu-
enced by many professors through taking their courses at MSU. In particular, classes
taught by Barbara Abbott, Betty Cheng, Richard Enbody, Abdol Esfahanian, John
Forsyth, Anil Jain, Devin McAuley, Mark McCullen, Bill Punch, Charles Ofria,
Charles Owen, Richard Reid, George Stockman, Mark Sullivan, Juyang Weng, and
Anthony Wojcik were inﬂuential and inspiring. The Computer Science graduate
director, professor Eric Torng, was extremely supportive.

This dissertation represents the completion of my Ph.D. at Michigan State. There
are many people that I’d like to thank. Some past and present members of the
Embodied Intelligence (BI) Lab at MSU provided helpful discussions, assistance,
and collaborations. A great thanks goes out to collaborators Zhengping Ji, Laura
Grabowski, Shuqing Zeng, and Hanqing Zhao. I’d like to especially thank EI Lab
members Paul Cornwell and Mojtaba Solgi for their discussions and support. Chuck
Bardel gets a special mention. Others at MSU I’d like to mention for their assis-
tance in various ways are Zubin Abraham, Alexis Ball, Tyler Baldwin, Matt Gerber,
Faliang Chang, Xiao Huang, Xiaoan (Dustin) Li, Mayur Mudigonda, Brian McCul-
loch, Kajal Miyan, Vikram Melapudi, David Mulder, Zahar Prasov, Yan Wang, Liu
Yang, and Nan Zhang. Finally, I’d like to thank my sister Julia, my Dad, and my

Mom for their encouragement and support throughout my life.

TABLE OF CONTENTS

List of Figures .................................. ix
List of Tables .................................. xviii
Introduction 1
Background 6
2.1 Visual Representation .......................... 6
2.1.1 Viewpoint Invariance ....................... 7
2.1.2 Object Representation ...................... 9
2.1.3 Local Appearance Hierarchies .................. 11
2.2 Visual Attention ............................. 13
2.2.1 Selection .............................. 13
2.2.2 Binding Problem ......................... 16
2.3 thctional Architecture of the Visual Cortex .............. 19
2.3.1 Distributed Hierarchical Processing ............... 20
2.3.2 Input Driven Organization .................... 23
2.3.3 Cortical Circuitry ......................... 24
2.3.4 Smoothness and Modularity ................... 26
Developing Neuronal Layers 28
3.1 Overview .................................. 28
3.2 Concepts and Theory ........................... 29
3.2.1 Belongingness ........................... 30
3.2.2 Spatial Optimality ........................ 30
3.2.3 Temporal Optimality ....................... 31
3.2.4 Solution .............................. 32
3.2.5 Incremental Solution ....................... 32
3.2.6 CCI Plasticity ........................... 33
3.3 Algorithm ................................. 34
3.4 Experiments ................................ 36
3.4.1 Natural Images .......................... 36
3.4.2 Hebbian Updating ........................ 38

vi

3.5 Bibliographical Notes ........................... 46
Top-Down Connections for Semantic Self-Organization 49
4.1 Motivation ................................. 50
4.2 Concepts and Theory ........................... 51
4.2.1 Three Layer Network ....................... 51
4.2.2 Relevant Input is Correlated with Abstract Context ...... 54
4.2.3 Null Space ............................. 55
4.2.4 Weighted Semantic Similarity .................. 58
4.2.5 Smoothness ............................ 59
4.3 Algorithm ................................. 60
4.4 Topographic Class Grouping ....................... 63
4.5 Experiments ................................ 69
4.5.1 MNIST Handwritten Digits ................... 69
4.5.2 MSU-25 Objects ......................... 75
4.5.3 NORB Objects .......................... 80
4.6 Summary ................................. 83
4.7 Adaptive Lateral Excitation ....................... 84
4.7.1 Motivation ............................. 85
4.7.2 Smoothness vs. Precision ..................... 86
4.8 Setup .................................... 87
4.8.1 Initialization ............................ 88
4.8.2 Adaptation ............................ 89
4.8.3 Developmental Scheduling .................... 89
4.9 Experiments ................................ 90
4.9.1 Comparison ............................ 91
4.10 Summary ................................. 95
4.11 Bibliographical Notes ........................... 97
4.12 Appendix: Neuronal Entropy ...................... 98
Recurrent Dynamics and Modularity 99
5.1 Motivation ................................. 100
5.2 Concepts and Theory ........................... 103
5.2.1 Internal Expectation ....................... 103
5.2.2 Network Overview ........................ 105
5.2.3 Network Dynamics ........................ 107
5.2.4 Minimum-Entropy Networks ................... 109
5.2.5 Irreducible Networks ....................... 113
5.2.6 Modular Networks ........................ 114

vii

5.2.7 Nonlinear Mechanisms for Modular Networks ......... 115

5.3 Algorithm ................................. 118
5.4 Experiments ................................ 119
5.4.1 Object Recognition ........................ 119
5.4.2 Vehicle Detection ......................... 123
5.5 Summary ................................. 125
5.6 Bibliographical Notes ........................... 128

6 WWN-3: A Recurrent Network for Visual Attention and Recog-
nition 130
6.1 Background ................................ 133
6.1.1 WWN Architecture ........................ 133
6.1.2 Attention Selection Mechanisms at a. High-Level ........ 135
6.1.3 Binding Problem ......................... 138
6.2 Concepts and Theory ........................... 141
6.2.1 Learning Attention ........................ 150
6.2.2 Entropy Reduction ........................ 152
6.2.3 Attention Selection Mechanisms ................. 154
6.3 Experiments ................................ 156
6.3.1 WWN—3 Learns .......................... 157
6.3.2 Two Object Scenes ........................ 159
6.3.3 Cross-Layer Connections ..................... 161
6.4 Summary ................................. 164
6.5 Bibliographical Notes ........................... 164
7 Concluding Remarks 165
BIBLIOGRAPHY 168

viii

LIST OF FIGURES

2.1 As the ﬁrst ﬁgure in this dissertation, I must note the following: Images in
this dissertation are presented in color. This ﬁgure shows a partial model of
a visual hierarchy, starting from sensor cells in the retina and progressing to
inferotemporal cortex (IT), an area implicated in visual object recognition.
Information progresses through these pathways and beyond and eventually
to later motor areas (not shown). All connections shown are bidirectional.
Each area is organized into six-layers of highly interconnected neurons. The
different areas in cortex, all have this same general six-layer connectivity
pattern [50]. A core idea is the hierarchical arrangement of neural areas
on many levels, from sensors to motors, with each area’s suborganization
being six-layered. Figure adapted and modiﬁed from [71]. ........ 21

2.2 Some pathways through different cortical areas for different sensory modali-
ties. The numbers indicate the numerical classiﬁcation of each cortical area
originally by Brodmann in 1909 [50]. There are two different pathways by
which visual information travels — it has been found that one pathway is
for identity and categorization (“what”) and the other has more to do with
location (e.g., how to reach for the object — “where”). The somatosensory,
auditory and visual information progresses through different pathways,
converging at premotor and prefrontal areas. The premotor area goes to
the motor cortex, which handles movement and action representation. All

paths are bidirectional. Figure courtesy of J uyang Weng. ......... 22
3.1 Example of a natural image used in the experiment ............. 36
3.2 Lobe components from natural images (with whitening). ......... 37
3.3 Lobe components from natural images (without whitening) ........ 38

3.4 Comparison of incremental neuronal updating methods. We compare in
25 and 100 dimensions. This ﬁgure shows 100-d results. Methods used
were (i) “dot-product” SOM, (ii) Oja’s rule with ﬁxed learning rate 10'3,
(iii) Standard Hebbian updating with three functions for tuning the time-
varying learning rates (TVLR): linear, power, and inverse, and (iv) CCI
LCA. LCA, with its temporal optimality, outperforms all other methods.
Consider this as a “race” from start (same initialization) to ﬁnish (0%
error). Note how quickly it achieves short distance to the goal compared
with other methods. CCI LCA beats the compared methods. E.g., after
28, 500 samples, when LCA has covered 56% distance, the next closest
method has only covered 24% distance. .................. 41

ix

3.5

3.6

3.7

4.1

4.2

Comparison of incremental neuronal updating methods. This shows the
25-d results. At this lower dimension, after 5000 samples, LCA has covered
66% of the distance, while the next closest method has only covered 17%

distance. ..................................

Comparison of LCA with two other Hebbian learning variants for a time-
varying distribution. This shows average error for all available components.
There are 70 available until time 200,000, 80 until 400,000, 90 until 600,000
and 100 until 1,000,000. We expect a slight degradation in overall perfor-
mance when new data is introduced due to the limited resource always
available (100 neurons). The ﬁrst jump of LCA at t = 200, 000 is a loss of

3.7% of the distance it had traveled to that point ..............

Comparison of LCA with two other Hebbian learning variants for a time-
varying distribution. This shows how well the neurons adapt to the 10 com-
ponents added at time 200,000 (called newdatal), and then how well they
remember them (they are observed in only the second and ﬁfth phases).
Initially, this new data is learned well. At time 400,000, newdata2 begins to
be observed, and newdatal will not be observed until time 800,000. Note
the “forgetting” of the non-LCA methods in comparison to the more grace-
ful degradation of LCA. The plots focusing on newdata2 and newdata3 are

similar ....................................

In this model, neurons are placed on different layers in a hierarchy — ter-
minating at sensors at the bottom and terminating at motors at the top.
Each individual neuron has three types of input projections: bottom-up,

lateral, and top-down. ...........................

The three-layer network structure. The internal layer 1 takes three types
of input: bottom-up input from layer 0, top-down input from layer 2 and
lateral input from the neurons in the same layer. The top-down input
is considered delayed as it is the layer 2 ﬁring from the last time step.
The “D” module represents this delay. A circle in a layer represents a
neuron. For simplicity, this is a fully connected network: All the neurons
in a layer takes input from every neuron in the later layer and the earlier
layer. For every neuron (white) in layer 1, its faraway neurons (red) in
the same layer are inhibitory (which feed inhibitory signals to the white
neuron) and its nearby neurons (green) in the same layer are excitatory
(which feed excitatory signals to the white neuron). Neurons connected
with inhibitory lateral connections compete so that fewer neurons in layer 1
will win for ﬁring (sparse coding [80]) which leads to sparse neuronal update

44

45

(only top-k neurons will ﬁre and update). Figure courtesy of Juyang Weng. 52

4.3 How bottom-up and top-down connections coincide in a fully connected

4.4

4.5

4.6

network. Looking at one neuron, the fan-in weight vector deals with
bottom-up sensitivity while the fan-out weight deals with top-down sen-
sitivity. In the model presented here, the weights are shared among each
two-way weight pairs. So, a neuron on layer 3' + 1 will have a bottom-up

weight the same as the top-down weight of a neuron on layer j .......

Self-organization with motor-boosted distances leads to partitions that sep-
arate the classes better. (a) There are two bottom-up dimensions 1:1 and
:52. Samples falling in the blue area are from one class and those falling
in the red area are another class (assume uniform densities). The “rele-
vant” and “irrelevant” dimension are shown by the upper right axes, which
are here linear. (b) The effect of self-organization using nine neurons in
the bottom-up space only. Observe from the resulting partitions that the
ﬁring class entropy of the neurons will be high, meaning they are more
class-mixed. (c) Boosting the data with motor information, which here is
shown as a single extra dimension instead of two (for visualization) (d)
The effect of self-organization in the boosted space, and embedding back
into two dimensions. Note how the partition boundaries new line up with
the class boundaries and how the data that falls into a given partition is

mostly from the same class (low entrOpy). ................

A layer-one weight vector, around other neighbor weight vectors, viewed as

54

images, of a neuron exhibiting “harmful” interpolation through 3x3 updating. 59

Topographic class grouping with a 1D neuron array in 2D input space.
The red area contains samples from class one, and the blue area contains
samples from class two. The 10 neurons’ bottom-up vector are circles
or squares. Their top-down membership is shown by the color or shape:
Gray neurons are unassociated, black neurons are linked to class two and
white neurons linked to class one. Square neurons are border neurons. To
better understand how TCG emerges, we provide the following four cases.
(a) After initialization, all neurons are unassociated. Only three neurons
are drawn to show they are neighbors. Other neighbor connections are
not shown yet, for clarity. (b) Neuron N1 has won for a nearby sample,
becomes linked to class two. Its neighbors are pulled towards it and also
link to class two. Note how N3 is actually pulled into the between-class

“chasm”, and not onto the class distribution. ...............

xi

66

4.7

4.8

4.9

4.10

4.11

4.12

4.13

4.14

4.15

Topographic class grouping with a 1D neuron array in 2D input space.
This is a continuation of the last ﬁgure. (c) Over wins by N1, N2 and
N5, self-organization occurred through neighbor pulling. N4 has become a
border neuron, and N6 is starting to be pulled. (d) A ﬁnal organization,
uncovering the relevant dimension. For this data, the relevant dimension is
not linear: it is through the center area of each class distribution, length-
wise. The ﬁnal neuron organization mirrors this, and the neurons have

organized along the relevant dimension. ..................

This shows why TCG cannot be guaranteed in 2D in general. (a) After
much sampling of class one, and none of class two, the class one area has
grown. (b) Further, after much sampling of class two and none more of
class one, the class two area grows, and happens to cut through the class

one area (leaving two buffer zones). Here, TCG did not emerge. .....

MN IST data represented in a network with 40x 40 neurons, trained without

top-down connections. ...........................

The handwritten digits “4” and “9” from the MNIST digit database [58]
are very similar from the bottom-up. (a) Result after self-organizing us-
ing no motor-boosting. The 100 weight vectors are viewed as images
below and the two weight vectors for each motor shown above. White
means a stronger weight. The organization is class-mixed. (b) After self-
organization using motor-boosted distance (weight of 0.3). Each class is
individually grouped in the feature layer, and the averaging of each feature

will be within the same class. .......................

(a) Weights after development of a 40 x 40 grid when top-down connections
were used. The map has organized so that each class is located within a

speciﬁc area. ................................

This corresponds to the last ﬁgure. (b) Probability of each neuron to
signify a certain class, from previous updates, when 6 = 0.3 is strong
(corresponding to (a)). (0) Probability maps for a different test when

B=0. ...................................

Sample(s) from each of the 25 objects classes, also showing some rotation.
In the experiments, the training images were 56 x 56 and grayscale.

Result for 25 objects viewed from a full range of horizontal 360 degrees.
Developed using top-down connections: the maximum class ﬁring proba-

bilities of the 40 x 40 neurons of layer 1. .................

Result for 25 objects viewed from a full range of horizontal 360 degrees.
Developed without using top-down connections: the maximum class ﬁring

probabilities of the 40 x 40 neurons of layer 1. ..............

70

73

4.16 (a). Images presented to a trained network to measure the class-response
scatter. (b) Bottom-up weight to the neuron representing this class (“tur-
tle”). This network was top-down—disabled. (c) Top responding neuron
positions for each of these samples for the unsupervised network (d) Layer-
two weight for a top-down-enabled network (d) Top responding neuron
positions for the supervised network. ................... 79

4.17 NORB objects. Each row is a different category. Examples of training
samples‘are shown on the left, and testing (not seen in training) is shown
on the right. Figure from [59]. ....................... 81

4.18 2D class maps for a 40 x 40 neural grid after training with the NORB data.
At each neuron position, a color indicates the largest outgoing weight in
terms of class output. There are ﬁve classes, so there are ﬁve neurons, and
ﬁve colors. (a) B = 0 (b) [3 = 0.3. These experiments used wraparound
neighborhoods ................................ 82

4.19 2D entropy maps for the Fig. 4.18 experiments. High-entropy means
prototypes are between class manifolds, which can lead to error. Whiter
color means a higher entropy. (a) ,6 = 0 (b) 3 = 0.3. Note the high entropy
neurons shown here coincide with class group borders shown in Fig. 4.18. 83

4.20 Performance comparison of LCA using 3 x 3 updating to LCA with exci-
tatory lateral connections with and without top-down connections. . . . . 92

4.21 Per-neuron class entropy for the four variants of LCA in the tests with the
25-Objects data. .............................. 93

4.22 Above is the bottom-up weights after development for a neural map (each
weight can be viewed as an image) that utilized 3 x 3 updating without top-
down. Below are the weights for a neural map that developed using explicit
adaptive lateral connections (also without top-down). Note the smearing
of the features above and the relatively higher precision of representation
below, while still being somewhat topographically organized. ....... 94

4.23 Performance comparison for different updating methods, using lateral con-
nections without top-down .......................... 96

5.1 Weakly separable data can become more strongly separable using temporal
information. (a) Two classes of data drawn from overlapping Gaussian
distributions in space. (b) Adding another dimension: perfect separation
trivially occurs if one includes another dimension which depends on the
label of each point, but of course the label is not available. (0) When
the data has much temporal continuity, the previous guess of class can
be used as the third dimension. z becomes an expectation by using the
guessed label of the current point and the previous z. Temporal trajectories
are shown here. The points in the middle are “transition points”, where
expectation is not as strong since recent data was from the other class. . 104

xiii

5.2

5.3

5.4

5.5

5.6

5.7
5.8

5.9

A three—layer network structure. The internal layer 1 takes three types
of input: bottom-up input from layer 0, top-down input from layer 2 and
lateral input from the neurons in the same layer. The top-down input is
considered delayed as it is the layer 2 ﬁring from the last time step. The
“D” module represents this delay. The connectivity of the three connection
types is global, meaning each neuron gets input from every other neuron on
the lower and higher layers. Lateral inhibition is handled in an approximate
sense via k-winners take all. Figure courtesy of Juyang Weng ........ 106

How bottom-up and top-down connections coincide in a shared way. Look-
ing at one neuron, the fan-in weight vector deals with bottom-up sensitivity
while the fan-out weight deals with top-down sensitivity. In the model pre-
sented here, the weights are shared among each two-way weight pairs. So,
a neuron on layer j + 1 will have a bottom-up weight that is linked to the
top-down weight of a neuron on layer j. .................. 107

Examples of the three types of connectivity. (a) Minimum entropy network.
Top-down feedback has a sensible and reliable effect of biasing the features
associated with the ﬁring motors at a level appropriate to the motors’ ﬁring.
(b) High entropy network. Top-down feedback is not useful in this type of
network since it spreads quickly. (c) Modular network. Here, the neurons
are minimum entropy except for one “border” neuron. Unchecked, positive
feedback spreads throughout, but this situation is manageable through
lateral inhibition, in which the low activity connection is inhibited, and
the network acts as a minimum entropy network. ............. 110

A section of an “unrolled in time” layer-two and layer-three in a network
with four feature neurons and three motor neurons. Keys differences from
the linear system are lateral inhibition and output normalization. Lateral
inhibition, at both the feature layer and motor layer, stops the ﬂow of
low energy signals into the future and can control the spread of positive
feedback. This ﬁgure shows two feature neurons inhibited at time t and
one motor neuron inhibited at time t + 1. The output normalization keeps
the top-down vector on the same scale as the bottom-up. It also allows
long-term memory .............................. 116

The effect of different expectation parameter values on performance. The
“X” in “FX” indicates the frame to start measuring performance after
a sequence shift, to a rotating object sequence. Three frame smoothing
means the latest three outputs “vote” to get a single output. ....... 120

How performance improves over training epochs through all object sequences. 121

Increased expectation parameter shifts errors from bottom-up misclassiﬁ-
cations to errors within a transition period (immediately after the class of
viewed object changed). .......................... 122

Examples of data within the vehicle class. ................. 123

5.10 Performance of globally versus locally connected networks when data is

5.11

6.1

6.2

6.3

6.4
6.5

limited. The locally connected network performs better with less data
for this dataset. This may be because vehicle images can be made up of
several interchangeable parts. Once training is mature, the expectation
mechanism can be enabled in testing, and performance shoots up to a

nearly perfect level ..............................

(a). Some features for the “vehicle” class, developed by the locally-connected

network. This shows the response-weighted input, as a weighted sum of the
data and the neuron responses. (b). Showing the respective locations of the
features from (a). This ﬁgure shows response-weighted input, so there is
nonzero sensory stimulus outside the receptive ﬁelds (the white box area),
but the neurons will not respond to any stimuli outside this area.

A high-level block diagram of WWN-3. The area are named after those
found in our visual pathway but it is not claimed that the functions or

representations are identical. .......................

WWN accomplishes different modes of attention by changing the directions
of information flow. (a) For bottom-up stimulus driven attention, informa-
tion ﬂows forward from pixels to motors. (b) In order to disengage from
the currently attended object, an internal suppression is applied at the
motor end, and a diffuse top-down excitation will cause a new foreground

to be attended to in the next bottom-up pass. ..............

WWN accomplishes different modes of attention by changing the directions
of information flow. This continues the last ﬁgure. (c) Top-down location-
based attention (orientation) occurs due to imposed location motor and
information ﬂows back to V4 and forward to the type motor. ((1) Top-
down type-based attention (object search) occurs due to imposed type

125

. 126

133

135

motor and information flows back to V4 and forward to the location motor. 136

Toy example illustrating the “coarse” binding problem. ..........

WWN-3 system architecture. The V4 area has three layers (depths) of
feature detectors at all image locations. Paired layers: each area has a
bottom-up component, a top-down component, and a paired (integration)
component. Within each component, lateral inhibitory competition occurs.
Note that in this ﬁgure top-down connections point up and bottom-up

connections point down. ..........................

XV

139

142

6.6

6.7

6.8

6.9

6.10

6.11

Interaction between bottom-up and top-down in a toy attention problem.
There are two foregrounds A or B. The foreground set is F. This system
has A-detectors and B-detectors. Consider the ﬁgures on the left. The
expectation of pre-response of an A-ﬁlters is shown to the left and expec-
tation for B-ﬁlters on the right. Assume the corresponding pre-response
distributions are Gaussian, which are shown on the right side. A key as-
sumption is that the pre—response alone does not lead to the best detection,
shown as overlap in the distributions. When both foregrounds are visible,
the system is not biased to detect one or the other, unless a top—down goal

is utilized (c). ...............................

Beneﬁt of paired layers in attention selection. The pre-response distribu-
tions of A-ﬁlters vs. B-ﬁlters are shown, when the A foreground is the only
one visible, but when top-down expectation E2: either biases A (correct) or
B (incorrect). Top-down boosting in cooperation with the true state (on
the left) leads to higher discriminability by moving the two means farther
away, which is very useful. But if the expectation is wrong (on the right),
it can be drastically more difficult to detect the true state. If paired layers
are used, lateral inhibition is applied ﬁrst, so that many of the B-detectors
will have zero response before the top—down boost. Then, an incorrect

boost can be managed. ..........................

Measured entropy along the “What” WWN pathways. In the what path-
way, there is a clear entropy reduction from V2 to IT. The top-down con-

nections enabled this emergence of discriminating representation to occur.

Measured entropy along both the “Where” WWN pathway. However, there
is not much of an entropy reduction along the where pathway, hovering
around 2.7 bits, which is about 6.5 different pixels of inaccuracy — a
little smaller than a 3 x 3 pixel neighborhood. We guess that there is an
accuracy ceiling for our current method that we ran into using 400 where
classes (compared to 5 what classes) with the 400 neurons in PP and 1200

neurons in V2. ...............................

Class representation that developed in the IT area. Observe there are ﬁve
different areas; one per class. Neurons along the border will represent two
(or more) classes. This is illustrated in the entropy histograms above by
the small group of neurons with about 1 bit of entropy (two choices).

Internal representation in the IT and PP areas in the developed network.
(a) Response-weighted input (weighted sum of samples) for four neurons
from IT. This shows the weighted average of the samples that these neurons
ﬁred for. These represent the “duck” class, and some ﬁre for multiple loca-
tions. (b) Response—weighted input for some PP neurons. These represent

multiple classes, but a single location. ...................

xvi

144

145

148

149

. 152

153

6.12 The foregrounds used in the experiment. There are three training (left)
and two testing (right) foregrounds from each of the ﬁve classes of toys:

“cat” , “pig”, “dump-truck”, “duck”, and “car”. .............. 157
6.13 Sample image inputs ............................. 157
6.14 Performance results on disjoint testing data over epochs. ......... 158

6.15 WWNs for the joint attention-recognition problem in bias-free and goal-
based modes for two object scenes. (a) Performance when input contains
two learned objects: bias-free, two types of imposed goals (top-down type-
context for search and location-context for orient and detect), and shifting
attention from one object to the other. (b) A few examples of operation
over different modes by a trained WWN-3. “Context” means top-down
goal is imposed. An octagon indicates the location and type action outputs.
The octagon is the default receptive ﬁeld. ................. 160

6.16 WWN-3 trains V2 through pulvinar location supervision and bottom-up
based LCA. We added a new direct connection from TM so that V2 de-
velops heightened type-speciﬁcity (even though it was already fairly type-
speciﬁc). To test the coupled speciﬁcity of V2 representations, an alternate
classiﬁer, based on the winning neuron’s entropy, was developed. ..... 162

4.1

4.2
4.3
4.4
4.5
4.6
6.1
6.2
6.3

LIST OF TABLES

Summary of results of training networks on MN IST data. Each result
is averaged over 5 trials. ......................... 73

Error results for MSU objects, averaged over 5 trials. ......... 76

Feature quality and grouping results for the experiments with 25 objects. 78

Error results using the normalized-centered NORB data ........ 81
Grouping results using the NORB 5-class dataset. ........... 82
Scheduling of inhibitory scope ....................... 91
Architecture I: trained with top-down from TM to V2 ........ 163
Architecture 2: trained without top-down from TM to V2 ...... 163
Average entropy for both architectures, in bits ............. 163

xviii

Chapter 1

Introduction

This dissertation is concerned with recurrent networks for attention and recognition
in developmental agents. It focuses on vision, but it is applicable to other modalities.
The work here is on general-purpose attention and recognition. In this sense, the
methods are not meant to directly apply to a single particular engineering problem,
but are instead motivated by the issue of how to build a machine that could under-
stand the visual world. Such machines could be taught to solve speciﬁc problems,
but it is not with a particular problem in mind that they are designed.

It seems logical to take a reductionist view of the mind. In this light, the mind,
mental states, attitudes etc. are all encapsulated in the brain. They are all explain-
able by brain operation. Arguments about the existence of intuitive phenomena like
mental states or qualia are attributable to confusion about what these terms are
actually referring to. Such\terms like “mental state”, “thinking”, or “feel” funda-
mentally seem to reﬂect misunderstanding of what is actually happening in the brain.
The introspective style of reasoning about the mind has been very useful. It may
be possible to formulate some higher-level theory of mind that explains intelligent
behavior and can be used in building intelligent machines. But results from those
working admirably at this scale over many years have illuminated some difﬁculties

of this approach.

As discussed elsewhere by Weng, symbolically programmed Al’s have been brit-
tle [117]. Attempts to model intelligent behavior suffered from such brittleness
problems since it seems impossible for the programmer to predict all the rules of
all the cases the machine should be able to handle. To deal with this, researchers
turned their attention to methods of learning, inspired by the thought that if the ma-
chine learns, then providing it with enough experience should be all that is needed.
Learning methods have led to very promising results and the thriving ﬁeld of machine
learning. But learning methods have mostly focused on solving single problems, thus
generating artiﬁcial intelligences that can perform single tasks. Progressing towards
AIs that can learn multiple tasks, without the programmers or modelers knowing
the tasks in advance, is difﬁcult with many learning methods that work very well
for a single task. For example, many learning methods suffer from the well-known
long-term memory problem: since they are designed to utilize all their resources to
greedily learn any task, they will forget already learned tasks. There is also the
related scaffolding problem: most learning methods that have learned one task have
not made it any easier for the machine to learn a different but related task. Yet
we can use our past experience to make it easy for us to learn some new skills in
only a few tries, or even in one shot. The developmental approach, also known as
autonomous mental development (AMD) [123], has emerged and aims to take these
issues into account. The goal of AMD is to build a developmental program so the
machine can learn skills from experience in a way that does not suffer from the long-
term memory problem or the scaffolding problem. The focus has shifted towards
the study of general purpose learning mechanisms.

Over the history of AI, many relatively successful approaches were formulated as
inspired by another ﬁeld or combination of ﬁelds, such as psychology, biology, chem-
istry, or neuroscience. It is an open question as to which scale is most appropriate for

modeling a developmental learning program. Recently, technology has advanced to

a point that the amount of data coming out of the neuroscience ﬁeld has exploded,
and much is now known about the visual pathway, and about some mechanisms
of computation in cortex. Much is still unknown, however. Theories and results
from psychology have been incrediblyinfluential and illuminating in understanding
visual attention and recognition. Connectionist approaches have been inspired by
both psychology and neuroscience. Background over multiple related ﬁelds will be
discussed further in Chapter 2.

Putting together some key knowledge about the brain may be leading towards
solving the puzzle of how a successful developmental program could be built. Cortical
- neurons can be considered as “internal representation” as demonstrated by stimulus-
response selectivity. Neurons represent sensory stimuli and also movements (shown
via ﬁring-movement correlations). Studies in mapping the brain’s function has shown
that different areas in each animal are reliably correlated with certain function, yet in
no way are all the functions of the different areas known. But there is recent evidence
that what the neurons represent is not dependent on its physical location, but instead
on the area’s input [96]. This points to, not a genetically-encoded representative
scheme based on brain location, but a shared mechanism of learning that occurs in
each of the areas, no matter what location. In this case, different functions emerge
since there are different inputs to each area. What is known about neuronal learning
is based on synaptic strength modiﬁcation and so far follows Hebb’s principle and, at
a lower temporal scale, the principles of spike-timing dependent plasticity (STDP).
Some principles may also be emerging based in the neuromodulation of learning.
In Chapter 3, I’ll discuss a mechanism of general purpose learning, for a layer

1

of neurons inspired by these results. This was called lobe component analysis

[122,124] (LCA). LCA satisﬁes many of the constraints of a developmental program

 

11 do not claim that these artiﬁcial neurons are functionally isomorphic to actual
neurons. They represent units of information integration and computation, inspired
by actual neurons or neuron groups.

appropriate for AMD.

The circuitry of the mature cortex is very complicated. Again the notion of
input-driven self-organization is encouraging. Over multiple layers, the neuronal
learning method discussed in Chapter 3 leads to selective wiring — automatic or-
ganization of the network connectivity. Input-driven selective wiring over multiple
layers is the focus of Chapter 4, which focuses on the purpose and effects of the top-
down connections. These connections cause recurrence in the networks; the chapter
presents a method and its analysis for utilizing top—down connections for a biased
compression of information, leading to internal representations that are more useful
for successful behavior given a limited resource. The top-down connections do not
represent supervised learning in the sense of gradient descent methods — they oper-
ate the same as bottom-up connections. Top-down and lateral excitation together in
the learning phase produces modular networks that show topographic class group-
ing — layers with representation spatially organized so that neurons that represent
the same action but different sensations are grouped together. It is so far unknown
how cortical areas could emerge that represent very different (in a physical sense)
stimuli, like the parahippocampal place area. Methods, analysis and results about
running the mature multilayer networks as recurrent dynamical systems is presented
in Chapter 5. The modularity of a network with topographic class grouping leads
to helpful recurrent effects that networks without modularity do not have.

Mechanisms and understanding of top-down connections, from chapters 4 and 5,
are integrated into the work in Chapter 6. It focuses on recurrent networks for gen-
eral purpose visual attention and recognition called Where-What networks (WWN)
[47, 48,63]. WWNs treat attention and recognition in a uniﬁed way, as it seems the
brain does not deal with these two problems independently. This chapter discusses
the architecture, training, and operation of WWNs for bottom-up and top-down

attention of any foreground over complex backgrounds. When there are multiple

objects in the visual scene, the binding problem becomes an issue. In WWN, syn-
chronization via recurrence and information ﬂow control lets the networks deal with
the binding problem without using combination neurons.

In, Chapter 7, I summarize the contributions and results in the dissertation, and

a few possible future directions are brieﬂy explored.

Chapter 2

Background

The topic material of this dissertation is at an intersection of computer vision, pat-
tern recognition (models of learning), computational neuroscience, artiﬁcial intelli-
gence, and psychology (studies of visual representation and attention). This chapter
discusses relevant background information from these ﬁelds. The purpose of this
chapter is to introduce the problems that motivated this work and to provide con-

text for the later chapters.

2.1 Visual Representation

One can view visual perception as a problem that our visual systems must solve.
Study and analysis of this problem has uncovered the amazing capabilities of bio—
logical vision. A few will be discussed here. First, our ability to recognize an object
that we know seems extremely robust over different viewing distances and angles.
Second, we seem to have this capability for a huge number of objects. Underlying
these two capabilities are questions of visual representation. Visual representation
is internal storage and mechanisms for interpreting and explaining visual stimuli.

What representations are appropriate for human-level capabilities in object vision?

2 . 1 . 1 Viewpoint Invariance

Our ability to recognize objects is very robust. We can reliability recognize single
objects that we know over many possible variations. Consider the following sim-
pliﬁed setting for a single object: a viewer is some ﬁxed distance from the object,
which is within the viewer’s ﬁeld of view. The object exists in 3D world space but is
represented as a pattern of intensities or colors on the viewer’s retinal image, which
we can consider as 2D. Let’s assume the object completely ﬁts within the retinal
image, is central on it, and there is no background. The object has its own internal
3D coordinate frame, which we’ll assume has its origin at the center of the object.
“Complete” viewpoint invariance implies that if the viewer is able to identify (ID)
the object for any one rotation, the viewer can ID it over all possible rotations. A
softer deﬁnition allows the viewer to see several different views ﬁrst or constraints
the testing set. Note the viewer doesn’t see the object temporally rotate. Whether
we have such a capability has been the subject of debate as summarized by Edelman
( [30], p. 178), due to seemingly conﬂicting data.

If the possible rotations are restricted occurs so the object’s appearance rotates
in an aﬂine way on the retinal image, then all the information needed for recognition
is contained in a single view. Could every point on the rotated image be mapped to
another point on an internal 2D stored view — a canonical template? Recognition
could be accomplished by “mentally” rotating the template until it matches the
object in the image. But consider the more realistic case where the rotations may
be over all three dimensions. There is no longer a mapping of from a 2D template
of one view to the visual perception, showing another view. If we extend the 2D
template matching concept of storage, then each object we’ve learned would have
a massive number of views associated with it, some of which can be very different.
Think of looking straight down into a cup compared to viewing it from the side.

It does not seem feasible nor optimal to store 2D template representations of every

view of every object that we can recognize.

Use of “3D templates” was the subject of much investigation. An internal model
of the world-based 3D shape would seem to work well for any particular object. An
experiment by Shephard and Metzlar, 1971 [90] supports the idea that objects are
stored via an internal 3D representation. They tested subjects’ reaction time in a
same/ different mental rotation task using 3D objects, similar to tetris blocks. Two
views of an object were presented, which showed either the same object rotated
along the vertical axis, or two mirror images of the object rotated along the vertical
axis. The reaction time results showed a linear dependency on rotation angle, which
suggests we can mentally rotate 3D internal representations. Intuitively, when given
a view from a novel, but somehow familiar 3D object, we do not have much trouble
imagining how it would look from another angle. The idea of using 3D templates is
object-based, and faces the problem of reconstruction from views. David Marr [70]
provided an fundamental theory of vision leading to 3D reconstruction, and the
more recent technique of SLAM [28] can reconstruct surfaces from views. SLAM
has shown to work well for navigation.

There was quite a bit of debate about whether internal representation is (or
should be) object-based or view-based. There is support for view-based representa—
tion from experiments showing poor performance of humans on viewpoint invariance
for certain objects, such as complex bent ‘paperclip’ objects [87]. Some objects (e.g.,
faces) are more similar over multiple views than others (e.g., paperclips). A core idea
is that some of an object’s parts change less over a set of views, in which the entire
view of the object changes signiﬁcantly. A view-based system with viewpoint invari-
ance over about 180 degrees were built for faces and cars [88], via decomposition of

the image into parts.

2.1.2 Object Representation

Should objects be represented holistically or broken up into parts?

David Marr ( [70], 1982) explained how holistic 3D—like templates could be
learned from a viewer’s perspective. Inspired by neuroscience (he was one of the
ﬁrst to advocate. both a functional and computational study of the brain), Marr
proposed that the visual recognition has three stages. First was the primal sketch,
a map of feature activation over the scene (e.g., edge locations were provided by
edge detection). The 2.5D sketch incorporates grouping, texture, and some depth
information (depth information is available since two eyes give two diﬂerent perspec-
tive views, which can be used to infer disparity and surface — local 3D shape —
information). The third stage is the manipulation of the 2.5D sketches towards a 3D
reconstruction. Marr’s theory has been signiﬁcant in the history of vision research.
But many results about the top—down nature of visual processing have not supported
a geometric reconstruction. When one considers a “visual task”, such a data-rich
and precise representation is probably not necessary. In particular it ignores what
the agent’s purpose, or goal, is. The theory advocates that all shape information is
important. How much of the surface and shape information that is useful depends
on the potential use the viewer can make of it. The challenge is how to represent all
the available information so that the behavior becomes successful.

As discussed above, parts of an object over views can change little while the
entire object changes a lot over the same views. Viewpoint invariance could be
accomplished through such local invariants. This was the inspiration behind Bie-
derman’s “Recognition by Components” (RBC) [9]. RBC object representations are
composed of volumetric building blocks called geons — a basic primitive 3D shape,
such as a cylinder — along with inter-geon relationships. The geons themselves are
stored internally and must be recognized with the same effort over any rotation (un-

less they are occluded). He gave three conditions [10] to enable viewpoint invariant

recognition via RBC. If the geon structural description (GSD) is unique for an ob—
ject, and possible views are not nondistinctive, then viewpoint invariance can occur
after seeing one view. Object recognition in this model boils down to comparing the
GSD of the current view and comparing it with GSDs for known objects. When he
added a single geon to a complex paperclip-like object, the response time of subjects
to different views decreased dramatically. Tarr [99] claimed that this result can be
explained as the viewer is simply looking for the single added easy-to-detect feature.
He replicated this experiment, but added more geons (three and ﬁve). He showed
that as the number of geons added increased, the viewpoint invariance ability of
the subjects decreased. Edelman, Tarr and others helped move the ﬁeld towards
viewer-based methods of representation and recognition.

The view-based models led to appearance-based methods in Computer Vision —
representation and recognition from sets of digital images. An initial problem for this
idea was that it seemed like a model would have to be automatically built from data
to explain each pixel of an image, and this dimension is very large. One way to avoid
the dimensionality issue is to map a set of images onto a lower-dimensional manifold.
Turk and Pentland showed that, via principal component analysis (PCA) [113], a
set of image views could be linearly mapped to a lower-dimensional space, and the
new dimension depended on the complexity of the variations in the data. Due to
theory of PCA, the lower dimensional eigenspace is optimally representative in a
linear sense. Murase and Nayar [76] showed how to generate and parameterize com-
plete eigenspaces for an object. By complete, it means the eigenspace contains all
possible views, even those not seen. However, the method was rather computation-
ally expensive to update as new training views were added. Additionally it could

not tolerate occlusions.

10

2.1.3 Local Appearance Hierarchies

The occlusion problem with global appearance methods led to local appearance
methods where smaller image windows were used as the inputs of the algorithms [24].
Interestingly, higher-order principal components from smaller windows over multiple
objects resembled the features detected by some known early-pathway visual neu-
rons, while lower order principal components did not seem to show useful structure.
These local features seemed similar no matter which objects were used in the data.
Also, no matter what the objects used, the resulting features were nearly the same.
Since PCA is Optimal, this suggests that the local statistics over all views is not
class dependent. We can assume evolution led to brains that can take advantage of
these local statistics. As a simulated model of V1 orientation selectivity, a basis of
Gabor ﬁlters [79] ﬁts some V1 neurons fairly well. Could internal representation for
local appearance be handled through such a basis?

Even though such modeled ﬁlters (e.g., parameterized Gabor ﬁlters) can ﬁt V1
orientation selectivity well, one basis (or layer) of local ﬁlters alone cannot explain
object representation and recognition. An efﬁcient coding of many-objects repre-
sentation from parts requires a hierarchy of features. Feature hierarchies seem to
be important for how the brain solves the problem of many-objects representa
tion [86]. It is not known whether such a structure is a necessary condition for
many-objects representation, however. In hierarchical systems, the receptive ﬁeld
(area of sensor detected by the ﬁlter) and complexity of features increases towards
higher layers. Eukishima (Neocognitron [36], 1983), Weng (Cresceptron [118,119],
1992), and LeCun (LeNet [58], 1998) implemented notable vision systems using a
local-to-global hierarchical approach. A hierarchical representation allows composi-
tionality — shared parts between objects — and thus an efﬁcient coding of many
possible objects. Another advantage of hierarchies is a robust tolerance to many

variations in object deviation due to slight tolerance to deviations of local features.

11

Thus, invariance should emerge in a local-to-global way, but the exact mechanisms
for this to happen remain controversial. “Max pooling” , in which identical ﬁlters in
different nearby locations become represented by the single ﬁlter with the maximum
ﬁring in the group [86] seems to aid networks in achieving invariance. In almost all
implementations, feature hierarchies utilize only feedforward activation. In mature
cortex, speed of processing of object detection indicates that feedforward activity
is probably sufﬁcient for recognition [103]. However, another result indicates that
feedback causes better performance [104]. There may be other roles for feedback,
such as in learning the features in the ﬁrst place.

Local feature hierarchies have not yet found the best method to set the features
on each layer. The evidence of input-driven functional emergence in the brain [56]
suggests a tantalizing prospect: perhaps the same learning mechanism is active in
groups of neurons in all the different areas of the visual pathway. Appropriate
diversity of function could emerge given this learning rule and appropriate input.
If this learning rule can locally compress information efﬁciently and effectively, the
global problem of setting features at all hierarchical levels could be solved by applying
this same method to each of the levels concurrently.

To test a learning algorithm’s suitability for such a method, we can see if it
can extract features similar to those in V1 from the same type of data V1 may
be interacting with. Thus, any learning rule that develops orientation ﬁlters from
natural images can be considered a candidate in general. But only an in-place
[124] — also called local learning — mechanism is also biologically plausible. Weng
proposed Lobe Component Analysis (LCA) as a candidate for this learning rule,
which is nearly in—place. Since LCA develops orientation selective neurons from
natural input, it passes as a possible candidate. We showed the advantages of LCA
over other Hebbian-learning rules in [122], as summarized in Chapter 3. Chapter 4

describes how LCA can be included in a general framework with bottom-up, lateral,

12

and top-down connections. It’s shown how, when using top-down connections, an

eﬂicient compression (which prioritizes relevant information) emerges.

2.2 Visual Attention

Recognition operates together with attention to transform an image into a mean-
ingful image. A fundamental problem of recognition is the segmentation problem.
Which part of the scene is the foreground, to be recognized? The rest of the scene
is considered the background. Realistic backgrounds can have very complex visual
structure and may or may not show any other objects. This problem reﬂects the
chicken-egg nature of segmentation (attention) and recognition: it seemingly can’t
be determined if the edges, colors, etc. belong to the object (ﬁgure) or the back-
ground until the object is recognized, yet some grouping must occur before attention
knows what to select. We also have some explicit internal control over what the fore-
ground is. Consider the famous image in which we can see either two faces or a vase

as the foreground (but not both), depending on what one tries to see.

2.2.1 Selection

Selective attention refers to some mechanisms by which an agent recodes its sen—
sory information into a simpler, more useful form. Simpliﬁed relevant information
is necessary for cognitive processes, such as decision making. Attention is essential
for artiﬁcial agents that learn intelligent behavior in complex unconstrained envi-
ronments, especially those that utilize vision. Attention is called selective, in that
a subset of the sensed information is suppressed. What it means for information to
be suppressed is not explicitly known. It may not be simply removed from being
processed. Even in situations when a person cannot remember something sensed,

that information can still inﬂuence that person’s actions [67].

13

Selective attention is separated into bottom-up selection and top-down selec-
tion [27]. Bottom-up selection is not explicitly controlled: salient foregrounds will
automatically “pop out” at the viewer. Sometimes, there is not a single salient
object in the image, but several objects, and the most salient is what is attended
ﬁrst. Top-down attention, a fundamental part of visual attention [27], is selective
for important locations or features biased by goal or task of the agent. Given the
same scene with the same eye ﬁxation, but two different top-down biases, the repre-
sentation of the information that reaches the later stage can be very different. For
example, imagine the differences between what a vehicle’s driver tends to attend to
compared to a passenger, even if they look in the same direction.

Treisman’s Feature Integration Theory (FIT) [109] has been an extremely in-
ﬂuential model of representation and attention, which described selection based on
saliency. In FIT, objects are decomposed into features, which themselves can be
recognized without attention, but a conjunction of features requires an attentional
spotlight to “shine” on that location, which selects that location. FIT introduced
the idea of separate feature maps concerned with different dimensions of stimuli (i.e.,
color, orientation, direction of movement, and disparity) feeding into a single master
map of salient locations. Koch and Ullman [53] proposed a computational saliency
model, implemented later by Itti and Koch [46], in which winner-takeall operation
of neurons on the master map led to both binding and attention location.

Saliency methods have been coupled with recognition. An example is NAVIS
(Neural Active Vision) by Backer et at. [2]. An issue with feature hierarchies is the
corruption of the original information [111]. Olshausen, Anderson and Van Essen
proposed a neural computing model of dynamic information routing to go along with
a saliency map structure [78] so that the original information in the attended area
could be sent to a recognition network. Each object is stored inside the network

as an object-based reference frame, which is used for recognition via an associative

14

memory similar to those demonstrated by Hopﬁeld [41]. Control neurons set “shifter
circuits” over multiple levels, which will normalize a part of the image in scale and
location for comparison with the object-based reference frame.

Non-biologically-inspired methods for top-down attention include the Viola and
Jones approach [115], which is designed to ﬁnd a particular class, such as faces.
The histogram of gradients approach, used for pedestrian detection [22], uses local
histograms of image gradient orientations as features to predict a certain classes
presence well. In both of these, the goal of attention selection is determined before-
hand.

Other approaches have combined bottom-up and top-down attention. Desimone
and Duncan [27] claimed that top-down selection occurs primarily by gating the
different channels of object information; in terms of feature-based attention it may be
based on comparison with a feature template stored in short term memory. CODAM
[100] models attention as a control system, and uses a goal module to process bottom-
up signals and hold internally generated goals, generating goal-signals that bias
lower-level competitive processing. The method in [75] modiﬁes gains and selects
scale of processing to produces different saliency maps, speciﬁc for certain classes.

A few researchers have proposed connectionist models for selecting and matching
an object in the image to a canonical internal reference frame. Examples include
Olshausen, Anderson and Van Essen [78] and Tsostos [111]. Deco and Rolls, 2004
[25], created a biologically inspired network for attention and recognition where top-
down connections controlled part of attention, but were not enabled in the training
phase due to instabilities they caused. Where-What networks are a biologically
plausible developmental model that integrates both bottom-up and top-down modes
of attention and recognition, without being limited to a speciﬁc task [47, 48, 63].

WWN does not use a master map or internally stored canonical objects [48,63].

15

2.2.2 Binding Problem

The binding problem is a fundamental problem of attention, and one that must
be solved for local feature hierarchies. Once the scene has become represented by
separate feature activations, how can a subset of those activations be recognized
as an object? How could a network select the features that actually belong to the
object and not select features that don’t?

Much evidence does not support the idea that we represent objects in a holis-
tic representation. Experimental results showed the existence of illusory conjunc-
tions [51, 108], which suggests there are separable features in object representation.
Illusory conjunctions occur when there are fast visual changes, and feature dimen-
sions of the actual observed items are mixed up upon being reported by the viewer.
For example, if one is presented with a red B and a green S very quickly in sequence,
upon reporting it back one may say there was a red S. The local-to-global feature
hierarchies discussed above also represent the scene in a disintegrated way. Separa-
ble features provide representation efﬁciency, but must eventually be reintegrated
for understanding of any meaningful thing in an image. The binding problem con-
cerns the mechanisms of information integration into understandable wholes [107].
For example, if we analyze an image based on location of objects and the types of
objects separately, how do we ensure any single solution for location and type is in
agreement? If a single image contains two different types in two different locations,
and the network (correctly) outputs both positions and both types, it is not clear
which type corresponds to which position.

FIT offered an theory of binding and an explanation for illusory conjunctions.
The most famous supporting evidence for FIT is in feature-based search. Subjects
were instructed to ﬁnd an item among distractors, and this item differed from all
the rest by a single simple feature, such as color. In this case, they were able to ﬁnd

it at the same speed (it “pops out”) no matter how many distractors there were.

16

This result suggests simple features are processed throughout the scene in parallel.
However, when the object shared each of its features with some distractor, search
time increased linearly as a function of the number of distractors, suggesting this
conjunction-type search occurs serially. Treisman proposed that objects are inter-
nally represented as an object ﬁle containing feature information and information
about feature relationships. Then, recognition cannot occur unless the ﬁle is applied
to an actual location on the image. Attention can be placed at only a single loca-
tion at any time. The spotlight binds features in the same location into an object
ﬁle, which is compared with stored object ﬁles for recognition. Illusory conjunctions
would be a result of the spotlight not in a location long enough for binding to occur,
leaving a feature “free-ﬂoating”.

One proposed high-level solution to the binding problem is by combination neu-
rons (or neuron assemblies) that represent the entirety of the combination of features
for an object, and exist somewhere in the brain. Then each feature’s detection is a
necessary condition for the combination neuron to ﬁre. In Olshausen et al.’s work
mentioned above, the associative memory network is in the spirit of the combination
neuron approach — somewhere in the network is stored a model of each “whole”
that can be recognized.

There are problems with the combination neuron approach, as underlined in
[116]. Returning to the position / type example, we could implement another winner-
take-all layer of position/ type combination neurons, having afferent input from the
separated feature layers. But note this scheme runs into the combinatorial explosion
problem that we have been trying to avoid by using a hierarchical architecture of
separable features in the ﬁrst place! It is uncertain how all possible combination
neurons could be learned without experiencing all combinations. And combination
neurons do not support generalization. For example, if a network can recognize both

red car and blue hot, it should be able to also recognize blue car even if it has never

17

seen one before. How is a network that sees a set of parts in some combinations
able to generalize across other combinations that have not been seen? Thus, for a
general-purpose vision system, any network using the combination neuron approach
will eventually run into unexpected ambiguities. However, synchrony of ﬁring over
multiple levels can alleviate the binding problem without explicit combination cells
[116].

Much of the work has investigated temporal synchrony (feature activations cor-
relate in time at some scale). In this work, synchrony results from bidirectional
connectivity: bottom-up and top-down connections. In the position/ type network,
given an image with two objects a and b, let the position detector layer, through
feedforward activity and winner-takeall, select position a, but the type detector
layer selects type b. The network now has two different pieces of information, both
of which are correct, but possibly not synchronized. It’s not necessary try all possible
combinations; instead it can set one piece and use it to ﬁnd the other 1. Following,
Treisman’s idea of spotlight, neurons in the position map project back to bias neu-
rons at the appropriate positions in the earlier layer. Type-sensitive neurons related
to both type a and b are biased in that position, however, only object a is actually
in that position. Then, the biased feedforward activation to the type layer leads
to type a being selected. This was implemented in our Where-What networks [63]
discussed in Chapter 6. Binding through location selection is often effective, since
often location information is enough. But it can’t handle transparency or occlusion.
For these cases, binding within the selected location is also needed, which can be
realized by top-down connections within a feature-hierarchy, so that, for example,

higher-layer form can bias the related lower-layer edges.

 

1For intuition, consider a quadratic equation in two variables :1: and y, which has
two possible solutions: {(a, b), (c, d)}. If we know e.g., :1: = a and y = d, we don’t
have to plug in all four combinations to ﬁnd a single solution. Instead, we can set
a: = a and solve for y.

18

The crucial importance of top-down in unguided attention was realized by Tsot-
sos. He proved that selecting features that represent an object well based on
the activation of the separate feature maps has exponential complexity (is NP-
complete [112]). However, guided by knowledge of the object, the search becomes
linear in the size of the image. One interpretation of that result is that if we know
how the features activate when the object is in each of all possible positions, we
can search through all possible positions (linear-time scan). In other words, without
some guidance, attention is intractable. Tsotsos’ selective tuning model [111] is a
multilayer pyramid-like network that uses a complete feedforward pass to ﬁnd the
best candidate location and type and a complete feedback pass to focus attention, by
inhibiting non-selected features and locations. Gating units perform selection from
the top-down. Some differences between the selective tuning (ST) model and the
Where-What networks are that ST uses top-down selection (gating) and top-down
inhibition through winner-take-all, while WWN uses top-down excitation through
weighted connections; additionally WWN uses multiple motor areas, for controlling
and sensing an agent’s actions, while ST uses a single area of interpretive output
nodes. With respect to the binding problem, by enforcing WTA on the output
layer and only using top-down originating from the output layer, ST effectively uses

combination neurons, which run into the combinatorial problem.

2.3 Functional Architecture of the Visual Cortex

It has so far been very difficult to map the mind onto the brain. Yet, some evidence
and principles are emerging that are very promising. Studies of cortex can lead to

intuition in how to design and implement networks.

19

2.3.1 Distributed Hierarchical Processing

One can think of the cerebral cortex as a vertical hierarchy. In such a hierarchy,
sensations from outside stimuli enter at the bottom (such as light interacting with
the photoreceptors of the retina) and inﬂuences the system, traveling upwards to-
wards motor areas, which control movement and behavior. A study and model of
hierarchical organization of the areas of the primate visual system can be found
in [71], and later [32] (see [14] for an overview). Each area in cortex exhibits the
same type of six-layered organization. Figure 2.1 illustrates this idea by displaying
a partial hierarchy of visual areas.

The visual pathway is composed of two separate pathways. Experiments by
Mishkin [74] showed that the lower pathway, called the ventral pathway and leading
to IT, specializes in “what” information (recognition). The other path is called the
dorsal, or “where” pathway. It is implicated in visuomotor reaching, thus encoding
object location. Why should there be two different pathways for what and where?
For a particular object, the location in the visual ﬁeld has very little to do with
the identity — where the object is seen does not raise or lower the likelihood of
its category too much. So, object class is, at least somewhat, invariant to visual
ﬁeld location. And the opposite seems true as well: object location is invariant
to its identity. The two diverging pathways come together later at the prefrontal
areas, where information is integrated from multiple modalities. The prefrontal
areas project to motor areas, which control movement and behaviors like speech.
Figure 2.2 shows some of the sensorimotor pathways through human cortex. For
each pathway, information ﬂows in both directions.

Neurons ﬁre all along the visual pathway in object recognition. Objects seem
to be represented in a hierarchical and distributed way along this pathway. At
lower levels (V1 and V2), neurons’ ﬁring is selective for features such as oriented

edges [42], disparity [6,21], color, and motion [92]. V4 seems to be selective for local

20

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 2.1: As the ﬁrst ﬁgure in this dissertation, I must note the following: Images
in this dissertation are presented in color. This ﬁgure shows a partial model of a visual
hierarchy, starting from sensor cells in the retina and progressing to inferotemporal cortex
(IT), an area implicated in visual object recognition. Information progresses through these
pathways and beyond and eventually to later motor areas (not shown). All connections
shown are bidirectional. Each area is organized into six-layers of highly interconnected
neurons. The different areas in cortex, all have this same general six-layer connectivity
pattern [50]. A core idea is the hierarchical arrangement of neural areas on many levels,
from sensors to motors, with each area’s suborganization being six-layered. Figure adapted
and modiﬁed from [71].

21

 

    

  

Occipital

lobe lobe
Somatosensory

Auditory Motor

Figure 2.2: Some pathways through different cortical areas for different sensory modali-
ties. The numbers indicate the numerical classiﬁcation of each cortical area originally by
Brodmann in 1909 [50]. There are two different pathways by which visual information
travels — it has been found that one pathway is for identity and categorization (“what”)
and the other has more to do with location (e.g., how to reach for the object - “where”).
The somatosensory, auditory and visual information progresses through different path-
ways, converging at premotor and prefrontal areas. The premotor area goes to the motor
cortex, which handles movement and action representation. All paths are bidirectional.
Figure courtesy of Juyang Weng.

22

shape features [83]. In the later IT area, neurons are found to ﬁre in response to
abstract things like faces or places (interestingly such neurons are grouped together
in these areas). Hubel and Wiesel’s discovery of orientation selectivity of V1 neurons
inspired the study of “functional architecture” of the visual cortex, which has led to
an immense research effort.

Yet mapping the functional architecture is diﬂicult. In 1991, Felleman and Van
Essen [33] provided a hierarchical diagram of different cortical areas in vision; ad-
ditionally they mapped out the connectivity between the visual cortical areas (in
the macaque monkey). They organized 32 different visual areas into 14 levels of
cortical processing, and many areas were connected (there is a higher probability
closer areas are connected), almost always in a bidirectional way. The complexity of
the architecture is not encouraging. The ﬁmctional —— e.g., the type of information
that is selected for —- purpose of only a subset of the neurons in only a few of these
32 areas is known. It’s not likely this mapping could be used an explicit blueprint
for modeling due to the lack of information about function and its high complexity.

But directly modeling mature structure is probably not necessary, due to evidence
supporting a developmental approach. There are general principles of organization

behind the complex architecture and multi-area selective wiring.

2.3.2 Input Driven Organization

The concept of a developmental program in an artiﬁcial agent is analogous to the
genome in a biological agent. In biological agents, development starts within a
single cell, called the zygote, containing the genome of the organism. All physical
and mental development thereafter is dependent on the genome’s coding, meaning
that the emergence of behaviors and skills is a process driven by the genes. The
human genome contains less than 30,000 genes [95]. Compare that to the estimated

1011 neurons and 1014 synapses in the central nervous system of a mature adult.

23

The discrepancy in number between the two estimates implies that all the different
cortical functions, such as edge detection, shape detection, face detection, object
recognition, motor control, etc. cannot be described in the genome.

Many neurons in the ﬁrst layer of the visual cortex (known as V1) develop sen-
sitivity to edges at preferred orientations [43]. Are these low-level feature detectors
in human vision pre-deﬁned —— hardcoded within the genome? Evidence suggests
the answer is no. Blakemore and Cooper’s 1970 experiment [11] and Sur’s work on
“rewiring” a ferret’s auditory cortex using visual information showed that feature
detectors are developed. In [56], an experiment was done on a newborn ferret where
input from the visual sensors was redirected to the auditory cortex, and input from
the auditory sensors was disconnected. It was found, after some experience, that
what would have been the auditory cortex in a normal ferret had altered its rep-
resentations to function as the visual cortex for the “rewired” ferret. Thus, it is
thought that the function of the different cortical areas develops as a result of the
same mechanism throughout. The evidence suggests that cortical fimction and orga-
nization is input-driven. Therefore the only reason developed neurons will respond
to different stimuli is because they adapt to different types of input. That is, the
operations done by each neuron are the same, but the input is different, so they will
adapt to detect different features. This suggests the same learning mechanisms are
active at different areas of the cortex. This notion inspired local learning rules for
self-organization —- feature extraction and automatic selective wiring — described

in Chapters 3 and 4.

2.3.3 Cortical Circuitry

Cortical neurons in any layer derive their representations via input connections from
other neurons from three locations: from earlier layers and areas (ascending or

bottom-up), from the same layer (lateral), and from later layers and areas (descend-

24

ing or top-down). There are both excitatory and inhibitory connections. About 85%
of the connections are excitatory [29].

Excitatory top-down — feedback — connections are more numerous and gener-
ally more diﬂuse than the bottom-up connections. They have been assumed to play
a modulatory role, while the bottom-up connections were considered as the directed
information carriers [14]. Driving connections move information while modulatory
connections modify the activity of the driving connections. But knowledge of the
speciﬁc computational roles — especially in development — played by the top-down
excitatory connections is not yet known. There is evidence that these connections’
are quite important for visual classiﬁcation ability [104]. It is known they play a
crucial role in perception over time, which seems to require some recurrence in the
circuits. There is much evidence of the top-down connections’ impact on visual
awareness for functions such as enabling segmentation between an object and its
background [102], perceptual ﬁlling in [57], and awareness of visual motion [82].

The role of the top-down connections in brain area organization and functional
development is very much unknown. In mammalian cortex, later visual cortical
areas include functionally speciﬁc regions in which neurons that respond to many
class variations are spatially localized. Neurons in the inferotemporal (IT) area,
along the ventral pathway, have been implicated in object and class recognition [98].
Further, there are areas that seem to have been developed to represent a speciﬁc
category of stimuli. These stimuli in IT include faces [26], localized within the
fusiform face area (FFA) and places, within the parahippocampal place area (PPA).
Stimuli such as places do not typically have much physical similarity. Could top-
down and lateral excitatory connections lead to these abstractly grouped areas of
cortex? The theory and results in Chapter 4 describes a method where top-down
connections can cause biased compression of information so that important (relevant)

information has a higher priority than irrelevant information; additionally top-down

25

and lateral excitation is shown to lead to such grouped areas. Chapter 5 describes

a method in which top-down connections can cause temporal processing.

2.3.4 Smoothness and Modularity

Many cortices, such as the somatosensory, motor, and visual, have been observed
to be organized topographically. The smooth topographic organization of orienta-
tion selectivity in neurons in primary visual cortex (V1) is classically well known.
At higher levels, this smooth organization gives way to a modular organization [15].
How does the cortex develop local topography, which seemingly requires smoothness,
yet develop two adjacent areas as modular, which requires a partition that seemingly
violates principles of smooth topography? Tootell [105] measured the responses of
each of the neighboring FFA and PPA areas (in humans using MRI) for stimuli of
different morphs between faces and houses. They found that the two areas could be
considered functionally different modules, since the peaks of the averaged morphed
stimuli responses were in one area or the other — there were no areas that responded
optimally to the morphed features. However, they also found some response for the
morphs could be found in either area. But there was not a smooth interpolation
between the two areas. Looking deeper, smooth V1 organization is not completely
smooth at a lower scale: “the projection of the world into V1 is smooth and con-
tinuous on the macroscopic level, but jittery and occasionally discontinuous on the
_ microscopic scale” (Koch, 2004 [52] — pg. 78). Maldonado et al. [68] measured the
features detected in the pinwheel centers of V1 and found a larger variance of fea-
ture types selected for, but not signiﬁcantly larger bandwidths. This implies that
selection of features with low correlated ﬁring can coexist nearby, so averaging of
these unrelated nearby features did not necessarily occur.

It is generally assumed that lateral excitation is the impetus for topography to

emerge. Supporting results have been found using the computational self-organizing

26

maps [55] (SOM). In cerebral cortex, many lateral connections are clustered close by
the neuron from which they originate, to other nearby (i.e., neighboring) neurons.
There are also strong long-range connections to neurons that detect similar features
(e.g., similar or identical orientations). In the self-organizing maps, an approxima-
tion of lateral excitation is isotropic biasing. That is, a ﬁring neuron will excite all
the neurons around it equally, as a function of radial distance. But in actual cortex,
the close connections are not isotropic, but instead “patchy”. Thus an isotropic
function of updating emanating from the winner neuron(s) is not biologically accu-
rate. In [73], it was shown that an orientation map with such patchy connectivity
can develop by incorporating adaptive, limited-range, lateral connections into an
SOM, and using oriented gaussians or natural image patches as stimuli. Lateral
excitatory connectivity was also shown to be the cause of “smoothness” of the map,
meaning the features represented in a small area (containing a few neurons) tend to
be similar. Yet in their mature simulation cortex, excitatory connections between
neurons that represent dissimilar stimuli (not very statistically correlated) were not
typically present, especially for long-range connections but even for nearby neurons.

Results in Chapter 4 suggest adaptive lateral connections play a crucial role in de-
veloping modularity, and coupled with top-down connections can cause modular and
abstract representation areas. Since top-down originate from the more abstract as-
sociation cortices, and the motor areas, the top—down connections carry information
that can be used to bias what features are currently important (internal attention)

to the task at hand. This idea inspired the approach described in Chapter 5.

27

Chapter 3

Developing Neuronal Layers

This chapter presents the Lobe Component Analysis (LCA) technique for general-
purpose neuronal computation. LCA is a candidate technique for developing local
features at any hierarchical layer. The algorithm allows nearly in—place development

of each neuronal layer, anywhere in a network.

3.1 Overview

The biologically inspired Lobe Component Analysis incrementally develops a set of
optimal pattern detectors from a high-dimensional sensory stream. Each feature
detector, which is a vector, is called a “lobe component”. The lobe components
are optimal in the sense that each is the best representation of the observations
that inﬂuenced its development. In other words, each feature is selected so that it
maximizes the likelihood of its input history. The optimality is implemented via a
biologically-inspired Hebbian learning rule. A major advantage of the optimality is
that there is no need to manually select (tune) the learning rate for the neurons —
each neuron tunes its own learning rate in the best way.

Over learning, each neuron seeks to latch onto a component (also called source,

cause, feature, etc.) that “caused” the data (such as a vertical edge), and can be used

28

to “explain” the data. It does this incrementally over a series of observations. To
keep other neurons from following the same path, it can laterally inhibit the others;
so neurons must ﬁnd different components. This nonlinear search is accomplished
in a nearly-optimal way (allowing for some inappropriate observations to fall into a
neuron’s history in the initial organization phase between all the components).

The neurons are ensured to learn different features through competition (lateral
inhibition). LCA uses k-winners-take—all to approximate the neural competition
mechanisms of lateral inhibition. Via the winner-takeall approach, only the most
similar lobe component direction to the input will shift to become nearer to the input
vector direction. The other neurons do not change, unlike e.g., gradient methods
which change all neurons for each sample, therefore LCA has long-term memory.

The technique is purely incremental, and additionally requires no extra storage
besides the lobe components themselves. The learning is mostly local, except the
approximation of lateral inhibition via k-winners take all. The development is almost
in-place.

A neuron’s history and the data that fall into its current Voronoi partition may
not coincide. We used a mechanism called CCI plasticity, which allows each neu-
ron to “forget” its earlier observations, which may not be in its Voronoi partition
anymore. This also provides the network with the ability to deal with nonstation—
ary distributions - the lobe components can eventually adapt to any changes in the

environment (such as lighting changes).

3.2 Concepts and Theory

Consider a simple computational model of a neuron (indexed i) having 17. synaptic

inputs. Its ﬁring rate is modeled by

9i = g(vz-,1a:1 + 222-2202 + + vi’nxn) = g(v,— -x), (3.1)

29

where x = ($1,532, ...,xn) is the vector of ﬁring rates of each of the n input lines,
and the synaptic strength — weight — associated with each input line so; is '02-’33 j =
1, 2, ..., n. Traditionally, g has been a sigmoid function. For the analysis and for the
experiments provided here, 9 is not necessary.

A vector of activity on the input lines x is in the sample space: x E X. From
the above, note that a neuron’s weight vector is also in this space v,- 6 X. A weight

vector is called a lobe component.

3.2.1 Belongingness

Belongingness deﬁnes the assignment of samples to lobe components. Given a limited
resource of c neurons, divide the sample space X into c mutually nonoverlapping

regions, called lobe regions:
X = R1 U R2 U U Rc, (3.2)

Each lobe region is represented by a single lobe component. Given any input
vector x, it exclusively belongs to a region R,- based on some criteria, e.g. similar-
ity based on some distance metric — x belongs to R, if it is most similar to the
representative lobe component v,- than the other vectors. Belongingness deﬁnes a

partition or tesselation of the input space.

3.2.2 Spatial Optimality

Given an assignment of samples to a lobe component i, what is the best way to set
Vi?

This, spatial optimality is deﬁned in the following sense. For random input vector
x and matrix of all lobe components V, use notation x(V) to indicate the lobe

component for the region an input belongs to. Then, we wish to ﬁnd the set of lobe

30

components to minimize the expected square approximation error E||x(V) — x||2.

v* = (v’f,v§, ...,v3) = arg m‘inEllﬂV) — X“? (3.3)

This spatial optimality requires that the spatial resource distribution in the layer
is optimal in minimizing the representational error. This distortion is minimized by
ﬁnding V* that minimizes the above expression. For a given partition, LCA provides
an optimal solution [122]. But if the partition must also be determined, the problem
becomes NP-hard [12,81, 128]. LCA is an approximate solution by assuming the set
of samples falling in a lobe region, based on e.g., minimum inner product difference,
is similar to the recent observation history of the representative lobe component.

CCI plasticity, deﬁned below, allows old observations to be forgotten.

3.2.3 Temporal Optimality

Incremental (Hebbian) learning algorithms using a single learning rate may ﬁnd the
correct direction of vector change, but will not always take the best “step” towards
the goal at each update.

Let u(t) be the neuronal internal observation (NIO), which for LCA is deﬁned

as response-weighted input: u(t)

suave—1)
“(t’ “ “v(t—nn x‘”

 

(3.4)

The synaptic weight vector v(t) is estimated from a series of observations U (t) =
{u(l), u(2), ..., u(t)} drawn from a probability density p(u) for this source. Suppose
the learning rate 77t is for NIO u(t) at time t. How can all the learning rates
n1, n2, ..., Wt: be set so that the estimated lobe component v(t) at every time t has
the minimum error while the search proceeds along its nonlinear trajectory toward

its intended target weight vector v*?

31

Let S (t) be the set of all possible estimators for v from the set of observations
U (t) A temporally optimal estimator means that every update at t, v is spatially
optimal over U (t):

nave» -v*u2. (3.5)

minimum-error(t) =

min
v(U(t))ES(t)
for all t = 1,2,3,4,

3.2.4 Solution

According to the theory of Principal Component Analysis (PCA) (e.g., see [49]),
the principal component of the conditional covariance matrix 2x),- (conditioned on
x belonging to R,- for lobe component i) is spatially optimal for lobe component i.
Now, we need to compute this solution from the data.

First, note v2- satisﬁes )‘i,1"i = Enivi-

Replacing 233,,- by the estimated sample covariance matrix of column vector 2:,

we have

Amvz' s itzl x(t)x<t)Tv.- = % 2:1(x(t)-v.~>x<t). (3.6)
:: t:

We can see that the best lobe component vector Vi: scaled by “energy estimate”
eigenvalue )‘i,1’ can be estimated by the average of the input vector x(t) weighted

by the linearized response x(t) - v,- whenever x(t) belongs to Ri-

3.2.5 Incremental Solution

For in-place development, each neuron does not have extra space to store all the
training samples x(t). Instead, it has to update synapses incrementally. The incre-
mental solution to the ﬁrst principal component follows CCI PCA [125]:

If the i-th neuron v,- (t— 1) at time t—1 has already been computed using previous

32

t - 1 inputs x(l),x(2), ...,x(t — 1), the neuron can be updated into vz-(t) using the

current sample deﬁned from x(t) as

 

_ x(t) - vi(t — 1)
"t “ “v.0: — 1)“ x‘” ‘3'”

Then Eq. (3.6) states that the lobe component vector is estimated by the average:

)‘i, lviz

“.3le

Zn (3.8)

Statistical estimation theory reveals that for many distributions (e.g., Gaussian
and exponential distributions), the sample mean is the most efﬁcient estimator of
the population mean (see, e.g., Theorem 4.1, p. 429-430 of Lehmann [61]). The
sample mean is the maximum likelihood estimator for the population mean. In
other words, no other estimator in Eq. (3.8) could reach as low of an error given
the observations.

Intuitively, each lobe component i develops v,- to be the expectation of its

response-weighted input.

3.2.6 CCI Plasticity

The sensory environment of a set of lobe components is not stationary, especially
early on. Therefore, the sensory input process is a nonstationary process too. The
CCI plasticity technique below which gradually “forgets” old “observations” (which
use bad xt when t is small) while keeping the estimator quasi-optimally efﬁcient.

The amnesic mean is deﬁned as:

1(t) = t- 1 - #(t)j(t-1)+ 1 + u(t)

t t rt (3.9)

where u(t) is the amnesic function depending on t. If u E 0, the above gives the

33

straight incremental mean. We adopt a three—sectioned proﬁle of u(t):

 

0 . ift _<_ t1,
u(t)=< c(t-t1)/(t2—t1) iftl <tgt2, (3.10)
[c+(t—t2)/r m2 < t,

in which, e.g., c = 2,r = 2000. As can be seen above, u(t) has three intervals.
When t is small, straight incremental average is computed. By the second interval,
we hope the lobe component has latched onto a cause, so we increase forgetting
rapidly so it can lose the earliest observations before it found its path. Until the
third interval, forgetting decreases to a very small (but nonzero) rate, to allow for

long-term plasticity.

3.3 Algorithm

The Candid Covariance-free Incremental LCA algorithm incrementally updates 0
neurons represented by the column vectors v1 (t), v2 (t), ..., vc(t) from samples
x(l), x(2), The length of v,- will be the variance of projections of the vectors x(t)
in the i-th region onto vi.

Initialization —— Sequentially initialize 0 cells using ﬁrst 0 inputs: vz-(O) = x(t) and
set cell-update age n,- = 1, for i = 1, 2, ..., c.

“Live.” For t = c + 1, c + 2, ..., do

1. Pre-competitive response potential. Compute potential for all neurons: For all i

with 1 S i S c, compute the responselz

,_ X0) ‘Vi(t - 1)
“— “vat—1)” ’

1Here we present linear response, but the entire system is nonlinear system due
to the top-k mechanism used.

 

(3.11)

 

34

 

2. Lateral inhibition. Rank k + 1 top winners so that after ranking, 311 2 y2... 2 ye,
as ranked responses2

Use a linear function to scale the response:

31% = (yz- — yk+1)/(yl — Elk—+1), (3-12)

for i = 1,2, ...,k. All other neurons do not ﬁre: y,- = 0 for i = k + 1,k + 2, ...,c.
3. Optimal Hebbian learning. Update only the top It winner neurons v -, for all j in

the set of top It winning neurons, using its temporally scheduled plasticity:
vj (t) = wlvj(t — 1) + wgij(t), (3.13)

where the cell’s scheduled plasticity is determined automatically by its two update-

age dependent weights, called retention rate and learning rate, respectively:

 

n(j) - 1 — #(nj) 1 + #05)
“’1 = n. #02 = ——
.7

”j , (3.14)

with wl + 2122 E 1.
4. Long-term memory. Update the real-valued neuron “age” n(j) only for the
winners: nj +— nj +313, j = 1, 2, ..., k (g,7 = 1 for the top winner). All other neurons

i that do not update, keep their age and weights unchanged: v,- (t) = v,- (t -— 1).

 

2For computational efﬁciency, this non-iterative ranking mechanism replaces re-
peated iterations that take place among a large number of two-way connected neu-
rons in the same layer.

35

 

 

Figure 3.1: Example of a natural image used in the experiment.

3.4 Experiments

3.4.1 Natural Images

The thirteen images available at http : //www . cis .hut . f i/ pro j ects/ ica/ imageica/
were used as examples of real—world “natural” images, i.e., images whose statistics
are representative of the signals we interpret through vision. For the input of each
experiment, we incrementally and select a 16 x 16 pixel patch from a random loca-
tion in a random image, and concatenate it into a column vector. ‘One of the images
used in the experiment is shown in Figure 3.1.

Earlier works [?,65] have already shown that LCA extracts orientation selective
features from natural image input.

Figure 3.2 shows the result when using LCA on 256 neurons and 1,500,000
whitened input samples. The lobe components in the image are ordered by the
number of times each was the winner during the procedure. The component with
the most wins is at the top left of the image grid, and it progresses through each
row until the one with the least wins, at the bottom right.

Whitening. Whitening is a preprocessing procedure that decorrelates the inputs.

It projects each sample along the principle components, but also adjusting by the

36

scale of each. The whitened sample vector x is computed from the original sample x
as x = Wx, where W = VD is the whitening matrix. V is the matrix where each
principal component v1, v2, ..., V7; is a column vector, and D is a diagonal matrix
where the matrix element at row and column i is i, where )‘i is the eigenvalue of
”i- For a lobe component, it must ﬁrst be dewhitened in order to display properly.
For example, to restore the original input vector, x = VD’li, is the dewhitening

procedure.

Figure 3.2: Lobe components from natural images (with whitening).

Figure 3.3 shows the result when whitening is not used. The ﬁlters are ordered
in the same way as the ﬁrst experiment. In this case, although they do show a pref-
erence for certain orientations, most ﬁlters are not localized, meaning the receptive
ﬁelds of each are the entire 16 x 16 window.

Due to the speed of both the algorithm and the programmed implementation,
these experiments over 1,500,000 samples took less than 30 minutes on a Pentium

M 2.0GHz PC with 1.0GB memory. The overall process is summarized in Figure

??.

37

 

Figure 3.3: Lobe components from natural images (without whitening)

3.4.2 Hebbian Updating

The purpose of this section is to show the effect of LCA’s optimality, especially com-
pared to typical Hebbian updating rules, which set a single learning rate manually.
These experiments originally done for the LCA article by Weng & Luciw, 2009 [122].

The basic Hebbian form [23, 38] for updating the weight vector v of a neuron is:

Av = ny(v,x)x (3.15)

where v is the amount of update for the weight vector v by executing v 4— v + Av,
77 the learning rate, x the vector input (pre-synaptic activity).

Oja’s classic neuron updating algorithm [77] is an algorithm that follows Eq. (3.15)
for incrementally computing the ﬁrst principle component, which is spatially optimal

as discussed in earlier sections. Like LCA, its N10 is response-weighted input.

AV = W(t)(X(t) - Y(t)V(t)) (3-16)

38

where y(t) = xT(t)v(t) is the neuronal response. This version must be used with
small 17 (e.g., n = 10‘3) for stability. If stable, the lengths of the vectors will tend
to unit.

A stable two-step version of Eq. (3.16) uses normalization. It aligns directly with

Eq. (3.15) and uses time-varying n is:

Av = n<t><xT<t)v<t»x<t). v «— mm: (3.17)

We called it “Hebbian with TVLR (time-varying learning rate)”.
The “dot-product” version of the SOM updating rule [55] (page 115) is also

considered as incremental neuronal learning:

Vi“) + ”(63(0)
llvz- (t) + n(t)x(t) M (3-18)

where v,- is the winning component vector at time t. The NIO used by SOM’s rule

Vi(t + I) =

 

is u = x (not weighted by response). Without response-weighting, this updating
rule did not perform successfully in these tests.

All of the above use a single learning rate parameter to adapt the neuron weights
to each new updating input, and a method to bound the strengths of synaptic
efﬁcacies (e.g., vector normalization). CCI LCA weights using the time-varying
retention rate wl (t) and learning rate w2(t), where wl (t) E w2(t), in order to
maintain the energy estimate. With the energy gone in the schemes above, there
is no way to adjust the learning rate v(t) to be equivalent to the CCI scheduling.
Therefore, the result of Eqs. (3.16), (3.17) and (4.16) cannot be optimal.

For tuning the time-varying learning rate n(t), we used three example suggested
[114] learning rates for u(t), which were ‘ linear’: 77(t) = n(0)(1 — t/T), ’power’:
n(t) = 17(0) (0.005/n(0))t/T and ’inv’: u(t) = n(0)/(1 +100t/T). The initial

learning rate 77(0) was 0.1 or 0.5. Plasticity parameters for LCA’s [u were t1 =

39

10, t2 = 100, c = 5,r = 5000.

Experiment: Stationary Distributions

The statistics of natural images are known to be highly non-Gaussian [91], and the
responses of V1 neurons to natural input have a response proﬁle characterized by
high kurtosis. The Laplacian distribution is non-Gaussian and has high kurtosis, so
we test estimation of the principle component of Laplacian distributions.

Data was randomly drawn from a d—dimensional Laplacian, with pdf: f (ml 71’ , b) =
215 exp (—J$—_bﬁﬂ). It is a combination of d one-dimensional Laplacians, each of
which is best explained with a single vector in the axis direction, with length equal
to the variance. All dimensions had zero mean (71’ = 0) and unit variance (b = 1).
The true sources to be extracted from this distribution are the d directions, with
length b.

The number of neurons is set to d, and initialized to random samples drawn
from the distribution. For a fair comparison, all methods started from the same
initialization. The training length (maximum number of data points) was T =
10000 d, so that each neuron would on average have 10000 updates. Dimension d
was 25 or 100, and results were averaged over 50 trials. The results measure how
well the neurons’ directions match the direction of the true components.

Results are shown in Figs. 3.4 and 3.5. The “SOM” curve shows the best-
performing variant among the six different learning rate functions and initial learning
rates, as suggested [114]. None of them led to extraction of the true components (the
best one uses 17(0) = 0.1 and the linear tuning function — in both cases). For Oja’s
rule with time-varying learning rate, we show only 17(0) = 0.1 since the alternate
curves (77(0) = 0.5) were uniformly worse. These results show the effect of LCA’s
dual optimality. In 25-dimensions, when LCA has achieved 20% error, the best other

Hebbian method has only achieved 60% error. Similarly, in 100-dimensions, when

40

.3
h

 

 

 
 
 
 
 
 
  

 

1.2 -
— Dot Product SOM
1 — Hebbian+TVLR(Lin.) .
— Hebbian+TVLR(Pow.)
Hebbian+TVLR (Inv.)
0'8 — Oja Update ‘
CCI LCA

 

 

 

 

Average error in radians

 

 

 

 

0.6
0.4 W V7 ’ " w l
0.2 '
0o 2 4 6 8 10
Number of samples x105

Figure 3.4: Comparison of incremental neuronal updating methods. \Ve compare. in 25
and 100 dimensions. This figure shows lOO-d results. Methods used were (i) “dot—product"
SOM. (ii) Oja’s rule with fixed learning rate 10’3. (iii) Standard Hebbian updating with
three functions for tuning the time-varying learning rates (TVLR): linear. power. and
inverse, and (iv) CCI LCA. LCA. with its temporal optimality. outperforms all other
methods. Consider this as a "race" from start (same initialization) to finish (0% error).
Note how quickly it achieves short distance to the goal compared with other methods. C C I
LCA beats the compared methods. E.g.. after 28,500 samples. when LCA has covered
56% distance, the next closest method has only covered 24% distance.

41

 

_|
A

 

 

 

 

 

 

1.2- ‘
m .ﬁ AAA; “ ‘v " '- 'h
: vv w w vr ' '
IE
1: 1“ d
N
h
E 0.8 -
h
o
h
L-
o 0.6 1
d)
c»
E 04— _
‘>’ .
<

0.2- “

G l l l l
0 0.5 1 1.5 2 2.5
Number of samples x105

Figure 3.5: Comparison of incremental neuronal updating methods. This shows the 25—d
results. At this lower dimension. after 5000 samples, LCA has covered 66% of the distance,
while the next closest method has only covered 17% distance.

42

LCA has achieved 30% error, the best compared method is still at 70% error. The
results for LCA will not be perfect due to the nonstationarity that occurs due to

self-organization.

Time-varying distributions: plasticity

It is important for an agent to have the capability to adapt to new environments
without catastrophic forgetting of what was already learned.

We performed a comparison of how well the best-performing of the algorithms
we tested before adapt to a time-varying distribution. We set up a changing envi-
ronment as follows.

This is motivated by how a teacher will emphasize new material to the class, and
only more brieﬂy’review old material. There are ﬁve phases. In the ﬁrst phase, until
time 200,000, the data is drawn from 70 orthogonal Laplacian components that span
a 70-dimensional space. In the second phase, from time 200,000 to 400,000, the data
is drawn from one of 10 new components in the 70 — d space (rotated to not lie on
axes directions) with a 50% chance or from one of the original 70 (using the original
rotations) with 50% chance. In the third phase, from time 400,000 to 600,000,
the data is drawn from either 10 previously unseen components or the original 70
(50% chance of either). The forth phase, until time 800,000, is similar — 10 more
previously unseen components are introduced. In the ﬁfth phase, until T =1,000,000,
we draw from all 100 possible components (and each has a 1% probability). We use
100 neurons over all phases (never increases or decreases). There are ﬁnally 100
neurons for 100 components, but in early phases we have extra resource (e.g., in
phase one, we have 100 neurons for 70 components).

Results, averaged over 50 runs with different rotation matrices for each run, are
shown in Fig. 3.6 and Fig. 3.7. LCA outperforms the other two variants — it

is better at adaptation, and suffers a more graceful forgetting of data that is not

43

 

 

— Oja Update
— CCI LCA

_L

 

— - - Hebbian+TVLR (lnv.)

 

 

new data
introduced

9
00

all

9
0')

Average error in radians
O
'4;

.9
an

observable

 

-newdatal -newdat82

 

 

§+newdata1

 

 

0 ﬁrst 70 itneratal §+newdata2 §+newdata3 §+newdat82
0 2 4 5 10
Number of samples x105

Figure 3.6: Comparison of LCA with two other Hebbian learning variants for a time-

varying distribution. This shows average error for all available components
available until time 200,000, 80 until 400,000. 90 until 600.000 and 100 until

. There are 70

1,000,000. We

expect a slight degradation in overall performance when new data is introduced due to the

limited resource always available (100 neurons). The ﬁrst jump of LCA at
a loss of 3.7% of the distance it had traveled to that point.

44

t = 200, 000 is

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

—- ~- Hebbian+TVLR (lnv.)

0.9- -— Oja Update r
m — CCI LCA
: 0.8~ ‘
_¢_u_ . forgetting
g 0.7- learning of newdata1 re-leaming
s. of newdata1 of newdata1
.E 0.6;; ‘
h
e 0.5
3
0 0.4
OI
E 0.3
2
< 0.2

K
0.1~ ‘
0 1 1 1 l l 1 l
2 3 4 5 6 7 8 9 10

Number of samples x105

Figure 3.7: Comparison of LCA with two other Hebbian learning variants for a time-
varying distribution. This shows how well the neurons adapt to the 10 components added
at time 200,000 (called newdata1), and then how well they remember them (they are
observed in only the second and ﬁfth phases). Initially, this new data is learned well. At.
time 400,000, newdata2 begins to be observed, and newdata1 will not be observed until
time 800,000. Note the “forgetting” of the non-LCA methods in comparison to the more
graceful degradation of LCA. The plots focusing on newdata2 and newdata3 are similar.

currently observed. We note that the “re—learning” in the last phase does not match
the previously observed performance. This is due to two reasons: the lessening of
plasticity for larger neuron ages, and the increasing of the manifold of the data while

retaining only a ﬁxed representation resource.

3.5 Bibliographical Notes

The in-place learning concept and the LCA algorithm were introduced in Weng
& Zhang, 2006 [124], and used as each layer in our Multi-layer In-place Learning
Networks [64, 120]. The MILN-based model of six-layer cerebral cortex [121], which
was informed by the work of Felleman & Van Essen [32], Callaway and coworkers [18]
and Grossberg [85], used LCA on both its supervised (L2/ 3) and unsupervised (L4)
functional layers. The journal version of LCA [122] was more comprehensive and
presented the comparisons with Hebbian learning rules included here. The multilayer
models use LCA on each layer.

LCA was inspired by principal components analysis (PCA) and independent
component analysis (ICA). Principal components are linearly the most expressive
features of a dataset, in terms of least mean square error of the projections [7]. Due to
orthogonality constraints, most principal components do not match the causes of the
image patch. In ICA, the constraints are not that of orthogonality but independence.
With respect to image patches, each patch can be thought of as a combination of
independent sources (such as edges at particular orientations).

An ICA approach was applied to natural images instead of task-speciﬁc views
(e.g., from a single object) and developed orientation selective features similar to
those found in V1 cortex and Gabor ﬁlters [8]. A similar result was found for the
non-negative matrix factorization [60] method, which models each image patch as a
linear combination of sources, where no component’s contribution could be negative.

But as discussed in [66], a problem with the ICA and NNMF methods is that they

46

treat each patch as a linear superposition (weighted sum) of the independent sources,
which is not true. So they do not always extract the independent components [66].
It is more appropriate to use mas: instead of sum, since real data suggests only one
source (i.e., cause) explains each pixel. A formal treatment of the idea is found in
Lucke, 2008 [66], who derived an approximate Hebbian learning rule for non-linear
component extraction. Earlier, Weng [124] had shown how non-linear extraction
could be done with LCA.

Such models are fundamentally trying to set their parameters so that they ex-
plain the observed data in the best way. Given a preselected model, the theory
of maximum likelihood estimation gives an optimality framework about how to set
the model’s parameters in the best way to explain the data. However, actual max-
imum likelihood learning is diﬂicult for visual feature extraction, due to the model
selection problem, large number of parameters, and local minima, making gradi-
ent based methods diﬁcult and initialization-dependent. From maximum likelihood
theory, Hinton derived an approximate rule for following the gradient called con-
trastive divergence, which shows good performance [39]. Contrastive divergence de
rives Vl-like ﬁlters when an additional sparse ﬁring criterion is introduced [69], yet
the sparseness used has the same problem as ICA since it interprets image patches
as linear combinations of independent causes. LCA and Lucke’s maximal cause
technique [66] both have maximum likelihood interpretations. In LCA, each neuron
is meant to “latch on” to a single independent cause early on in learning and via
vvinners—take-all build a history of observations containing that cause. The resulting
weight explains that history of observations in the best possible way, which likely
reﬂects the core cause itself (the target). The other network instantiations do not
consider this spatiotemporal optimality and thus require learning rates to be small
so that neurons do not overshoot their targets. But this dramatically slows down

the convergence, as shown by the comparison experiments between LCA and other

47

Hebbian learning networks here.

48

Chapter 4

Top-Down Connections for

Semantic Self-Organization

The representation learning problem boils down to two criteria: 1. ﬁnding the fea-
tures (neurons) that explain (i.e., represent) experience (data) in the best way, and
2. represent what is more important to the agent more than what is less important,
given a limited resource. The ﬁrst objective was the subject of Chapter 3. This
chapter presents a method for dealing with the second criteria.

Networks here utilize three layers, and perform a task of recognition from vision.
The third layer is the motor layer, where ﬁring of neurons controls action. The
visual sample is labeled by an action module that takes the motor layer’s ﬁring,
selects the top-ﬁring neuron, and produces the label associated with it. The motor
layer itself is not necessarily WTA. These results apply generally to any feature layer
with bottom-up and top-down excitatory inputs.

This chapter presents a method for utilizing the top-down connections to de-
velop a biased compression (criteria #2 above). Coupled with lateral excitation,
the method develops networks with modular connectivity. Modules contain a motor
neuron and multiple feature neurons. The motor neuron and feature neurons project

in an excitatory way almost exclusively within the module, while a few “hub” feature

49

neurons have connections to multiple motor neuronsl. Using hardcoded isotropic
lateral excitation, modules additionally become grouped, in that all feature neurons
are in a single connected location on the feature map. This was called topographic
class grouping [64]. Using adaptive lateral excitation [62], modules again emerged,

but the grouping was less prevalent and there were no “hub” feature neurons.

4.1 Motivation

It is a core challenge of an autonomous developmental system to automatically
generate eﬂ‘icient and effective internal representation from a raw data stream, given
a limited resource. This means learning to compress the input so that the important
input variation is given higher priority than the unimportant variation. Examples of
variation in object recognition include translation, rotation, scale, lighting, etc. For a
classiﬁcation problem, within-class variations (perhaps lighting) are not as important
as between-class variations (perhaps shape). What is important to the agent is
determined by its behavior, as encapsulated in motor areas of the network. Agents
with no experience are to be taught what to do in each case, through supervision
and imposed behavior from a teacher.

The cortex does not compress in an unbiased way (e.g., purely reduce redun-
dancy [5]) nor does it derive a compact coding, where, for example, each object
is represented by its own neuron. This type of specialization is too expensive (Ito
and Gilbert, 1999 [45], p 21). Instead, the coding and compression of information
seems to be selective. Some information may be lost, but the important informa-
tion for completing the task is likely to be retained. As stated by Barlow, “The
best way to code information depends on the use that is to be made of it” (Barlow,

2001 [5]). Behavior and actions must bias organization and coding of earlier sensory

 

1Motor neurons can be considered as within-module hubs [15].

50

cortical areas. Feedback connections from later levels to earlier levels seem likely
causes of such bias. There are just as many, if not more, feedback connections than

feedforward connections (Ito and Gilbert, 1999 [45], p 18).

4.2 Concepts and Theory

4.2.1 Three Layer Network

*
V

 

 

 

to motors
O O O O O
O O
O 0
Layer 1+ 1 ’ '
- --/'
Top- -Down Excitatory ’+lo"
O O
W -
ﬂ . . o
' ][ ' - . .
Layer 1 o l - Lateral Inhibitory
Bottom— —Up Excitatory 'I+[al
I 0 0
.. a
0 ., .. .. a
0 a I ., a
Layer 1 - l 4" 0 0 0 0
to sensors
11

Figure 4.1: In this model, neurons are placed on different layers in a hierarchy — termi-
nating at sensors at the bottom and terminating at motors at the top. Each individual
neuron has three types of input projections: bottom-up, lateral, and top-down.

The three-layer network architecture used for theory and experiments in this
chapter is shown in Fig. 5.2. There are d pixels (“sensory neurons”), n feature

neurons and m motor neurons. Let the number of classes in the input be equal to

the number of motor neurons.

Via lateral
connections

  

I

Car
Cat
Cow
Pig
Tank
Tiger
Dozer
Duck

 

 

 

\JHHHB

Layer 0 (input) Layer 1 (features) Layer 2 (motor)

Figure 4.2: The three-layer network structure. The internal layer 1 takes three types
of input: bottom-up input from layer 0, top—down input from layer 2 and lateral input
from the neurons in the same layer. The top-down input is considered delayed as it is
the layer 2 ﬁring from the last time step. The “D” module represents this delay. A circle
in a layer represents a neuron. For simplicity. this is a fully connected network: All the
neurons in a layer takes input from every neuron in the later layer and the earlier layer. For
every neuron (white) in layer 1, its faraway neurons (red) in the same layer are inhibitory
(which feed inhibitory signals to the white neuron) and its nearby neurons (green) in the
same layer are excitatory (which feed excitatory signals to the white neuron). Neurons
connected with inhibitory lateral connections compete so that fewer neurons in layer 1
will win for firing (sparse coding [80]) which leads to sparse neuronal update (only top-k
neurons will ﬁre and update). Figure courtesy of J uyang Weng.

Input. Although the networks operate incrementally and in an open-ended fashion,
for simplicity consider a set of data. Let S be the supervised (training) set of
stimuli, which contains input/ output data, where the correct output is provided by
a teacher. Any input vector x,- is for example the raw pixel values of a digital image
and its corresponding output 22- is the motor vector selecting the correct label. For
classiﬁcation, any x,- is a member of one of m classes. The corresponding output
vector from the stimuli set zz- a vector of zeros except for a single one that denotes its
class membership: (22-(cj) = 1) —» (22- 6 class cj). Each dimension of z is considered
a distinct action.

Learning. A network uses data in S to learn so that it is able to provide the correct
action for a given test sample: x ¢ S, which is similar to the training data. In other
words, given a case that was not taught by the teacher, the network should act in
the correct way, where correctness is understood by the teacher.

Action. We utilize a hardcoded action production module, which simply uses the
index of the motor neuron with the largest ﬁring rate: arg maxlSz-Sm{zz-}. It maps
the label to a word, that was used in training to teach the class.

Connection types. There are four connections types for a general network layer:
bottom-up excitation, top-down excitation, lateral excitation, and lateral inhibition,
as shown in Fig. 4.1. In general, neurons on any layer l are connected from the lower
and higher layers through excitatory input connections (dendrites and synapses)
and excitatory output connections (axons). These grow and are adjusted through
learning. Neurons are connected to neurons on the same layer through inhibitory
connections. In the model here, the inhibitory connections are not weighted nor
learned. They are handled through approximate methods, such as winner-take—all,
which are computationally more efﬁcient and easier to deal with. In this section,
the lateral excitation is hardcoded and isotropic.

Output and Input Spaces. The ﬁring rate vector is in the output space of

53

 

 

Layer: j-I j j+1

Figure 4.3: How bottom-up and t0p-down connections coincide in a fully connected net-
work. Looking at one neuron, the fan—in weight vector deals with bottom-up sensitivity
while the fan-out weight deals with top-down sensitivity. In the model presented here, the
weights are shared among each two-way weight pairs. So, a neuron on layer j + 1 will have
a bottom-up weight the same as the top-down weight of a neuron on layer 3'.

our layer 1, denoted )2. Let the layer l — 1 closer to the sensors has an output
space X and the layer 1 + 1 closer to the motors has an output space Z. Given
the diagram Fig. 5.2, we can see that in our case the output spaces of the lowest
and highest layers are the same as the sensory input and motor output spaces,
respectively. If there’s no top-down connections, the input space of layer 1 is X.
On the other hand, if there are top-down connections to layer l, the input space
contains all paired vectors from X x Z, containing both bottom-up and top—down
input: X x Z = {(x,z)|x e X,z E Z}.

Weights. The bottom—up weights of the feature neurons are column vectors in
weight matrix V, which has 17. columns. A key aspect of this model is the bottom-up
weights to layer 1 + 1 and the top-down weights to layer l are shared (see Fig. 5.3).
Let the bottom-up weight matrix to layer 1 + 1 be W, and the top-down weight

matrix to layer l is M = WT.

4.2.2 Relevant Input is Correlated with Abstract Context

An extraordinary amount of the information we experience can be considered irrel-
evant. Knowledge of what is relevant and what is irrelevant is especially important

during learning. This model proposes that relevance in the input is ﬁndable by

54

observing what is correlated with imposed action. For example, if a child sees the
shape of the letter “A” on a ﬂashcard, on television, on a sign, etc. and a teacher
gets the child to speak “A” in each case, then the relevant information (the shape of
the letter) does not change too much over the different inputs, but other information
(the surrounding visual scene) changes a lot.

For the network described here, let the image input space X be made up of
relevant subspace R and irrelevant subspace I, where relevance is in terms of the
data class labels: X = I x R (See Fig. 4.4). Projecting samples into the irrelevant
subspace will not aid in class discrimination, but it will help to do so with the
relevant subspace.

Including the higher-layer (more abstract) context space, the relevant subspace

is deﬁned as ’R. x Z:

XxZ=(Ix’R)xZ=Ix(’/ZXZ)

Let an abstract-boosted input vector p to layer 1 be a combination of a output
of layers l— 1 and 1+ 1: p = (x, z). We wish to use the higher-layer part to uncover
the relevant subspace R. Self-organization of neurons in the space R x Z will lead

to discriminant features (neurons), through the use of this abstract-boosted input.

4.2.3 Null Space

Linear Discriminant Analysis (LDA) deﬁnes relevant information as the set of pro—
jected data onto a linear subspace that can be used to most accurately classify the
data, for any linear subspace of the same dimension.

For the random input vector p, the within-class scatter deﬁnes the average co-

variance of a classes samples:

55

 

 

 

X1

’ (b)
l / N
N t x
@>
X2 X1 X1

(C) ((0

Figure 4.4: Self-(n'ganization with motor-boosted distances leads to partitions that sep-
arate the classes better. (a) There are two bottom-up dimensions 2:1 and 3:2. Samples
falling in the blue area are from one class and those falling in the red area are another
class (assume uniform densities). The “relevant” and “irrelevant" dimension are shown
by the upper right axes, which are here linear. (b) The effect of self-organization using
nine neurons in the bottom-up space only. Observe from the resulting partitions that the
ﬁring class entropy of the neurons will be high, meaning they are more class-mixed. (c)
Boosting the data with motor information, which here is shown as a single extra dimension
instead of two (for visualization) (d) The effect of self-organization in the boosted space.
and embedding back into two dimensions. Note how the partition boundaries now line up
with the class boundaries and how the data that falls into a. given partition is mostly from
the same class (low entropy).

 

     

Sw = Z E (p — u.)T(p - u.) (4.1)
i=1 P667;

where p,- = E [p 6 Ci] indicates the within-class mean for class i. The between-class

scatter deﬁnes the covariance of the class means:

S B = 210% - u)T(uz' - u) (4.2)

where p = E[p] is the mean of p.
LDA theory states that the best subspace for linear classiﬁcation is spanned by

S
the eigenvectors associated with the largest eigenvalues of i. The most discrimi-

SW
vTSBv

nant feature is the ﬁrst eigenvector, which will maximize T . However, there
v

are many practical difﬁculties involved in computing this givenvhigh-dimensional
data [97].

Instead of dealing with the above expression directly, some researchers advocated
to attempt to project the between-class scatter information into the null space of
SW and deriving features in this space [19]. The null space of the within-class
scatter is very powerful. Intuitively, if a direction v exists where SWv = 0 and
S Bv 75 0, perfect classiﬁcation can be attained using this v. The null space may
not exist non-trivially with the original data.

As formulated above, the top-down part of the abstract-boosted data is in the
null space of the within-class scatter matrix. If we let matrix Z be an orthogonal
basis of layer 1 + 1’s output space Z, it can be seen that SWZ = 0. Additionally, if
layer l+1 is a classiﬁcation motor layer, where each dimension represents a different
class and each sample only has one class, then S 32 ¢ 0. The space Z is orthogonal
to the subspace deﬁned by the within-class scatter, and can trivially be used for

perfect classiﬁcation of the training data.

There are two powerful properties associated with X x Z for a classiﬁcation

57

motor, which should be apparent:

Property 4.2.1. Class Separateness: Since each motor dimension is associated with

a diﬂ’erent class, distributions for any two classes are guaranteed to be separated in

XxZ.

Property 4.2.2. Similarity Bias: Any two samples from different classes have a
greater distance between one another in X x Z, but any two samples from the some

class have the some distance in X x Z as in X.

We will exploit these properties for a biased self-organization? After learning,
a non-imposed sample is not in X x Z, but is in X. We wish to achieve a good

self-organization using X x Z so performance holds up when Z is no longer available.

4.2.4 Weighted Semantic Similarity

It is useful to be able to control the relative inﬂuence of bottom-up and top—down.
When a layer uses paired input p = (X,z), the inﬂuence of the motors may not
be sufﬁcient, since the dimension of the input is typically large (e.g., a 40 row and
40 column digital image gives 1600 dimensions) and the dimension of the output is
typically not (e.g., 10 classes). Instead, normalize each input source and control the

relative inﬂuence by

X Z

where it is expected that a + ,6 = 1. Setting a = ﬂ = 0.5 gives the bottom-up
and top-down spaces equal inﬂuence. Raising ,6 will increase the between class
scatter by increasing the distance between classes in X x Z, as seen in Fig. 4.4. The

normalization places each class distribution on its own unit sphere (when dimension

 

2Here, I focus on classiﬁcation. Solgi has derived results for the general regression
case [94].

58

of layer l—l is more than two), separate from other classes in X x Z. Normalization
constrains the maximum distance between classes in X x Z. It also reduces the
effect of high-varying dimensions within each class.

Bottom-up and lateral information-based similarity is purely physical and is non-
semantic. Top-down-based similarity is based on more abstract information. If a
network is shown images of a bench and a desk chair, and in each time the teacher
says “chair, chair”, then by the above, a and S can be set so that the inner product

angle difference between the two images is very low; even zero (if a = 0 and [3 = 1).

4.2.5 Smoothness

 

Figure 4.5: A layer-one weight vector, around other neighbor weight vectors, viewed as
images, of a neuron exhibiting “harmful” interpolation through 3x3 updating.

Some sort of regularization is necessary. Without it, the placement of neuron
feature vectors may not correspond well to the stimuli density. One method is by
using the idea of neighborhood and neighborhood updating, from the Self-Organizing
Maps [55]. This smoothness approximates lateral excitatory connections which are
more dense closer to the neuron of origin. We use 3 x 3 updating, meaning the
winner neuron will update its weights and so will the neurons adjacent to it.

It is beneﬁcial for most of the neighborhood pulling to be within classes. Smooth-

ness is useful for generalization, but this isotropic type could be harmful. Consider

59

a neuron with two neighbors that ﬁre often. This neuron is pulled in X x Z, by
both of its neighbors, without regard to what it actually represents. Pulling places
it in between what the neighbor neurons represent. This could be useful if they
represent a single class, as it would average into a variation that might generalize
well for that class. But if they represent different classes, the averaging effect may
lead to a representation of something that wouldn’t actually ever be experienced
(see Fig. 4.5).

Topographic Class Grouping, discussed later, will emerge via 3 x 3 updating
along with abstract-boosted distance. That leads to helpful interpolation within the
class groups, while neurons in different class groups are protected from one another

by border neurons. This will be discussed in more detail in Section 4.4.

4.3 Algorithm

Here is presented the algorithm for incremental motor-boosted self-organization for
a three layer network. This version handles each sample separately. It is linear time '
complexity in the number of neurons.

LCA is used on each layer for self-organization, as follows:

(y.v,o,a(1>> +— fLCA<x.z|v,M.a(1)) (4.4)

(z,W,M,a(2)) .— fLCA(y,O|W,O,a(2)) (4.5)

The feature layer’s input is from two sources: bottom-up and top-down stimuli
(x and 2). It uses the bottom-up weights V = (v1, ...,vn), and top-down weights
M = (m1, ..., Inn). It updates V and the neurons’ ages a(1), and outputs the ﬁring
rate vector y.

The motor layer takes the ﬁring rate y as input and uses the bottom-up weights

60

 

to the motor W = (w1,...,wc). It outputs 2, which is fed back as the top-down
input to Layer 1. In training, 2 is imposed.

In this version, the bottom-up weights to the motor layer and the top—down
weights to the feature layer are shared: W = MT.

For the per-neuron normalization discussed earlier, let V, W, and M be ﬁlled
and maintained throughout the below algorithm with normalized columns of V, W
and M (or zero vectors when appropriate). For example: V = (—vl—-, ..., i) =

||V1||2 Ianll

(v1, ...,vn).
Initialization3 — Sequentially initialize n synaptic weight vectors (columns of V)
using ﬁrst it stimuli: Vt = x(t) for t = 1, 2, ..., n. Place these feature neurons evenly
spaced on the d—dimensional feature plane so the distance between non-diagonal
neighbors is one. Fill the weights between layer-one and two (matrices W and M)
with zeros. Set the initial cell ages to one.
1. Sense. Draw a stimulus x from S randomly. Impose the correct action l: the
motor vector is set to 21 = 1 and 22- = 0,‘v’i 7é l.
2. Pre-response. Compute pre-competitive potential vector y for layer-one, using
both bottom-up and top-down:

x

A z A
y (— a—V + B—M. (4.6)
”X“ “Z“

3. Lateral-Inhibition. To compute the ﬁring vector y for the feature layer using
lateral inhibition: Set all neurons’ ﬁring to zero except the highest k1 pre—responses
(k1 = 1 for winner-take all). If there are ties, break them randomly. Relatively
scale the ﬁring neurons’ responses as follows: Rank the elements of 9. Let 31 be the
highest value, 3k be the k—th highest and Sk+1 be the k + 1-th highest. Then set

each neuron response as:

 

3This initialization method is a key for fast development; it is better if the ﬁrst
n samples contain some from all different classes.

61

 

.. _ s
0, otherwise
4. Optimal Hebbian Learning for Feature Neurons. Each layer-one ﬁring

neuron, corresponding to a nonzero element of y, updates its bottom-up weights to

be more similar to the bottom-up stimulus x. For each updating neuron i =

1 + p(a(-1))
1, ..., k1, ﬁrst set its updating weight based on this neuron’s age: '7,- <— %,
a.
i

where p is the CCI plasticity function to control fast adaptation and long-term

plasticity [122]. Then, update the bottom-up weights:

Vi *- (1 - 70V: + 72- a; x (4.8)

and this neuron’s age a?) <— 0(1) + 1.

5. Lateral Excitation. Each winning neuron’s neighbors in a 3 x 3 neighborhood
update via an approximation of lateral excitation. For each “neighbor” neuron j
that hasn’t already updated, measure the distance between it and the nearest ﬁring
neuron i as ri,j° Set its updating weight based on each neuron’s age and distance:

1 + Mag-1) )
71- «— —— (1 - 7i, j / 2), and update the bottom-up weights for the neighbors

as)

J
not in the winning set:

Vj 4— (l — ’yj)Vj + 7]. yj x, (4.9)
. I (1) (1) . .
as well as their ages. aj <— aj + rm.

6. Optimal Hebbian Learning for Motor Neurons. On the motor layer, the

winning motor neuron is already known as l, since the action was imposed. The
2
1 + Ma] ))

————, then u date this
.5» "

updating follows LCA. Set its updating weight: 7 «—

motor neuron’s bottom-up weights:

62

w; «— <1 — w; + 7 y, (4.10)

(2) (2)

and update its age: at +— ol + 1.
7. Updating Top-Down Weights. Update the top-down weight matrix: M «—
WT.

4.4 Topographic Class Grouping

The above algorithm leads to Topographic Class Grouping in many instances. TCG
means neurons that represent a particular class are grouped together on the neuronal
plane. The importance of TCG is that it allows us to use 3 x 3 updating to regular-
ize the self-organizing network, but we reduce motor interference. With grouping,
interference typically occurs within the same class, where it possible becomes useful
for generalization along the class’ manifold.

To clarify how TCG can emerge, we present a set of conditions sufﬁcient for TCG
to emerge in some cases. Whether or not TCG emerges depends very much on the
class distributions. It may not work if there’s a multi-modal distribution for a single
class, or highly overlapping distributions for two classes. This version is based on
small groups that grow into larger groups.

First deﬁne three types of top-down connectivity: linked, border, and unassoci-

ated.

o If IIIileoo > 1 -— e, and mid- > 1 — e (for some small 6), neuron i is linked to
class j. A neuron can only be linked to one class. Linked neurons became so

since they mostly won for only a single class.
0 A neuron with its updating age zero (has never won) is unassociated.

0 Otherwise, neuron i is a class-mixed border neuron — border neurons have

updated for samples from multiple classes, but not enough to‘ link to any class.

63

 

Now deﬁne the TCG property:

Property 4.4.1. Topographic Class Grouping. A network has this property if there
is a path from every neuron linked to a class to any other neuron linked to the same
class, and there is at least one neuron linked to every class. A path is a sequence of

neurons in which consecutive neurons are adjacent.

Lemma 4.4.2. If there is at least one linked neuron for the current sample’s class,

S can be set so that no unassociated neuron can win.

The pre-competitive response computation for any neuron i is

X Z
A- — A: — A :. 4.11
i” “ “lequ +5 llzllm’ ( ’

The top-down part ym- = SWZ—“rﬁi will be close to [3 for a neuron linked to the
current training class. For unassociated neurons and neurons linked to other classes,
yt,i = 0. If [i > a + c, then a neuron that is linked to the current imposed class will
always have higher pre—competitive potentials than neurons that are linked to other
classes.

We can establish TCG initially (in a somewhat restrictive manner), by the fol-

lowing:
Base Case: Given the ﬁrst n samples from n different classes, if the ﬁrst 77. winners
are unassociated, the network has TCG. After a neuron wins for the ﬁrst time, it
and its neighbors are linked to the current class (see algorithm); they also update
bottom-up weights. A simple way to ensure the ﬁrst n winners are different is to
train from the initialization set, using a single sample from each class. Then, all the
winners will be different if the ﬁrst n samples are different in X.

Here is more detail about what happens for the ﬁrst sample from any class Ci:

1. The ﬁrst sample from a class C,- is imposed and the appropriate motor neuron

i is imposed for the ﬁrst time.

64

2. The motor neuron c,- updates the appropriate column of W with a learning
rate of one, thus becoming sensitive to the feature layer’s ﬁring pattern y
exactly. This ﬁring pattern has k1 neurons and their neighbor neurons ﬁring

(nonzero). Let k1 = 1, so there will only be one neuron 3 x 3 group.

3. The top-down matrix M is updated based on W. Then the neurons that ﬁred
in the feature layer become the only ones with a nonzero top-down weight from

motor neuron i. This establishes the initial group for Ci:

Hypothesis: Assume we have a network with TCG. Step: Now, a new sample
is input from an arbitrary class. TCG will not be violated after the update if the
winner is a linked neuron to the current class, as long as any growing group does
not bisect an existing group (see Fig. 4.8). This is so since the linked winner cannot
convert any neighbor into a neuron linked to a different class. How can we ensure the
winner is always a linked neuron? By Lemma 4.4.2, we can set S so no unassociated
neurons will win. As for the border neurons, it’s not possible to guarantee this over
all data distributions. From LCA theory, we know the border neuron is an average
of its response weighted input conditioned on its ﬁring. For sensible data, a border
neuron represents a mixture of different classes, and should lie in a very low density
area, thus not getting much bottom—up support.

Groups will spread and grow since neighbor updating does not pull the neurons
all the way to the winner. The pulled neighbors end up somewhere else in the
density. Figure 4.6 and Figure 4.7 illustrates the incremental grouping and growing
process with simple class areas in 2D and a 1D neuronal array.

TCG is not guaranteed to emerge, even for the conditions above. A bisection is
seen in Fig. 4.8. It should be apparent that the extra degree of freedom allows a
growing class group to break an already existing one in two. But, this might only

happen if one class is trained at a time.

65

X2

 

 

X2

 

X1

(1))

Figure 4.6: Topographic class grouping with a 1D neuron array in 2D input space. The
red area contains samples from class one, and the blue area contains samples from class
two. The 10 neurons’ bottom-up vector are circles or squares. Their top-down membership
is shown by the color or shape: Gray neurons are unassociated, black neurons are linked
to class two and white neurons linked to class one. Square neurons are border neurons.
To better understand how TCG emerges, we provide the following four cases. (a) After
initialization, all neurons are unassociated. Only three neurons are drawn to show they are
neighbors. Other neighbor connections are not shown yet, for clarity. (b) Neuron N1 has
won for a nearby sample, becomes linked to class two. Its neighbors are pulled towards it
and also link to class two. Note how N3 is actually pulled into the between-class "chasm”,
and not onto the class distribution.

66

 

 

X2

 

 

(<1)

Figure 4.7: Topographic class grouping with a 1D neuron array in 2D input space. This
is a Continuation of the last figure. (c) Over wins by N1, N2 and N5, self—organization
occurred through neighbor pulling. N4 has become a border neuron, and N6 is starting to
be Pulled. (d) A ﬁnal organization, uncovering the relevant dimension. For this data. the
TGICVant dimension is not linear: it is through the center area of each class distribution,
lengthwise. The ﬁnal neuron organization mirrors this, and the neurons have organized
along the relevant dimension.

67

 

 

 

 

 

 

U

 

 

 

(b)

Figure 4.8: This shows why TCG cannot be guaranteed in 2D in general. (a) After much
samPIing of class one, and none of class two, the class one area has grown. (b) Further,
after much sampling of class two and none more of class one, the class two area grows,
and happens to cut through the class one area (leaving two buffer zones). Here, TCG did
not emerge.

68

4.5 Experiments

It is proposed that top-down connections from a motor layer to a feature layer cause
grouped class areas and more class-speciﬁc features on the feature layer, leading
to higher recognition rates overall compared with a network not using top-down
connections. To test this, we developed two types of networks. In the ﬁrst type, the
feature layer developed and utilized both bottom-up and top-down connections. The
second network type only utilized bottom-up connections. For a fair comparison,
top-down connections were disabled in the testing phase for both network types

tested. Additionally, the weights were frozen in testing.

4.5.1 MNIST Handwritten Digits

The network is meant to take natural images and other real-world modalities as
input. But to study and show the effects of the discussed properties more clearly,
the MNIST database of handwritten digits was used4. This well-known dataset of
70,000 total images (60,000 training, 10,000 testing) contains 10 classes of handwrit-
ten digits - from 0 to 9. Each image is composed of 28 x 28 = 784 pixel intensity
values. The pixels of the image that corresponds to the digit are nonzero inten-
sities. The background pixels equal zero (black). All images have already been
translation-normalized, so that each digit resides in the center of the image. No
other preprocessing was done.

The greedy, “permutation invarian ” [39], method used here is not expected
to top the best performance on this data. In the current form, the network will
evaluate each digit globally - similarity is based on the positions of the non-black
pixels. This makes the network very susceptible to translation, rotation, and scale

invariance. A localized, windowed, method and/or a method that trains via error

 

4It is available at http: //yann.lecun.com/exdb/mnist/ .

69

gradient following is probably necessary. Other methods (e.g. [58], with local analysis

and deformations) are better suited for the digit recognition problem.

Results

JuVﬂnduuuuwuvuv

q
q

1:
q I)
i \‘V
\ ii!
| ii?
l 553
l 13%
I II!
I II!
I |\\
I \l‘
l \\\
I 311
l 112
l 2::
l 121
2 2.:

1 11

.,

"1n

 

initial connectivity after development
Figure 4.9: MNIST data represented in a network with 40 X 40 neurons, trained without
top-down connections.
Fig. 4.9 shows the layer-1 neurons in a 40 x 40 grid trained with all training
samples, but not using top-down input. There are areas which can be seen to

correspond to a particular class, but there is no guarantee that the area of a single

class is connected in the topographic map.

70

The lower part of Fig. 4.9 shows the development of the bottom-up weights of
the motor neuron corresponding to the digit-1 class. The darker the intensity, the
larger the weight. Due to the sparseness parameter k = 1, the inter-level pathways
have become sparse and reﬁned (most connections have been pruned). As shown, in-
variance is achieved by the positive weights from the corresponding digit “1” neurons
to the corresponding output neuron. Therefore, the within-class invariance shown
at the output layer can be learned from multiple regions in the previous layer.

The TCG algorithm is quite powerful in its class separation ability. Single groups
per class result even for very similar bottom-up inputs, such as the handwritten
digits “4” and “9”, as seen in Fig. 4.10. As summarized in Table 4.2, when top-
down projections are used the error rate is signiﬁcantly lower (21.3% down to 7.7%)
for this limited set of 10 x 10 neurons. In the large scale tests, all ten classes are
used and error rate constantly decreased when top-down projections were used, at
grid sizes of 10 x 10, 20 x 20 and 40 x 40. Figure 4.11 (a) shows the result for
all 10 classes. Each class takes enough grid space within which to explore its own
within-class variations with less interference from other classes. Neurons are more
“pure”, when top-down supervision is used. Purity is measured by entropy (see
Appendix). Figure 4.11 part (b) shows the corresponding probability maps of the
ten neurons of part (a), and part (c) shows the probability maps for a result when
,6 = 0. Table 4.2 summarizes results for all tests. Top-down connections led to a
better performance for the same size map compared with when no top-down was
used. The best overall performance on the data was for a 100 x 100 map with ﬂ = 0.3,
which reached 2.97%. This is on the same level as Hinton’s 2.49%, for the contrastive
divergence technique [39]. It is a fair comparison since both techniques are greedy
and permutation invariant. Using the gradient to ﬁne tune the representation later

is possible (as Hinton did).

71

Layer-2 weights
4-Neuron Neuron

    

’/

i

[i
44
-/
4/
'l
'I'
‘1

 

 

 

 

 

 

‘/
r,
9' -
.1,
Y ,1
‘7 7f
7

D

D

 

Figure 4.10: The handwritten digits “4” and “9” from the MNIST digit database [58] are
very similar from the bottom-up. (a) Result after self-organizing using no motor-boosting.
The 100 weight vectors are viewed as images below and the two weight vectors for each
motor shown above. White means a stronger weight. The organization is class-mixed.
(b) After self-organization using motor-boosted distance (weight of 0.3). Each class is
individually grouped in the feature layer, and the averaging of each feature will be within
the same class.

72

 

1!?66666666666i333333333a1833333333333!)
VV£‘£666666666633553333111153335333333}?
?5‘C‘666‘666666J33533333333333333333333?
51(‘(t66‘5666663133333333333333333333333
LSJIIt(£66666665333333333333333333333333
‘66littlt“6‘666533333333333333333333333
6“‘ll‘l‘5‘““6613333333333333333333333
éééé‘lllltl666666‘1133333333333333333333
66““ll“““6666641‘5333333333‘3333777
6666‘tllt6&66£6£666444533333333331?1?777
6666“‘6£6‘666666664445333333333311?7777
bbbbbbbb666666£$560444333333333331?17777
6669966666666669094443533333333311777777
bbbbbbbbbbbbbﬁﬁtiff4¢4333353333311777771
655556006666611911¢¢¢4333335353371777777
bbbbbb66666644991914¥1533333377777777777
benobbbbbbeHQHQVV¥VVVIVV132}777777777777

OwuabbbbuHRH4499944¥V¥f¥9777777777777777
touwwouuuquﬂqquvvvrifffrzi777777777777
rrrwraquuqHQwevvvvveftf7vrtal77777777777
f r f r.9.444.“.“.0.“qqquy"""’30nv'77777777777
rrrrrﬁQQQMHHHUItilt!!!IIIOOOO7731111l211
rrrrrflaﬂﬂeuHuIllllttfrleooooaaazza11111
rrrr{fifthﬁuuIlllltlftlloooooaaaall11231
x:rrr{5:59settllttistllaoooooaaaaa1212:1
3IfrrfIIISCYIIIJIIlllao00000098332212222
:55ff{‘5‘tt8888888t100000000006922222222
555f5(5555t88888888I00000000000222212222

55S55555553338888889000000000o0222133222
$55555555110088888630000000000o232111222
SSSSSSSSO\\IOBB$$3311500000000OO93331222
555555551‘\IOOOQﬁQQQQQQOOOOOOOOO01833222
55555551l\\|869QQQQQQQQQOOOOOOOoaataazzz
555555llillliQQQQQQ9991999000000L1122222
$$$$£lllll111QQQQQQQQ97717999900a$111221
IlllllllllliQQQQQQQQ9997791999160aa11122

I’IIIIIIIIII94.999999999777799"!0‘9t 11111
IIIIIIllIlIl'QQQQ?4????7’799711400la1333
Ill/Illllflll99.9997997777797793’4’00333
ll/l/llllllll1114999997779177'f7I0000323

 

 

 

 

(a)

Figure 4.11: (a) Weights after development of a 40 x 40 grid when top-down connections

were used. The map has organized so that each class is located within a speciﬁc area.

Table 4.1: Summary of results of training networks on MNIST data. Each result is

averaged over 5 trials.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

y
p
545662 0
misssauema
nuﬂvoﬁUnUnuﬂVOVO
E
e
M O ~%m%m% O 0
r7% 7%?
mmxwmwssx
E
t 000000
$00000000
00000000
T55000000
111111
000000
.MWWOOOOOO
00000000
D22000000
666666
t
e
Sill!
seassesss
$44000000
1.1111
C
\l/
RH
(
mowowowow
.m n y n y n y n y
o
T
m00000000
dxxxxxxxx
.n00000000
ﬂU1r1alxla9u9u4.4

 

 

 

 

73

 

    
    

   

"0" l "2"

   

a.
"5" "6" "7"
- .(b)

 

    

"5" l "6" h l "7" "8H "9"

 

(C)

Figure 4.12: This corresponds to the last ﬁgure. (b) Probability of each neuron to signify
a certain class, from previous updates, when 6 = 0.3 is strong (corresponding to (a)). (c)
Probability maps for a different test when 6 = 0.

74

4.5.2 MSU-25 Objects

For the dataset, 25 toy objects were selected (see Fig. 4.17). To collect the images,
each object was placed on a rotatable surface a few feet in front of a camera. The
surface was rotated at a slow rate, and the camera captured images sequentially
and automatically, while the operator ensured that the object appeared roughly in
the center of the image columns. 200 images of 56 rows and 56 pixels were taken
in sequence for each object. At the operator’s rate of rotation, the 200 images
covered about two complete rotations of 360 degrees for each object. The capture
process was intentionally not too controlled, so an object varies slightly in position
and size throughout its sequence. The images were taken indoors, under ﬂorescent
lighting, and an umbrella was used to reduce specularity on the objects’ surfaces.
The background was controlled5 by placement of a uniform color by a sheet of gray

fabric.

 

Figure 4.13: Sample(s) from each of the 25 objects classes, also showing some rotation.
In the experiments, the training images were 56 X 56 and grayscale.

Including an additional “empty” (no object) class, there were 200 X 25+1 = 5001

 

5Due to automatic adjustment of the overall image intensity by the camera’s
capture software, later background color normalization had to be done. A program
was written to do this.

75

images total. Every ﬁfth image in each sequence was set aside for testing, so the other
80% were used to train the networks. Grayscale was used due to color’s usefulness
in class discrimination. Five top-down-enabled and ﬁve top-down—disabled networks
were trained for each of the sizes: 20 x 20, 30 x 30, and 40 x 406. Training involved
random sample selection over 50,000 training samples. Two types of the networks
were trained: the ﬁrst type used excitatory top-down connections (6 = 0.3), while

the second type did not (6 = 0).
Results

Table 4.2: Error results for MSU objects, averaged over 5 trials.

 

 

 

 

Neural Avg. error Avg. error Error
plane size (no top-down) (used top—down) reduction
20 x 20 8.13% 3.03% 62.7%
30 x 30 2.62% 0.83% 68.3%
40 x 40 0.63% 0.33% 47.6%

 

 

 

 

 

 

A neuronal population’s TCG can be observed via visualization. However, for
reporting purposes, TCG should be measured in some other way. The within-class
scatter of neuron responses on the 2D neuronal plane can provide such a measure-
ment.

Once a network was developed, a set of testing stimuli was used as input. The
top-one neurons position was recorded. The within class scatter of ﬁring for each
stimulus class, averaged over all stimulus classes, measures how condensed the neu-
ron responses were upon the neuronal plane. See Fig. 4.16. The class-response
scatter w is the trace of the within class scatter matrix, normalized for map size:
w = W/ﬁ. Using Tr(A) = Tr(BAB-1), we can see this measure is invari-

ant to rotation of the 2D map.

 

6For the larger (40 x 40 and greater) supervised networks, we scaled up in reso-
lution to avoid neurons on the corners that didn’t get updated otherwise.

76

 

 

..25 U
.530 U

a: I

E: I
sonogram I
as I
..uOoEntom I
252%.. I
>mmsmo2m I
:ouSn I
399:... I
..on_:._ouuo¢to I
23320.25 a 5......
333.55. I
goatﬁtm I
2.5 I
Suﬁsm I
62.3.55 I
.3220 I

>,8 I
5.85 I
252E930 I
.3 I
swam—320 I
33233 I

 

seal

 

iqeqmd sselo

Mn

 

Figure 4.14: 25 Objects. Developed using top-down connections.

77

 

Figure 4.15: Result for 25 objects viewed from a full range of horizontal 360 degrees.
Developed without using top-down connections: the maximum class ﬁring probabilities of
the 40 X 40 neurons of layer 1.

Table 4.3: Feature quality and grouping results for the experiments with 25 objects.

 

 

 

 

 

Top-down Neural Developmental Class-response
grid size entropy [0-1] scatter [0—1]
2:23;: a: 3:3:
2:25:23 3:3: 3:1:
23:51:? 40 x 40 3:32 3i?)

 

 

 

 

 

78

(a)

(b) (e)
(d) (c)

Figure 4.16: (a). Images presented to a trained network to measure the class-response
scatter. (b) Bottom-up weight to the neuron representing this class (“turtle”). This
network was top—down—disabled. (0) Top responding neuron positions for each of these
samples for the unsupervised network (d) Layer-two weight for a top-down—enabled network
(d) Top responding neuron positions for the supervised network.

   

 

79

Error results are presented in Table 4.2 and grouping results in Table 4.3. We
tried different values of I: from 1 to 10 for testing and reported the best results. The
class and view angle could ideally be represented by the maps so that each neuron is
responsible for a single class over a set of angles from about 5°to 25°, from largest
(n = 1600) to smallest (n = 400) sizes. The results show the supervised networks
develop to utilize the same amount of available resource better, shown by better
error rate. Especially notable differences are seen when the number of neurons is
smaller.

The per—neuron entropy is also lower, meaning the neurons are more “pure” - rep-
resenting samples within a single class. The class-response scatter was signiﬁcantly
lower in the networks that used top-down connections. All examined top-down en-
abled networks showed TCG. Figures 4.14 and 4.15 show that topographic class
grouping occurred in the network developed with top-down connections, but not the
network without.

As a major within-class variation in this experiment, viewing angle is disregarded
by the discriminant features. That is, the cortical region of each object is invariant
to its viewing angle variation. It is important to note that disregarding viewing angle
is not a mechanism built into the network through programming. Such an internal
invariance is an emergent property of the network. Thus, other variations, such as
lighting, vertical viewing angle, deformation should be applicable but further more

extensive experiments are needed to quantitatively study the effects.

4.5.3 NORB Objects

The normalized-centered NORB dataset [59] is one of several 3D object recognition

datasets, which is publicly available7. It contains binocular image pairs of ﬁve

classes of objects (four legged animal, human ﬁgure, airplane, truck, car), with 10

 

7http://cs.nyu.edu/ ylclab/data/norb—v1.0/

80

Mwhﬂmﬁwﬁd‘ﬂv’x m
era-rev; erratic
viA~ArW-==z-t+x
WNVQw\—%NV
xtﬁmiekxi‘sxVK-‘t

Figure 4.17: NORB objects. Each row is a different category. Examples of training
samples are shown on the left, and testing (not seen in training) is shown on the right.
Figure from [59].

different actual objects belonging to each class. The 5 training objects per class and
5 testing objects per class are disjoint. The small set (used here) has 24,000 training
images, and 24,000 testing images, over uniform background. The dimension of
each training sample is 96 x96 x 2 = 18, 432. The images vary in terms of rotation
(0 to 340°in 18°increments), elevation (30 to 70°in 5°increments), and lighting (6
different conditions). Recognition must be done based on shape, since all objects
have roughly the same texture and color. And as the objects rotate, many of the
appearances change signiﬁcantly. The NORB classes are more general than those
used in the 25 objects’ case, so they are tougher to distinguish, but there are fewer
classes. We tried top-down—disabled and top-down-enabled networks of size 40 x 40

and 60 x 60. We also tried using wraparound grids on supervised networks.

Results

Table 4.4: Error results using the normalized-centered NORB data

Method Resource Disjoint Test Error
K-NN+L2 [59] 24000 18.4%

No TCG TCG Diff.

1600 26.5% 17.68% 8.82%
3600 26.2% 15.7% 10.5%

 

 

 

 

Proposed Net.

 

 

 

 

 

81

Table 4.5: Grouping results using the NORB 5-class dataset.

 

 

 

Top Edge Neural Entropy Scatter

down wrap plane size
N N 0.5 0.41
Y N 40 x 40 0.13 0.25
Y Y 0.07 0.33
N N 0.49 0.42
Y N 60 x 60 0.11 0.22
Y Y 0.09 0.33

 

 

 

 

 

 

 

Results are presented in Tables 4.4. which presents best results at the different
sizes, and 4.5 that summarizes the grouping metrics. Figures 4.18 and 4.19 show
some visualization results. A very signiﬁcant difference of error of up to 10.5% is
observed when comparing networks that used top-down connections to those that
did not. The top—down—enabled networks exhibit purer neuron representations, lower

within class scatter measures, and show TCG.

   

(b)

Figure 4.18: 2D class maps for a 40 X 40 neural grid after training with the NORB data.
At each neuron position, a color indicates the largest outgoing weight in terms of class
output. There. are ﬁve classes, so there are ﬁve neurons, and ﬁve colors. (a) 8 = 0 (b)
B = 0.3. These experiments used wraparound neighborhoods.

Using the NORB dataset allows comparison. Our method compares favorably
with other methods that deal with input monolithically. See [59] for more details.

With top-down connections our method outperforms K—nearest neighbor. This shows

the power of within-class regularization. Nearest neighbor requires all 24,000 train-

82

     

(b)

Figure 4.19: 2D entropy maps for the Fig. 4.18 experiments. High-entropy means
prototypes are between class manifolds, which can lead to error. Whiter color means a
higher entropy. (a) B = 0 (b) B = 0.3. Note the high entropy neurons shown here coincide
with class group borders shown in Fig. 4.18.

ing samples to be stored, while the top—down-enabled networks only used 1, 600 and
3, 600 neurons. However, the network trained with disabled top-down connections
showed signiﬁcantly worse performance. SVM had to use signiﬁcantly subsampled
data (it was too slow to train with the original high dimensionality), with which is
performs slightly better. The convolutional networks (6.5%) observe a signiﬁcantly
better error rate, but recall that convolutional networks utilize local analysis, and a
fair comparison here is only with other monolithic networks. Our method is poten-
tially more scalable than any of the other methods, due to its linear complexity, and
new classes can be potentially added on the ﬂy. Each could be learned quickly due to
resource having been used for function approximation instead of for discrimination

only.

4.6 Summary

The work reported here described a method in which top-down connections lead
to Topographic Class Grouping (TCG). Further, the work explains why TCG leads
to signiﬁcantly lower error rates. Samples from different classes are far apart in

the top-down—connection boosted input space, allowing class groups to grow out of

83

smaller initial groups. The lateral excitatory pulling occurs upon class manifolds,
instead of between them, reducing errors.

In the MSU 25-object data set, these networks self-organized to represent the
objects over many different views using a small resource. It was shown that top-
down connections led to networks that classiﬁed with lower errors and the responses
for each stimulus class were grouped. With the NORB data set, it achieved a better
result than K-NN using only 7% of the resource. The computational cost is linear in
both the number of neurons and the input and output dimensionality in both training
and testing. This method using top-down connections succeeds in object recognition
and is potentially scalable for more complex real-world problems. It may contribute
to a better understanding of how developmental systems can autonomously develop

internal representation that is useful for desired behavior.

4.7 Adaptive Lateral Excitation

Earlier, the SOM-inspired idea of isotropic updating simulated the lateral excitatory
connectivity. The updating was done in a 3 x 3 region around each winner neuron.
The section investigates the performance effects of adaptive excitatory connections,
and describes how they can lead to both smoothness and precision, in conjunction
with top-down connections. The performance effects of adaptive lateral connections
coupled with top-down connections have not been studied, especially in comparison
to the other types of lateral excitation.

It is pr0posed that a major use in terms of performance of adaptive excitatory
lateral connectivity and top-down connectivity is to develop abstract representative
modules — statistically correlated ﬁring groups. Modularity emerges due to the
adaptivity of the local connections — as the correlations between groups decrease,
the connection strengths also decrease. This serves to decrease the interference

between different ﬁring groups.

84

 

4.7 .1 Motivation

Due to the pressure of evolution, the brains of organisms need to self-organize at
different scales during different developmental stages. In early stages, the brain
must organize globally (e.g., large cortical areas) to form “smooth” representation
that is critical for superior generalization with its limited connections. At later
stages, the brain must ﬁne tune its organization for precision and modularity. Yet,
lateral connectivity cannot just reduce over time (to be eventually cut off). Certain
spatial structure is correlated and this correlation information should be retained.
It is assumed that illusory contour detection occurs due to lateral connections, for
example. But “smoothness” and “precision” are two conﬂicting criteria, and it seems
two different organizing mechanisms are needed.

Lateral connectivity is also direction that has yet to be fully explored in terms
of performance and in conjunction with these top-down connections. It alone has
been handled in several different ways in various models. The self-organizing maps
[55] utilize an isotropic updating function with a scheduled scope. Based on SOM,
the important work of LISSOM [73] used explicitly modeled lateral connections (as
weight vectors) of both excitatory and inhibitory types. The scope of the excitatory
weights was smaller than the inhibitory, and the excitatory scope and learning rates
adapted throughout learning. The excitatory lateral connections helped lead to
organization (nearby neurons represent similar or identical features), while major
effects of lateral inhibition was to encourage development of different features and
to decorrelate the output - leading to a sparse response8. But LISSOM did not

utilize a “motor” layer, and thus was not tested for performance with real-world

 

8The “winner-takeall” method used in SOM can also be considered as a form
of lateral inhibition and it has the same effect - different features are developed
and the output becomes sparse (an extreme case — taking the winner neuron alone
to completely represent the input). Taking the “top-k” winners (neurons with the
k-largest responses to the stimulus) has a similar, but more relaxed, effect.

85

 

engineering problems. We investigated [62] the use of adaptive lateral connections

of excitatory type, such as those used in LISSOM.

4.7.2 Smoothness vs. Precision

Using the idea of neighborhood updating from SOM leads to a topographic LCA,
where a winning neuron’s neighbors are updated, with their response weighted by dis-
tance. This achieves some amount of topographic organization and cortical smooth-
ness, depending on how the learning rate and scope of the neighborhood function
are tuned.

For high-dimensional real-world data, this method has possibly introduced a new

problem. In this method, the updating equation for neighbor neurons is changed to

vj(t) = wlvj (t - 1) + w2h(nz-,j,t)x(t — 1), (4.12)

where neuron z' is a winner, and nm- is the distance from neuron i’s 2D position
to the position of neuron j. The kernel function h deﬁnes the neighbor updating
strength.

The possible issue with Eq. 4.12 is that a non-winner neuron’s response yj does
not depend on its bottom-up and top-down combined weight vector vj. Instead, it
is simply a function of distance from the winners. This means neurons can ﬁre and
update for stimuli that they do not represent well themselves.

This purely neighbor-based updating leads to problems for efficiently representing
real data. Real world, high-dimensional data (e.g., raw pixels from a digital camera)
is typically sparsely distributed — there will be large areas in the input space where
a stimulus is extremely unlikelyg. And, at least with vision, the input space tends

to have multiple disconnected areas where stimuli are probable. Using the SOM-

 

9Consider averaging two images of two different objects. This type of “ghosting”
effect (see Fig. 4.5 is not typically seen in reality.

86

style method of updating with any type of tuned learning rate leads to neurons
representing areas of low probability between multiple high probability areas, since
they are “pulled” by their neighbors closer to each of the separate high-probability
areas. This phenomenon is well-documented in [20].

This phenomenon is problematic for several reasons. First, the approximation
of the probability density by response is poorer since less resource is used to rep-
resent regions where the data actually lies. Second, the neurons in low probability
areas do not send meaningful messages to the next layer when they ﬁre. Since they
are “between” several different high probability areas, each presumably with dif-
ferent meaning, their ﬁring does not send the next level an unambiguous message.
Their ﬁring is interference between different tasks, classes, etc. This interference can
lead to performance errors (depending on the data). It is also interesting to note
that in biological cortex - at least in V1 — these types of between-feature averaged

representations have not been observed [68].

4.8 Setup

With explicit connections, neuron i’s synaptic weight vector v,- have the standard

bottom-up (b) and top-down (e) components, but also a lateral (1) component

v,- = Vb,z’ ll Ve,z’ U Vl,z' (4.13)

where the U symbol is for vertical vector concatenation.

In the proposed LCA with lateral excitatory connections, these connections affect
the pre-competitive potential response. They take their effect before the winners are
chosen instead of after, in the SOM-inspired method. The pre-response for a neuron
is a function of its three sources from the bottom-up x(t — 1), top-down e(t — 1) and

lateral y(t — 1) as follows:

87

 

 

 

 

”-(t) . (a . X(t — 1) ~ Vb,i(t - 1)
”z 9% Mt — 1)” Ilvb,.-<t - 1)“
e(t - 1) ° Ve,z'(t - 1)
+ ' "e(t — 1>unv.,.-(t - 1)”
y(t — 1) -vl,,-(t -— 1)
+ 7' ”y(t — 1)|I||vz,,-(t- an) “'14)

and a, 6, and 7 must sum to one. They control the relative contributions of the
three sources to this neuronal layer.

We update the neurons with the k largest :9, considered winners, using Eq. 3.13.
Due to this competitive process, the developing weight vectors will only update for
stimuli that they themselves represent well (unless e.g., a is too small). It will be
less likely to have damaging interpolation as is seen in the 3 x 3 updating ease when
using this method. The interpolation provided by lateral connectivity should now

be useful (within high-probability areas).

4.8. 1 Initialization

— How are the lateral excitatory weights to be initialized? Similar to in LISSOM, and
based on observations that most lateral excitatory connections are short-range [17],
the scope of connectivity is restricted. For example, a neuron can only excite another
neuron up to 5 neural positions away. And the actual values of the weights within
the scope of connectivity are determined by an isotropic function such as a Gaussian.
These initial weights will organize cortical representation to help pull similar features
together in the physical neuron map. However, we also want to adapt the weights
so that features with low correlation can exist nearby without interfering with one

another — the lateral weight between them will diminish and be cut.

88

 

4.8.2 Adaptation

- Biological lateral connections are strong between functionally similar neurons [7 2,
110]. While the purpose of lateral excitatory weights early in development is to
drive topographic organization, the purpose of later adaptation in lateral excitatory
weights is to develop weights between nearby neurons that reﬂect the correlation
of ﬁring of those neurons. We can use LCA’s updating equation exactly for lateral
weights and we will develop a weight between neuron i (which is a winner and has
nonzero ﬁring rate) and j (which is within the scope deﬁned by m but may or may

not be a winner) equal to

E(yt(t)yj (t - 1)|y7;(t) > 0)- (4-15)

which is the expectation of ﬁring rate of neuron j when neuron i has ﬁred, weighted
by neuron i’s own ﬁring rates. This is simply Hebb’s principle applied to the lateral

weight.

4.8.3 Developmental Scheduling

— As is necessary for in-place learning, each neuron has a self-stored age, and an
age-dependent updating schedule that deﬁnes “’1 and 102. However, adaptation
of the lateral weights must be scheduled differently from the bottom-up and top—
down weights. This is due to a simultaneous dependency in development — the
bottom-up and top-down weights depend on the lateral connections early on for
their development and topographic organization, and the lateral connections cannot
reliably begin to reﬂect Eq. 4.15 until this organization has settled — meaning
the adaptation of bottom-up and top-down weights has settled down. Therefore,
the lateral connections must have more plasticity, later, than the bottom-up and

top-down connections.

89

 

How can a single neuron have two separate updating schedules when it does
not know the origin of its synapses? We can consider the lateral connections to be
representative of a different cell type which we do not directly model — interneurons
dealing with one-to-one connectivity in the same layer. This cell type could perhaps
then have a different schedule of plasticity.

The performance effect of Eq. 3.13 is to cut off connections between areas of
the stimulus space that do not correlate, thereby avoiding the problems that come
from neuronal neighbor pulling. This lateral (and top-down) updating leads to
more modular collections of neurons — functionally related neuron groups with lower
between-group interference than compared to the 3 x 3 method. And, due to the
lower interference, each feature will tend to represent the higher-probability areas

(where the data is actually observed).

4.9 Experiments

We compared the laterally connected LCA algorithm with an LCA method without
lateral connections (ﬂ = O), that instead used 3x 3 updating. The data was the MSU-
25 Objects dataset. All tests utilized a neuronal plane size of 20 x 20 neurons.They
connect to a motor layer of 26 neurons, which had top—down projections back to
the feature layer. In testing, the highest responding motor neuron was taken as the
guessed class.

We used the schedule of inhibition strength, or connection scope (changing k),
shown in Table 4.6 and did not allow the lateral connections (if they were enabled)
to adapt until t = 500. Each sample input was repeated for ﬁve iterations, and we
did not update synapses until after the last iteration per sample. This allowed the
lateral activity to settle. In training, the correct motor output was imposed. The
response was reset for each new training or testing sample, since such temporally

discontinuous experience (jumping directly to a new image with a new object class

90

Table 4.6: Scheduling of inhibitory scope.

 

 

 

 

 

 

Sample Number of
number winners (k)
0 20
1000 15
2000 5
3000 3
4000 1

 

 

 

 

having no transition) is not experienced in reality.

Results are shown in ﬁgures 4.20 and 4.21. The version with adaptive lateral
connections and adaptive top-down connections (oz = ﬂ = '7 = 0.33) is the best. It
is interesting that the laterally connected version without using top-down (a = 'y =
0.5) matches the 3 x 3 version with top-down. Why is this the case? Fig. 4.21 shows
the mean per-neuron class-entropy (see Appendix). If this is high, the neurons are
ﬁring for more than one class. We claimed this type of interference leads to errors.
Indeed the two versions using 3 x 3 updating have signiﬁcantly greater entropy on
average than the ones using the lateral connections. The effects of the two methods

can be visualized by looking at the bottom-up weights as images, as in Fig. 4.22.

4.9.1 Comparison

We also compared with an LCA algorithm with lateral connections but instead using
the “dot-product” SOM updating equation [55]
vz-(t — 1) + n(t)x(t — 1)

vi“) 2 Ilvzlt — 1) + n<t>x<t - 1)||2 (“6)

Where v,- is the winning component vector at time t.

 

We also compared by instead using the LISSOM updating equation [73]10

—‘

10Note that we are not comparing with SOM and LISSOM — we simply replace
LCA’s updating equation

91

Disjoint Test Performance

 

 

 

 

 

 

 

 

 

1
0.95 ~
0.9 r
2
g
c 0.85-
5.3
'E
§’ 0.8 —
m
C!
0.75 —
+ Adaptive Lateral LCA + Top—Down
+ Adaptive Lateral LCA
0.7 ~ + 3 x 3 Updating LCA + Top-Down ‘
-I- 3 x 3 Updating LCA
0.Q: 4% 1 l 1
"”0 0.5 1 1.5 2 2.5

Samples Trained 4

x10

Figure 4.20: Performance comparison of LCA using 3x 3 updating to LCA with excitatory
lateral connections with and without top-down connections.

92

Per-Neuron Class Entropy

 

 

 

 

 

 

 

 

0.7 . . i
+ 3 x 3 Updating LCA
-O- 3 x 3 Updating LCA + Top-Down
0.6— + Adaptive Lateral LCA _
+ Adaptive Lateral LCA + Top-Down
0.5- -
>5
8
g 0.4- —
UJ
(D
3
(T: 03* .
>
<
0.2- —
0.1 — -
0 _::::::::::::::’:::
0 0.5 1 1 5 2 2.5

Samples Trained x 104

Figure 4.21: Per-neuron Class entropy for the four variants of LCA in the tests with the
25—Ol)je('ts data.

 

Figure 4.22: Above is the bottom-up weights after development for a neural map (each
weight can be viewed as an image) that utilized 3 x 3 updating without top-down. Below
are the weights for a neural map that developed using explicit adaptive lateral connections
(also without top-down). Note the smearing of the features above and the relatively higher
precision of representation below, while still being somewhat topographically organized.

94

v. = v2'0?-1)+7I(t)X(t-1)y,-(t)
2m "Via“1)+77(t)x(t-—1)yz-(t)||1 (4-17)

The tuning of u(t) for both of these is in general not simple 11. Since the two
above updating methods use only a single learning rate parameter to adapt the
neuron weights to each new updating input, and a method to bound the strengths
of synaptic efﬁcacies (e.g., vector normalization), while CCI LCA uses the time-
varying retention rate wl (t) and learning rate w2(t), where wl (t) + w2(t) = 1, in
order to optimally maintain the energy estimate (as formalized in Section ??, and
in order to achieve optimal representation. With the energy gone in the SOM and
LISSOM updating schemes above, there is no way to adjust the learning rate n(t)
to be equivalent to the optimal LCA scheduling.

The result in Fig. 4.23 supports this, as the non-LCA updating methods led to
much worse performance. Interestingly, the entropy and organization of the feature
layers was not signiﬁcantly different between the three methods. The problem arose
for the motor neuron’s bottom-up weights, which were not optimal and unstable for

the SOM and LISSOM updating methods.

4.10 Summary

Efﬁcient (not wasting the available resource) and effective (leading to good perfor-
mance) emergent internal representation is crucial for development. In published
computational cortical maps, self-organization — topographic “smoothness” —- is of-
ten achieved at the cost of high precision. The work reported here showed adaptive
lateral excitatory connections used developmental scheduling to both self-organize a

cortical map and to develop feature subgroups without cross-group interference that

 

11We used the “power” equation with initial learning rate of 0.1 for the SOM
method [114], and based our tuning of the LISSOM equation from the appendix
in [73]

95

 

Recognition Rate
9 .0
01 a:

_0
4s

.0
co

0.2

0.1
0

Comparison of Updating Methods

 

 

 

 

 

+ Adaptive Lateral LCA + LCA Update
+ Adaptive Lateral LCA + LISSOM Update
—>- Adaptive Lateral LCA + SOM Update '

 

 

 

l l l

 

0.5

2 2.5

1 .5
Samples Trained x 104

Figure 4.23: Performance comparison for different updating methods, using lateral con—

nections without top-down.

96

Comparison of Updating Methods

 

.9
to
r

 

 

.0
m
l

+ Adaptive Lateral LCA + LCA Update
-4- Adaptive Lateral LCA + LISSOM Update
->- Adaptive Lateral LCA + SOM Update ‘

 

 

_o
N
l

 

Recognition Rate
0 O 0
'oo '4: 'o:

P
N
T

 

 

 

 

0'10 0.5 1 1.5 2 2.5

Samples Trained x 104

Figure 4.23: Performance comparison for different updating methods, using lateral con-
nections without top-down.

96

traditional (e.g., 3 x 3 updating) methods exhibit. The performance improvements
are effects of several cortex inspired mechanisms including top-down connections,
dually optimal LCA updating, and local sorting that simulated lateral inhibition
without requiring many iterations. Under these mechanisms, it was shown that
global—to-local scope scheduling and adaptive lateral connections leads to effective

and efﬁcient self-organization.

4.11 Bibliographical Notes

Neural networks have traditionally operated using bottom-up (feed-forward) connec-
tions as primary connections, with the top-down (feedback) connections not used
at all or approximately used in a constrained weight-tuning mode. For example,
backpropagation—based networks [58,126] use the top—down error signal to train
the bottom-up weights, but it does not use explicit top-down connections. A few
networks used top-down information as part the input. The use of an expanded
input including both input and output vectors for the Self-Organizing Maps (SOM)
was brieﬂy mentioned as a possibility by Kohonen 1997 [54], and was also used
as an input to a Hierarchical Discriminant Regressor (HDR) [129]. In the later-
ally connected self-organizing LISSOM [93] and the Multi-layer In-place Learning
Network (MILN) [120], neurons take input from bottom-up, lateral and top—down
connections. For LISSOM, Sit & Miikkulainen 2006 [93] explained how a neuron
responding to an edge, for example, can develop to receive top—down feedback from
neurons that detect corners including that edge in the next layer. LISSOM did not

utilize an output layer, however.

97

4.12 Appendix: Neuronal Entropy

Each neurons “purity” with respect to class can be measured by using entropy.
Developmental entropy takes into account the updating history. It is generally cor-
related with ﬁring entropy of a mature neuron. Neuron i developed in a pure way if
its entropy of stimulus history is small. From Eq. (4.8), it can be seen that bottom-
up weight wb,i of neuron i is a weighted sum of input samples which have been used
in updating:

A neuron with low developmental entropy was updated with most samples from
the same class. A neuron with high entropy was updated by samples from many
different classes. First, measure per-neuron probability with respect to each class.

mi
Wb,i = Z 0‘2,i(t)bi(t) (4-18)
t=1

where bi(t) are stimuli (samples) used to update the neuron WM and a2,,-(t) are
the corresponding update weights. To measure the purity of each neuron, we deﬁne

the empirical “probability” that samples arose from class j as:

 

E 01 ‘ j
pj = 77% 2’“ ) d E classj (4.19)
Et=1 02,10)
Let the matrix P = {p1, ...,pn} be the matrix of probabilities for n neurons

where there is a distribution p2- = {Pi,12 1),-,2, ..., 1),-,6} for each neuron. To quantify

the “purity” of the probability distribution for the i-th neuron, we have

C
52' = — dz pi,d10gc(pi,d) (4.20)
:1

If a,- = 1, then neuron i was updated using inputs arising from a single class.

98

Chapter 5

Recurrent Dynamics and

Modularity

About 85% of cortical neurons are excitatory, and about 85% of their synapses are
to other excitatory neurons [29]. This suggests that excitatory feedback circuits are
very prevalent throughout cortex, yet computational models and neural networks
have not typically utilized excitatory feedback.

Many actions are not reﬂexive response to stimuli, but are instead a result of de-
liberation over some time. Such deliberation may be purely internal, but also could
utilize a constant stimuli stream from the environment. This chapter considers the
second case. This chapter presents theory, experiments and results pertaining to
computational multilayer Hebbian neural networks that use both bottom-up and
top-down connections as dynamic systems in time. The top-down connections pro-
vide temporal context —— a “bias” or “expectation” of the next sensation based on
previous sensation. Using this temporal context, the performance in a challenging
object recognition task is shown to becomes nearly perfect on each view when the
data is sequential (video). If we assume the agent is not required to answer regarding
the identity of the object within the ﬁrst few frames after a transition from looking

at one object to another, the performance becomes 100%.

99

In previous work, top-down connections were shown to cause a motor-initiated
biasing effect during development. This led to a biased compression of internal
representation so that the relevant information about the input was prioritized over
the irrelevant information, leading to better performance as compared to networks
that did not use top-down connections [64].

Here, I’ll examine the effect of the meaning-carrying top—down connections on
the lower layer of neurons over time. We assume the network has already undergone
a signiﬁcant amount of learning. The class-conditional entropy of the feature neu-
rons, as reﬂected in the top-down connection strengths, has a major impact. Three
network types are analyzed, in a linear approximation of the actual network, which
is nonlinear due to the lateral inhibitory mechanism used. A minimum entropy
network can also be interpreted as modular as every feature neuron is associated
with only one motor neuron. In such a case, the recurrent activity effect spreads
activation appropriately to all the associated features of each active motor neuron.
A high entropy network has widespread connectivity. There is a single steady-state
distribution of activity, which contains nonzero activation for all neurons. Given
any initial input, this type of network will always converge to the same pattern of
activation. A low entropy network is nearly modular, but has a few shared between-
module connections. It has the same properties as the high-entropy network, but
lateral inhibition is proposed as a mechanism to functionally convert a low-entropy
network into a minimum entropy network, and thus control the effect of top-down

excitation.

5.1 Motivation

Intuitively, stability of perception is a core difference for an artiﬁcial network op-
erating in the real world as compared to one operating on stored data. In realistic

environment, an object tends not to blink in and out of locations, or change how it

100

looks all of a sudden. The changes in location and appearance tend to be gradual.
However, there can be sudden changes. This spatial and temporal locality is taken
advantage of in biological networks, and it should also be taken advantage of in
artiﬁcial networks.

Humans use temporal context to guide visual recognition. Normal human adults
do not experience the world as a disjointed set of moments. Instead, each moment
contributes to the context by which the next is evaluated. Is our ability to utilize
temporal context in decision making innate or developed? There is evidence that
the ability may be developed from a child’s genetic programming so that it emerges
several months after birth. From much experimental evidence, Piaget [84] stated that
before they are around 10 months old, children do not visually experience objects
in sequences, but instead as disassociated images. Many of Piaget’s tests measured
overt behavior and could not measure internal decisions, however. Baillargeon [3]
later found evidence of object permanence, an awareness of object existence even
when the objects are hidden from view, in children as young as 3.5 months. It
illustrated an aspect of the covert (internal) process behind recognition. Evidence
of this awareness in signiﬁcantly younger infants is not supported. How does this
ability to predict emerge? Does it emerge from interactions with the environment
(i.e., would a child in an environment obeying different laws of physics learn to
predict differently?) or is it genetically programmed to emerge at a speciﬁc time?

It seems that after sufﬁcient developmental experience, children will generate
an internal expectation from recent experience that is used for biasing recognition.
The network basis for generating this expectation is not clear. Object permanence,
speciﬁcally the drawbridge experiment, may be a special case that can be solved
through a set of “occlusion detectors”, such as those found in the superior central
sulcus in the ventral visual pathway [4]. How the brain creates prediction signals

in general relates to the fundamental question of how the brain represents time.

101

Buonomano [16] discussed the two prevalent views of how this may be — “labeled
lines” , in which each neuron’s ﬁring is explicitly representative of a certain amount of
time, or “population clocks”, where the temporal information is represented by the
overall population dynamics of local neural circuits. In the latter, each individual
neuron carries no timing information. Such local circuits would be highly recurrent,
and have connections from other areas, where external events arise and perturb the
Circuit’s state.

There are many bottom-up connections from inferior temporal cortex (ITC) to
prefrontal cortex (PFC) and many top-down connections from PFC to ITC. It is
hypothesized that neurons in PFC perform behaviorally-releth category binding,
while neurons in ITC are responsive to high-level visual features [35]. Therefore it
is thought the bottom-up connections provide information about detected high-level
features to PFC, which binds them together for categorization. But all the compu-
tational roles of the top-down connections are currently not known, speciﬁcally in
development and learning.

We seek to verify two general ideas about top-down connections via simulation:
that they could act as the impetus of category-speciﬁc self-organization, e.g., seen
in the fusiform face area (FFA) and parahippocampal place area (PPA), and that
they can act as a “bias” (memory store) for biasing the ITC features [104]. We built
networks with three interconnected neuronal layers: a sensory area (layer one), a fea-
ture representation area (layer-two: ITC), and a category-behavior area (layer-three:
PFC). In the network presented here, each layer-two neuron receives excitatory in-
puts from the bottom-up, laterally, and top-down and each layer-three neuron has
bottom-up inputs. Neurons compete with others on the same layer through lateral
inhibition. Neurons that are not ﬁring-inhibited learn through a Hebbian learning
algorithm, in which the strength of synaptic learning is based on presynaptic and

postsynaptic potentials. In contrast to slow feature analysis methods [34], our net-

102

work’s development does not depend on slowly changing inputs, but instead the
correlations between the category information from PFC and the true class of the

stimulus.

5.2 Concepts and Theory

The same mechanism of motor-boosted distance we discussed in the last chapter
can be used to make use of this temporal context. Here, the network generates
the top-down context on its own, over a set of input frames. If we use a nonzero
top-down parameter in the testing phase, we create a temporally sensitive network
for use over realistic video streams.

Once a network has developed through supervision, it can run without super-
vision to classify stimuli. Top-down connections can play a signiﬁcant role after
development. When the input is temporally continuous (e.g., video sequences), the
top—down connections can provide temporal context. The classiﬁcation at any time
step will feed back to bias feature neurons and will affect the sensation at the next

time step.

5.2.1 Internal Expectation

For realistic video data, a spatiotemporal decision is expected to be more accurate
than a merely spatial decision. We are given a sample x(t) that has not been seen
before from one class in C = (cl, 02, ..., cd). Assume we have accurate a posteriori
estimates of class p(c,-]x(t)), then taking the maximum over 0’ as the result will give
the optimal Bayesian choice. But depending on the spatial distribution, the error
rate of this spatially optimal choice could still be unsuitably high (see Fig. 5.1).

If there’s temporal locality in the data (e.g., there’s a 90% chance the class of

a sample at t + 1 is the same as the class of the sample at t), using a posteriori

103

 

 

 

 

 

Figure 5.1: \Veakly separable data can become more strongly separable using temporal
information. (a) Two classes of data drawn from overlapping Gaussian distributions in
space. (b) Adding another dimension: perfect separation trivially occurs if one includes
another dimension which depends on the label of each point, but of course the label is not
available. (c) When the data has much temporal continuity, the previous guess of class
can be used as the third dimension. z becomes an expectation by using the guessed label
of the current point and the previous z. Temporal trajectories are shown here. The points
in the middle are “transition points", where expectation is not as strong since recent data
was from the other class.

104

probabilities over a spatiotemporal window of k frames: p(cz-|x(t), x(t — 1), ..., x(t —
k)) lead to a much lower error rate. First, it is unknown what I: to select. If an
event important to the current decision is made more than 1:: frames in the past,
it will be forgotten. Practically, due to the high-dimensionality by using the raw
input vectors (i.e., high-dimensional images), this is very tough to estimate for even
a moderately large k.

If instead we keep a single vector parameter (z), which is updated incrementally
by some function f : z(t) = f (x(t — 1), z(t — 1)), then the problem becomes estima-
tion of p(c,-|x(t), z(t)). A ﬁrst advantage over the last form is that z potentially can
store information from as far back in time as possible. A second major advantage
is the potential of z to be compressed in a more abstract form than x. For exam-
ple, if the recent image contained a cat, it can be compressed as small as a single
neuron activation (representing the abstract “cat”). This removes all the irrelevant
information from the past and stores only the relevant information.

By f, old data gets integrated into the current state, thereby “hashing” temporal
information into spatial. In the network presented here, motor output z acts as an
abstract memory, which is updated after each sample. It feeds back to the feature

layer as a “prior” that biases the next decision.

5.2.2 Network Overview

The three-layer network to consider is shown in Fig. 5.2. The sensory (input) layer’s
activation is represented by x, the hidden (feature) layer’s activation is represented
by y and the motor (output) layer’s activation is represented by z. Sensitivities
(weights) of the feature neurons to the input are column vectors of matrix V. Top-
down weights of feature neurons from motor neurons are column vectors of matrix
M. Bottom-up weights of motor neurons from feature neurons are column vectors

of matrix W. There are n feature neurons and 0 motor neurons. The motor neurons

105

Via lateral
connections

  

Tank
Tiger
Dozer
Duck

OOOOOOO‘

 

 

 

/O
A
z.
—e
—e
‘0
\0
\e
V

Layer 0 (input) Layer 1 (features) Layer 2 (motor)

Figure 5.2: A three—layer network structure. The internal layer 1 takes three types of
input: bottom-up input from layer 0, top-down input from layer 2 and lateral input from
the neurons in the same layer. The top-down input is considered delayed as it is the layer
2 ﬁring from the last time step. The “D” module represents this delay. The connectivity
of the three connection types is global, meaning each neuron gets input from every other
neuron on the lower and higher layers. Lateral inhibition is handled in an approximate
sense via k-winners take all. Figure courtesy of Juyang Weng.

are called such since they control action. We utilize a hardcoded action production
module. which simply uses the index of the motor neuron with the largest ﬁring
rate: arg maxlSiSC{zZ-}. It maps the label to a word, that was used in training to
teach the class.

Compose W and M so that bottom—up and top-down weights are linked, but
not exactly shared (see Fig. 5.3). We also want columns of W and M to have
unit ll-norm. So, let W = MT, where W are the non-normalized bottom-up motor
weights. Then columns of W and M are ll-normed from W or M. By using ll—norm,
the feature-motor weights have a probabilistic interpretation. For a motor neuron i,
its bottom-up weight from feature neuron j represents p(y'lm(t) > OlyjfU — 1) > 0)
— the probability the motor ﬁres if the motor neuron ﬁred last. For a feature neuron
1', its top-down weight from motor neuron j represents p(yz-f (t) > 0|y5-"(t — 1) > 0).

The feature layer’s firing rate vector y is a function of bottom-up activity x and

top-down ﬁring rates 2:

106

 

 

Layer: j-I j j+1

Figure 5.3: How bottom-up and top-down connections coincide in a shared way. Looking
at one neuron, the fan-in weight vector deals with bottom-up sensitivity while the fan-
out weight deals with top-down sensitivity. In the model presented here, the weights are
shared among each two-way weight pairs. So, a neuron on layer 3' + 1 will have a bottom-up
weight that is linked to the top-down weight of a neuron on layer j.

y(t) = ch(X(t),Z(t)|V,M)- (5-1)

where f LC denotes the per-layer computation algorithm. The motor layer uses the

ﬁring of the feature layer as its input:

z(t + 1) = ch(y(t), 01w, 0). (5.2)

The recurrence occurs since the ﬁring of the motor layer is fed back to the feature

layer. For purposes of analysis, let all network weights be frozen (no learning).

5.2.3 Network Dynamics

For purpose of analysis, we’ll ﬁrst examine a simpler linear system. Then, we’ll
discuss the effect of adding in the nonlinear lateral inhibition and other effects.
For the linear system, let the layer computation function e = f LC(a,b]A, B)

combine the bottom-up and top-down activity as follows:

c = fLC(a,b, |A,B) = (1 — a)a A + ab B . (5.3)

Then,

107

 

y(t) = (1 — a) x(t) V + az(t) M (5.4)

z(t + 1)

v(t) W (5.5)

and by using z(t) = y(t — 1) W, and substituting for z, we get

y(t) = (1 - a)x(t) V + ay(t — 1) W M (5.6)

The matrix A = W M contains all the recurrence in the system. Since both W
and M are column stochastic matrices, A is column stochastic. Any element 14,-, j
represents the indirect ﬂow of excitation from neuron i to j 1: iii, j = flow(i, j),

where

C
k=1

For the formulation of this system, f low(i, j ) = flow(j, i), and therefore matrix
A is also symmetric and row stochastic. A system with this conservation of ﬂow
property is inherently stable. The purpose of this work is to examine the distribution
of excitation over time without concern for issues of stability.

A stochastic matrix deﬁnes a Markov chain. A models transition probabilities
between feature neurons if only one feature neuron is active at any time; it also can
describe the percent of excitation routed from any active feature neuron ft: to the
neurons it connect to. These transitions are not direct, but instead go through the
motor neurons. Similarly, the matrix WT MT is an excitation routing matrix for
motor neurons.

To put the above into the standard form for a linear time-invariant system, let

 

1There is no direct ﬂow here, as all feature-feature connectivity is through the
motor neurons.

108

 

A = aMTWT, B = (1 — a)I, and u(t) = VTxT(t):

yT(t + 1) = AyT(t) + Bu(t). (5.8)

(For further analysis, assume y means yT.)

The closed form solution to the above equation [1] is:

k—l .
y(k) = Aky(0) + Z Ak-J—l B u(j). (5.9)
j=0

We’ll use eq. 5.9 to analyze the network’s behavior over time for different types
of connectivity described by the excitation routing matrix.

Due to positive feedback, this system is generally characterized by spread of
activity. We wish to show the usefulness of top-down connections if the connec-
tivity described in the recurrence matrix is modular. Otherwise, problems from
the unchecked positive feedback occur when the features are not highly selective
(non minimal entropy) and the connectivity is widespread and nonmodular. En-
couragingly, the spread of positive feedback in low entropy systems is potentially

manageable, while it is not in high entropy systems.

5.2.4 Minimum-Entropy Networks

Let a network with every neuron having zero entropy be called strictly modular. In a
strictly modular network, each feature neuron is only associated with a single motor
neuron, as its connections to the other motors will equal zero (see Fig. ??)(a). Each
motor and its associated features is called a module. Viewed as a graph, it would

consist of m disconnected components.

109

 

it'll.

(a)

Figure 5.4: Examples of the three types of connectivity. (a) Minimum entropy network.
Top-down feedback has a sensible and reliable effect of biasing the features associated with
the ﬁring motors at a level appropriate to the motors’ ﬁring. (b) High entropy network.
Top-down feedback is not useful in this type of network since it spreads quickly. (c)
Modular network. Here, the neurons are minimum entrepy except for one “border” neuron.
Unchecked, positive feedback spreads throughout, but this situation is manageable through
lateral inhibition, in which the low activity connection is inhibited, and the network acts
as a minimum entropy network.

 

Internal Activity Only

If the external (sensory) input is removed, what is this network’s behavior when there
is some internal ﬁring? We will show here that each ﬁring motor potentiates only
its associated features evenly, and all activity eventually dies out. Let u(t) = 0,
but y(O) aé 0, meaning there is some existing activity stored as current context,
but the sensory input has turned off (i.e., the eyes are closed). Eq. 5.9 becomes
y(k) = Aky(0) = akAk, where A = MTWT.

For A, each of the matrices W and M are column stochastic. Therefore A
is also column stochastic. It describes a markov chain over the feature neurons,
linked to each other indirectly through motor neurons. A strictly modular network
With multiple motors is actually a reducible markov chain, with m sets of closed
states (modules) and no transition states. Each module corresponds to a mode of

Operation. Without external input, depending on the existing internal context y(O),

110

 

it can “enter” any of the modules, but activity cannot spread between modules.

The feature excitation routing matrix of this type of markov chain has the form

 

 

C1 0 0
A: 0 02 o
o o Cm]

. Each internal submatrix is square, column stochastic and strictly positive. They

describe the routing activity within the different ﬁring modules.
Lemma 5.2.1. For a strictly modular network, Ak = A.

Proof. Any module’s internal routing matrix C,- can be decomposed into pairs of r
eigenvalues and associated matrices [37] as Cf = All“ 302(1) +A§ 201(2) +...+A¢ iCET).
Each submatrix in A is rank one, and thus has only one eigenvalue, which will be

equal to one. Therefore, for any module i, Cik = 0i It follows that Ak = A. C]

Case 1: a = 1: Since Ak = A, the distribution of excitation at any future
frame is the same as the distribution at just the next frame. Thus, each ﬁring motor
neuron will distribute its ﬁring equally (since its nonzero top down weights each
equal one due to L1 norm) among its associated features. Those features’ ﬁring
feed back into the same motor neuron. There will be no spread of excitation among

different modules.

Case 2: 0 _<_ a < 1: The network operates as described above but all internal
activity will decrease exponentially over time due to 01".

The number of steady state distributions of A is the number of modules, since the

Eigenvalues of a disconnected graph are the eigenvalues of its connected components.
if the features initially ﬁring are all linked to the same motor (single mode case), the

network will stay within that motor’s mode of operation. Activity cannot spread

to another module. When operating in one mode, the features associated with that

111

motor are biased and the features not associated with that motor are not biased.
When multiple motors ﬁre, the excitation is distributed based on A, but there is

no sharing between modules. Let any motor 2' have ﬁring rate zi. Since each motor

distributes ﬁring potential evenly among its features, the potentiation of each of its

n,- features IS Zi/ni-

Internal Activity and External Stimulation

Even with such controlled feedback, the usefulness of the top-down connections in

the above case in general seems to require

E[zz-(t)[x(t — 1) e 0,] > E[zj(t)|x(t — 1) e Ci], Vi e j (5.10)

where Ci is the class label represented by motor i. The truth of the above conjecture
depends on 1. the spatial probability distributions of each class, 2. the amount
of spatial overlap of these distributions, 3. the features used in V to represent
input space X, and 4. the amount of temporal locality of class label in the data.
If conditions are such that this expectation difference exists for data that is not
perfectly classiﬁable without time and the samples are drawn spatially i.i.d., then
the added 2 dimensions should make the optimal spatiotemporal decision boundary

better than the optimal spatial boundary (refer to Fig. 5.1).

Class Transitions: Avoiding Hallucination

A key issue is that of transitions from experiencing stimuli from input class to
another. The internal context will be inaccurate at these points. As shown here, a
strictly modular network will recover. Recall Eq. 5.9. We set time 0 to the transition
DOint. Assume that the internal context y(0) strongly biases one class, but all the
next input frames are of another class. For simplicity let them be the same image

so that: u(j) = u(O), Vj.

112

In a strictly modular network with an arbitrary internal context y(0) given the

same input x over time, motor activity z(k) approaches WTVTx.

Proof. Using Eq. 5.9 and Lemma 5.2.1:

A

y(k) = akA
+ Auk—1(1-a)u(0)

+ Auk—'20 — a)u(0)

+ Aa(1 — a)u(0)
+ (1 — a)u(0)

= akA — akAu(0) + a(Au(0) — u(0)) + u(O). (5.11)

As It increases with 0 S a < 1, this approaches a(Au(0) — u(0)) + u(O).

To get motor output z(k), we project onto W. For a strictly modular network,
the motor transition matrix WTMT = I. Thus, WT(aAu(0) — au(0) + u(0)) =
«WTum — aWTu(0) + wTu(0) = WTu(0). [:1

Therefore, after a transition, the motor activity will converge to motor activity
equal to that when the bottom-up is presented without any top-down. In other
Words, the inﬂuence of top-down will die out over time, and transitions to other

Classes are possible with an incorrect initial internal context. But there may be a

“transition period” where the output action may be wrong before the recovery.

5-2 - 5 Irreducible Networks

In the general case, networks are irreducible, meaning there is a path over nonzero

weights between any two neurons. Strictly, this means the feature-transition markov

113

chain is irreducible. As a markov chain, A is aperiodic since every state links to
itself with nonzero probability; it is positive recurrent, for the same reason. Thus
there exists a single stationary distribution described by the stationary probability

vector it, that satisﬁes

WAT = W (5.12)

Since A is doubly stochastic, it is easy to see that 7r,- = l/n, Vi. Given this result,
for any irreducible A, the limit of Alf as k ——> oo approaches a uniform equilibrium
distribution.

This means that, for any initial activity distribution, even if the features initially
ﬁring are all mostly associated with the same motor, the steady state excitation
distribution will include all feature neurons. Additionally, excitation eventually
becomes distributed evenly, no matter what the initial excitation. It seems top-
down connections eventually spread activity evenly throughout the entire network.

Unfortunately, an irreducible markov chain could arise with even just one high-
entropy feature — a feature neuron with nonzero connections to all motor neurons.
B 111: we cannot require strictly modular networks. It is problematic for efficiency if all
features are zero entropy. It is incredibly inefficient to develop feature hierarchies for
each class separately. It is probable that our visual systems can recognize so many
di Herent things through effective use of shared features, which would be associated

with multiple classes.

5- 2 - 6 Modular Networks

A network like the one in Fig. 5.4(c) can be considered modular, but not strictly
in the sense we discussed earlier. True modularity is characterized as described

by BIlllmore and Sporns [15], where nodes within each module (or “community”)

114

have high intra-module connectivity, but have very low connectivity to neurons
in other modules. Some nodes are called “hubs”, having high centrality, meaning
they have short paths to many other nodes. “Provincial hubs” connect to many
nodes within the same module, which “connector hubs” connect to many nodes
in different modules. There are very few hubs compared to the number of non-
hub nodes. A modular structure is therefore characterized by communities and a
few hubs, of an intra-community or inter-community nature. Bullmore and Sporns
reviewed connectivity real brain data over the brains of many species and concluded
that the archetypical brain network seems to have such a modular structure.

In the work presented here, the motor neurons are provincial hubs. Shared
features might be connecter hubs. In the network in Fig. 5.4(c), the middle feature
neuron is a connector hub. In our three-layer structure, a modular network is called
low-entropy network, since most feature neurons are associated with a single motor.
A non-modular network then has widespread connectivity, and is called high-entropy.
F or this network type, Ak approaches the uniform stationary distribution quickly
(at small It), meaning excitation spreads nondiscriminantly rapidly. There does not
seem like any good way to use top-down connections in a high-entropy network.

In the next subsection, several mechanisms are proposed for control of top-down

connections in a modular network.

5 -2-7 Nonlinear Mechanisms for Modular Networks

IVIOdular networks are irreducible and have a single uniform stationary distribution.
31113 their Ak approaches the stationary distribution very slowly. So, Ak z A at
small It.

We wish to prevent excitation spread that might lead to decision errors in modu-
lar networks; however we do not wish to prevent excitation spread that might ﬁll in

mi Ssing information appropriately. If each module is a group of variations of same-

115

Motor Neurons
9(t) y(t) /z\(t+l) z(t+1) ||z(t+1)” 9(t+1)

 

 

 

 

 

 

 

 

 

. . . Normalization
Lateral Inhibition

Feature Neurons
Feature Neurons

Figure 5.5: A section of an “unrolled in time” layer-two and layer-three in a network with
four feature neurons and three motor neurons. Keys differences from the linear system are
lateral inhibition and output normalization. Lateral inhibition, at both the feature layer
and motor layer, stops the flow of low energy signals into the future and can control the
spread of positive feedback. This ﬁgure shows two feature neurons inhibited at time t and
one motor neuron inhibited at time t + 1. The output normalization keeps the top-down
vector on the same scale as the bottom-up. It also allows long-term memory.

type features, then the connector hub neurons can “prime” related feature types. If
we see two eyes and a mouth in the right conﬁguration, it is east to “imagine" a
nose. But we do not need to imagine unrelated things.

Feature support in modular networks comes from three sources: from the sensed
environment, from other neurons in its module (strong). and from connector hubs in
other modules (typically weak). The influence of neurons with low support should
not go to far into the future. as it could lead to errors. But we don’t want to cut
off neurons with moderate support, since their ﬁring could lead to interesting and
useful perception.

Several mechanisms might provide the desired effects. 1. Sigmoid activation
functions. Sigmoidals are biologically supported and can suppress ﬁring unless exci-
tation is high enough. 2. Neuronal discharge [29]. Similar to [3 < 0 above, if neuron
ﬁring loses some excitation at every time step, neurons with sparse and/ or temporal
support will eventually stop ﬁring. 3. Lateral inhibition. Lateral inhibition is a

competitive method for cutting off weaker responses.

116

We have implemented a k-winners take all approximate version of lateral inhi-
bition for experiments here. It depends on the two parameters k1 and k2, which
are the the number of neurons that will ﬁre on the feature layer and motor layer,
respectively. Both should be greater than one, to avoid hallucination; in general It
can be set to the expected amount of feature neurons that represent a class (e.g.,
k1 ‘= 15 with n = 400 and c = 26).

Let f L 11 be the layer-one lateral inhibition function depending on k1, and f L I2
be the layer-two lateral inhibition function depending on leg. Now, we set f LC =

f LC' A, e.g., the layer computation function uses LCA:

b

The corresponding discrete time system is:

 

fL12(fL11(Y(t)‘fV)) M+a x(t) ..
llfL12(fL11 (y(t)w)) u ”x(t)“

The difference between this and the linear system is twofold: 1. lateral inhibition

y(t + 1) = a (5.14)

and 2. normalization of x and 2. Lateral inhibition is used for reasons discussed
above. Without the normalization of z, the response y(t) is not controlled in its scale.
Consider the use of Ll-norm (so that 2 will have a probabilistic interpretation):
as internally updated prior probabilities of the potential classes in the upcoming
stimulus. Scale control lets the top—down inﬂuence match the bottom-up inﬂuence,
which may otherwise be a problem if we have many more pixels than motors. An
interesting aspect of normalization is that it allows long-term memory, by preventing
decay. A byproduct of this is that there has to be explicit “null” states, as the
network will not settle into zero activity. This interesting direction has not yet been
explored fully.

It may be also useful to cut off weak feature responses immediately, without let-

117

ting them receive any top-down boost from the motors. This strategy has somewhat
of an interpretation in the laminar organization of cortical areas [94]. This is called
using “paired layers”, but was not done in the experiments here. Paired layers can
eliminate the transition period of errors, since features with no bottom-up support

will be cut off immediately.

5.3 Algorithm

This is the algorithm for running the network, without learning (frozen weights). In
general development, such a thing does not occur. But we use this for experiments
in this paper. We start with a mature network. Reset t to one. For t = 1, 2,

1. Sense. Set x(t) as before. Let z(l) = 0.

2. Pre—response. Compute pre—response y for all neurons on layer-one:

. _ x(t) z(t)
y“) ‘ allX(t)IIV+a||2(t)llM' (5'15)

3. Lateral-Inhibition of Layer-One. Compute post-competition ﬁring vector
y(t) for the feature layer using approximate lateral inhibition. Again, set all neurons’
ﬁring to zero except the highest [:1 > 1 pre—responses. Let 3k be the kl-th highest

value of y(t). Then set a neuron’s response as:
gilt), if gift) 2 3k

y’l (t) = (5.16)
0, otherwise

4. Pre—response of Layer-Two. Let z(t + 1) = WTy(t).
5. Lateral-Inhibition of Layer-Two. Here, the motor inhibition parameter k2
will be used, which must be larger than one. Let 3k be the k2-th highest value of

z(t + 1). For motor neuron i, where 1 S i S m, set its response as:

118

2-t+1, if 2-t+1 ZS
z,(t+1)= ’( ) ‘( ) k (5.17)

0, otherwise

6. System Output. Take the highest ﬁring motor to indicate the class label of the

current input x(t).

5.4 Experiments

The networks developed in [64] and the last chapter 4, which showed topographic
class grouping, have modular connectivity. Most layer-one neurons are connected to
only a single motor neuron, but the border neurons between groups have signiﬁcant
connections to two or three motor neurons.

For these modular, top-down connections should provide a recognition rate ben-
eﬁt when the data is temporally continuous. We compared low-entropy networks
sized 20 x 20 in two cases: top-down enabled in testing and top-down disabled in
testing (i.e., a = 0). These networks have global connectivity, i.e., each feature

neuron is sensitive to all pixels.

5.4.1 Object Recognition

The MSU 25—Object dataset was used, which has views of 25 objects rotating in
depth. We trained networks with 20 x 20 neurons over ten epochs using a = 0.3
in the training phase. The images were trained in sequences, with a few empty
(no object) frames in between each sequence to mimic an object being placed and
taken away. The disjoint images were tested after each epoch, also presented in
sequences of 40 per class with a few background frames in between each sequence.
Parameters lea) = Id” = 1 in training, but were increased in testing. They were
held constant at 16(1) = 15, 19(2) = 8 for the tests. Figure 5.6(a) shows the effect of

different B in testing after 5 epochs. It can be seen that expectation leads to perfect

119

 

Effect of top-down expectation parameter on
disjoint recognition (5 epochs of training)

 

.
. .
. .
. . . . .
099., ................ _. ........... , ....... . ...............
I I O I I
. . . .
. / . .
098 ............ -. ............... . ...........
I ' . .
O I I ‘
097.. ...................... I} ...........................................
l ' - . ‘ .
. . . .
. . . .

 

on Rate
.0
8
\\
u
\

Iti

 

095 ................................................................................ .,
I I o . . n
.

.................................................................................

Recogn
.0
SE

9 . . .
. . . .
n . . . o
p ................................ - ....................................... ........-
a . . . - ,
' n . . . . l
. . . n I

0_92i-O-F1(Allframes) . ...... ] .........
:--l-F2 (After ﬁrst frame) 3 - l

+F7 ......... ............................. .(
+F7 + 3 frame smoothing : :

0 0.2 0.4 0.6 0.8 1
Expectation Parameter Setting (a)

 

.l-

0.91

 

 

 

 

 

0.9

 

Figure 5.6: The effect of different expectation parameter values on performance. The
“X” in “FX” indicates the frame to start measuring performance after a sequence shift, to
a rotating object sequence. Three frame smoothing means the latest three outputs “vote”
to get a single output.

performance after the transition periods. Figure 5.6(b) measures how training the
same sequences over and over again can help performance. It helps a lot to see the
same sequence at least twice. Figure 5.8 shows how the transition period is affected
by a. Increasing expectation eventually leads to no errors except in the transition
periods. But higher a will have longer transitions. It would be allowable for there
to be a brief period of confusion on transitions for autonomous agents (perhaps such
an effect exists in some biological agents). Such a requirement of decision after each
frame seems too strict, as is we are asked “what object is that?”, we will not answer
without at least some delay, until we are “certain”.

Since expectation takes into account the recent class history the most, the per-

formance will be best when the probability that the next image contains the same

120

 

Training sequences over epochs’ affect on disjoint recognition
1 .

 

 

0.95 -

.°
(0

P
on
or

 

Recognition Rate

 

; —o—Without Top-Down Context
0.8 ~ 1 - -- --I-—With Top-Down Context;a=0.68; F1 ~

: +With Top-Down Context;a=0.68; F3
+With Top-Down Context;a=0.68; F7

0.75 i . . 3
2 4 6 8 1o

Epochs of Training

 

 

 

 

 

 

Figure 5.7: How performance improves over training epochs through all object sequences.

121

 

 

 

 

 

 

Effect of Context-Based Sequential Decision Making

 

 

 

 

 

 

 

 

 

 

1 ........ I ............... l ....... : ..... 1 “g,
.M '
i as"
0.98 .................................................. -. ..... [3: .............. l
.J/

0.96 .......................................................................... ..
3094-.....wa / ............................... ..
g 2 ./

c 0.92 _ ................................... i" ..................................... _
:2 _./‘ 3
r: g '
g) 0.9 ................................ ' ............................................... a
o : : .l“ : : :
[I : : : : :
0.88 ........ 3' .............. .ik/ ........... 1: ............. . .3. ............... i ...... 1
i 5 5 -I— a=0 7
3 r" E E E
086 ........ ........ ,é’f' . . . ............... ............... .HH + 0t=0.8; _.
3 3 i i +a=os
0.84 ........ if" ........... ............... ............... .... . “=0 _
0 82 i i i i I
2 4 6 8 10

First Frame Counted After Transition to Next Object

Figure 5.8: Increased expectation parameter shifts errors from bottom-up misclassiﬁca-

tions to errors within a transition period (immediately after the class of viewed object
changed).

122

Effect of Context-Based Sequential Decision Making

 

 

 

 

 

 

 

 

 

 

1 ........ I ............... 1 ....... : ..... s M“;
0.98a ............................................... gunk/“i: .............. a
2/
r
0.96 .. ................................................. ;. ...................... ..
8 o 94 .......................................... / ............................... _
to f
[I g."
c 0.92 .2 .................................... if! ...................................... ..
C .-/ i
g) 0.9 ,_ .............................. ............................................. _
o ‘1
(D ; : .1 : : :
[I : : f : : :
0.88... ...... .W’c.‘ ............. ............... ............... ...... _.
; if"? g E -'— cc=0 7
0.86.. ...... If: ............... ............... . “=0.8;-
; ; 2 ; +a=05
084 ........ (I: ........... ............... ............... . a: -
0 82 J l i l l
2 4 6 8 10

First Frame Counted After Transition to Next Object

Figure 5.8: Increased expectation parameter shifts errors from bottom-up misclassiﬁca—
tions to errors within a transition period (immediately after the class of viewed object
changed).

122

 

class is high. This is called temporal continuity. Under high temporal continuity,
higher expectation can be effective by reducing the effect of outlying points across
the wrong decision boundary by pulling them back towards the class center, leading

to more correct classiﬁcations. But it will lead to longer transition periods.

5.4.2 Vehicle Detection

Here, the above method is shown to work with local features. Additionally, global
features are compared to local features on a vehicle / non-vehicle discrimination
problem. Vehicles are generally decomposable into visible parts (e.g., headlight and
license plate on a car). Local features should become tuned to parts, which improves
the generalization power of the network, meaning the performance is better with less
data. The data was collected at the GM facility using a camera mounted on the side

of a test vehicle. Some examples of vehicle data used are shown in Fig. 5.9.

 

Figure 5.9: Examples of data within the vehicle class.

There were 225 vehicle images and 225 not-vehicles images, each sized 32 x 32.
To improve generalization, the network was extended to develop local (parts-
based) features directly from the vehicle and false return data. Using the de-

veloped local features, unfamiliar objects can be recognized and classiﬁed if there

123

are any familiar object parts (e.g., a car is recognized by a tail light) and there no
need to recognize the entire object. This is critical for recognizing objects that the
system has not seen before, which share some common properties with the general
class they are within. It should also lead to a network that can handle occlusion.

A global network (10 x 10 neurons) is compared to a local network (window
size of 11 x 11, and the local competitive area is 5 x 5 for each neuron), and a
local network using expectation after it matures. In a global feature network, each
neuron on layer-one is sensitive to every element within the sensor matrix, which
has (1 elements (pixels). In the local network used here, the approach given in [48]
is followed, where each neuron on layer-one has a r x r receptive ﬁeld, where r2
is less than d. The number of neurons is equal to the number of pixels, and each
neuron’s receptive ﬁeld is centered on a particular pixelz. The competition step is
also local. A neuron competes with its local region of l x l neurons. The local top-
k0) responding neurons are called winners, in the local version. The local network
used here did not utilize smoothing, and [3 was gradually increased in training from
0 to 0.3 (in training) after 40 training samples.

The networks are initialized and trained with only 5 samples from each class.
Then all samples are tested, in alternating cars and background sequences. Next,
the next 5 samples are trained, and so on, until all samples are trained. This was
done 10 times for each network type, using a different training order each time. In
the expectation-enabled network, 6 = 0.3 during the testing phase, but in the other
networks ,6 = 0 in testing.

Results are summarized in Fig. 5.10. The local network does indeed do better
with less data, however it eventually only does just as well as the global network.
If expectation is enabled however, the performance becomes nearly perfect. Fig.

5.11 show some distinct features developed by the local network. It also shows the

 

2The image boundaries are extended with black pixels

124

 

respective spatial locations of these features.

Effect of limited training on networks’ performance

I l l l l f l I l

 

1|- “Aug-k4a.AAAG’A'KAiii‘d‘é‘AAAAA‘AhAA“ﬂ
A
\Expectation
enabled
a:
150.95 'r ‘
a:
C
.9
t
C
§’
0
n:

9
to

 

 

-A— Network with Local Features + Late Expectation
+ Network with Local Features

—l- Network with Global Features
0. 85 l l r I 1 l

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Percent of Data Used for Learning

 

 

 

 

 

 

Figure 5.10: Performance of globally versus locally connected networks when data is
limited. The locally connected network performs better with less data for this dataset.
This may be because vehicle images can be made up of several interchangeable parts. Once
training is mature, the expectation mechanism can be enabled in testing, and performance
shoots up to a nearly perfect level.

5.5 Summary

This chapter describes enabling a network to keep and update an internal state z
that biases the next activity of the feature layer. It is considered as temporal context:
an internal “prior” that is used to bias the next decision. Ideally, this prior will bias
a small subset of features highly correlated with the current state and will not bias
others.

The algorithms presented are quite efficient since it does not have to learn to

make decisions from several raw frames in a row. The dimensionality of such a

125

  

I
|_I_ _Iildil — 5" - M
In“ I|=II SE:
.11.. m illIII._.E£;,§I F,
u...- HﬁIﬁl

 

Figure 5.11: (a). Some features for the “vehicle” class, developed by the locally-connected
network. This shows the response-weighted input, as a weighted sum of the data and the
neuron responses. (b). Showing the respective locations of the features from (a). This
ﬁgure shows response-weighted input, so there is nonzero sensory stimulus outside the
receptive ﬁelds (the white box area), but the neurons will not respond to any stimuli
outside this area.

space makes it difﬁcult to learn such a policy. Instead it learns to make decisions
from a single temporally updated parameter 2.

Since this is a positive feedback system, it is crucial to keep the spread of activity
under control. For strictly modular networks, it was formally shown how positive
feedback can be controlled and useful. For high-entropy networks, positive feedback
potentially spreads to all neurons evenly within a short time. For modular networks
— characterized by highly connected communities where different communities are
connected by hubs — several nonlinear methods are discussed as ways to control
positive feedback without disabling it. The network examples here control positive
feedback in two key ways: 1. by developing in such a way that they are modular, 2.
lateral inhibition to keep low responding neurons from spreading activity. The ﬁrst
point is often overlooked: a network’s connectivity seems to be extremely important.

Although temporal context has been used in many task-speciﬁc models and in ar-

126

tiﬁcial neural networks, this is the ﬁrst time where context is used as motor initiated
expectation through top-down connections in a biologically-inspired general-purpose
developmental network. If the motor action represents abstract meaning (e.g., rec-
ognized class) these mechanisms enable meaningful abstract expectation by such
networks. We showed the effect in improving recognition rate to nearly perfect per-
formance after a transition period when the class has ﬁrst changed. Expectation’s
effectiveness in improving performance is crucially linked to the capacity to develop
discriminating features —— top-down connections may not be useful when features
are not as good.

From the perspective of developmental robotics and autonomous mental devel-
opment, the mechanism described in this paper can lead to more realistic learning
ability for mentally developing machines Supervised learning methods that can be
applied to visual recognition of objects are formulated at a per-frame level where
each training sample (frame) must be accompanied by a label. This does not take
advantage of the temporal context, or the temporal continuity present in real-world
(subject to real laws, e.g., realistic physics) video streams. When a child is taught,
say to recognize numerals and letters in the classroom, there is not a continual
stream of repetitive speech from the teacher speaking the names of the characters
over and over. The teacher will 1. direct the child’s visual attention and 2. speak the
name of the character. The direction of attention hopefully ensures that the child is
looking at the character continuously. For AMD, a semi-supervised learning mode
can be utilized based on this chapter’s method, which should signiﬁcantly reduce
the training load of the teacher. In that mode, expectation will be enabled and the
weights not frozen. It should be useful since the common distinctions of “training
phase” from “testing phase” cannot be made so easily in a real time, developmental

system. Then, the teacher can sparsely provide labels.

127

 

5.6 Bibliographical Notes

Artiﬁcial neural networks can also be roughly considered as spatial and / or temporal
(recurrent). A recurrent network is explicitly designed to deal with time by using
feedback. A spatial network typically treats the world frame-by—frame with each
frame considered independent. Spatial networks can be made to act temporally by
e.g., blurring the output over time, but they are not recurrent. Recurrent networks
can handle time-series prediction, where the past inﬂuences the present decision.
Recurrent networks can be considered as dynamic systems.

Most networks are spatial, such as those that perform PCA [125] or ICA [8,
44]. The Self-Organizing Maps [55] are spatial, but their LISSOM extension that
uses explicit lateral excitatory and inhibitory connections [73] is recurrent due to
lateral excitatory and inhibitory feedback. Traditional backpropagation nets are also
spatial, but backpropagation—through—time [127] converts back-prop networks into
recurrent networks. Classic recurrent neural nets include Jordan nets and Elman nets
[31], which utilize context units that can “remember” the last network activations.
They are trained by propagating an approximation of the error gradient back through
time and modifying the weights based on that. Those networks suffer from long-term
memory problems as the an input’s inﬂuence over time either takes over network
activity or decays rapidly. The long-short term memory networks [40] address this
problem by ensuring an error signal propagated back through time does not blow up
or vanish by using “memory cells” and “gate units” . The work we present here is in
the tradition of SOM and vector quantization, traditionally spatial approaches, but
our approach is also a recurrent network, similarly to LISSOM. But unlike LISSOM
we focus on recurrence between one layer and the next instead of on the same layer.

Douglas et al. described a model of neurons as electronic circuits, which dis-
tributed current proportionally using excitatory feedback. They showed that in-

hibitory neurons (or “neuronal discharge”) was necessary for stability [29], which is

128

similar to the conclusions here. Sejnowski [89] described how excitatory feedback
could be used for prediction of temporal sequences, by using a temporally asymmet-

ric learning rule.

129

Chapter 6

WWN-3: A Recurrent Network
for Visual Attention and

Recognition

The work presented in this article concerns a type of multilayer, multipath, re-
current, developmental network called the Where-What networks (WWN). WWNs
are designed for concurrent attention and recognition, via complementary pathways
leading to complementary outputs (type motor and location motor). WWNs are
biologically-inspired grounded networks that learn attention and recognition from
supervision. By grounded, we mean such a network is internal to an autonomous
agent, which senses and acts on an external environment. WWN network models
consist of two pathways for identity (what) and action-related location (where and / or
how), so there are separate “motor” areas for identityand location. The motor areas
connect to another later module that controls the actions of this agent. In our super-
vised paradigm, the agent is taught to attend by being coerced to act appropriately

” a

over many cases. For example, a teacher leads it to “say car” while the agent is
looking at a scene containing a car, and the teacher points out the location of the

car. Before learning, the agent does not understand the meaning of car, but it was

130

coerced to act appropriately. Such action causes activation at the Type-Motor and
Location-Motor in WWN-3. Top-down excitatory activity from these motor areas,
which are concerned with semantic information, synchronize with bottom-up exci-
tatory activity from the earlier areas, concerned with “physical” image information.
Bidirectional co—ﬁring is the cause of learning meaning within the network. Neurons
on a particular layer learn their representations via input connections from other
neurons from three locations: from earlier areas (ascending or bottom-up), from the
same area (lateral), and from later areas (descending or top-down). Learning occurs
in a biologically-inspired cell-centered (local) way, using an optimal local learning
algorithm called Lobe Component Analysis [122].

WWN-3 utilized the following four different attention mechanisms: (1) Bottom-
up free-viewing, (2) Attention shift, (3) Top-down object-based search, and
(4) Top-down location-based binding. Top-down excitation is the impetus of
the latter three mechanisms. In WWN-3, top-down excitation serves a modulatory
role, while the bottom-up connections are directed information carriers, as are the
suspected roles of these connections in the brain [14]. We’ll show how an architec-
ture feature called paired layers is important so that the top-down excitation, which
we think of internal expectation, can have appropriate inﬂuence without leading
to hallucination or corrupting bottom-up (physical) information with top-down se-
mantic bias. In addition to modulation, in WWN-3, top-down connections (along
with the bottom-up and lateral connections) allow a network’s internal activity to
synchronize. When there are multiple objects in the scene, there will be multiple
internally valid solutions for type and location, but some of these solutions will ac-
tually be incorrect (mixed up). This is part of the well-known binding problem.
But in WWN-3, after a bottom-up pass to select for candidate locations and candi-
date types, top—down location bias effectively selects a particular location to analyze

and it then synchronizes with the appropriate type, similar to Treisman’s idea of

131

spotlight [109]. Top-down bias can also be introduced at the Type-Motor, causing
“object search” — it picks the best candidate location given the particular type
bias. After the network has settled on a particular type and location, it can dis-
engage from the current location to try another location, or it can disengage from
both location and type.

The architecture of the WWNs is divergent into two distinctive pathways for
foreground identity (what) and location (where). Through development, ﬁring of
neurons further along the “What” pathway become more invariant to object posi-
tion, while becoming speciﬁc to object type. The opposite is true for the “Where”
pathway. Neurons earlier in each pathway represent both location and type in a
mixed way.

WWN is a recurrent network; it utilizes both bottom-up and top-down connec-
tions at all areas and layers (intermediate layers included) in both the learning phase
and in operation. During learning, the top—down connections are essential for the
network to distinguish foreground from background in the earlier areas, and therefore
are needed to learn appropriately. For the network to attend on its own, activity
at one (goal-setting) or no (bias-free) motor area is imposed. In these cases, the
interaction between bottom-up and top-down is essential for attention. The early
area selects candidate locations and features from the input image in a bottom-up
way. When one of the motor areas is imposed, information ﬂows via top-down con-
nections down this pathway and up the other pathway to set the other motor area.
The top-down connections serve a modulatory role to pick out the biased candidates
from the early area and ignore the rest. Yet the top-down connections can lead to
hallucination, if there is no associated bottom-up support. To avoid this, a paired

layer architecture is used, which is analyzed here in terms of signal processing theory.

132

 

6.1 Background

6.1.1 WWN Architecture

 

- PP .7 --. . my». ..

 

 

 

 

 

 

 

 

. Retina , a V2 . global global

 

 

 

 

 

Figure 6.1: A high-level block diagram of WWN-3. The area are named after those
found in our visual pathway but it is not claimed that the functions or representations are
identical.

It is known that our visual system has two major pathways: ventral (“wha ”) for
object identiﬁcation and dorsal (“where”) that deals more with visuomotor aspects
(i.e., where to reach for an object), which presumably codes an object’s location.
These pathways separate from early visual areas and converge at prefrontal cortex,
which is known to be active in top—down attention. Prefrontal cortex connects
to motor areas. WWN was built inspired by the idea of these two separating and
converging pathways. Meaningful foregrounds in the scene will compete for selection
in the ventral stream, and locations in the scene will compete for processing in the
dorsal stream.

There are ﬁve areas of computation in WWN—3. The input image is considered
as retinal activation. Instead of a multi-area feature hierarchy (ventral pathway),
we use a shape—sensitive area we called V4, but we don’t claim the representation is
identical to V4. From this area, one path goes through the IT (inferotemporal) and
TM (Type-Motor) —— possibly analogous to the inferior frontal gyrus [101]. TM is

133

concerned with object type. The other path goes through the PP (posterior parietal)
area and LM (Location Motor) — possibly analogous to the frontal eye ﬁelds (FEF).
LM is concerned with object location. Each of these ﬁve areas contains a 3D grid
of neurons, where the ﬁrst two dimensions are relative to image height and width
and the third is “depth”, for having multiple features centered at the same location.
These neurons compute their ﬁring rates at each time t. WWN is a discrete-time,
rate-coding model, and each ﬁring rate is constrained from zero to one. The pattern
of ﬁring rates for a single depth at any time t can be thought of as an image.
Computing inputs to a neuron in an area is equivalent to sampling the image of
ﬁring rates from the input area images. There are two types of input sampling

methods for an area — local or global:

0 Local input ﬁeld: V4 neurons have local input ﬁelds from the bottom-up.
This means they sample the retinal image locally, depending on their position
in the 2D major neural axes (ignoring depth). A neuron at location (i, j)
with receptive ﬁeld size w, will take input vector from a square of sides w long,

centered at location (2' + [w/ 2], j + [w / 2])

a Global input ﬁeld: Neurons with global input ﬁelds sample the entire input

area as a single vector.

An architecture ﬁgure for WWN-3 is shown in Fig. 6.5. We initialized WWN-3
to use retinal images of total size 38 x 38, having foregrounds sized roughly 19 x 19
placed on them, with foreground contours based on the object’s contours. V4 had
20 x 20 x 3 neurons, with bottom-up local input ﬁelds (of 19 x 19) at different
locations on the retina (based on the neurons’ 2D locations), and top-down global
receptive ﬁelds. PP and IT also had 20 x 20 neurons and had bottom-up and top-
down input ﬁelds that were global. LM had 20 x 20 neurons with global bottom-up

input ﬁelds, and TM had 5 x 1 neurons (since there were 5 classes) with global

134

 

bottom-up receptive ﬁelds.

6.1.2 Attention Selection Mechanisms at a High-Level

 

 

 

 

 

 

 

 

 

   

 

 

 

 

 

 

BOTTOM-UP .
FREE-VIEWING » pp 5‘ ; "-M,
.M"! r’. V2

(8)
DISENGAGE ' I, » _ 7 m f

 

 

 

Retina ]—’. v2

 

 

 

 

 

(b) 3

Figure 6.2: WWN accomplishes different modes of attention by changing the directions of
information ﬂow. (a) For bottom-up stimulus driven attention, information ﬂows forward
from pixels to motors. (b) In order to disengage from the currently attended object, an
internal suppression is applied at the motor end, and a diffuse top—down excitation will
cause a new foreground to be attended to in the next bottom-up pass.

 

 

 

 

 

Selective attention is not a single process; instead, it has several components.
These mechanisms can be broken down into orienting, ﬁltering and searching. These
are not completely independent, but the distinctions are convenient for the following
discussion. The Where-What network makes predictions about how each of the
mechanisms could work, and, in the following, we will discuss how these mechanisms

work in WWN-3.

135

 

TOP-DOWN . i .
LOCATION-BASED . 1,, .j .— up;

 

 

Retina —— v2 :

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(C)
TOP-DOWN .
OBJECT-BASED ..., —-. w
Retina —-. v2 .
.——]
IT ‘_ m

 

 

 

 

 

(d)

Figure 6.3: WWN accomplishes different modes of attention by changing the directions
of information ﬂow. This continues the last ﬁgure. (c) Top-down location-based attention
(orientation) occurs due to imposed location motor and information ﬂows back to V4 and
forward to the type motor. ((1) Top-down type-based attention (object search) occurs
due to imposed type motor and information ﬂows back to V4 and forward to the location
motor.

136

 

 

Orienting

Orienting is the placement of attention’s location. In covert orienting, one places
one’s focus at a particular location in the visual ﬁeld without moving the eyes;
this is different from overt orienting, which would require the eyes to move. Covert
orientation is realized in WWN based on sparse ﬁring (e.g., winner—takeall or WTA)
in the LM area. LM neurons’ ﬁring is correlated with different attention locations
on the retina, which emerged through supervised learning. Changes in attended
location can occur in two ways: (1) an attended area emerges through feedforward
activity, (2) an attended location is imposed (“location-based” top—down attention,
which could be done by a teacher or internally by the network, or (3) a currently
attended location is suppressed, by boosting LM neurons in other areas. The effect

of this is to de-engage attention, and shift to some other foreground.

Filtering

Attention was classically discussed in terms of ﬁltering [13,67,106]. In order to
focus on a certain item in the environment, the “strength” of the other information
seems to be diminished. In WWN-3, there are multiple passes of ﬁltering. 1. Early .
ﬁltering. This is done without any top-down control. As WWN-3 is developmental,
the result of early ﬁltering depends totally on the ﬁlters that were developed in the
supervised learning phase. Then, responses of these developed V4 neurons in WWN
are an indicator of what interesting foregrounds there are in the scene. If there is no
top-down control, internal dynamics will cause to converge to a steady state with
a single neuron active in each of TM and LM, representing the type (class) and
location of the foreground. A single feedforward pass is thought to be enough in
many recognition tasks. But if there are multiple objects, attention seems to focus
on each individually [109].

Another ﬁltering process, based on top-down excitation, occurs on the result of

137

 

 

the ﬁrst-pass ﬁltering. 2. Biased ﬁltering. This “second-pass” ﬁltering is in the
service of some goal, such as searching for a particular type of foreground. The ﬁrst-
pass ﬁltering has coded the visual scene into ﬁring patterns representing potentially
meaningful foregrounds. Binding after the ﬁrst pass would be done in a purely
feedforward fashion, and an incorrect result could then emerge at the motor layers.
The second-pass ﬁltering is due to top-down expectation. It re-codes the result
of ﬁrst pass ﬁltering into biased ﬁring patterns, which then inﬂuence the motor
areas. The second-pass allows the network to synchronize its state among multiple
areas. For example, feedback activity from the Location-Motor causes attention at
a particular location, which causes the appropriate type, at that location, to emerge

at the Type-Motor.

Searching

Searching for a foreground type is realized in WWN-3 based on competitive ﬁring
(e.g., WTA) of the Type-Motor (TM) neurons. Similar to the link between retinal lo-
cations and LM neurons, correlations between foreground type and TM neurons are
established in the training phase. Along the ventral pathway, the location informa-
tion is gradually discarded. Top-down type-based attention will cause type-speciﬁc
activation to feed back from TM to V4, biasing ﬁrst-stage ﬁltered foregrounds that
match the type being searched for. Afterwards, the new V4 activation feeds forward
along the Where pathway, and will cause a single location to become attended based

on ﬁring in the location motor area.

6.1.3 Binding Problem

When multiple objects are present in the scene, the binding problem is an issue
for WWN. Since location and type are dealt with separately, there are multiple

internally valid solutions that are incorrect. Through top-down connections, we can

138

 

 

Internally valld solutlona: { (T1,UL), (11m), (T3,UL), (mum

@6969 @@O@

 

 

 

 

 

 

 

Features
Input @&¢w°°6
@

 

 

 

Internally valld solutlons: { (T 1,UL), (T3.LR)}

f Comblnatlon Features f

@6969 @@.®

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Features
( l 6
Input 6‘s”
9’
(b)

Internally valid solutlons: { (T1 ,UL), crawl}

®®® @@.®

Features \ //
Input ® gt°°°6

 

 

 

 

 

 

 

 

 

 

 

 

Figure 6.4: Toy example illustrating the “coarse” binding problem.

139

 

 

 

 

handle the “coarse” binding problem through synchronization. See Fig 6.4. In (a),
separable features avoid a combinatorial explosion of representation but introduce
the problem of how to reintegrate the information. The two types of information
here are location and type. When there are two types in two different locations
in the input, a feedforward network using winner-takeall at the two output layers
can give four possible internally valid solutions. But two of these are wrong. (b)
Possible solution: combination neurons. Another winner-takeall layer of units that
represent each combination of features can solve this problem, but this leads to the
very combinatorial explosion we were trying to avoid by using separable feature
layers. Additionally, every single combination will have to be individually learned.
(c) A solution avoiding the combinatorial explosion: synchronization using top-down
connections. The initial feedforward activity is an initial guess of location and type,
and this allowed to be inconsistent. Top-down connections from the location area
bias a particular area on the image so that a incorrect type is suppressed and the
correct type chosen. This is a basic way to implement a “spotligh ” of attention.
There still may be binding required within the spotlight, if there are occlusions or
transparency. Top—down connections within a visual hierarchy might solve the full
binding problem (e.g., form biases lower-layer edges).

In WWN’s bias-free mode, the binding problem has to be dealt with. However,
using combination neurons are problematic since they introduce the combinatorial
explosion problem [107]; additionally they do not allow generalization properly [116].
In WWN, the coarse binding problem (not consideration occlusions or transparency)
is handled by having a feedforward pass, then letting information ﬂow from the
location motor back towards the early feature layer and up the other pathway to the

type motor. In this way, internal consistency is veriﬁed.

140

 

 

 

6.2 Concepts and Theory

Here, we discuss how bottom-up and top-down information are integrated in the

Hebbian network (see Fig. 6.5).

Paired Input

For some neuron on some layer, denote its bottom-up excitatory vector by x, and

its top—down excitatory vector as 2. We used paired input:

" (II—:- Il’“ ”’IIzII) “’1’

where 0 g p g 1. The parameter p allows the network to control the inﬂuence of
bottom-up vs. top-down activation, since the vector normalization flmdamentally
places bottom-up and top—down on equal ground. Setting ,0 = 0.5 gives the bottom-

up and top-down equal inﬂuence.

Paired Layers

In Fig. 6.5, we show three internal components within each major area of V4, IT,
and PP. This internal organization is called paired layers. Paired layers handle
and store bottom-up information separately from the top-down boosted bottom—
up information. The paired layer organization is inspired by the six-layer laminar
organization which is found in all cortical areas [18], but it has been simpliﬁed to
its current form. Further discussion of this architecture feature is found in [94].
Paired layers allow a network to retain a copy of its unbiased responses internally.
Without paired layers, top-down modulatory effects can corrupt the bottom-up in-
formation. Such “corruption” might be useful, when the internal expectation (we
consider top-down excitation as an internal expectation) relates to something in the

visual scene. However, there are times when such expectations are violated. In such

141

 

 

     
 
 

RETINA

Local Input Field (33 x 38 x 1)

V4

(20 x 20 x_3_x_3)

 

 

 

 

 

 

 

 

 

 

 

 

IT
00x39533}l_f"‘"'"“"“'“'i;:::_@q§sz1xa
'0000 0000'
0000 0000
0000 0000
0000 0000

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  
 

I___ _______________

 

 

 

 

 

[ control of actions (e.g., speak a word, point, reach)]

 

Figure 6.5: W'WN—B system architecture. The V4 area has three layers (depths) of feature
detectors at all image locations. Paired layers: each area has a bottom-up component, a
top—down component. and a paired (integration) component. Within each component,
lateral inhibitory competition occurs. Note that in this ﬁgure top—down connections point
up and bottom-up connections point down.

142

instances, the internal expectation is in opposition to reality. Then, the incorrect
feedback can potentially lead to false alarms, or hallucinations.

To explain further, we present the following formalism to explain how paired
layers are implemented. Given a neuronal layer 1, let V be the matrix containing
bottom-up column weight vectors to neurons in this layer, and M be the matrix
containing top-down column weight vectors to the layer, where each of the column
vectors in these matrices are normalized. X is a bottom-up input matrix: column
2' contains the input activations for neuron i’s bottom-up input lines. Z is the top-
down input matrix. For the following, X and Z are also column normalized. We use
diag(A) to mean the vector consisting of the diagonal elements of matrix A.

Non-paired layers:. First compute layer l’s pre competitive response y:

y = p diag(VTX) + (1 — p) diag(MTZ) (6.2)

where p and 1 — p are positive weights that control relative bottom-up and top-
down inﬂuence. The post-competitive ﬁring rate vector y of layer l is computed
after lateral inhibition function f, controlled by the layer’s sparsity parameter k.
The top It: ﬁring neurons will ﬁre and others have ﬁring rate set to zero, giving a

sparse response:

3’ = 9 (f (3’. k)) (6-3)

The lateral inhibition and sparse coding method f is achieved by sorting the
components of the pre-response vector 9. Let 31 be the highest value, 5 k be the
k-th highest. Then set a neuron’s response as:

372'. if 392' 2 8k

317; ‘— (6-4)
0, otherwise

143

 

 

 

 

 

EIgbIF = {4}] Pr(3’)b|F = {A})
. B i I A

EQ"

 

 

 

 

 

 

A-Filter B-Filter '8 a
,. (a)
E[yl,IF = {all PT(QbIF = {3})

A B

 

 

91b“

 

 

 

 

 

 

A-Filter B-Filter a '8
(b)
E[gbv|F = {A1,B}l Pr(3)b|F = {A, 3})

 

 

, A B

 

 

 

 

 

 

A-Filter B-Filter B

(C)

Figure 6.6: Interaction between bottom-up and top-down in a toy attention problem.
There are two foregrounds A or B. The foreground set is F. This system has A-detectors
and B—detectors. Consider the ﬁgures on the left. The expectation of pre-response of
an A-ﬁlters is shown to the left and expectation for B-ﬁlters on the right. Assume the
corresponding pre-response distributions are Gaussian, which are shown on the right side.
A key assumption is that the pre-response alone does not lead to the best detection, shown
as overlap in the distributions. When both foregrounds are visible, the system is not biased
to detect one or the other, unless a top-down goal is utilized (c).

144

Correct Top-Down Boost
Without Paired Layers:

PropIF = {4} n3. = {A})

 

 

 

With Palred Layers:

Press: {A} r; Ex = {All

 

 

 

 

H's»

 

Incorrect Top-Down Boost

P'r(1?p|F = {Alp Ex = {3})

 

 

 

 

Pr(g)p|F = (A10 Err = {3})

 

B

 

A

 

JQLa

Figure 6.7: Beneﬁt of paired layers in attention selection. The pre-response distributions
of A-ﬁlters vs. B-ﬁlters are shown, when the A foreground is the only one visible, but when
top-down expectation Ea: either biases A (correct) or B (incorrect). Top-down boosting
in c00peration with the true state (on the left) leads to higher discriminability by moving
the two means farther away, which is very useful. But if the expectation is wrong (on the
right), it can be drastically more diﬂicult to detect the true state. If paired layers are used,
lateral inhibition is applied ﬁrst, so that many of the B—detectors will have zero response
before the top-down boost. Then, an incorrect boost can be managed.

145

where g is a threshold function that prevents low responses after competition.
Paired layers: With paired layers, the ﬁring rate vector y of layer 1 neurons are
computed after lateral inhibition of both the bottom-up part y‘b and top-down part

3ft separately, and additional lateral inhibition for the integrated response:

yb = y(f (diag(vTX), 1%)) (6.5)
Yt = g(f(diag(MTZ). n))

y = g(f(pyb+(1-p)3'tl 1910))

A key idea is that the lateral inhibition causes sparse ﬁring within the bottom-up
layer and sparser ﬁring in the paired layer (e.g., kb = 20 and kp = 10 where there are
1200 neurons). The top-down layer is generally not as sparse (e.g,. [ct = 200) so that
it might reach as many potentially biased neurons as possible. In a non-paired layer,
such diffuse top-down biasing can match up with many relatively weak responding
ﬁlters (if there was no top-down inﬂuence) and boost them above the stronger ﬁlters
that have more bottom-up support. But the intermediate competition step in the
paired layer method ensures the diffuse top-down biasing will not signiﬁcantly boost
the relatively weak ﬁlters since they were already eliminated in bottom-up competi-
tion. Filters with support from both bottom-up and top-down thus receive the most
beneﬁt.

Both bottom-up and top—down are highly local in both space and time — they
are highly spatiotemporally sensitive. The input image (bottom-up) could change
quickly, or the “intent” of the network (top-down) could change quickly. In either
case, a paired layer adapts quickly.

Take the following simpliﬁed case for understanding. Denote the foreground set

146

 

 

of a visual scene by F. There are two types of interesting foregrounds — call these
foregrounds A and B. For detection and attention, our network has a set of ﬁlters
— two at each possible location; one 'to detect foreground A, and one to detect
foreground B.

We must assume the expected pre-competitive response of an A—ﬁlter is higher
than a B-ﬁlter when A is actually present, and B is not (and vice versa). Assume
E [33”] = a when ﬁlter 2' is an A-ﬁlter, and E [99¢] = ,8 if ﬁlter 1' is a B-ﬁlter. Thus,
a > B ifF = {A}, a < [3 ifF = {B}. In this example, let a = B when F = {A,B}.
This is graphically shown in Fig. 6.6.

Assume Pr(37b) is Gaussian. The mean will be a for A-detectors and B for
B-detectors. See Fig 6.7. If they have equal standard deviations 0‘, then the dis-
criminability is deﬁned as

d’ = Iii-“J (6.6)

0

For success of detection, it is desirable to maximize d’.

The effect of top-down feedback is one of two possibilities. If the system expects
foreground A, A—ﬁlters are selectively boosted and a is increased. One can see how
this will increase (1’ when F = {A} or F = {A, B} but decrease (1’ when F = {B}.
Otherwise, if the system expects foreground B, thus B-ﬁlters are selectively boosted
and ﬂ is increased.

Looking at the ﬁgures in Fig. 6.7, observe that when top-down expectation is
applied in accordance with the true state, la — 3| increases, and thus so does (1’. If
F = {A, B}, top-down feedback again increases (1', presumably beneﬁcially. How-
ever, when top—down feedback is applied in opposition to the true state, (1' will
decrease. This is natural, but we wish to minimize this decrease. In other words, to
make the probability of false alarm Pr(F A) as low as possible. Fig. 6.7(b) contrasts

using paired layers to without, and shows why we expect Pr(FA) will be less when

147

 

paired layers are used. A boosted weak response leads to a higher false alarm rate,
but if we cut off weak responses before they are boosted, many incorrect ﬁlters will
have zero response, leading to a lower mean Gaussian with lower standard deviation
(since many elements are zero). Then, the incorrect boosting is not as harmful, as

false alarms will be lessened.

 

1

P
01

 

9

Type Entropy (Bits)

 

 

 

V2 IT
V2

 

...3
01
CD

 
  
 
  
 

_a
o
O

01
O

 

Number of Neurons

CO

 

0.5 1 1.5
Type Entropy (Bits)
IT

 

NOD-b
GOO
GOO

Number of Neurons
A
O
O

 

 

_.—

 

CO

0.5 1 1 .5
Type Entropy (Bits)

Figure 6.8: Measured entropy along the “What” WWN pathways. In the what pathway,
there is a clear entropy reduction from V2 to IT. The top—down connections enabled this

emergence of discriminating representation to occur.

148

 

 

 

.cs

 

 

 

 

 

 

 

  

7.5?
$3.5»
>
O.
2 3-
E
c 2.5-
.9
s 2-
3 , . .
" v2 PP
V2
g 150
O
a
z 100
“6
g 50»
E
2 o
0 1 2 3
Location Entropy (Bits)
PP
to 80
C
9
8 60
z
“5 40
a
E 20
3
z

 

OO

1 2 3 4
Location Entropy (Bits)

Figure 6.9: Measured entropy along both the “Where” WWN pathway. However, there
is not much of an entropy reduction along the where pathway, hovering around 2.7 bits,
which is about 6.5 different pixels of inaccuracy —— a little smaller than a 3 x 3 pixel
neighborhood. We guess that there is an accuracy ceiling for our current method that we
ran into using 400 where classes (compared to 5 what classes) with the 400 neurons in PP
and 1200 neurons in V2.

149

6.2. 1 Learning Attention

Each bottom-up weight for a V4 neuron was initialized to a randomly selected 19 x 19
training foreground, as seen in Fig 6.15(b). This greatly quickens training by placing
the weights close to the locations in weight space we expect them to converge to.
Initial bottom-up weights for IT, PP, TM, LM were randomly set. Top—down weights
to V4, IT and PP were set to ones (to avoid initial top-down bias)1.

WWN learns through supervised learning externally and local learning internally.
The Type-Motor and Location-Motor areas were ﬁring-imposed per-sample in order
to train the network. There are c neurons in TM, one for each class, and we used
20 x 20 neurons in LM, one for each attention location. Each areas’ neurons self-
organize in a way similar to self-organizing maps (SOM) [55], but with combined
bottom—up and top-down input p, and using the LCA algorithm for optimal weight
updating. WWN’s top-down connections have useful roles in network development
(training). They lead to discriminant features and a motor-biased organization of
lower layers. Explaining these developmental effects are out of the scope of this
paper, but have been written about elsewhere [64]. Further focus on learning and
development in WWN is presented in [47].

For a single area to learn, it requires bottom-up input X, top-down input Z,
bottom-up and top-down weights V and M, and the parameters p (controlling in-
ﬂuence of bottom-up versus top-down), kb (the number of neurons to ﬁre and update
after competition in the bottom-up layer), kt (the same for the top-down layer), and
kp (for the paired layer). This area will output neuronal ﬁring rates y. It updates
neuronal weights V and M.

The non-inhibited neurons update their weights using the Hebbian-learning LCA
updating [122]:

 

1Setting to initial positive values (nonzero) mimics the initial overgrowth of con-
nections in early brain areas, later pruned by development.

iii.
. El.

Vz‘ *— Wh') Vi + (1 - w(nz-)) xi yz- (6.7)

where the plasticity parameters (u(ni) and (1 — w(17,-)) are determined automatically
and optimally based on the neuron’s updating age 7),: A neuron increments its
age when it wins in competition. Learning rate of each neuron is a function of its
ﬁring age2. This learning is Hebbian as the strength of updating depends on both
presynaptic potentials (e.g., x,) and postsynaptic potentials (e.g., yi).

Denote the per-area learning algorithm we have discussed as LCA. To train
the whole WWN, the following algorithm ran over three iterations per sample. Let
0 = (kb, kt,kp,p).

1. (yV4,VV4,MV4) ,__ LCA(XV4,ZV4,VV4,MV4,0V4)
2. (yIT,VIT,MIT) ,_ LCA(XIT,ZIT,VIT,MIT,01T)
3. (yP,VP,MP)«—LCA(XP,ZP,VP,MP,0P)

4. (yTM ,VTM ,0) «— LCA(xTM ,o, vTM ,o, aTM )

5. (yLM ,VLM ,0) «— LCA(XLM ,0,VLM ,0, 9“” )

A few more items to note on training: (1) Each area’s output ﬁring rates y is
sampled to become the next area’s bottom-up input X, and the previous area’s top-
down input Z. (2) For V4, each top—down source from the What or Where path
is weighted equally in setting 2V4. (3) We used a supervised training mechanism
(“pulvinar”-based training [47]) to bias V4 to learn foreground patterns: we set
its Z based on the ﬁring of the LM area -— only neurons with receptive ﬁelds on
the foreground would receive a top-down boost. This is not quite a “skull-closed”
training method, but it is used since we have a limited size network. (4) For PP
(denoted as “P” above) and IT areas, we used 3 x 3 neighborhood updating in

the vein of self-organizing maps in order to spread representation throughout the

 

2The above equation is for bottom-up weights. For top-down weights, substitute
in z,- for x,- and m,- for Vi-

151

 

layer. This was done for the ﬁrst two epochs. (5) The above algorithm is a forward-
biased algorithm, in the sense that it takes less time for information to travel from
sensors to motors (one iteration) than from motors to V4 (two iterations). Therefore,
weight—updating in V4 only occurred on iterations two and three for each image. (6)

Parameters: pV4

IT=pP

was set to 0.75 (bottom—up activity contributed 75% of the input),

and p = 0.25. In training, all values of It were set to one. This ensured a

sparse representation to develop.

-Cat
- Dog
l:lDTruck
- Duck
DCar

 

 

 

 

Figure 6.10: Class representation that developed in the IT area. Observe there are ﬁve
different areas; one per class. Neurons along the border will represent two (or more)
classes. This is illustrated in the entropy histograms above by the small group of neurons
with about 1 bit of entropy (two choices).

6.2.2 Entropy Reduction

Neurons further along the What pathway become more invariant to object position,
while becoming speciﬁc to object type. The opposite is true for the Where path-
way. Neurons earlier in each pathway represent both location and type in a mixed
way. For more information on learning, see [47]. Tests in feedforward mode with a

single foreground in the scene showed the recognition and orientation performance

152

 

 

 

 

Figure 6.11: Internal representation in the IT and PP areas in the develOped network.
(a) Response-weighted input (weighted sum of samples) for four neurons from IT. This
shows the weighted average of the samples that these neurons ﬁred for. These represent
the “duck” class, and some ﬁre for multiple locations. (b) Response-weighted input for
some PP neurons. These represent multiple classes, but a single location.

improved after epochs of learning (an epoch is an entire round of training all possible
samples).

The speciﬁcity of a layer can be measured by its ﬁring entropy relative to a set of
classes which are either types or locations. As an example, consider that we measure
the response of a single neuron in a ”wha ” layer over a set of stimuli from different
classes. If this neuron only ﬁred when one of the classes was present in the stimulus
and not for the others, it is considered type-speciﬁc. If this neuron only ﬁred for
one of the classes, and additionally that class could be placed in multiple locations,
that neuron is also considered location-invariant. We measure entropy for a neuron
by the following: — 22¢ Pr(z')log2(Pr(z')), where Pr(z') is the probability neuron fired
for class 2' (there are c classes). We can then characterize the entropy of the entire
layer or area by measuring it for each neuron and taking the average.

The base 0 logarithm ensures that a neuron’s maximum entropy is one (ﬁring

equally for each class). Probability is measured approximately. We tested the net-

153

work over a set of stimuli, each containing one of the c classes, and this neuron has
ﬁred for n. of them. The probability it ﬁres for class i (p(z')) is measured as the
number of stimuli containing class 2' this neuron ﬁred for divided by n.

After learning, we measured the entropy along both pathways respective to their
particular information types. Entropy reduction for type was very apparent in the
what pathway (see Fig. 6.8), but entropy reduction for location was not seen in the
where pathway. This is probably due to the sheer number of where classes (400)
compared to the what classes (5) and due to the high location speciﬁcity of the V4
neurons (due to the “pulvinar” supervision method used). Fig. 6.10 shows how IT
organized a class—grouped representation (due to top-down connections and 3 x 3
updating), allowing near-zero entropy to occur in IT. Here, only the border neurons
ﬁre for more than one class, reﬂected in Fig. 6.8. Fig. 6.11 shows some internal

representation for a few IT and PP neurons.

6.2.3 Attention Selection Mechanisms

Through training, WWN-3 becomes sparsely and selectively wired for attention.
Afterwards, manipulation of parameters allows information to ﬂow in different ways.
The changing of information ﬂow direction is a key to its ability to perform different
attention tasks. Speciﬁcally, it involves manipulating the p parameters (bottom-up
vs. top—down within an area) and 7 (percentage of top-down to V4 from IT vs. PP).

In WWN-3, we examined four different attention modes. Free—viewing mode is
completely feedforward. It quickly generates a set of candidate hypotheses about the
image based on its learned ﬁlters. But there can be no internal veriﬁcation that the
type and location that emerge at the motors match (binding problem). Free-viewing
mode is necessary to reduce complexity of an under-constrained search problem, but
it cannot solve the problem itself. Another mode, top-down location-based binding

acts as a spotlight. It constrains ﬁring at LM to a winner location neuron and selects

154

 

 

for TM appropriately through top-down bias from LM to V4 and bottom-up from
V4 to TM. The top-down object-based attention mode allows the network to act in
search mode. It constrains ﬁring at TM to a winner neuron and selections for LM
appropriately through top-down bias from TM to V4 and bottom-up from V4 to
LM. The forth mode involves a disengage from the currently attended location or
both currently attended location and type.

Multiple objects introduce the binding problem for WWN-3. Another problem
is the hallucination problem, which could occur when the image changes (containing
_ different foregrounds). The following rules for attention allow WWN-3 to deal with
multiple objects and image changes. Whenever a motor area’s top neuron changes
suddenly, switch to the corresponding top-down mode if it is a strong response, or
switch to free-viewing mode if it is a weak response. Whenever both motor areas’ top
neurons change suddenly to strong responses, go into the top—down location-based
mode. If the network focuses on a single location and type for too long, disengage
from the current location. The following parameter settings specify how this was
done in WWN-3. In all modes we set leg/4 = 8 and 1613/4 = 4.

1). Bottom-up free-viewing: In forward mode, there is no top-down since the
, network does not yet have any useful internal information to constrain its search.
WWN-3 sets (pV4 = pp = pIT = 1).

2). Top-down searching (object-based): In this mode, information must travel
from the TM down to V4 and back up to the LM. If '7 = 1, all top-down inﬂuence
to V4 is from the what path. Set '7 = 1, p1T = 1, pp 2 O, and pV4 = 0.5. leg/4
is large (50% of neurons) to allow wide-spread top-down bias. For IT and PP, kb is
set small (up to 10% of neurons), for sparse coding, while kt and kp must be large
enough to contain all neurons that may carry a bias. For example ktIT = n / c where
c = 5 (classes) and n = 400 neurons if we assume an equal number of neurons per

class.

155

 

 

3). Tap-down location-based: In this mode, information must travel from LM
to V4 and back up to TM. Thus, the network-sets '7 = O, pIT = 0, pp = 1, and
pV4 = 0.5. The k parameters are set the same as for object-based attention.

4). Location and type attention shift: In this mode, the network must disengage
from its current attended foreground to try to attend to another foreground. To do
so, the current motor neurons that are ﬁring are inhibited (for the location motor, an
inhibition mask of 15 x 15 width“ was used) while all other neurons are slightly excited
until the information can reach V4 (two iterations). The information ﬂows top-down
from motors to V4 (p1 T = pPP = 1). The top-down activation parameters kt must
be set much larger since all neurons except the current class should be boosted.
After two iterations, it re-enters free-viewing mode.

Is it plausible to have multiple different attention modes? Computationally, at—
tention and recognition is a “chicken-egg” problem, since, for attention, it seems
recognition must be done, and for recognition, it seems attention (segmentation)
must be done. The brain might deal with this problem by using complementary
pathways and use of different internal modes to enforce internal validity and syn-
chronization. For example, Treisman famously showed [109] that there is a initial
parallel search followed by a serial search (spotlight), which binds features into ob-
ject representations at each location. It seems that after feedforward activity, a

top-down location bias emerges to focus on a certain spot.

6.3 Experiments

Each input sample to WWN—3 contains one or more foregrounds superimposed over
a natural background. The background patches were 38 x 38 in size, and selected

from 13 natural imagesB. The foregrounds were selected from the MSU 25-Objects

 

3Available from http://www.cis.hut.ﬁ/projects/ica/imageica/

156

 

 

dataset [64] of objects rotating in depth. The foregrounds were normalized to 19 x 19
size square images, but were placed in the background so that the gray square contour
was eliminated (masked). Three training views and two testing views were selected
per each of the ﬁve classes. The classes and within-class variations of the foregrounds
can be seen in Fig. 6.15(b). Five input samples, with a single foreground placed over

different backgrounds, can be seen in Fig. 6.15(a).

6.3.1 WWN-3 Learns

 

Figure 6.12: The foregrounds used in the experiment. There are three training (left)
and two testing (right) foregrounds from each of the ﬁve classes of toys: “cat”, “pig”,
“dump-truck”, “duck”, and “car”.

 

Figure 6.13: Sample image inputs.

157

"Where" and "What" motor performance through epochs

 

 

 

 

 

1.7 , I .
,‘-I~ i’.-'~.-.-I—._..-’-‘Y"~.’io.96
"I. j ' i - «0.94
E -r
92 ‘0.92 a)
“515 . , a
t “0.9 E
e .9
a)
0
g 1.3—
.92
C]
H I i ‘ .8
0 5 10 15 28

Epochs
Figure 6.14: Performance results on disjoint testing data over epochs.

The training set consisted of composite foreground and background images, with
one foreground per image. We trained every possible training foreground at every
possible location (pixel-speciﬁc), for each epoch, and we trained over many epochs.
So, there are 5 classes x 3 training instances X 20 x 20 locations = 6000 different
training foreground conﬁgurations. After every epoch, we tested every possible
testing foreground at every possible location. There are 5 x 2 X 20 x 20 = 4000
different testing foreground conﬁgurations.

To simulate a shortage of neuronal resource relative to the input variability, we
used a small network, ﬁve classes of objects, with images of a single size, and many
different natural backgrounds. Both the training and testing sets used the same 5
object classes, but different background images. As there are only 3 V4 neurons
at each location but 15 training object views, the WWN is 4/5 = 80% short of
resource to memorize all the foreground objects. Each V4 neuron must deal with
various misalignment between an object and its receptive ﬁeld, simulating a more

realistic resource situation. Location was tested in all 20 x 20 = 400 locations.

158

 

To see how a network does as it learns, we tested a single foreground in free-
viewing mode throughout the learning time. As reported in Fig. 6.15(b), a network
gave respectable performance after only the ﬁrst round (epoch) of practice. After 5
epochs of practice, this network reached an average location error around 1.2 pixels

and a correct disjoint classiﬁcation rate over 95%.

6.3.2 Two Object Scenes

After training had progressed enough so the bottom-up performance with a single
foreground was sufﬁcient, we wished to investigate WWN-3’s ability with two ob-
jects, and in top-down attention. We tested the above trained WWN-3 with two
competing objects in each image, placed at four possible quadrants to avoid over-
lapping. We placed two diﬁerent foregrounds in two of the four corners. There
were 5 classes x 4 corners x 3 other corners (for the second foreground) = 60
combinations. WWN-3 ﬁrst operated starting in free-viewing mode (no imposed
motors), until it converged. If the type and location (within 5 pixels) matched one
of the foregrounds, it was considered a success. Next, the type of the foreground
that wasn’t located was imposed at TM as an external goal, and WWN-3 operated
in top-down searching mode to locate the other foreground. Next, WWN-3 would
shift its attention back to the ﬁrst object. Finally, the location of the foreground
that wasn’t identiﬁed in the ﬁrst phase was imposed at LM as an external goal, and
WWN—3 operated in top-down location—based mode to ﬁnd the other foreground.
As shown in Fig. 6.15, the success rates for this network were 95% for the free-
viewing test. The success rates were 96% when given type context to predict location
and 90% when given location context to predict type. It successfully attended to

the other object via an attention shift 83% of the time.

159

 

Two competing objects in the scene

1 . T .
0.8 -
b’
2
5 0.6
U
E
a.» 0.4 '
E
8
0.2

 

 

.\“
e’ e30 “based “a0“ 5“
Vie —v’9 \pcaﬁo P558

q‘\e“"\“g 536

  

Input image Free-viewing Cat context Truck context
0))

Figure 6.15: WW Ns for the joint attention-recognition problem in bias-free and goal-based
modes for two object scenes. (a) Performance when input contains two learned objects:
bias—free, two types of imposed goals (top-down type-context for search and location-
context for orient and detect). and shifting attention from one object to the other. (b) A
few examples of operation over different modes by a trained “7 W N-3. “Context” means
top-down goal is imposed. An octagon indicates the location and type action outputs.
The octagon is the default receptive ﬁeld.

160

 

   

6.3.3 Cross-Layer Connections

Callaway [18] discussed how the cortex might use alternate pathways through the
pulvinar and thalamus so that earlier area can send information in a more direct
way to later areas. These alternate pathways would allow the higher level areas
to have higher resolution versions of the input signals that may have been signiﬁ-
cantly transformed in the cortico-cortical pathways going through many areas. This
experiment tests an alternate pathway in WWN, but in the opposite direction.

The purpose of this section is to investigate the performance effects of direct
top-down connections from the type-motor to V2. The direct-connection network
maintains nearly “pure” type speciﬁcity in the earliest layer, leading to higher recog-
nition accuracy for the limited set of foregrounds tested.

The following experiment is designed to test the prediction that (i) entropy of
V2 will be greatly decreased through top-down connections from that motor and
(ii) such low entropy can lead to higher recognition rates, given we have sufﬁciently
large resource.

There were two network architecture types trained, as seen in Fig. 6.16. The
ﬁrst used top-down connections directly from TM to V2. The second architecture
did not use the TM to V2 connections. V2 contained 20 x 20 x 3 = 1200 neurons,
PP and IT contained 20 x 20 neurons, there were 5 types and 20 x 20 = 400 location
classes. Training occurred in the same way as before.

Every neuron had its type and location winning probabilities recorded through
training. A simple way to measure how well the architecture would do with direct
connections to TM and LM and a large amount of training experience is to take the
winning V2 neuron’s highest type probability as the output type, as indicated in
Fig. 6.16 as the “V2 entropy classiﬁer” block.

Comparisons between V2 entropy-based classiﬁcation and the classiﬁcation through

IT and PP are shown in Tables 6.1 and 6.2 (for the disjoint test data), with the ﬁrst

161

 

TOutput Type and Location (2)

 

 

   
  

 

 

 

 

 

V2 Entropy
Classiﬁer
.'_ . a} . ,7, :Output Loc. (1)
j w . pp. F
, .r » Imposed Loc.

 

 

 

 

   

391“.“- —’ V2; .- puma.

  
    

 

 

 

 

 

 

 

 

‘ Output Type (1)

 

,. Imposed Type

 

 

 

 

 

 

Figure 6.16: WWN-3 trains V2 through pulvinar location supervision and bottom-up
based LCA. We added a new direct connection from TM so that V2 develops heightened
type-speciﬁcity (even though it was already fairly type-speciﬁc). To test the coupled
speciﬁcity of V2 representations, an alternate classiﬁer, based on the winning neuron’s
entropy, was developed.

162

table showing the result for the architecture with direct top-down connections from
TM to V2 used in training. The recognition rates and location error (measured in
pixels) show that the architecture that used direct top-down connections is better
overall. As expected, using the classiﬁcation paths through IT and PP slightly de-
creased the performance. Average entropy is shown in Table 6.3. Architecture 1 led
to a greatly reduced type-entropy (which was close to zero) in V2, as compared to

architecture 2.

Table 6.1: Architecture 1: trained with top-down from TM to V2

 

 

 

 

V2 entropy-based WWN network
classiﬁcation classiﬁcation
Recognition rate 95% 92.4%
Distance error (pixel) 1.1 1.9

 

 

 

 

Table 6.2: Architecture 2: trained without top-down from TM to V2

 

 

 

 

V2 entropy-based WWN network
classiﬁcation classiﬁcation
Recognition rate 91.4% 89.4%
Distance error (pixel) 1.5 2.1

 

 

 

 

Table 6.3: Average entropy for both architectures, in bits

 

 

 

 

 

Architecture Architecture
1 2
V2 (what) 0.05 0.28
IT 0.16 0.18
V2 (where) 2.7 2.6
PP 2.4 2.2

 

 

 

 

 

This experiment shines light on a question of architecture selection for develop-
mental recurrent networks. Should earlier areas have direct (cross-layer) connectivity

to and from much later areas? Enabling entropy reduction from sensors to motors

163

is probably a principle of development. But if the data is too complex, then it takes
too much resource to reduce entropy enough within a single layer. It will take a large
amount of representative resource to recognize a large number of objects without any
shared features. Using top-down direct connections can increase the class-selectivity
of the neurons, as shown above, but probably at the expense of efﬁcient resource

utilization.

6.4 Summary

The work here demonstrated that the WWN-3 is capable of bottom-up and top-down
attention when two objects are in the scene. It uses internal synchronization through
top-down connections to avoid the binding problem for this data. Either “where”
or “what”, to provide top-down context bias, as a goal or preference. Experiments
using the disjoint foreground object subimages with general object contours reached

an encouraging performance level by a limited size WWN-3.

6.5 Bibliographical Notes

The ﬁrst version WWN-1 [48] operated on single foregrounds over random natural
image backgrounds: type recognition given a location (top-down location based) and
location ﬁnding given a type (top-down object based), where 5 locations were tested.
The second version WWN-2 [47] additionally realized attention and recognition for
single objects in natural backgrounds without supplying either position or type (free-
viewing), and also used a more complex architecture. Further, all pixel locations
were tested. The work reported here corresponds to the third version of WN
—- WWN-3 [63]. It extends the prior versions WWN-1 and WWN-2 to deal with
multiple objects in natural backgrounds, multiple views per class, non-square object

contours, and disjoint testing foregrounds.

164

 

 

Chapter 7

Concluding Remarks

The main contribution of this work are the methods and theory about utilizing
excitatory top-down connections in multilayer developmental networks. Top-down
connections distribute positive bias, which can be exploited and controlled using
this work’s methods. There are two types of effects: spatial and temporal. Spatial:
The topographic class grouping framework addresses the dual problems of feature
extraction and automatic network wiring, by illustrating how and why top—down
connections can cause biased (reduced-entropy) features and modular networks to
emerge from incremental experience. Temporal: The ﬁring of higher-layer units
represents an abstract bias distribution. Top-down connections propagate this bias
to lower layer units, and lower layer units feed back into the higher-layer units. It was
shown that a clear interpretation of top-down connections in modular networks is
as appropriate boosting of sub—features. Lateral inhibition, paired layers, and ﬁring
thresholds are effective to avoid incremental corruption of that temporal context
over time. But in networks without modularity, this interpretation does not apply.

The theme of this work is cortex-inspired developmental vision. The design of
our visual cortex was evolved to deal with the problems of many-objects represen-
tation, attention, and binding that need to be solved for a developmental vision

system. Therefore, study of the cortex should lead to important insights about how

165

 

 

 

to determine parameters and evaluate the artiﬁcial networks. Many of the cortical
inspirations for these mechanisms and architectures are explained elsewhere in this
work. To point out a few: 1. top-down connections are as numerous as bottom-up
connections and nearly all layers that are connected are connected in a bidirectional
way, and 2. Modularity is an important organizing principle. The role of recurrence
at all layers of deep feature hierarchies and attention systems has not sufﬁciently
been explored. This work is a step towards this goal.

Where—what networks are recurrent networks for attention and recognition. They
use separate feature maps to represent location and type separately. It was shown
how recurrent connectivity can enforce multilayer synchronization. This handles the
binding problem without occlusions and transparency, without using combination
neurons. When there are multiple objects in the scene, an unconstrained search for
a single object’s location and type has multiple solutions that can be correct with
respect to the network but incorrect with respect to the world. From an unattended
image, the network generates a plausible hypothesis along each path. By letting one
of the unknowns become a constraint, the other is synchronized via top-down bias
to the earlier layer followed by updated forward activity along the other path.

Experimental results are also key contributions. Some results are summarized
here. In the LCA framework, it was shown how dual optimality greatly assists LCA’s
speed and precision of feature extraction, as compared to other Hebbian learning
rules that use a manually tuned learning rate. And LCA’s “CCI plasticity” showed
long-term plasticity was possible even with such a fast convergence. Topographic
class grouping emerged for handwritten digit recognition and recognition of objects
from different 3D viewpoints; and the networks that learned with enabled top-down
connections developed more effective compression than those not using them. They
also developed modularity. Combining top-down connections with adaptive lateral-

excitation led to a better compression compared to isotropic updating, but at the

166

 

 

 

expense of smoothness and connector hubs. The modular networks developed were
shown to use recurrence for signiﬁcant performance boosts in object recognition and
vehicle recognition (using local features) when the data was temporally ordered.
Where-what networks successfully performed bottom-up and top-down attention
when there were multiple objects in the scene.

The future of this work involves extending WWNs via a larger ventral pathway
— a feature hierarchy. It would also be interesting to look into a motion pathway for
sequence or trajectory learning in vision using WWNs. The associative information
ﬁlling in effects of recurrent excitation will be examined. Finally, networks should

be embedded into active agents, which will interact with and learn about the world.

167

 

 

 

BIBLIOGRAPHY

168

 

 

 

 

BIBLIOGRAPHY

[1] P.J. Antsaklis and AN. Michel. Linear systems. 2006.

[2] G. Backer, B. Mertsching, and M. Bollmann. Data and model-driven gaze
control for an active-vision system. IEEE Thorns. Pattern Analysis and Machine
Intelligence, 23(12):1415—1429, 2001.

 

[3] R. Baillargeon. Object permanance in 3.5 and 4.5-month—old infants. Devel-
opmental Psychology, 23:655—664, 1987.

[4] GI. Baker, C. Keysers, J. Jellema, B. Wicker, and D. I Perrett. Neuronal rep-
resentation of disappearing and hidden objects in temporal cortex of macaque.
Exp. Brain Res, 1402375—381, 2001.

[5] H. Barlow. Redundancy reduction revisited. Network: Computation in Neural
Systems, 12:241—253, 2001.

[6] HB Barlow, C. Blakemore, and JD Pettigrew. The neural mechanism of binoc-
ular depth discrimination. The Journal of Physiology, 193(2):327, 1967.

[7] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman. Eigenfaces vs ﬁsher-
faces: Recognition using class speciﬁc linear projection. 19(7):711—720, July
1997.

[8] A. J. Bell and T. J. Sejnowski. The ‘independent components’ of natural scenes
are edge ﬁlters. Vision Research, 37(23):3327—3338, 1997.

[9] I. Biederman. Recognition-by-components: A theory of human image under-
standing. Psychological review, 94(2):115—147, 1987.

[10] I. Biederman and RC. Gerhardstein. Recognizing depth-rotated objects: Ev-
idence and conditions for three-dimensional viewpoint invariance. Journal of

Experimental Psychology: Human perception and performance, 19(6):1162—
1182,1993.

[11] C. Blakemore and G. F. Cooper. Development of the brain depends on the
visual environment. Nature, 228:477—478, Oct. 1970.

169

 

[12] J.-P. Braquelaire and L. Brun. Comparison and optimization of methods of
color image quantization. IEEE Transactions on Image Processing, 6(7):1048—

1051, 1997.
[13] DE Broadbent. Perception and Comunication. 1958.

[14] J. Bullier. Hierarchies of cortical areas. In J .H. Kaas and CE. Collins, editors,
The Primate Visual System, pages 181-204. CRC Press, New York, 2004.

[15] E. Bullmore and O. Sporns. Complex brain networks: graph theoretical
analysis of structural and functional systems. Nature Reviews Neuroscience,

10(3):186-198, 2009.

[16] D. V. Buonomano and U.R. Karmarkar. How do we tell time? Neuroscientist,
8:42-51, 2002.

[17] P. Buzas, K. Kovacs, A.S. Ferecsko, J .M.L. Budd, U.T. Eysel, and Z.F. Kisvar-
day. Model-based analysis of excitatory lateral connections in the visual cortex.
Journal of Comparative Neurology, 499:861—881, 2006.

[18] E. M. Callaway. Local circuits in primary visual cortex of the macaque monkey.
Annu. Rev Neurosci, 21:47—74, 1998.

[19] LE Chen, H.Y.M. Liao, M.T. K0, J.C. Lin, and G.J. Yu. A new LDA-based
face recognition system which can solve the small sample size problem. Pattern
recognition, 33(10):1713-1726, 2000.

[20] Y-M Cheung and L. Law. Rival—model penalized self-organizing map. IEEE
Transactions on Neural Networks, 18:289—295, 2007.

[21] BG Cumming and AJ Parker. Binocular neurons in VI of awake monkeys
are selective for absolute, not relative, disparity. Journal of Neuroscience,
19(13):5602, 1999.

[22] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented his-
tograms of flow and appearance. pages 428-441. Springer.

[23] P. Dayan and LP. Abbott. Theoretical neuroscience. Citeseer, 2001.

[24] V.C. de Verdiere and J .L. Crowley. Visual recognition using local appearance.
Computer Vision (ECCV’98), page 640.

[25] G. Deco and E. T. Rolls. A neurodynamical cortical model of visual attention
and invariant object recognition. Vision Research, 40:2845 - 2859, 2004.

170

 

[26] R. Desimone, TD Albright, CG Gross, and C. Bruce. Stimulus-selective prop-
erties of inferior temporal neurons in the macaque. Journal of Neuroscience,

4(8):2051—2062, 1984.

[27] R.. Desimone and J. Duncan. Neural mechanisms of selective visual attention.
Annual Review of Neuroscience, 18:193—222, 1995.

[28] M. Dissanayake, P. Newman, S. Clark, HF Durrant-Whyte, and M. Csorba. A
solution to the simultaneous localization and map building (SLAM) problem.
IEEE Transactions on Robotics and Automation, 17(3):229—241, 2001.

[29] RJ Douglas, C. Koch, M. Mahowald, KA Martin, and HH Suarez. Recurrent
excitation in neocortical circuits. Science, 269(5226):981, 1995.

[30] S. Edelman. Representation and recognition in vision. The MIT Press, 1999.

[31] J .L. Elman. Finding structure in time. Cognitive Science: A Multidisciplinary
Journal, 14(2):179—211, 1990.

[32] D. J. Felleman and D. C. Van Essen. Distributed hierarchical processing in
the primate cerebral cortex. Cerebral Cortex, 1:1—47, 1991.

[33] DJ. Felleman and DC. Van Essen. Distributed hierarchical processing in the
primate cerebral cortex. Cereb. Cortex, 1(11):1—47, 1991.

[34] M. Franzius, N. Wilbert, and L. Wiskott. Invariant object recognition with
slow feature analysis. Artiﬁcial Neural Networks-I CA NN 2008, pages 961-970.

[35] DJ. Freedman, M. Riesenhuber, T. Poggio, and BK. Miller. A comparison of
primate prefrontal and inferior temporal cortices during visual categorization.
Journal of Neuroscience, 23(12):5235, 2003.

[36] K. Fukushima, S. Miyake, and T. Ito. Neocognitron: A neural network model
for a mechanism of visual pattern recognition. 13(5):826—834, 1983.

[37] F. Gebali. Reducible Markov Chains. Analysis of Computer and Communica-
tion Networks, pages 1—32.

[38] J. Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of Neural
Computation. Addison-Wesley, New York, 1991.

[39] GE. Hinton, S. Osindero, and Y.W. Teh. A fast learning algorithm for deep
belief nets. Neural Computation, 18(7):1527—1554, 2006.

[40] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Compu-
tation, 9(8):1735-1780, 1997.

171

 

 

 

[41] J J Hopﬁeld. Neural networks and physical systems with emergent collec-
tive computational abilities. Proceedings of the national academy of sciences,
79(8):2554—2558, 1982.

[42] D. H. Hubel and T. N. Wiesel. Receptive ﬁelds of single neurons in the cat’s
striate cortex. Journal of Physiology, 148:574—591, 1959.

[43] D. H. Hubel and T. N. Wiesel. Receptive ﬁelds, binocular interaction and func-
tional architecture in the cat’s visual cortex. Journal of Physiology, 160(1):107—
155,1962.

[44] A. Hyvarinen and E. Oja. Independent component analysis: algorithms and
applications. Neural Networks, 13(4-5):411—430, 2000.

[45] M. Ito and C. D. Gilbert. Attention modulates contextual inﬂuences in the
primary visual cortex of alert monkeys. Neuron, 22:593—604, 1999.

[46] L. Itti and C. Koch. Computational modelling of visual attention. Nat. Rev.
Neurosci, 2:194—203, 2001.

[47] Z. Ji and J. Weng. Where what network-2: A biologically inspired neural net-
work for concurrent visual attention and recognition. In IEEE World Congress
on Computational Intelligence, Spain, 2010.

[48] Z. Ji, J. Weng, and D. Prokhorov. Where-what network 1: “where” and
“what” assist each other through top-down connections. In Proc 7th Int ’1
Conf on Development and Learning, Monterey, CA, August 9-12 2008.

[49] I. T. Jolliffe. Principal Component Analysis. Springer-Verlag, New York, 1986.

[50] E. R. Kandel, J. H. Schwartz, and T. M. Jessell, editors. Principles of Neural
Science. McGraw—Hill, New York, 4th edition, 2000.

[51] N. Kanwisher. Repetition blindness and illusory conjunctions: Errors in bind-
ing visual types with visual tokens. Journal of Experimental Psychology: Hu-
man Perception and Performance, 17(2):404—421, 1991.

[52] C. Koch. The Quest for Consciousness: A Neurobiological Approach. Roberts
and Company Publishers, Englewood, Colorado, 2004.

[53] C. Koch and S. Ullman. Shifts in selective visual attention: Towards the
underlying neural circuitry. Human Neurobiology, 4:219—227, 1985.

[54] T. Kohonen. Self-Organizing Maps. Springer-Verlag, Berlin, 2nd edition, 1997.

[55] T. Kohonen. Self-Organizing Maps. Springer-Verlag, Berlin, 3rd edition, 2001.

172

 

 

[56] S. L. Pallas L. von Melchner and M. Sur. Visual behavior mediated by retinal
projections directed to the auditory pathway. Nature, 404:871—876, 2000.

[57] V.A.F. Lamme. Blindsight: the role of feedforward and feedback corticocorti-
cal connections. Acta Psychol, 107:209—228, 2001.

[58] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324,
1998.

[59] Y. LeCun, F.J. Huang, and L. Bottou. Learning methods for generic object
recognition with invariance to pose and lighting. In Proc. of the IEEE Con-
ference on Computer Vision and Pattern Recognition, 2004.

[60] D Lee and S Seung. Learning the parts of objects by non-negative matrix
factorization. Nature, 401:788—791, 1999.

[61] E. L. Lehmann. Theory of Point Estimation. John Wiley and Sons, Inc., New
York, 1983.

[62] M. Luciw and J. Weng. Laterally connected lobe component analysis: preci-
sion and topography. In Proc. 9th International Conf. on Development and
Learning (ICDL ‘09), Shanghai, China, June 5 - 7 2009.

[63] M. Luciw and J. Weng. Where what network-3: Developmental top-down at-
tention with multiple meaningful foregrounds. In Proc. IEEE World Congress
on Computational Intelligence, Barcelona, Spain, 2010.

[64] M. D. Luciw and J. Weng. Topographic class grouping with applications to
3D object recognition. In Proc. International Joint Conference on Neural
Networks, Hong Kong, June 1-6 2008.

[65] Matthew D. Luciw. Multilayer in-place learning for autonomous mental de
velopment. Master’s thesis, MIchigan State University, 2006.

[66] J. Lucke and M. Sahani. Maximal causes for non-linear component extraction.
The Journal of Machine Learning Research, 9:1227—1267, 2008.

[6 7] D. G. Mackay. Aspects of the theory of comprehension, memory and attention.
The Quarterly Journal of Experimental Psychology, 25(1):22—40, 1973.

[68] P.E. Maldonado, I. Godecke, C.M. Gray, and T. Bonhoeffer. Orientation se-
lectivity in pinwheel centers in cat striate cortex. Science, 276:1551—1555,
1997.

173

 

[69] Y. MarcAurelio Ranzato, L. Boureau, and Y. LeCun. Sparse feature learning
for deep belief networks. Advances in Neural Information Processing Systems,
20:1185—1192.

[70] D. Marr. Vision. Freeman, New York, 1982.

[71] J .H. Maunsell and DC. Van Essen. The connections of the middle temporal
visual area (mt) and their relationship to a cortical hierachy in the macaque
monkey. J. Neuroscience, 3(12):2563—2586, 1983.

[72] BA. McGuire, C.D. Gilbert, P.K. Rivlin, and TN. Wiesel. Targets of horizon-
tal connections in macaque primary visual cortex. J Comp Neurol, 305:370—
392, 1991.

[73] R. Miikkulainen, J. A. Bednar, Y. Choe, and J. Sirosh. Computational Maps
in the Visual Cortex. Springer, Berlin, 2005.

[74] M. Mishkin, L.G. Ungerleider, and K.A. Macko. Object vision and spatial
vision: Two cortical pathways. Philosophy and the neurosciences: a reader,
page 199, 2001.

[75] MC. Mozer, M.H. Wilder, and D. Baldwin. A Uniﬁed Theory of Exogenous
and Endogenous Attentional Control. Department of Computer Science and
Institute of Cognitive Science University of Colorado, Boulder, CO 80309, 430,
2007.

[76] H. Murase and SK. Nayar. Visual learning and recognition of 3-D objects from
appearance. International Journal of Computer Vision, 14(1):5—24, 1995.

[77] E. Oja. Simpliﬁed neuron model as a principal component analyzer. Journal
of mathematical biology, 15(3):267—273, 1982.

[78] BA Olshausen, CH Anderson, and DC Van Essen. A neurobiological model of
visual attention and invariant pattern recognition based on dynamic routing
of information. Journal of Neuroscience, 13(11):4700, 1993.

[79] BA. Olshausen and DJ. Field. Sparse coding with an overcomplete basis set:
A strategy employed by V1? Vision research, 37(23):3311—3325, 1997.

[80] B. A. Olshaushen and D. J. Field. Emergence of simple-cell receptive ﬁeld
properties by learning a sparse code for natural images. Nature, 381:607-609,
June 13 1996.

[81] SM. Pan and KS. Cheng. An evolution-based tabu search approach to code-
book design. Pattern Recognition, 40(2):476—491, 2007.

174

 

 

 

 

[82] A. Pascual-Leone and V. Walsh. Fast back projections from the motion to
the primary visual area necessary for visual awareness. Science, 292:510—512,
2001.

[83] A. Pasupathy and CE. Connor. Shape representation in area V4: position-
speciﬁc tuning for boundary conformation. Journal of Neurophysiology,
86(5):2505, 2001.

[84] J. Piaget. The Construction of Reality in the Child. Basic Books, New York,
1954.

[85] RD. Raizada and S. Grossberg. Towards a theory of the laminar architecture
of cerebral cortex: computational clues from the visual system. Cereb Cortex,
13:100C113, 2003.

[86] M. Riesenhuber and T. Poggio. Models of object recognition. Nature Neuro—
science, 3:1199—1204, 2000.

[87] I. Rock and J. DiVita. A case of viewer-centered object perception. Cognitive
Psychology, 19(2):280—293, 1987.

[88] H. Schneiderman and T. Kanade. A statistical method for 3D object detection
applied to faces and cars. page 1746, 2000.

[89] TJ Sejnowski. Predictive sequence learning in recurrent neocortical circuits.
Advances in Neural Information Processing Systems 12, page 164, 2000.

[90] RN. Shepard and J. Metzler. Mental rotation of three-dimensional objects.
Science, 171(3972):701, 1971.

[91] E. Simoncelli and B. Olshausen. Natural image statistics and neural represen-
tation. Annu. Rev. Neurosci, 24:1193 - 1216, 2001.

[92] LC. Sincich and J.C. Horton. The circuitry of V1 and V2: integration of
color, form, and motion. 2005.

[93] Y.F. Sit and R. Miikkulainen. Self-organization of hierarchical visual maps
with feedback connections. Neurocomputing, 69:1309—1312, 2006.

[94] M. Solgi and J. Weng. Developmental Stereo: Emergence of Disparity Pref-
erence in Models of the Visual Cortex. IEEE Trans. on Autonomous Mental
Development, 1(4):238—252, 2009.

[95] C. Southan. Has the yo-yo stopped? An assessment of human protein-coding
gene number. Proteomics, 4(6):1712—1726, 2004.

175

 

 

 

[96] M. Sur, A. Angelucci, and J. Sharm. Rewiring cortex: The role of patterned
activity in development and plasticity of neocortical circuits. Journal of Neu-
robiology, 41:33—43, 1999.

[97] D. L. Swets and J. Weng. Using discriminant eigenfeatures for image retrieval.
18(8):831—836, 1996.

[98] K. Tanaka. Inferotemporal cortex and object vision. Annual Reviews of Neu-
roscience, 19:109—139, 1996.

[99] M.J. Tarr and H.H. Bulthoff. Is human object recognition better described
by geon structural descriptions or by multiple views? Comment on Bieder-
man and Gerhardstein (1993). Journal of Experimental Psychology-Human
Perception and Performance, 21(6):1494—1505, 1995.

[100] J .G. Taylor. The, CODAM model of Attention and Consciousness. pages 292—
297, 2003.

[101] JG Taylor. CODAM: A neural network model of consciousness. Neural Net-
works, 20(9):983—992, 2007.

[102] J.M. Hupe et al. Cortical feedback improves discrimination between ﬁgure and
background by v1, v2 and v3 neurons. Nature, 394:784—787, 1998.

[103] S. Thorpe, D. Fize, and C. Marlot. Speed of processing in the human visual
system. nature, 381(6582):520—522, 1996.

[104] H. Tomita, M. Ohbayashi, K. Nakahara, I. Hasegawa, and Y. Miyashita. Top-
down signal from prefrontal cortex in executive control of memory retrieval.
Nature, 401(6754):699—703, 1999.

[105] R.B.H. Tootell, K.J. Devaney, J.C. Young, R. Rajimehr G. Postelnicu, and
LG. Ungerleider. fmri mapping of a mophed continum of 3d shapes within
inferior temporal cortex. Proc Natl Acad Sci USA, 105:3605—3609, 2008.

[106] A. Treisman. Monitoring and storage of irrelevant messages in selective atten-
tionl. Journal of Verbal Learning and Verbal Behavior, 3(6):449—459, 1964.

[107] A. Treisman. Solutions to the binding problem: review progress through con-
troversy summary and convergence. Neuron, 24:105-110, 1999.

[108] A. Treisman and H. Schmidt. Illusory conjunctions in the perception of objects.
Cognitive Psychology, 14(1):]07—141, 1982.

[109] AM. Treisman and G. Gelade. A feature-integration theory of attention.
Cognitive psychology, 12(1):97—136, 1980.

176

 

 

[110] D.Y. Ts’o, C.D. Gilbert, and TN. Wiesel. Relationships between horizontal
interactions and functional architecture in cat striate cortex as revealed by
cross-correlation analysis. J Neurosci, 6:1160—1170, 1986.

[111] J. K. Tsotsos, S. M. Culhane, W. Y. K. Wai, Y. Lai, N. Davis, and F. Nuﬂo.
Modeling visual attention via selective tuning. Artiﬁcial Intelligence, 78:507—
545, 1995.

[112] J .K. Tsotsos, J. Konstantine, and University of Toronto. Dept. of Com-
puter Science. The complexity of perceptual search tasks. 1989.

[113] M. Tqu and A. Pentland. Eigenfaces for recognition. Journal of Cognitive
Neuroscience, 3(1):71—86, 1991.

[114] Alhoniemi E Vesanto J, Himberg J and Parhankangas J. Som toolbox for
matlab 5. Technical Report A57, Helsinki University of Technology: Finland,
2000.

[115] P. Viola and M. Jones. Rapid object detection using a boosted cascade of
simple features. volume 1, page 511, 2001.

[116] C. Von Der Malsburg. The what and why of binding: review the modelers
perspective. Neuron, 24:95—104, 1999.

[117] J. Weng. Task muddiness, intelligence metrics, and the necessity of au-
tonomous mental development. Minds and Machines, 19(1):93—115, 2009.

[118] J. Weng, N. Ahuja, and T. S. Huang. Cresceptron: a self-organizing neural
network which grows adaptively. In Proc. Int ’1 Joint Conference on Neural
Networks, volume 1, pages 576—581, Baltimore, Maryland, 1992.

[119] J. Weng, N. Ahuja, and T. S. Huang. Learning recognition and segmentation
using the Cresceptron. 25(2):109—143, Nov. 1997.

[120] J. Weng, H. Lu, T. Luwang, and X. Xue. A multilayer in-place learning
network for development of general invariances. International Journal of Hu-
manoid Robotics, 4(2), 2007.

[121] J. Weng, H. Lu, T. Luwang, and X. Xue. Multilayer in-place learning networks
for modeling functional layers in the laminar cortex. Neural Networks, 2008.

[122] J. Weng and M. Luciw. Dually-optimal neuronal layers: Lobe component
analysis. IEEE Transactions on Autonomous Mental Development, 1(1), 2009.

[123] J. Weng, J. McClelland, A. Pentland, O. Sporns, I. Stockman, M. Sur, and
E. Thelen. Autonomous mental development by robots and animals. Science,
291(5504):599—600, 2001.

177

 

 

 

[124] J. Weng and N. Zhang. Optimal in-place learning and the lobe component
analysis. In Proc. World Congress on Computational Intelligence, Vancouver,
Canada, July 16—21 2006.

[125] J. Weng, Y. Zhang, and W. Hwang. Candid covariance-free incremental prin-
cipal component analysis. 25(8):1034—1040, 2003.

[126] PJ Werbos. Backpropagation through time: what it does and how to do it.
Proceedings of the IEEE, 78(10):l550—1560, 1990.

[127] P.J. Werbos. The roots of backpropagation: from ordered derivatives to neural
networks and political forecasting. Wiley-Interscience, 1994.

[128] X. Wu and K. Zhang. A better tree—structured vector quantizer. In Data
Compression Conference, 1991. DCC ’91 ., pages 392—401, 1991.

[129] Y. Zhang, J. Weng, and W. Hwang. Auditory learning: A developmental
method. IEEE Transactions on Neural Networks, 16(3):601—616, 2005.

178