USING DUALLY OPTIMAL LCA FEATURES IN SENSORY AND ACTION SPACES FOR CLASSIFICATION By Nikita Nitin Wagle A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Computer Science and Engineering 2012 ABSTRACT USING DUALLY OPTIMAL LCA FEATURES IN SENSORY AND ACTION SPACES FOR CLASSIFICATION By Nikita Nitin Wagle Appearance based methods have utilized a variety of techniques to extract training-data dependent features, such as Linear Disciriminant Analysis (LDA), Support Vector Machine (SVM), k-means clustering, and sparse auto-encoders. The Developmental Networks (DN), which use Lobe Component Analysis (LCA) features developed not only from the image space X but also the effector (action) space Z. Since the effector space Z can be taught to represent a set of trainer specified meanings (e.g., type and location in two ports), a DN treats these meanings in a unified way for both detection and recognition for objects in dynamic cluttered backgrounds. However, the DN method has not been applied to publicly available data sets and compared with well-known major techniques. In this work, we fill this void. We describe how the Z information enables the features to be more sensitive to trainer specified output meanings (e.g., type and location). The reported experiments fall into two extensively studied categories — global template based object recognition and local template based scene classification. For the data sets used, the performance of the DN method is better or comparable to some global template based methods and comparable to some major local template based methods while the DNs also provide statistics-based location information about the object in a cluttered scene. To my family and friends. iii ACKNOWLEDGMENT The authors would like to thank Matthew Luciw and Yuekai Wang for providing some of their programs for the work reported here. iv TABLE OF CONTENTS List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Figures vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1 Introduction 1.1 Existing Major Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Characteristics of DN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Novelty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 5 2 Previous Work 2.1 Global Template Based Methods 2.1.1 LDA Algorithm . . . . . . 2.1.2 SVM Algorithm . . . . . . 2.2 Local Template Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 7 8 11 Developmental Network Model Area Function . . . . . . . . . . . . . . . . . . Area Computation . . . . . . . . . . . . . . . Dual Optimality of Lobe Component Analysis Where-What Network . . . . . . . . . . . . . Receptive Fields Are Selective and Dynamic . Properties of DN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13 14 17 20 22 23 4 Experimental Results 4.1 Global Template Based Methods . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Local Template Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . 26 26 31 5 Conclusions 39 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3 The 3.1 3.2 3.3 3.4 3.5 3.6 . . . . . . . . v . . . . . . . . . . . . . . . . LIST OF TABLES 1.1 4.1 Comparison of characteristics of global and local template based methods. Sparse methods include sparse auto-encoders, sparse RBMs, k-means clustering, etc. in this work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Disjoint test recognition rate from C and γ parameter variation on FERET dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Global template based method comparison on Weizmann and FERET datasets 30 4.3 Recognition rate gained from variation of top-k firing neurons and patch size on NORB dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Recognition rate gained from variation of top-k firing neurons and patch size on CIFAR-10 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.5 Local template based method comparison on NORB dataset [3] . . . . . . . 33 4.6 Local template based method comparison on CIFAR-10 dataset [3] . . . . . . 33 4.4 vi LIST OF FIGURES 2.1 The two-class clustering of data points when subject to linear separator in MDF space in LDA algorithm, RBF kernel based non-linear SVM classification versus the separation of data points into voronoi regions in DN algorithm (For interpretations of the references to color in this and all other figures, the reader is referred to the electronic version of this thesis.). . . . . . . . . . . . 11 3.1 An illustration of LCA network model. . . . . . . . . . . . . . . . . . . . . . 18 3.2 The meaning of dual optimality in LCA network. . . . . . . . . . . . . . . . 19 3.3 A simple WWN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4 The square-like tiling property of the self-organization in a cortical area. . . 23 3.5 The top-down representational effect. . . . . . . . . . . . . . . . . . . . . . . 24 4.1 Number of features is varied in MDF subspace to gain maximum recognition rate when trained on LDA classifier; Weizmann dataset need 12 features, whereas FERET needed 14 features. . . . . . . . . . . . . . . . . . . . . . . 27 (a) The variation number of Y-area neurons to gain maximum attainable recognition rate in LCA algorithm on Weizmann and FERET datasets (b) The variation number of epochs in training phase to gain maximum attainable recognition rate in LCA algorithm on Weizmann and FERET datasets . . . 28 Visualization of weights in Y area and Z area of Weizmann dataset (a) bottom-up weights for Y (20 × 20 Y neurons in which each cell has dimension 88 × 64) (b) top-down weights (TM) for Y (20 × 20 Y (TM) neurons in which each cell has dimension 10 × 10) (c) bottom-up weights of two type neurons in Z area (1 × 28 Z (TM) neurons in which each cell has dimension 20 × 20). 36 The variation number of epochs in training phase to gain maximum attainable recognition rate on NORB and CIFAR-10 datasets . . . . . . . . . . . . . . . 37 4.2 4.3 4.4 vii 4.5 4.6 4.7 Visualization of weights in one depth (of 30 depths) in Y area of CIFAR-10 (a) bottom-up weights (23 × 23 Y neurons in which each cell has dimension 10 × 10) (b) top-down weights (TM) (23 × 23 TM neurons in which each cell has dimension 10 × 10) (c) top-down weights (LM) (23 × 23 LM neurons in which each cell has dimension 23 × 23). . . . . . . . . . . . . . . . . . . . . . 37 Visualization of the bottom-up weights for Z area - two type neurons in one depth (of 30 depths) in TM (1 × 10 TM neurons in which each cell has dimension 23 × 23) and two location neurons in LM (23 × 23 LM neurons in which each cell has dimension 23 × 23). . . . . . . . . . . . . . . . . . . . . . . . . 38 The number of epochs in training phase versus the location error (in pixels) in CIFAR-10 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 viii Chapter 1 Introduction Techniques for appearance-based pattern recognition can be categorized into two types — global template based methods and local template based methods. A global template based method typically assumes that the object of interest has already been detected and cropped and then shifted, scaled, and rotated in a standard way. A local template based method does not use such assumptions. It uses multiple local templates at arbitrary image locations on scene images where features of scene of interest (or objects of interest) can arise anywhere in each input image. Problems of this latter type have been called scene classification (i.e., features indicate a scene type) and sometimes also object recognition (i.e., features indicate an object). 1.1 Existing Major Methods The LDA and SVM algorithms have been widely applied to global template based image matching. The LDA algorithm [22, 23, 24, 4] finds a linear combination of features that separate two or more classes of objects; the resulting combination is used for dimensionality 1 reduction before further classification. The SVM algorithm [25, 26, 5] constructs a hyperplane in high-dimensional space, to gain a good separation or a functional margin that has the largest distance to the nearest data points of any class, since, larger the distance, lower is the classification error. The local template based methods use features derived from local patches of images rather than from the entire images. These local feature representations can be obtained by applying well-known unsupervised feature learning algorithm such as sparse auto-encoders [28, 29], sparse RBM [30], k-means clustering, etc. [10, 31, 3]. We will discuss the global and local template based techniques in more detail later. 1.2 Characteristics of DN The DN (Developmental Network) framework [8] has used global templates, as in [1], derived from image datasets, as well as local templates derived from image patches of cluttered scenes, as in [2]. The novel characteristics of a DN over some well-known pattern recognition methods are discussed below and a summary is provided in Table 1.1. Characteristics LDA SVM Sparse Methods Environmental openness Possible No No High dimensional sensors Yes Possible Yes Completeness Yes No No Real-valued actions No No No Real-time training No No No Incremental learning No Yes No Perform while learning No No No Input having background No No Yes DN Yes Yes Yes Yes Yes Yes Yes Yes Table 1.1: Comparison of characteristics of global and local template based methods. Sparse methods include sparse auto-encoders, sparse RBMs, k-means clustering, etc. in this work. 2 1) Environmental openness: The DN is meant to learn and improve from realistic scenes and its (human taught with effector-supervised) experience of actions through such scenes, even though its performance is imperfect due to its limited computational resource and limited learning experience. It is not assumed that only certain objects are of interest (e.g., face), so the environment is open. 2) Low and high dimensional sensors: The DN is applicable to low- and high-resolution video. Lower resolution video speeds up computational time and saves storage space, but should work reasonably well regardless of the resolution, since the internal operations of DN are based on normalized inner product. 3) Completeness in representation for different amounts of teaching: All the features in DN emerge as optimal representations learned from the experience. The DN does not use, and not limited by, handcrafted features (e.g., SIFT, or oriented edges which may fail if edges are sparse or absent in an input). 4) Instead of learning features from image space X only, DN finds optimal clusters in the space of X × Z, which are sensorimotor features, where Z is the spaces of effectors (including labels represented as vectors). By optimal, we mean maximal likelihood (ML) in the representation of the space of X ×Z based on limited computational resource in DN (e.g., the number of neurons) and the limited amount of teaching. The analytical results in this work show the advantages of using both X and Z for learning features (i.e., discriminant). 5) Real-valued actions: The DN always accepts and processes real-valued sensory and effector (motor) information, instead of human supplied discrete class labels. Discrete class labels are special cases of real-valued actions (e.g., each neuron in Z represents a class), the set of all possible robotic actions is more general than a set of all class labels, since actions 3 can be taught or created for the physical world without a human predefined class. 6) Real-time training: During training, the sensory and memory refreshing rate must be high enough so that each physical event (e.g., motion of a person or the motion of the head of self) can be temporally sampled and processed in real-time (e.g., hopefully about 15Hz to 30Hz when the computers are fast). Although we used class labels in our experiments here for comparisons, class labels of objects appearing in the real scene require a human to supply which restrict the possibility of real-time training. The DN can take actions directly, regardless of whether they represent a real-valued action or a class label. 7) Incremental learning: Acquired skills must be used to assist in the acquisition of new skills, as a form of “scaffolding.” This requires incremental processing. Batch processing is used by DN. By incremental processing, we mean that each new observation x ∈ X and z ∈ Z must be used to update the current DN and the current (x, z) must be discarded before the next (x, z) can be acquired. Incremental learning is necessary for learning in real time, as the amount of data in the sensory and motor stream is virtually unbounded. 8) Perform while learning: At any time, the human teacher can teach the DN by imposing an action on its effector port Z (motor-supervised learning). As soon as the human lets Z free, the DN generates z ∈ Z as its best prediction or best action at this time. 9) Input having background: A typical scene has objects of interest in a cluttered background. The method must be able deal with object in any position in unknown cluttered backgrounds. A global template method and a local template method are different in the sense that the former attempts to match the entire scene but the latter attempts only to match some local templates. A local template method can deal with cluttered background. Local patches in DN add“votes” to a single supervised location in the Location Motor (LM) 4 of Z, although different local patches are at different locations in the input image. 1.3 Novelty This work has two major novelties — First, the DN algorithm is originally meant for a more general problem of spatiotemporal event detection and recognition in complex background, and is not restricted to the global and local template based image matching problem here. However, it is desirable that we apply it to the space-only problems here so that it can be compared with major pattern recognition techniques. The ways to use the effector space Z are very different between DN and other compared methods. Second, this work is the first to use the DN for problems where multiple local features contribute to the classification (type) and localization (location). Multiple firing feature detectors (Y neurons) vote for the corresponding single neuron in the Type Motor (TM) area and the corresponding single neuron in the Location Motor (LM) area. The area Z consists of two subareas TM and LM. DN can be used for shallow learning (single level of features), deep learning (multiple levels of features), and a mixed shallow and deep learning. Here, we concentrate on shallow learning. Experimentally, we show that the shallow learning (DN) is at least comparable to the deep learning methods (sparse auto-encoders, k-means clustering, etc.). In a local feature based DN, the ”where” information from LM is applied as input to the DN in a supervised manner during the training phase along with the ”what” information or type of object from TM applied as input to the DN. While Z is free during a test session, through multiple updates of DN, the location information and type information feed from Y to Z and then from Z back to Y reinforce each other and suppress inconsistent features, like relaxation. 5 The object location must be consistent with type and the type must be consistent with location. Such an relaxation between Y and Z is faster than in deep learning methods. The remainder of the paper is organized as follows. In chapter 2, we review previous work. In chapter 3, we explain DN. The experimental results are presented in chapter 4, and chapter 5 gives concluding remarks. 6 Chapter 2 Previous Work In this chapter, we briefly discuss the most widely applied algorithms for global and local template based image matching. 2.1 Global Template Based Methods There are many methods for global template matching such as Wavelets, Neural Networks, Correlation, PCA, LDA, SVM, etc.. Of these, we discuss two types of methods — LDA and SVM in this work. 2.1.1 LDA Algorithm The LDA algorithm [22, 23, 24] finds a linear combination of features which characterize or separate two or more classes of data instances, so that dimension of the data instances is reduced prior to being projected onto the linear space. The LDA algorithm, previously devised by Daniel Swets and Juyang Weng [4], is based on optimal subspace generation; the subspace is generated using two projections - a Karhunen Lo´ve projection to produce a set e 7 of MEF (most expressive features) features followed by a discriminant analysis projection to produce a set of MDF (most discriminant features) features; together they are called the DKL (discriminant Karhunen Lo´ve) projection. The MEF projection discriminates a e dataset based on lighting variations between two classes, and chooses a reduced set of most expressive features to be projected onto the MDF space; whereas the MDF projection defines an optimal discrimination between two classes. A set of MEF and MDF features is generated for each image in training dataset; an image from the testing dataset is projected in the same subspace. In the subspace, a Euclidean distance is computed between the two feature spaces to find a set of k-nearest neighbors to recognize the class to which the image in the testing set belongs. The MEF features forming the MEF vector Y are the unit eigenvectors associated with the m largest eigenvalues of the covariance matrix of the vector X of training image instances, where m is chosen such that sum of the unused eigenvalues is less than some fixed percentage P (P = 5%) of the entire training dataset, and m < n, n being the original number of features. The MDF generation finds a projection matrix W that maximizes the detsb , i.e. maximize between-class scatter while minimizing within-class scatter. The detsw ni t between-class scatter matrix is defined as Sw = c i=1 j=1 (Yj − Mi )(Yj − Mi ) , for ratio i = 1, 2, ..., c classes with class mean Mi for ni samples from class i. and the within-class scatter matrix is defined as Sb = c (M − M)(M − M)t , for mean vector, M for all i i=1 i data instances from all classes. 2.1.2 SVM Algorithm The SVM algorithm [24, 25, 26], on the other hand, maps a set of n-dimensional data points from a finite-dimensional space to a high- dimensional space, constructing a (n-1) dimensional 8 maximum- margin hyperplane or set of hyperplanes in a high- dimensional space, so that a good gap called the functional margin, that has the largest distance to the nearest training data points of any class, is gained. The testing data points are then mapped to the same space and are predicted to belong to a class based on which side of the gap they fall on; and fits a non-linear kernel function to the maximum- margin hyperplane when separation is non-linear in finite- dimensional space. The support vectors are a set of data points that lie closest to the decision boundary; these points have direct bearing on the position of the decision boundary. If (xi , yi ) are a set of data instance-class label pairs, i = 1, 2, ..., l where xi ∈ Rn and yi ∈ 1, −1l , the SVM algorithm finds an optimum solution to the following problem: min w,b,ξ 1 T w w+C 2 l ξi i=1 subject to yi (wT φ(xi + b)) ≥ 1 − ξi , ξi ≥ 0. Here, w is a normal vector to the hyperplane, and parameter b determines the offset w of the hyperplane from the origin along the normal vector. The data instances xi are mapped from a finite- dimensional space to a high-dimensional space by the transformation function φ. If no hyperplane can separate data into ’yes’ or ’no’ classes, the hyperplane that can split the data instances in as clean manner as possible, while maximizing the distance to the nearest well split data instances is chosen, and the slack variable, ξi , measures the degree of misclassification of the data instances xi such that C > 0 is the penalty parameter of ξi . The kernel function of the transform is expressed in terms of the transformation function φ(xi ) as K(xi , xj ) = φ(xi ) · φ(xj ). The three kernel functions, that are extensively used are 9 the linear kernel, Gaussian RBF (radial basis function) kernel and polynomial kernel [26]linear kernel - K(xi , xj ) = xT xj i RBF kernel - K(xi , xj ) = exp(−γ xi − xj 2 ), γ > 0 polynomial kernel - K(xi , xj ) = (γxT xj + r)d , γ > 0 i The most obvious choice of kernel function is a Gaussian RBF kernel, since, unlike the linear kernel, it can separate data points into classes when the relation between data points and class labels is non-linear. Also, unlike the polynomial kernel, the RBF kernel has fewer hyperparameters that influence the complexity of kernel model selection. However, if the number of features of the data instances is extensively large, then a linear kernel provides a better separation as compared to an RBF kernel. The RBF kernel function is dependent on the Euclidean distance of xi (training data instance or support vector) from xj (testing data point). The support vector is placed at the center of the RBF kernel and σ determines the area of influence the support vector has over the data space. The efficiency of an RBF kernel 1 depend on a good choice of two parameters C and γ, where γ = 2σ , and σ determines the area of influence of the support vector over the new data instance. The best combination of C and γ is gained through a grid-search with exponentially growing sequences of C and γ (C ∈ 2−5 , 2−3 , ..., 213 , 215 ) and γ ∈ 2−15 , 2−13 , ..., 21 , 23 ). Each combination of C and γ parameters is tested for using cross validation, and the parameters with best cross-validation accuracy are chosen. The SVM training model, which is used for testing new data instances, is trained on the complete training dataset using the selected values of C and γ [5]. Figure 2.1 shows the two-class clustering of data points by the LDA, SVM and DN methods. The LDA boundary is linear, whereas the SVM boundary is characterized by the non-linear kernel function chosen. The DN algorithm separates data points into voronoi 10 SVM MDF vector LCA + + + LDA + + + + _ _ + + Truth + _ _ _ _ _ + + _ _ _ Figure 2.1: The two-class clustering of data points when subject to linear separator in MDF space in LDA algorithm, RBF kernel based non-linear SVM classification versus the separation of data points into voronoi regions in DN algorithm (For interpretations of the references to color in this and all other figures, the reader is referred to the electronic version of this thesis.). regions, such that the boundary separating the two regions is at an equal distance from both points. 2.2 Local Template Based Methods A simple feature learning framework that incorporates a local feature-learning algorithm like sparse auto-encoders [28, 29], sparse RBMs [30], K-means clustering, and Gaussian mixture models [10, 31], as a “black box” module [3] within has been studied to discover local feature representations from unlabeled data instances. A set of random patches is extracted from unlabeled training data instances, such that each patch has dimension w × w and has dchannels, where w is the size of the receptive field; each w × w patch can be represented as 11 a vector in RN of pixel intensity values, such that N = w · w · d. A dataset of randomly sampled patches X = x(1) , ..., xm is then constructed, where xi ∈ RN . Then, each patch x(i) is normalized by subtracting the mean and dividing by the standard deviation of its elements, after which the dataset is optionally whitened. When the data is pre-processed, an unsupervised learning algorithm viewed as a “black box” module takes the dataset X and outputs a function f : RN → RK and maps input vector x(i) to a new feature vector y = f (x) ∈ RK of K features, where K is a parameter of the unsupervised learning algorithm used and the kth feature is fk . A (n − w + 1) × (n − w + 1) representation with K channels is defined for each w × w “subpatch” of the training data; y (i,j) is referred to as the Kdimensional feature representation extracted from location i, j of the training data. The feature representation, y (i,j) is split into four equally sized quadrants and the sum in each quadrant is computed to yield a reduced K-dimensional representation of each quadrant for a total of 4K features on which a linear classification algorithm is applied to identify new data instances [3]. 12 Chapter 3 The Developmental Network Model In this work, we consider the DN model to have a general purpose brain area Y , which is connected with the sensory area X and the motor area Z, as illustrated in Figure 3.1, in which the order of areas from low to high is X, Y, Z. Much of the material in this section in extracted from Weng and Luciw 2011 [9]. 3.1 Area Function At the first c time instances, during the “prenatal” learning phase, the first c neurons of A in {X, Y, Z} initialize their synaptic vectors V = (v1 , v2 , ..., vc ), where each synaptic vector vi is initialized using the input pair pi = (bi , ti ), where the bottom up input is bi and the top-down input is ti , i = 1, 2, ..., c, and initialize the firing ages A = (a1 , a2 , ..., ac ), such that each firing age ai is initialized to be zero, i = 1, 2, ..., c. After the “birth” phase, at each time instant, each area A computes its response r from its input pair p = (b, t) based on its adaptive part N = (V, A) and its current response r. The current response r is regulated by the attention vector ba . Each area A updates its 13 adaptive part N to N as follows: (r , N ) = f (b, r, t, ra , N ) where f is the area function. The attention supervision vector ra is used to softly avoid the area A from excessively learning background. The attention supervision vector ra has the same dimension as r. The attention supervision vector ra , in this work, suppresses all the A neurons to zeros except 3 × 3 = 9 ones centered at the correct object location. Since, the vector ra is biologically driven by other connected brain areas, it is not very accurate during early development phases, which may result in learning some irrelevant information (background information). Thus, there arises a need for the soft internal attention vector ra to be removed in future for construction of a more powerful model of brain-like development. The area A, whether X, Y, orZ compute and update in a unified way as described above. But X area does not have bottom-up input and Z area does not have top-down input. The X area and Z area are nerve terminals. 3.2 Area Computation We introduce that the receptive field (RF) of a neuron should contain three parts — sensory RF (SRF), motor RF (MRF) and lateral RF (LRF), respectively. The effective field (EF) of each neuron should also include three parts: sensory EF (SEF), motor EF (MEF), and lateral EF (LEF), respectively. The SRF, MRF, LRF, SEF, MEF, LEF together form the hextuple fields of a neuron as shown in Figure 3.3(a). The lateral connections — connections with neurons in the same area that are strongly correlated or anticorrelated, whether inhibitory or 14 excitatory, are used by each cortical area (X, Y, Z) so that the neurons in each area can find their roles. The highly correlated cells are connected by excitatory connections to generate a smooth map that globally covers a rough “terrain” but gradually becomes selective and local to fit the details of the “terrain”. Whereas, highly anti-correlated cells are connected by inhibitory connections, so that the neurons in the same cortical area can respond to different features, and allow only top ranked neurons to fire and update, ensuring that other weakly responding neurons do not fire so that they do not learn irrelevant information and also keep their long-term memory intact. The lateral connections within Y area of the DN model, in this work, are simulated by the top-k competition among neurons for fast-speed identification of top-k winners within each network instead of actual inhibitory connections. The top-k mechanism is used for quick sorting of the winner neuron in real-time (every 30ms), when the software or hardware cannot run fast enough i.e. cannot update the entire network at 1kHz or above. In our work, we hand-pick the value of k, assuming that the value is largely gene pre-dispositioned. According to the sparse coding theory explained in [1, 10], if only top-k neurons in each area fire and update, then those that do not update act as long term memory for current context. Consider an area A in {X, Y, Z}.; if both bottom-up input b and top-down input t are associated with each area, each neuron in A has a weight vector v = (vb , vt ), corresponding to the two area inputs (b, t), else the part of input that is not associated with the area is not included in the notation. The sum of two normalized inner products gives the pre-action of the neuron: ˙ ˙ r(vb , b, vt , t) = v · p, ˙ ˙ ˙ ˙ where v is the unit vector of the normalized synaptic vector, v = (vb , vt ), and p is a unit 15 ˙ ˙ vector of the normalized input vector, p = (b, t). The inner product measures the degree ˙ ˙ of match between the two directions — v and p, because r(vb , b, vt , t) = cos(θ) where θ ˙ ˙ is the angle between two unit vectors v and p. Let the area A under consideration be Y , connecting with bottom-up input area X and top-down input area Z. The area Y with many neurons has a set of cluster centers: {(vx , vz ) | vx ∈ X, vz ∈ Z}. Each center (vx , vz ) is the center of the corresponding Voronoi tile in the area’s input space X × Z, and is an instance of co-occurrence, which is linguistically impure in human language or does not have exact description as a concept of the external environment in natural language. Thus, we assume that the linguistic impurity must be true for all neurons inside the brain which are not under direct supervision of the external environment. In our DN based LCA algorithm [1], we use top-k mechanism to show that lateral connections within neurons in each area enable them to sort top winner neurons within each time step tn , n = 1, 2, 3, .... Consider the weight vector of neuron i is vi = (vbi , vti ), j = 1, 2, ..., c, and if k = 1, the single winner neuron j is identified as follows: j = arg max r(vbi , b, vti , t). 1≤i≤c If c is sufficiently large and the set of c synaptic vectors distributes well, i.e. the density of the c points well approximates the observed probability density in parallel input space, then 16 there is a high probability that both parts of the winner neuron j match well: vbj ≈ b and vtj ≈ t not counting the length of the vectors because of the length normalization in r(vb , b, vt , t). Consider area A to be Y -area; we want the response value yj to approximate the probability for (x, z) to have vj = (vxj , vzj ) as the nearest neighbor. If k = 1, only one winner neuron fires with a response value yj = 1 and the rest of the neurons in the area do not fire, so yi = 0 for i = j. Thus, in general, for k > 1, a dynamic scaling function shifts and scales the preaction potential of each neuron ri in a dynamic manner, so that the winner has a response value y = 1 and the (k + 1)-th and weaker neurons have a response value zero. Without a need to explicitly solve c simultaneous equations, each dynamic function1 g depends on values in y at time tn to give the updated response yi but for the next time instant tn+1 . 3.3 Dual Optimality of Lobe Component Analysis The Lobe Component Analysis (LCA) [1] is a model for long-term memory retention, and uses a dually optimal (optimal in space and time) framework that casts long-term memory and short-term memory together by the optimal distribution of the limited number of neurons of each area in the input space X × Z — optimal Hebbian learning, spatially and temporally, as shown in Figure 3.2. 1 Consider r ≥ r , ..., ≥ r 1 2 k+1 and rk+1 ≥ ri for all i > k + 1. Then yi = g(ri ), i = 1, 2, ..., c, where g(r) depends on the ranked values r1 and rk+1 — g(ri ) = (ri − rk+1 )/(r1 − rk+1 ), if i ≤ k. The response value rk+1 is subtracted from the input r and the difference is divided by r1 − rk+1 so that the response vector y of the area has k positive components with the maximum value is at most 1. The remaining c − k neurons in y have response value zero — g(ri ) = 0 for all i > k. 17 Z Y Motor Area Bridge c X Sensory Area Figure 3.1: An illustration of LCA network model. In LCA framework, the connection pattern of each neuron in Y area is shown in 3.1. The Y area acts as a bridge between the sensory area X and the motor area Z. The two-way local connections in green represent neuronal input and in pink represent neuronal output. In the same area, near neurons are connected by excitatory connections for smoothness of representation, and far neurons are connected by inhibitory connections, hence, competition between neurons result in detection of different features by different neurons. The meaning of the dual (spatiotemporal) optimality of LCA is shown in Figure 3.2. The upper area is a 2-D representation of the positions for the neurons in several stacked 18 Neuronal layer (2-D version) Top-3 firing neurons for current input Best matched neuron c Y Correspondence between a neuron and its feature Current input + Move to tile the manifold using optimal directions XxZ The input space of neuronal layer and optimal step sizes Figure 3.2: The meaning of dual optimality in LCA network. layers within Y area. The top-3 firing neurons in the Y area, shown in different shades and patterns of blue (top-1 neuron in dark blue shade at the center, top-2 in light blue shade, and top-3 in vertical blue n white stripes), are context-dependent working memory for the current context and the ones that do not fire are context dependent long-term memory for the current context. A 2-D representation of a very high dimensional input space P = X × Z (a manifold of the distribution of input data, which is very sparse in P and of a much lower dimension than the apparent dimension of P ) of the cortical area Y is shown by the lower area, in which each neuron in Y plane is linked with its synaptic weight vector by a curved arc; the synaptic weight vectors of Y represented in P as small dots define a Voronoi diagram in P . LCA is dually optimal, its spatial optimality means that the target tiling by the Voronoi diagram in the manifold is optimal to minimize the representation error for P = X × Z, whereas its temporal optimality means that the neuronal weight of firing neurons must move 19 toward their unknown best target the quickest throughout the development process. Here, the updating trajectory of every neuron is a highly nonlinear trajectory. The statistical efficiency theory for neuronal weight update (amnesic average) ensures minimum error in each age-dependent update. This means that not only the direction but also every step size of each neuronal update is nearly optimal, is autonomously determined. 3.4 Where-What Network The “Where-What” Network (WWN), a DN embodiment, has a motor zone (type or “what” motor area TM), which is used to teach a type concept and another motor zone (location or “where” motor area LM), which is used to teach a location concept. The Figure 3.3(b) shows the three areas of a simple WWN — retina X, simple brain Y , and motor Z. The Z area has two concept areas LM and TM. The connecting wires indicate that the pre-synaptic and post-synaptic neurons have co-fired; a two-way arrow indicates two one-way connections whose two sysnapses are generally not same. The weight is the frequency of pre-synaptic co-firing when the post-synaptic neuron fires. Within each cortical area, each neuron connects with highly correlated neurons using excitatory connections but connect with highly anti-correlated neurons using inhibitory connections. Every Y neuron is location-specific and type-specific, corresponding to an object type (marked by its color pattern) and to a location block (2 × 2 each). Each LM neuron is location-specific and type-invariant, and each TM neuron is type-specific and location-invariant. Each Z neuron pulls all applicable cases from Y area neurons as well as boosts all applicable cases in Y as top-down context. A WWN does not treat features in Y as a “bag-of-features” to reduce the number of training samples, because of the inner-product-based neuronal response for 20 Sensory X Top-down SEF Internal Y Top-down MRF Motor Z Excitatory Inhibitory MEF SRF LRF Bottom-up X Lateral LEF Y Bottom-up (a) r Z c LM Location output A A Top-down context B Human Type output Bus Top-down Airplane context Animal X X Retina Y Car Neurons compete to fire (b) Z TM Figure 3.3: A simple WWN. Z. The location of each element in a vector x affect the outcome of the inner product. An object contour from a foreground, by default, is approximated by an octagon; however, refinement of the contour needs synapse maintenance, discussed in [2], which automatically cuts off synapses that have bad matches or whose pre-synaptic input is from the background. In our experiments, we do not use the synapse maintenance. In case of shortage of Y neurons, the Y area optimally uses the limited resource by implicitly balancing the trade off between type or “what” motor concept representation and location or “where” concept representation. Hence, each Y neuron must deal with the misalignment between an object and its receptive 21 field, simulating a more realistic resource situation. This shows that, a WWN does not require the human programmer to model each concept but instead enables un-modeled concepts, in this case, location and type, to be learned interactively and incrementally as actions; it enables such concepts to serve as goals, whether supervised or self-generated, but in general serve as attended spatiotemporal equivalent topdown contexts; and it enables such goals to direct perception, recognition and behavior emergence [9]. 3.5 Receptive Fields Are Selective and Dynamic The hextuple field concept means that the receptive field of a neuron is attention-selective and temporally dynamic, which means that a different subpart is active at a different time [11], depending on top-down attention and the competition in early areas. Conventionally, a sensory neuron has a receptive field, but a motor neuron does not. However, the SRF of each motor neuron is global, but selective and dynamic, since only winner Y neurons fire at a time instant. The dynamic and selective property of SRF gives clear explanation of why each TM neuron is locationally invariant and type specific, an why each LM neuron is type invariant and location specific. Similarly, an MRF is also selective and dynamic in the sense that different motor actions boost a V1 neuron at different contexts. Each Y neuron connects one neuron in LM and TM, respectively, thus, an MRF is typically disconnected. We, hence, state that the motor area Z can be taught to represent any human communicable concept that are produced by muscle contractions. Verbal concepts such as written, verbal, sign languages, etc. can be communicated through muscle languages, whereas non22 a a c c b b (a) (b) a c b (c) Figure 3.4: The square-like tiling property of the self-organization in a cortical area. verbal concepts such as reaching, grasping and manipulation can be produced through muscle procedures. 3.6 Properties of DN Here, we discuss two important properties of DN — distance-sensitive property and top-down representational property relevant to our work. First, the expression for neuronal learning can be rewritten as vj ← vj +w(nj )(yp−vj ). Thus, the amount of vector change w(nj )(yp − vj ) is proportional to the vector difference yp − vj = p − vj when y = 1. We call it the distance-sensitive property. With this property, we have the square-like tiling theorem: Theorem 1 (Square-like tiling) Suppose that the learning rule in a self-organization scheme has the distance-sensitive property. Then the neurons in the area move toward a uniformly distribution (tiling) in the space of areal input p if its probability density is uniform. The proof is available in [9] The square-like tiling property of DN is illustrated in Figure 3.4. In a uniform input space, 23 Top-down Z Top-down Z 8 square tiles with synaptic centers nt va le Xi Ir t Re t van ele r Xt δr δi Xi an lev rre I Xi Xt Bottom-up X (b) Bottom-up X (a) le Irre t van δ’r δ’i Re lev an t Xt Bottom-up X (c) Figure 3.5: The top-down representational effect. neurons in an layer self-organize until their Voronoi regions are nearly isotropic (squarelike to nearly hexagons in 2-D). The Voronoi region of neuron c is very anisotropic — elongated horizontally) — resulting horizontal pulling is statistically stronger as shown in Figure 3.4(a). A horizontal perturbation leads to continued expected pulling in the same direction (rightward in this case) as shown in Figure 3.4(b). Through many updates, the Voronoi regions are nearly isotropic, ideally regular hexagons but generally square-like as illustrated in Figure 3.4(c). Note that “move toward” does not state how fast. The speed of self-organization depends on the optimality of the step sizes; the temporal optimality of DN deals with the speed. Second, as shown in Figure 3.5, learning using top-down inputs sensitizes neurons to action-relevant bottom-up input components (e.g., foreground pixels) and desensitize to irrelevant components (e.g., leaked-in background pixels). This is true during operation, when top-down input is unavailable during free-viewing. This is called the top-down representational effect. We have the top-down effect theorem: 24 Theorem 2 (top-down effect) Given a fixed number of neurons in a self-organization scheme that satisfies the distance sensitivity property, adding top-down input from motor Z in addition to bottom-up input X enables the quantization errors for action-relevant subspace Xr to be smaller than the action-irrelevant subspace Xi , where X = Xr × Xi . The proof is available in [9]. The top-down inputs sensitize the response for relevant bottom components although which are relevant is unknown. Without top-down input, square Voronoi tiles in the bottomup space give the same quantization width for irrelevant component Xi and the relevant component Xr : δi = δr . All samples in each tile is quantized as the point (synaptic vector) at the center as illustrated in Figure 3.4(a). With top-down inputs during learning, square tiles cover the observed “pink” manifolds, indicating the local relationships between Xr and Z as shown in Figure 3.4(b). When top-down Z is not available during free-viewing, each tile is narrower along direction Xr than along Xi : δr < δi , meaning that the average quantization error for relevant Xr is smaller than that for irrelevant Xi as shown in Figure 3.4(c). The above theorem gives two consequences (due to δi > δr ): First, action-relevant bottom-up inputs are salient (e.g., toys and other Gestalt effects). Thus, we need to reconsider the conventional thinking that bottom-up saliency is static and probably totally innate. Second, relatively higher variation through a synapse gives information for cellular synaptic pruning in all neurons, to delete their links to irrelevant components. This was used in synapse maintenance [2], but this capability has been turned off because the image data available from the public datasets, that we use here, do not present variation of background pixels like a dynamic physical world would. 25 Chapter 4 Experimental Results In this chapter, we conduct two experiments, the DN algorithm is applied to: (a) A global template, in which the patterns to be recognized have been shifted, rotated, and scaled so that the entire input image contains mainly the pattern of interest. (b) A set of local templates, in which each input image contains a cluttered background, which means that the detection and pre-normalization in (a) of object of interest has not been done. The experimental results, so obtained, are compared to other widely-used major algorithms (LDA, SVM, sparse coding or k-means clustering, etc.) in the pattern recognition community. 4.1 Global Template Based Methods In the first experiment, a set of well-framed (a dataset is termed ’well-framed’ if only a small variation in the size, position and orientation of objects within the image is allowed) “feature images” (global template) from two datasets — Weizmann and FERET are trained on LDA, SVM and LCA-net algorithms. The Weizmann face dataset (A courtesy of Weizmann Institute) is a set of images, of size 88 × 64, from 28 humans, each having 30 images with all 26 1 Recognition rate 0.9 0.8 0.7 0.6 0.5 Weizmann dataset FERET dataset 5 6 7 8 9 10 11 12 Number of features used 13 14 15 Figure 4.1: Number of features is varied in MDF subspace to gain maximum recognition rate when trained on LDA classifier; Weizmann dataset need 12 features, whereas FERET needed 14 features. possible combinations of two different expressions under three different lighting conditions with five different orientations, of which 812 images are used as training set and 28 images are used as testing set (one from each class). The FERET face dataset [27] is a set of 1762 images, of size 88 × 64, from 1010 humans, of which 1624 images are used as training set and 138 images are used as testing set. C\γ 0.002 0.0313 4.3% 0.125 4.3% 0.5 4.3% 2 7.2% 8 80.4% 0.0078 4.3% 4.3% 7.2% 79.7% 80.4% 0.0313 0.125 0.5 4.3% 4.3% 4.3% 7.2% 4.3% 4.3% 7.2% 79.7% 79.7% 84.7% 79.7% 84.7% 84.7% 84.7% 84.7% Table 4.1: Disjoint test recognition rate from C and γ parameter variation on FERET dataset The Weizmann as well as the FERET data instances fall in the case, where number of feature vectors are more than the number of data instances, causing the within class scatter matrix Sw to degenerate. In such a case, if the MEF and MDF projections are done on 27 1.05 1.05 1 Recognition rate Recognition rate 1 0.95 0.9 0.85 0.8 10 14 16 18 20 n (a) 22 24 26 28 0.9 0.85 0.8 0.75 Weizmann dataset Feret dataset 12 0.95 30 0.7 5 Weizmann dataset Feret dataset 10 15 20 Number of epochs (b) 25 30 Figure 4.2: (a) The variation number of Y-area neurons to gain maximum attainable recognition rate in LCA algorithm on Weizmann and FERET datasets (b) The variation number of epochs in training phase to gain maximum attainable recognition rate in LCA algorithm on Weizmann and FERET datasets the original data instances, it will render the Sw non-invertible due to the high-dimensional input data relative to the number of data instances. The DKL projection on the datasets finds a set of MEF and MDF features by using the LDA algorithm [22, 23, 24, 4] to deal with the Sw degeneracy issue, since the within-class scatter matrix non-invertibility issue does not arise in the MEF space; followed by a k-nearest neighbor search to label the class to which the image from the testing data belongs. The n-dimensional original image space is projected onto an m-dimensional MEF space; the m is chosen such that for s data samples from c classes, m + c ≤ s. Thus, m is constrained to be less than the rank of Sw to make −1 Sw non-degenerate. Also, since Sw Sb can have at most c-1 non- zero eigen values, kdimensional MDF space is constrained to k ≤ c − 1; so that, k + 1 ≤ c ≤ m ≤ s − c. The number of MDF features used to linearly classify the testing data were varied and recorded for the Weizmann and FERET datasets as shown in Figure 4.1. The LDA classifier gains 0.6% error on the Weizmann dataset after 30 cross-validation tests (a different image from 28 each class is used for testing in each test) and 17.4% error on FERET data as shown in Table 4.2. On the other hand, prior to application of SVM algorithm [25, 26, 5], the images in both, the training and testing data were linearly scaled to the range [0 1], to avoid features in numeric range dominate those in lower ones as well as to avoid numerical complications during calculation of inner product of feature vectors. The support vectors for FERET dataset needed an exhaustive parameter search to find the best (C,γ), which are not priory known for a dataset, so that the testing data can be predicted with higher accuracy. The grid search for the best (C,γ) parameters is done using cross-validation for C within the range of 2−5 to 215 and for γ within the range of 2−15 to 23 , and the prediction with the best cross-validation accuracy is chosen as shown in Table 4.1(b). A coarse grid-search (C = 2−5 , 2−3 , ... 215 ; γ = 2−15 , 2−13 , ... 23 ) finds the best value of parameters C and γ as 2 and 0.0313 respectively which cause 3.6% error on training and 15.3% error on testing data. The Weizmann dataset did not gain a good recognition rate for any combination of C and γ, a 15.9% error on the training set and a 3.6% error on the testing set occurred by applying linear Gaussian kernel as shown in Table 4.2. In the training phase of the LCA algorithm [1], a sequence of images are presented to X by inserting a background image between every consecutive image from the training set - background, image one, background, image two, background, image three... and so on, such that each stimulus in sensory area has a 2 virtual time unit lasting. For each dataset, the number of Y area neurons is varied from n = 10, 15, ..., 25 and the training phase is run through variation in number of epochs from 5, 10, 15, ...., 30 and the recognition rate on the testing data with the best combination of the parameters is recorded as shown in 29 Figure 4.2. The response value of top-k neurons is ranked so that they replace repeated iterations that take place among a large number of two-way connected neurons in the same layer; in the experiment, the k-value is chosen as 1. The internal representation of Y-area and Z-area after the training phase are shown in Figure 4.3. Each Y neuron detects a type as in Figure 4.3(b); 28 different RGB color intensities represent 28 different types in TM. Due to limited neuronal resources in Y area, some neurons deal with multiple object types at multiple pixel locations. The bottom-up weights (TM) of two of Z neurons as in Figure 4.3(c) are normalized to the range 0 to 255 such that the pixel value indicates the strength of the connection between the corresponding Y neuron (the same (row,col) location as the pixel) and the Z neuron.The Weizmann and FERET datasets had to be trained on 625 Y area neurons (n = 25) after which 0% error was gained on both, Weizmann and FERET dataset in resubstitution test (data in training phase is the same as that from testing phase), and 0% error was gained on Weizmann dataset and 8.7% error was gained on FERET dataset in disjoint test (data in training phase is not the same as that from testing phase), both in 20 epochs as shown in Table 4.2. Dataset Weizmann FERET Test LDA SVM resubstitution 100% 100% disjoint 99.4% 90.3% per image training time 17.7s 3.9s per image testing time 4.3s 1.2s resubstitution 100% 96.4% disjoint 82.6% 84.7% per image training time 12.2s 7.0s per image testing time 5.7s 4.1s LCA 100% 100% 4.1s 1.4s 100% 91.3% 4.6s 1.7s Table 4.2: Global template based method comparison on Weizmann and FERET datasets 30 4.2 Local Template Based Methods In the second experiment, a set of local templates is derived from two object recognition datasets, NORB and CIFAR-10 using the idea of local patch extraction from foreground in [2] and the recognition rate so gained is compared to some major local-feature learning algorithms [28, 29, 30, 31, 3]. Patch size\k 11 × 11 12 × 12 13 × 13 14 × 14 15 × 15 16 × 16 17 × 17 18 × 18 19 × 19 20 × 20 21 × 21 22 × 22 23 × 23 24 × 24 25 × 25 26 × 26 27 × 27 28 × 28 29 × 29 1 2 3 4 35.7% 42.3% 70.0% 51.5% 39.4% 42.3% 73.1% 54.7% 39.4% 45.9% 79.3% 59.9% 39.4% 53.2% 81.8% 59.5% 48.0% 53.2% 84.2% 61.3% 48.0% 59.8% 85.1% 68.0% 53.1% 60.1% 93.8% 72.9% 53.1% 63.5% 93.8% 76.0% 53.1% 63.5% 94.0% 81.6% 63.2% 72.6% 94.2% 83.2% 63.2% 72.9% 94.2% 83.7% 69.9% 74.5% 95.0% 85.2% 67.1% 71.3% 94.2% 81.6% 61.3% 69.8% 94.2% 81.6% 56.2% 68.8% 94.0% 79.2% 43.7% 65.2% 93.5% 77.5% 26.4% 65.2% 81.2% 51.3% 49.1% 74.9% 34.7% 49.1% 74.9% 27.8% 5 42.3% 43.5% 43.5% 49.7% 54.1% 54.1% 63.5% 63.5% 71.9% 71.9% 74.5% 74.5% 74.2% 66.3% 45.0% 32.1% 24.7% - Table 4.3: Recognition rate gained from variation of top-k firing neurons and patch size on NORB dataset The Norb dataset with elimination of complex background (small- NORB dataset) [6] is a set of images of 50 toys, of size 96 × 96, belonging to 5 classes namely, four-legged animals, human figures, airplanes, trucks and cars, imaged by two cameras under 6 lighting conditions, 9 elevations and 18 azimuths, of which 24300 images were used for training and testing each. The CIFAR-10 dataset [7] is a set of 60000 color images, of size 32 × 32, belonging to 10 31 Patch size\k 5×5 6×6 7×7 8×8 9×9 10 × 10 11 × 11 12 × 12 13 × 13 14 × 14 15 × 15 16 × 16 17 × 17 18 × 18 19 × 19 20 × 20 21 × 21 22 × 22 1 25.2% 36.3% 36.3% 35.3% 36.3% 36.3% 32.3% 32.3% 29.3% 29.3% 29.3% 29.3% 25.2% 25.2% 25.2% 2 22.2% 30.3% 30.3% 44.4% 44.4% 42.4% 44.4% 44.4% 43.4% 43.4% 40.4% 40.4% 40.4% 40.4% 38.3% 38.3% 38.3% 3 25.2% 39.4% 51.5% 64.6% 72.7% 80.8% 79.8% 76.7% 71.7% 68.6% 67.6% 64.6% 62.6% 59.6% 57.5% 56.5% 54.5% 51.5% 4 5 23.2% 38.3% 52.5% 29.3% 63.6% 43.4% 61.6% 42.4% 59.6% 43.4% 56.5% 43.4% 53.5% 40.4% 51.5% 40.4% 49.5% 40.4% 48.4% 40.4% 45.4% 40.4% 44.4% 39.4% 42.4% 38.3% 42.4% 38.3% 40.4% 38.3% Table 4.4: Recognition rate gained from variation of top-k firing neurons and patch size on CIFAR-10 dataset classes, with 6000 images per class, of which 50000 images are used for training 10000 images are used for testing. Each image in the training set of NORB images is rotated at an angle of 20 ◦ , and the process is looped until the original image is obtained, so that 18 rotated instances of each image in training dataset are obtained. Each 20 ◦ rotated image instance is added to the original training set of NORB images so that a test image belonging to the same class is placed into the correct class in spite of the elevation angle to which it is raised. In the CIFAR10 dataset, however, each patch extracted from the training image is trained at each location within the training image (each image is circularly shifted row-wise and then column-wise, and the process is looped until the original image is gained back), for instance, a patch of size 10 is trained at (32 − 10 + 1) × (32 − 10 + 1) i.e. 23 × 23 different locations within the image. 32 The circular shift of training set of images is to make sure that the object to be recognized is trained in each location possible within the image, so that a test image containing an object from the same class is placed into the correct class, in spite of the position at which it is located within the image. Algorithm Recognition rate Conv. Neural Network 93.4% Deep Boltzmann Machine 92.8% Deep Belief Network 95.0% Deep Neural Network 97.1% Sparse Auto-encoder 96.9% Sparse RBM 96.2% K-means (Hard) 96.9% K-means (Triangle) 97.0% K-means (Triangle, 4000 features) 97.2% WWN 95.0% per frame training time 1.3s per image testing time 0.8s Table 4.5: Local template based method comparison on NORB dataset [3] Algorithm Rec rate Raw pixels 37.3% 3-Way Factored RBM (3 layers) 65.3% Mean-covariance RBM (3 layers) 71.0% Improved Local Coord. Coding 74.5% Conv. Deep Belief Net (2 layers) 78.9% Sparse Auto-encoder 73.4% Sparse RBM 72.4% K-means (Hard) 68.6% K-means (Triangle) 77.9% K-means (Triangle, 4000 features) 79.6% WWN 80.8% per frame training time 1.5s per image testing time 1.2s Loc Error (in pixels) 1.7 Table 4.6: Local template based method comparison on CIFAR-10 dataset [3] The local feature templates derived from NORB and CIFAR-10 images are trained on the WWN algorithm [2] varying training parameters like, k-value in the number of top-k 33 neurons firing, size of input patch of local features, thickness of the Y-area neurons, etc.. A recognition rate of 95.0% was obtained from NORB dataset for a patch of size 22 × 22 (the size of the original image being 96 × 96, the thickness of the network being 10, when top-3 neurons fire as shown in Table 4.3; whereas a recognition rate of 80.8% was obtained from CIFAR-10 images for a patch of size 10 × 10 (the size of the original image being 32 × 32), the thickness of the network being 30, when top-3 neurons fire as shown in Table 4.4. The number of training epochs are varied from 0 to 15 and the recognition rate at each epoch is plotted as shown in Figure 4.4. Each epoch performs 3 iterations for reinforcement learning of LM and TM input by WWN. The internal representation of Y area and Z area after the training phase of the WWN algorithm are visualized in Figure 4.5 and Figure 4.6 respectively. Each Y neuron (in all depths) detects a type as in Figure 4.5(b) in a specific location as in Figure 4.5 (c). Due to limited neuronal resources in Y area, some neurons deal with multiple object types at multiple pixel locations. The bottom-up weights (TM and LM) of two of Z neurons as in Figure 4.6 are normalized to the range 0 to 255 such that the pixel value indicates the strength of the connection between the corresponding Y neuron (the same (row,col) location as the pixel) and the Z neuron.The Figure 4.7 shows the location error (in pixels) over 20 epochs for CIFAR-10; the location error remains constant after 10 epochs. The recognition rate, thus obtained, is compared to the recognition rate obtained by some major local-feature learning algorithms as shown in Table 4.5 and Table 4.6 to show that a WWN performs comparable to them. In NORB dataset, the object of interest is already centered, thus, ”where” information in DN gets no room for improvement. However, in the CIFAR-10 dataset, the object of interest appears at different locations within a scene. Thus, the ”where” information from 34 LM ensures that the object is present in the scene as a configuration (not necessarily rigid) that is consistent with the training experience, not as scattered broken parts, for instance, the head and the tail of an airplane is present with an observed configuration and not as separate dis-assembled parts, which might be a reason for the recognition rate to be slightly better than other methods. 35 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 (a) Bottom- up Y (b) Top- down Y (TM) “Amir-1” neuron “Amir-2” neuron (c) Bottom-up Z (TM) Figure 4.3: Visualization of weights in Y area and Z area of Weizmann dataset (a) bottomup weights for Y (20 × 20 Y neurons in which each cell has dimension 88 × 64) (b) top-down weights (TM) for Y (20 × 20 Y (TM) neurons in which each cell has dimension 10 × 10) (c) bottom-up weights of two type neurons in Z area (1 × 28 Z (TM) neurons in which each cell has dimension 20 × 20). 36 1 Recognition rate 0.9 0.8 0.7 0.6 0.5 NORB dataset CIFAR−10 dataset 0.4 1 2 3 4 5 6 7 8 Number of epochs 9 10 11 12 Figure 4.4: The variation number of epochs in training phase to gain maximum attainable recognition rate on NORB and CIFAR-10 datasets 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 (a) Bottom- up Y (b) Top- down Y (TM) (c) Top-down Y (LM) Figure 4.5: Visualization of weights in one depth (of 30 depths) in Y area of CIFAR-10 (a) bottom-up weights (23 × 23 Y neurons in which each cell has dimension 10 × 10) (b) top-down weights (TM) (23 × 23 TM neurons in which each cell has dimension 10 × 10) (c) top-down weights (LM) (23 × 23 LM neurons in which each cell has dimension 23 × 23). 37 “Ship” neuron “Automobile” neuron Row 5 Col 5 neuron Row 6 Col 5 neuron Figure 4.6: Visualization of the bottom-up weights for Z area - two type neurons in one depth (of 30 depths) in TM (1 × 10 TM neurons in which each cell has dimension 23 × 23) and two location neurons in LM (23 × 23 LM neurons in which each cell has dimension 23 × 23). 4 Location error (in pixel) 3.5 3 2.5 2 1.5 1 0.5 0 2 4 6 8 10 12 14 Number of epochs 16 18 20 Figure 4.7: The number of epochs in training phase versus the location error (in pixels) in CIFAR-10 dataset. 38 Chapter 5 Conclusions In this work, the DN algorithm has been compared with major well-known pattern recognition algorithms in global and local template based image matching problem. It showed that the performance of the DN method is better than, or comparable to, those global template based methods and the performance is comparable with those local template based methods. In the local template based problems, DN allows different firing feature neurons to vote for a single neuron in TM and LM, which represent the type and location respectively. This experience-based voting has reached pixel-level location accuracy for CIFAR-10 dataset. The work, here, indicates that the pay-off of finding the better spatial features seems to have diminished. The DN method tested in this work for space-only problem was meant for a tight integration of space and time information for visual events in natural cluttered backgrounds. In the work of [15, 16], it has been shown that using temporal information for spatial recognition in DN has considerably improved its recognition rate without imposing rigid constraints on object appearance through time which is typical with model-based object tracking. 39 A possible future direction of research is the detection of object contours from richly textured background — using a technique called synapse maintenance, as reported in [2]. The function of synapse maintenance in DN was not used for experiments here because the image datasets available in the public repositories like the ones we use here consist of snapshots of static scenes (static object fixed with a static background) which are very different from what human-eye sees from the dynamic physical world where the contours of unknown objects manifest themselves when an object moves relative with its cluttered backgrounds. This perspective indicates that learning directly from dynamic scenes can take into account information that has been overlooked by many publicly available computer vision data sets. 40 BIBLIOGRAPHY 41 BIBLIOGRAPHY [1] J. Weng, and M. Luciw, Student Member, IEEE [2009] “Dually optimal neuronal layers: Lobe component analysis,” IEEE Transactions On Autonomous Mental Development, Vol. 1, No. 1. [2] Y. Wang, X. Wu, and J. Weng [2011] “Synapse maintenance in the ‘where-what’ networks,” Proceedings of International Joint Conference on Neural Networks, San Jose, California, USA. [3] A. Coates, H. Lee, and A. Ng [2011] “An analysis of single-layer networks in unsupervised feature learning,” Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), Fort Lauderdale, FL, USA, Volume 15 of JMLR: W&CP 15. [4] D. Swets, and J. Weng [1996] “Using discriminant eigenfeatures for image retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 18. [5] C. -W. Hsu, C. -C. Chang, and C. -J. Lin [2010] “A practical guide to support vector classication,” Technical Report, National Taiwan University. [6] Y. LeCun, F. Huang, and L. Bottou [2004] “Learning methods for generic object recognition with invariance to pose and lighting,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). [7] A. Krizhevsky [2009] “Learning multiple layers of features from tiny images,” Master’s Thesis, Department of Computer Science, University of Toronto. [8] J. Weng [2010] “A 5-chunk developmental brain-mind network model for multiple events in complex backgrounds,” In Proc. Intl Joint Conf. Neural Networks, pages 18, Barcelona, Spain. 42 [9] J. Weng, and M. Luciw [2011] “Brain-like emergent spatial processing,” IEEE Transactions on Autonomous Mental Development, Submitted and accepted. [10] B. Olshaushen, and D. Field [1996] “Emergence of simple-cell receptive field properties by learning a sparse code for natural images,” Nature, Vol. 381, Pages 607-609. [11] D. Fitzpatrick [2000] “Seeing beyond the receptive field in primary visual cortex,” Current Opinion in Neurobiology, Vol. 10, No. 4, Pages 438-443. [12] J. Weng, and M. Luciw [2006] “Optimal in-place self-organization for cortical development: Limited cells, sparse coding and cortical topography,” Proc. 5th Int’l Conference on Development and Learning (ICDL’06), Bloomington, IN, Pages +1-7. [13] M. Luciw, and J. Weng and S. Zeng [2008] “Motor initiated expectation through topdown connections as abstract context in a physical world,” IEEE Int’l Conference on Development and Learning, Monterey, CA, Pages +1-6. [14] J. Weng [2011] “Three theorems: brain-like networks logically reason and optimally generalize,” Proc. Int’l Joint Conference on Neural Networks, San Jose, CA, Pages 29832990. [15] M. Luciw, and J. Weng [2010] “Where what network 3: developmental top-down attention with multiple meaningful foregrounds” in Proc. International Joint Conference on Neural Networks, Barcelona, Spain, Pages 4233-4240. [16] J. Weng [2011] “Why have we passed ‘neural networks do not abstract well’ ?,” Natural Intelligence: the INNS Magazine, Vol. 1, No.1, Pages 13-22. [17] M. Luciw, and J. Weng [2009] “Laterally connected lobe component analysis: Precision and topography,” Proc. IEEE 8th Int’l Conference on Development and Learning, Shanghai, China, Pages +1-8. [18] M. Jordan, and C. Bishop [1997] “Neural networks,” CRC Handbook of Computer Science, CRC Press, Boca Raton, FL, Pages 536-556. [19] Z. Tu, X. Chen, A. Yuille, and S. -C. Zhu [2005] “Image parsing: Unifying segmentation, detection, and recognition,” Int’l J. of Computer Vision, Vol. 3, No. 2, Pages 113-140. [20] B. Yao, and L. Fei-Fei [2010] “Modeling mutual context of object and human pose in human-object interaction activities,” Proc. Computer Vision and Pattern Recognition, San Francisco, CA, Pages +1-8. 43 [21] A. Gupta, and A. Kembhavi and L. S. Davis [2009] “Observing human-object interactions: Using spatial and functional compatibility for recognition,” PAMI, Vol. 31, No. 10, Pages 1775-1789. [22] R. Fisher [1936]. “The use of multiple measurements in taxonomic problems,” Annals of Eugenics, Vol. 7, Pages179188. [23] S. Mika, et al. [1999] “Fisher discriminant analysis with kernels,” IEEE Conference on Neural Networks for Signal Processing IX, Pages 4148. [24] R. Duda, P. Hart, and D. Stork [2000] “Pattern classification (second edition),” Wiley Interscience, New York. [25] C. Cortes, and V. Vapnik [1995] “Support-vector networks,” Machine Learning, Vol. 20, No. 3, Pages 273-297. [26] T. Poggio, and G. Federico [1990] “Networks for approximation and learning,” Proceedings of the IEEE, Vol. 78, No. 9, Pages14811497. [27] P. Phillips, H. Moon, P. Rauss, and S. Rizvi [2000] “The FERET evaluation methodology for face recognition algorithms,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, No. 10. [28] I. Goodfellow, Q. Le, A. Saxe, H. Lee, and A. Ng [2009] “Measuring invariances in deep networks,” NIPS. [29] H. Lee, A. Battle, R. Raina, and A. Ng [2007] “Efficient sparse coding algorithms,” NIPS. [30] H. Lee, C. Ekanadham, and A. Ng [2008] “Sparse deep belief net model for visual area V2,” NIPS. [31] J. Yang, K. Yu, Y. Gong, and T. Huang [2009] “Linear spatial pyramid matching using sparse coding for image classification,” Computer Vision and Pattern Recognition. [32] Y. Wang, X. Wu, and J. Weng [2011] “Brain-Like Learning Directly from Dynamic Cluttered Natural Video,” Proceedings of the 18th international conference on Neural Information Processing. 44