4‘ aw»- .3?" a \‘ 33': "'35? -‘ "3/1; 3%?! . . {‘7.‘ ‘ f '-‘ :3 121:”. :55 ‘ i J . - m rig: * 3 3w: ,tyq-m ~ v 3 . 12331;. ' 91 '4' . ‘ :a. Q . '3, ”1!:3‘5 3". 9:3 \ \21 1... ‘u u 9‘ 13 “1‘.“=§3*"”“f 3. fit; €123§%ia§*“~~ a .11 iéi?‘ ' .1 . 7.: ?- “ f. "i {am}; “$35. 2-.— I ‘3 a 3,. ._... ~>‘ .-1- p . , fig” , '1. .;.~:,_ <5.“- ....¢- w.»- .‘v «a a. M W. .- ,. -n . ., ,. 4 m0- _ «a H... ”- J. ’ ”2‘; ,4 “a ,... ,3 :3 I! .u' u-.- Ju'ar“ MW 4... ... J ‘ .tt‘e-F: ...—-—-‘ " _ fire: — .h" - WW 2 .- «2.1 -4 ‘~»>‘ r. w W .~ I “35's (so; iBRARY sity L M'Chigan State Univer F This is to certify that the dissertation entitled SPARSE REPRESENTATIONS FOR IMAGE CLASSIFICATION presented by Ke Huang has been accepted towards fulfillment of the requirements for the Ph.D. degree in Electrical and Computer Engineeriflg M Major Professor’s Signature 08703/67~ Date MSU is an affirmative-action, equal-opportunity employer _._._._._ —._._.— -u-cu--._.-.- -.-.-—.- -.- «— -.-— - —‘-s-I-o-l-l-C-O-l-l-l-'-O-l-0-l-n— .- lu-u-o-1-0-v-I---I-u-o—o-I—-—- ~ PLACE IN RETURN BOX to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE Straws? ’3 mammal i i APR 2 i 23:; 6107 p'JClRC/Dateoue.indd-p.1 vgwggx—r - SPARSE REPRESENTATIONS FOR IMAGE CLASSIFICATION By Ke Huang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Electrical and Computer Engineering 2007 ABSTRACT SPARSE REPRESENTATIONS FOR IMAGE CLASSIFICATION By Ke Huang In recent years, filter bank based approaches such as wavelet and wavelet packet transforms have been used extensively in image classification. One key issue in wavelet based classification methods is how to choose the best set of features (subbands). In the existing methods, each subband is evaluated separately or only the children sub- bands are compared with their parent subband for selection. In this thesis, we show that this subband selection method does not consider the dependence from different subbands in the selection process. In order to address this issue, we propose subband selection methods that take the dependence into account. We offer a theoretical and experimental analysis of dependence among features among different subbands. Based on this analysis, mutual information based subband selection (MISS) algorithm is proposed for subband selection based on feature dependence. The MISS algorithm is further improved by the subband grouping and selection (SGS) algorithm which combines the dependence between subbands and the evaluation score of each sub— band. All of these methods result in a compact set of features for efficient image classification. The development of efficient subband selection methods for image classification motivates us to consider the more general problem of sparse representation of images for classification. We propose a new approach in the framework of the sparse repre- sentations by combining the reconstruction error, classification power and sparseness in a single cost function. The formulation of the proposed sparse representation for image classification method is further improved by using the large margin method for the measure of dis- crimination. Based on this new and improved formulation, we can model the robust and sparse feature extraction with an optimization problem that can be solved by iterative quadratic programming. In order to reduce the computational complexity required for the iterative quadratic programming, we propose decomposing the robust and sparse feature extraction into two steps, with the first step being sparse recon- struction and the second step being sparse feature selection and dimension reduction. For the second step, we propose a new method called large margin dimension reduc- tion (LMDR). LMDR integrates the idea of Ll-norm support vector machine (SVM) and distance metric learning for obtaining feature representation in a low dimensional space. Finally, to measure the goodness of features obtained by different methods, a mutual information based feature evaluation criterion is proposed. The proposed measure is independent of distance metric and classifier. Computation of mutual information in a high dimensional space is addressed by using the uncorrelated lin- ear discrimination analysis (ULDA). The proposed computational model effectively reduces the computational complexity of computing mutual information for feature evaluation. ACKNOWLEDGMENTS I would like to express my sincere gratitude to my advisor, Prof. Selin Aviyente, for her guidance, encouragement, and support in every stage of my graduate study. Her knowledge, kindness, patience, and vision have provided me with lifetime benefits. I am grateful to Profs. John Deller, Rong Jin and Hyder Radha in my thesis committee for their valuable comments and suggestions on the thesis drafts, as well as for the experience as a student with these three outstanding teachers. I would also like to thank many faculty members of MSU who were the instructors for the courses I took. The course works have greatly enriched my knowledge and provided the background and foundations for my thesis research. I want to take this opportunity to thank Dr. Michelle Yan, Hong Shen, Sam Zheng and Jason Tyan for their support before, during and after my internship at Siemens Corporate Research. The amazing internship provided me very valuable experience in both research and development, and aligned my expectation for a future job. Friends made at Siemens Corporate Research are another valuable treasure I got. My memory of graduate studies at MSU can never be complete without mentioning my fellow graduate students at MSU. The survivor story would be much tougher if without the accompanying by the students coming to MSU in the same year as me. Special thanks to Rensheng Sun couple for their countless help, to Yunhao Liu, Chen Wang and Zhenyun Zhuang for providing me encouragements and accommodation in the coldest winter. This thesis draws a period for my more than 20 years of education in schools. In addition to the thanks to my teachers and classmates in the old days, I must mention my parents, without whose love, nurturing, and support I could never accomplish all these. I dedicate this thesis to my parents. Finally, thanks to my wife, Gangli, for her love and patience. iv TABLE OF CONTENTS LIST OF TABLES ' vii LIST OF FIGURES viii 1 Introduction 1 1.1 Overview of Contributions ........................ 3 Wavelet Subband Selection for Image Classification 6 2.1 Background on Wavelet and Wavelet Packet Transforms ....... 6 2.2 Wavelet Packet - Feature Extraction .................. 11 2.3 Dependence Analysis ........................... 12 2.3.1 Theoretical Analysis ....................... 14 2.3.2 Simulation ............................. 16 2.4 Subband Selection with Dependence ................... 18 2.4.1 Mutual Information ........................ 20 2.4.2 Computation of Mutual Information .............. 21 2.5 Subband Selection Combining Dependence and Energy ........ 24 2.5.1 Statistical Dependency Between Subbands ........... 25 2.5.2 Subband Grouping and Selection ................ 27 2.6 Experiments ................................ 29 2.6.1 Experiment Setting ........................ 29 2.6.2 Performance with Different Wavelet Bases ........... 32 2.6.3 Performance Using Subbands Selected from Two Wavelet Bases 34 2.6.4 The Clustering of Subbands in the SGS Algorithm ...... 39 2.6.5 Performance on Unseen Data .................. 41 2.6.6 Discussion ............................. 42 2.7 Conclusion ................................. 46 A Sparse Representation Framework for Signal Classification 47 3.1 Background ................................ 47 3.1.1 Problem Formulation ....................... 48 3.1.2 Reconstruction and Discrimination ............... 50 3.2 Sparse Representation for Signal Classification ............. 52 3.2.1 Problem Formulation ....................... 52 3.2.2 Problem Solution ......................... 54 3.3 Experimental Results ........................... 55 3.3.1 Synthetic Example ........................ 55 3.3.2 Sparse Classification with Hand Written Digit Images ..... 56 3.3.3 Experiments with the Object Recognition ........... 59 3.4 Conclusions ................................ 63 4 Sparse Representation for Signal Classification: An Optimization Approach ' 64 4.1 Introduction ................................ 64 4.2 Data Model ................................ 65 4.3 SRSC with Large Margin: An Optimization Model .......... 66 4.3.1 SRSC with Large Margin: Formulation ............. 66 4.3.2 SRSC with Large Margin: Iterative Optimization ....... 70 4.4 SRSC with Large Margin: A Two-Step Approximation ........ 71 4.4.1 Ll-norm Support Vector Machine ................ 72 4.4.2 Distance Metric Learning .................... 73 4.4.3 Large Margin Dimension Reduction ............... 76 4.5 Experiments ................................ 79 4.5.1 Experiments with Synthetic Data ................ 79 4.5.2 Experiments with the UCI Data ................. 81 4.5.3 Experiments with the Texture Data ............... 85 4.6 Summary ................................. 86 5 Feature Evaluation with Information Theoretic Measures 89 5.1 Introduction ................................ 89 5.2 Quantitative Evaluation of Features with Mutual Information . . . . 92 5.2.1 Fano’s Inequality ......................... 93 5.3 Mutual Information Computation .................... 94 5.3.1 Overview of Uncorrelated Linear Discriminative Analysis . . . 97 5.4 Experiments ................................ 99 5.4.1 Experiments with the UCI data ................. 99 5.4.2 Experiments with Energy Features for Wavelet Based Texture Classification ........................... 100 5.5 Conclusion ................................. 105 6 Summary and Conclusions 106 6.1 Summary ................................. 106 6.2 Future Work ................................ 108 6.2.1 Basis Design for Sparse Image Classification .......... 108 6.2.2 Combination of Data-Driven and Model Based Methods for Sparse Signal Representation and Classification ........ 109 6.2.3 Saliency Analysis with Sparse Representation ......... 111 BIBLIOGRAPHY 113 2.1 2.2 2.3 2.4 3.1 3.2 3.3 3.4 4.1 5.1 5.2 LIST OF TABLES Normalized energy Covariance with haar basis ............. Normalized energy covariance with db4 basis .............. Normalized energy covariance between db4 basis and haar basis Minimum Error Rates Achieved With Different Wavelet Bases and Different Selection Methods ....................... Classification errors with different levels of white Gaussian noise . . . Classification errors with different sizes of occlusion .......... Error rates with white Gaussian noise .................. Error rates with occlusions ........................ The properties of the data sets “iris” and “glass” ........... Classification Error Rates (in percentage) on “wine”, “iris” and “glass” Classification Error Rates with Different Energy Functions (in percent- age) .................................... vii 17 17 17 61 61 83 99 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 LIST OF FIGURES A wavelet packet decomposition tree: (a) The original image, (b) The 2—level decomposition tree, (c) Projection of the original image onto each leaf node of the tree. ........................ Illustration of MISS algorithm: At each step, a subband in SR with lowest mutual information with all subbands in SU is moved from SR to SU. In this figure, each square represents a single wavelet subband. Dimension reduction in mutual information through a many-to-one mapping function ............................. A simplified illustration of SGS: subbands are first partitioned into different groups (3 ellipses) and then the subband with highest energy (filled dot) is selected from each group. ................. Brodatz texture images used in the experiments ............. The error rates with six subband selection algorithms with the ’Haar’ basis ..................................... The error rates with six subband selection algorithms with the ’dblO’ basis ..................................... The error rates with six subband selection algorithms with the ’bio28’ basis ..................................... The error rates with six subband selection algorithms by selecting sub- bands from the ’bio28’ basis and the ’db10’ basis. ........... The error rates with six subband selection algorithms by selecting sub- bands from the ’Haar’ basis and the ’bio28’ basis. ........... The error rates with six subband selection algorithms by selecting sub- bands from the ’Haar’ basis and the ’db10’ basis. ........... The wavelet basis functions at the first decomposition level for “haar” and “bio28”, in the first row and second row, respectively. In this figure, white regions correspond to high values and dark regions corre- spond to low values. ........................... 13 20 22 28 30 34 35 35 39 The subband labelling scheme for a 2—1evel wavelet packet decomposition. 40 Texture images used for classification on unseen data. First row: bars; Second row: grass; Third row: knit pattern. .............. The error rates with six subband selection algorithms with the ’bio28’ basis on unseen data ............................ The error rates with six subband selection algorithms with the ’bio28’ basis and training data from all big texture images. .......... viii 44 44 3.1 3.2 3.3 3.4 3.5 4.1 4.2 4.3 4.4 4.5 4.6 4.7 5.1 5.2 5.3 Distributions of projection of Signals from two classes with the first atom selected by: J1(X,)\) (the left figure) and J2(X,/\) (the right figure) .................................... Samples of the USPS hand written digit images ............. Corrupted data for the experiments: (a) with different degrees of white Gaussian noise; (b) With different sizes of occlusion. .......... (a) Object images with different levels of white Gaussian noise; (b) Object images with different sizes of occlusion. The images in the first line of (a) and (b) are noiseless. ..................... The relation between error rate and sparsity ............... Illustration of the max—margin linear classifier .............. Illustration of the max-margin linear classifier with samples that are not linearly separable. .......................... Illustration of the effect of transform for feature selection: (a) the original data distribution from two classes; (b) the distribution of the data after rotation. ............................ Feature distributions: (a) Original features; (b) Features obtained with LMDR; (c) Features obtained with the generalized LDA ....... The relation between classification error rate and feature dimension: Comparison between LDA, Ll-norm SVM and large-margin dimension reduction on (a) “iris” and (b) “glass” data sets ............. The projection of “iris” and “glass” data sets with LDA to 2D space. The relation between classification error rate and feature dimension: Comparison between Ll-norm SVM, LMDR and the combination of Ll-norm SVM and LMDR on texture data ................ MI computation with dimension reduction. The original feature X is in a high-dimensional space and the transformed feature Y is in a low-dimensional space. .......................... Estimation of mutual information on 3 different data sets using equal weight and LDA based method. ..................... Estimation of mutual information on 5 different energy features for wavelet based texture classification. ................... ix 56 57 58 60 62 67 69 75 82 84 85 87 94 100 103 CHAPTER 1 Introduction Image classification is an important research problem that has applications including computer vision, medical imaging and biometrics. A standard image classification system consists of feature extraction, feature selection, and classifier design [1—8]. The research in this dissertation focuses on the feature extraction, selection and evaluation aspects of an image classification system in the framework of transform (filtering) based feature extraction methods. The major objective is to obtain a compact and robust set of features for a given image set using multi-resolution transforms such as the wavelets, wavelet packets, directional wavelets and Gabor functions. Image classification usually relies on features extracted from transform or filtering operations [9]. The image classification discussed in this dissertation is also based on the filtering features. The general paradigm for feature extraction is to filter images with a set of filters and to compute energy values or other statistical values from the filtered images as features. The filters can be either data dependent or data independent. Data dependent filters are derived from a training data set. Principal component analysis (PCA) [10], linear discriminative analysis (LDA) [10,11] and K- SVD [12] are examples of data dependent filters. Wavelets [13], wavelet packet [13], directional wavelets [14] and Gabor filters [3] are example of data independent filters. For images with different texture structures, multi-scale data independent filter- ing, such as wavelets, has been applied to image classification [1,2,7]. The multi—scale structure of the wavelet filters is a good fit for the multi—scale structure that often appears in texture images. In this sense, the filtering results with the wavelet filters can discriminatively describe the texture structure for classification. In this disserta— tion, feature extraction and selection with data independent filtering using wavelets and wavelet packet are studied. Some of the techniques discussed in the dissertation can also be applied to features obtained with data dependent filtering. The first part of the proposed work focuses on the extraction of a set of features from wavelet and wavelet packet transforms. The proposed approach first quantifies the dependence between wavelet subbands and then exploits this property to select a compact set of features that are discriminative for the given image set. This method is then further improved by incorporating the individual discrimination power provided by each subband into the selection process. The promising results obtained with the proposed feature selection methods mo- tivate the formulation of a more general feature selection problem for image clas— sification. This more general feature selection problem can be stated as “How can we choose the ‘best’ set of multi—resolution transforms for discriminating between different image classes?” The ‘best’ set of transforms is defined by the compactness (sparseness) of the selected feature set, the robustness of the features in noisy environ— ments and the accuracy of the classification. Combining all of these requirements, we pose this more general problem using sparse representations. Sparse representations aim at finding a sparse and close approximation to a given signal using a large collec- tion of functions, called a dictionary [15—19]. In the proposed work, this framework is modified to address the question of image classification by defining a cost function that incorporates the discrimination and reconstruction abilities of the elements in the dictionary, as well as the sparseness of the selected feature vector. As part of the proposed work, the different aspects of this problem will be investigated including the selection of the dictionary elements, different cost functions to quantify the discrimi- nation power of the selected features and the tradeoff between reconstruction power and discrimination power. In the third part, the formulation of the proposed Sparse representation for im- age classification method is further improved by using the large margin method for the measure of discrimination. Based on this new and improved formulation, we can model the robust and sparse feature extraction with an optimization problem that can be solved by iterative quadratic programming. In order to reduce the computational complexity required for the iterative quadratic programming, we propose decompos- ing the robust and sparse feature extraction into two steps, with the first step being sparse reconstruction and the secOnd step being sparse feature selection and dimen- sion reduction. For the second step, we propose a new method called large margin dimension reduction (LMDR). LMDR integrates the idea of Ll-norm support vector machine (SVM) and distance metric learning for obtaining feature representation in a low dimensional space. In the fourth part of this dissertation, we propose a mutual information based feature evaluation criterion and a corresponding computational model, such that fea- tures obtained from different selection methods can be quantitatively evaluated. Tra- ditional feature evaluation based on empirical study is subjected to the selection of distance metric and classifier. The proposed measure is independent of distance metric and classifier, and reflects more objectively the goodness of a feature representation. Computation of mutual information in a high dimensional Space is usually involved in the model. To deal with this problem, we propose computing mutual information with the uncorrelated linear discrimination analysis (ULDA). The proposed compu- tational model effectively reduces the computational complexity of computing mutual information for feature evaluation. 1.1 Overview of Contributions The contributions of the thesis can be divided into four parts: subband selection with the wavelet packet analysis, a general framework of sparse representation for image classification, sparse representation for image classification with the large mar- gin method and large margin dimension reduction and classifier independent feature evaluation with mutual information. In wavelet packet based signal classification, sparse representation is realized through subband selection, i.e., selection of a subset of subbands for signal repre- sentation. The major contributions of this work can be summarized as follows: 1. The dependence among features extracted from different subbands is quantified. 2. The dependence between subbands is incorporated into the selection process to improve classification accuracy. 3. The selected features are less redundant than the existing methods, thus re- ducing the dimensionality of the feature vector. The reduced dimensionality of the feature vectors increases the robustness of image classification in noisy environments. 4. The proposed feature selection method is easily generalizable to other multi- resolution and directional transforms. In the second part of the proposed research, a more general framework of sparse representations for signal classification is introduced with the following contributions: 1. The current framework for the general sparse representations is modified by in- corporating both the reconstruction error and the discrimination power into the optimization problem. This flexibility improves the robustness of classification in noise. 2. The proposed framework will allow the design of adaptive basis functions that can best represent and discriminate between the given signals. 3. The proposed framework will also have Significant impact on a variety of clas- sification problems such as texture classification and object discrimination. In the third part of the proposed research, an improved formulation of sparse representations for signal classification is introduced and a new dimension reduction method is introduced with the following contributions: 1. The proposed framework of incorporating both the reconstruction error and the discrimination power into the feature extraction process is reformulated based on the large margin method. The new formulation makes the feature extraction problem solvable with the iterative quadratic programming method. 2. A more computationally efficient algorithm, large margin dimension reduction, is proposed for extracting discriminative features in the low dimensional space. In the fourth part of the proposed research, a mutual information based mea- sure is proposed for feature evaluation. A computational model is also prOposed for practically computing mutual information in this case. The contributions can be summarized as: 1. Feature evaluation based on empirical study is first Analyzed. Based on this analysis, a mutual information based feature evaluation method that is inde- pendent of distance metric and classifier is proposed. 2. An uncorrelated linear discrimination analysis based method is proposed for computing mutual information in the high dimensional space. Compared with the existing mutual information computational model, the computational model retains the accuracy of computation with reasonable complexity. CHAPTER 2 Wavelet Subband Selection for Image Classification 2.1 Background on Wavelet and Wavelet Packet Transforms Wavelet decomposition and its extension, wavelet packet decomposition have gained popular applications in the field of signal/ image processing and classification. For example, speech [20], EEG [21], and texture images [2,22] have been successfully clas- sified by statistical models built in the transform domain obtained from the wavelet decomposition. Wavelet transforms enable the decomposition of the image into dif- ferent frequency subbands, similar to the way the human visual system operates. This property makes it especially suitable for the segmentation and classification of texture images [1,2, 7, 22—27]. For the purpose of texture classification, appropriate features need to be extracted to obtain a discriminative representation as possible in the transform domain. A widely used wavelet feature is the energy of each wavelet subband [1,2, 7, 20,21,23,27, 28]. This idea of extracting energy features from filtered images can be traced back to [29], where a bank of band-pass filters were used for image analysis. A more recent survey on filtering based methods for texture classifica- tion can be found in [9]. In early research, such as in [2,28], features from all wavelet subbands are used for texture classification. However, it is common knowledge in the area of pattern recognition that proper feature selection is likely to improve the classification accuracy with fewer number of features [10]. At the same time, the overcomplete structure of the wavelet packet transform motivates the selection of the wavelet features for classification. For the widely used energy feature, since one en— ergy value is extracted from each subband, wavelet feature selection is thus equivalent to selecting a set of subbands for texture decomposition. Therefore, “wavelet feature selection” and “wavelet subband selection” are interchangeably used in this chapter. Note that different from the general subband selection method that aims at signal reconstruction, such as the entropy based subband selection method proposed in [30], wavelet feature selection does not require the selected subbands to reconstruct the texture image. At this point, it is important to make a distinction between wavelet feature selec- tion and the general feature selection discussed in pattern recognition [10,31]. Both selection methods aim at obtaining a compact representation of the texture for clas- sification. However, the two techniques are not exactly the same. First, for general feature selection methods, the explicit knowledge of the feature extraction process may not always be available. The input to the general feature selection process is usually a vector of values representing the different features without any a priori in- formation about how these features were obtained. On the other hand, the wavelet feature selection methods can take advantage of the tree structure of the wavelet decomposition for the selection process. For example, in this chapter, the wavelet packet decomposition structure is analyzed and used in determining the optimal set of subbands. Second, some selection methods based on component analysis, such as the principle component analysis (PCA), independent component analysis (ICA), and linear discriminant analysis (LDA), usually require that all of the original fea- ture components are available at both the training and testing stages for projecting features into a selected subspace. In the setting of wavelet packet decomposition, this means that a full decomposition is required at both the training and testing stages. On the other hand, for the wavelet feature selection method, an texture is only needed be decomposed into the wavelet subbands selected by the wavelet feature selection method during the training stage. In this sense, wavelet feature selection selection also reduces the computational complexity of feature extraction. One of the most well-known subband selection algorithms is based on the en- tropy cost function [30], where the entropy of the coefficients at the children nodes is compared to the entropy at the parent node. In this method, the “best” wavelet packet tree is chosen to minimize the entropy of the representation coefficients. As such, the main goal of this method is signal reconstruction and not classification. For this reason, subband selection methods that particularly aim at achieving compact signal representations for classification have been proposed. For example, in [1], the features are only extracted from subbands with high energy values higher than a pre- determined threshold value. The energy distribution over these subbands with high energy is then used as features for classification. In [32], each subband is evaluated based on the discrimination power of the extracted energy value, and the subbands with high discrimination power are selected for subsequent classification. A similar evaluation method is also employed in [21], where the decomposition tree is pruned by comparing the discrimination score of the parent and its children nodes. In [7], a neuro—fuzzy method is used for subband selection to obtain a compact representation. One important and effective principle guiding the feature selection process is to exploit the dependence relation among different feature components [22, 25, 33—37]. For wavelet feature selection, this principle indicates that the dependence between different subbands should be investigated and utilized. However, in the existing research on wavelet feature selection [1,7,21,30,32], either each subband is evaluated separately, or only the parent subband and children subbands are compared based on a pre—determined criterion and the wavelet subband selection is based on the evaluation. When each subband is evaluated separately, it is implicitly assumed that the features from different subbands are independent. When selection is based on comparing parent and children subbands, only the dependence between parent and children subbands is considered. Usually the assumption of independence between subbands does not hold, thus degrading the classification accuracy. The dependence between wavelet coefficients that have the “parent-child” or “sib- ling” relationship has been successfully measured by mutual information [38], or mod- elled by hidden Markov Model (HMM) [39,40]. It should be noted that the depen- dence of energy features among subbands is different from the dependence of wavelet coefficients among subbands. The statistical properties of wavelet coefficients of an image have also been analyzed and modelled [41,42]. It has been shown that incor- porating the dependence among wavelet coefficients improves the image compression efliciency [43] and image denoising performance. In this chapter, we are interested in the dependence of the extracted features, not the wavelet coefficients. Since the extracted features are a function of all the coefficients in a subband, the dependence between features is more complicated than the dependence between individual wavelet coefficients. The dependence among subbands can also be interpreted as the amount of redun- dancy among subbands. In classification, one can take advantage of this redundancy in features by two different approaches. In the first approach, the structure of re- dundancy among the transform coefficients is modelled with a parametric model and the parameters of the model are extracted as features for classification. Examples of this first approach can be seen in [22, 25,35], where the dependency among wavelet subband coefficients that have “parent-children” relation is modelled with the HMM and used for texture classification. In the second approach, first features are extracted from the transform coefficients and then the redundant features are removed, since the removal does not reduce information useful for classification. Algorithms presented in [33, 34, 36,37] belong to this second type of approaches. In [34], the “minimax entropy principle” is proposed to require that a newly added feature should be “very different” from the existing features and the degree of difference is measured by the changes in the entropy caused by the inclusion of a new feature. To illustrate the difference between the two types of approaches, consider the following simple exam- ple. Let X = [X1,X2] be a 2-dimensional binary feature vector describing a data sample. Assume that as the prior knowledge, it is known that X2 E X1. In the first type of dependency analysis, this dependence feature can be extracted as “the sec- ond component is the complement of the first component”. For the second method, the component X2 is simply removed since the inclusion of X2 does not provide any extra information for classification. In this chapter, we adopt the second type of approach, i.e., identifying the redundant features and then removing them. In this sense, the models developed in [38—40] are not suitable for the problem of wavelet feature selection for texture classification studied in this chapter. In this chapter, the dependence between energy values extracted from differ- ent subbands are analyzed and incorporated for wavelet feature selection. Differ- ent from the HMM model that only models dependence between “parent-children” subbands [22, 25, 35], in this chapter the dependence between features from all sub- bands are taken into account in the selection process. The analysis and the proposed algorithms can be easily extended for other features, such as entropy, kurtosis, etc. Applying the same analysis to other filter bank based methods, such as Gabor decom- position, is also straightforward. The dependence between energy features extracted from different subbands is theoretically analyzed and verified by the simulation re- sults. Based on this analysis, Mutual Information based Subband Selection (MISS) algorithm is proposed for subband selection based on feature dependence. Experi- mental results show that dependence is an effective criterion for selecting subbands to obtain a compact feature representation. In order to further combine the dependence between the subbands and the evaluation score of individual subbands, Subband Grouping and Selection (SGS) algorithm is proposed to incorporate both factors into the subband selection process. The contributions of this chapter include demonstrating the dependence among extracted features; demonstrating that dependence can be used for effective subband selection; and combining dependence between features from different subbands and the evaluation score of each individual feature for better subband selection. Although classification of texture images are used to show the effectiveness of the proposed methods, the focus of this chapter is not to propose a new feature extraction method for texture classification, but to identify the dependence among different components of wavelet features and propose new subband selection methods for texture classi- fication. This analysis and the proposed methods can be similarly applied to the classification of other signals (like EEG, speech) using the wavelet packet transform. The organization for the rest of this chapter is as follows. Section 2.2 briefly 10 reviews the standard 2D wavelet packet decomposition and feature extraction. In Section 2.3, the dependence between energy features from different subbands is ana- lyzed. Due to the lack of an accurate statistical model for natural images, simulation results for covariance between energy values from different subbands are presented. Based on the analysis provided in Section 2.3, Section 2.4 proposes an algorithm for subband selection based on dependence. Section 2.5 proposes a second algorithm that combines the dependence between subbands and the evaluation score of each individ- ual subband. Experimental results and related discussions are given in Section 5.4. Section 5.5 concludes the chapter with suggestions for possible future research. 2.2 Wavelet Packet - Feature Extraction As an extension of the standard wavelets, wavelet packets represent a generalization of the multi-resolution analysis and use the entire family of subband decompositions to generate an over-complete representation of signals. In 2D discrete wavelet packet transform (2D-DWPT), an image is decomposed into one approximation and three detail images. The approximation and the detail images are then decomposed into a second-level approximation and detail images, and the process is repeated. The standard 2D-DWPT can be implemented with a low-pass filter h and a high-pass filter 9 [13]. The 2D-DWPT of an N x M discrete image A up to level P+1 (P S min(log2 N, log2 M )), is recursively defined in terms of the coefficients at level p as follows: Cit-cilia): 2 Zn: ”(mlh Ck ,(m+2i, n+2j) (2.1) 021:1“ 1):: 2 h( ((nm)g Ck (m+2i, n+2j)’ (2-2) Gil-:2 (i j)— —Z :0” g(m Ck ,(m+2i, n+2j)’ (2-3) 05:3 (i j) zg: Em g(m k,(m+2i, n+2j)’ (2-4) 11 where C8 is the image A. At each step, the image C}: (with the, value of Cilia) at the position (i, j )) is decomposed into four quarter-size images (15:1, Cf: +11, 02:32, 05:313. A two-level wavelet packet decomposition is illustrated in Figure 2.1. 2D-DWPT decomposition allows us to analyze an image simultaneously at differ- ent resolution levels and orientations. Several energy functions can be used to extract feature from each subband for classification. Commonly used energy functions in- clude magnitude [o|, magnitude square |o|2 and the rectified sigmoid |tanh(a)o| [9]. Both magnitude and magnitude square are parameter-free while the rectified sigmoid function can be adjusted by the parameter a. The mean and the standard deviation of subband coefficients can also be extracted as features [24,44]. All these definitions of energy are highly correlated. In this chapter, the definition of energy based on squaring is used. The energy in different subbands is computed from the subband coefficient matrix as: az=zzwi..r <25» 3' .7' where 0,2,(k) is the energy of the image projected onto the subspace at node (p, k). The energy of each subband provides a measure of the image characteristics in that subband. The energy distribution has important discriminatory properties for images and as such can be used as a feature for texture classification. 2.3 Dependence Analysis Wavelet packet decomposition is an orthogonal transform. However, it is observed that considerable dependence exists between coefficients in different subbands, es- pecially for coefficients from subbands having “parent-child” or “sibling” relations in the decomposition tree. The dependence between coefficients has been modelled by Hidden Markov Model (HMM) [39,40] and utilized for better image compression efficiency [43]. However, no analysis has been done on the dependence among fea- tures extracted from different subbands. In this section, covariance is used to analyze the dependence between energy values from any two subbands. The analysis is first 12 Figure 2.1. A wavelet packet decomposition tree: (a) The original image, (b) The 2-level decomposition tree, (c) Projection of the original image onto each leaf node of the tree. 13 conducted theoretically and then a simulation is used for a given image model. 2.3.1 Theoretical Analysis Based on equations (2.1) to (2.4), the wavelet coefficients at level (p + 1) is obtained by convolving the wavelet coefficients at level p with the wavelet filters. The highest level in the wavelet packet decomposition tree is the original image A. Therefore, the coefficients in any subband can be written as a linear combination of pixel values in the original image A. The weighting coefficients in the linear combination are determined by the properties of the wavelet basis, i.e. the lengths and the values for filters 9 and h. Considering the definition of subband energy in equation 2.5, the energy of wavelet coefficients in a subband is a second order polynomial in terms of the pixel values of the image A. Coefficients in the polynomials are determined by the filters 9 and h and the decomposition level. Denote the coefficient for the term A(i, j )A(i’, j’) 2 p(k), as follows: as f};) (i, j, i’ , j’ ) in the representation of an energy function a 030:) = Z Z Z 2 film, i',j’)A(z',j)A(z",j’) (2.6) Due to the downsampling, some of the coefficients ff(i, j, i’, j’) are zero. Hence, the covariance between two energy values is defined as COV(0§1(k1),0§2(k2)) = El0§.(k1)03.(k2)l — Ela§.(k1)lE[0§.(k2)l- (2'7) Cov(o§,(k1),o,2,2(k2)) will be a fourth-ordered polynomial in terms of the pixel values in the image A. Using definitions (2.1) to (2.4), the correlation between energy values of two children nodes, of(0) and 012(1), is: 14 E0i(0)oi(1)l , = E]; 23; (1%: 2n: h(m)h(n)x(m + 2i, n + 2j)) (2.8) 2p: 2‘]: (21: Zr: h(l)h(7‘):z:(l + 2p, 7“ + 2q)) ] Given that g and h are FIR filters, the random variables in equation (2.8) are the different pixel values. Therefore, the covariance is a function of the 4th—order statistics of the original image A. If we have a statistical model for the pixels, the correlation value can be easily calculated. To illustrate this, suppose that the image A is of size 4 x 4. The image model can be described using AR—l Gaussian model, which is simple but often used in digital image processing [45]. A useful property of AR—l Gaussian model is that the marginal distribution of each pixel is also Gaussian [46]. The Gaussian AR-l process with mean a is usually written in terms of a series of white noise innovation processes {En}: Xn — [1, = 0.(Xn_1 — [1) + En (2.9) where E, ~ N (0, 02) are i.i.d. and [al < 1. The marginal distribution is also normal: 0.2 l—a2 Xn ~ N01, ) (2.10) Given the marginal distribution, equation (2.8) can be expanded as follows: E[0I(0)UI(1)l = ’Ul(h,g)E[Xi1]+ v2(hag)Ein3X2l +v3(h, g)E[XfX§] + 22402. g)E[X,2X,X3] + v5(h, g)E[X1X2X3X4] (2.11) where {X1,X2,X3,X4} are i.i.d. Gaussian random variables correspond- ing to different pixels distributed as given by equation (2.10), and {v1(h,'g), u2(h, g), u3(h,g), u4(h,g)} are functions of wavelet filters. Suppose that the size of the original image X is N x M, and define K = 15 min (N, M). u1(h,g)E[Xf] represents those terms on the right hand side of equation (2.8) where the indices of four pixel values are exactly the same. Thus, u1(h, g)E[X{‘] contains exactly K terms. Similarly, u2(h, g)E[Xi’X2] contains 4K(K-1), u3(h,g)E[XfX22] contains 3K(K-1), v4(h,g)E[X12X2X3] contains 6K(K- 1)(K-2) and v5(h, g)E[X1X2X3X4] contains K(K-1)(K-2)(K—3) terms from the right hand side of equation (2.8). Based on the marginal distribution, the four expected values are given as follows: 2 2 2 EN] = 11“ + 61121-3: + 3(1-3-5) E[Xi°'X2] = #4 + 3mg, 2 E[X,2X§] = (#2 + 14,) (2.12) EIXIX2X31= #4 + #27310: E[X1X2X3X4] = [1.4 Therefore, we can obtain the covariance between the energy features from two sub- bands analytically. In the next section, we compute the derived covariance values for a specific set of wavelet filters. 2.3.2 Simulation Suppose that the Haar wavelet basis is used, i.e., g={0.7071, 0.7071} and h = {- 0.7071, 0.7071}. The parameters in equation (2.10) are chosen as follows: a = 128, o = 12 and a = 0.8, which are standard values for a 256-valued grey image. The normalized covariance: _ Cou(X,Y) N-COU(X,Y)—\/C0u(X,X)\/C0u(Y,Y) (2.13) is used for measuring the dependence between energy of different subbands. With these assumptions, the covariance matrix for energy values from any two subbands in the wavelet packet decomposition tree can be easily calculated. The size of images at the. first decomposition level is 2 x 2. The covariance matrix between the four 16 Table 2.1. Normalized energy covariance with haar basis Imam h—aiU) mam) h-ai(3) hof(0) 1.0000 0.5656 0.2765 -0.0669 hf_o(1) 0.5656 1.0000 0.2782 -0.0610 h.of(2) 0.2765 0.2782 1.0000 -0.2705 hof(3) -0.0669 -0.0610 -0.2705 1.0000 Table 2.2. Normalized energy covariance with db4 basis d_o12(0) d_of( 1) d-o'f’(2) d_of (3) do¥(0) 1.0000 0.5560 0.5635 0.0191 d_o'f’(1) 0.5560 1.0000 -0.1294 0.2210 d.ol (2) 0.5635 -0. 1294 1.0000 -0.0029 d_ol(3) 0.0191 0.2210 -0.0029 1.0000 energy values from the first decomposition level, 01(0) 01(1), of(2), of(3), is shown in Table 2.1. For the Daubechies’ 4-tap (2 vanishing moments) filters, where g = {0.4830, 0.8365, 0.2241, -0.1294} and h = {0.1294, 0.2241, -0.8365, 0.4830}, the same 4 x 4 covariance matrix is shown in Table 2.2. In Table 2.3, the covariance between 4 energy values computed from the subbands in the first decomposition level of Haar basis and 4 energy values for the corresponding subbands in Daubechies’ 4-tap basis is given. These results Show that various degrees of dependence exist between the energy features extracted from different subbands within a wavelet basis and from different wavelet bases. The wavelet basis, i.e., the high and low pass filter pair h and g, affects the degree of dependence. To analyze the dependence between energy values more Table 2.3. Normalized energy covariance between db4 basis and haar basis h_of(0) h_of(1) h_of(2) h_of(3) (1.012 (0) -0.0516 -0.0214 -0.0707 0.9870 d of( 1) 0.1279 0.0295 0.3520 0.4322 (1.01 (2) 0.0665 0.3608 0.0189 0.6090 fl1(3) 0.2174 0.1972 0.0339 0.0666 17 accurately, more complicated statistical image models are needed. The analysis in this section illustrates that given the statistical model for image pixels, we can calculate the covariance between energy values for any two subbands. The results indicate that the wavelet basis has a great impact on the degree of de- pendence between energy values from different subbands. The energy values from subbands with different wavelet bases may have different degrees of dependence. Therefore, the assumption used in many image processing algorithms that different wavelet subbands yield independent energy values is incorrect. Thus, the independent subbands can be combined to form a sparse representation for classification. 2.4 Subband Selection with Dependence In Section 2.3, an analytical procedure is given for quantifying the dependence be- tween energy values from different subbands using different wavelet bases, given that the underlying image model is known. Although in most applications, an accurate sta- tistical image model is not available [47], the simulation in Section 2.3 still strongly supports the hypothesis that the energy values from different subbands and even different wavelet bases may be dependent. This observation motivates us to select subbands based on dependence to generate a compact representation of the wavelet features for texture classification. In real applications, due to the lack of suitable statistical image models, it is not likely that dependence between features can be easily calculated analytically. How- ever, the dependence relationship between energy values from subbands can still be estimated with nonparametric methods based on a training set. More specifically, given training samples, the dependence between the energy values from different subbands can be estimated empirically from the training data. Based on this estima- tion, independent subbands can be chosen to achieve a compact representation for the subsequent classification. In this chapter, Mutual Information based Subband Selection (MISS) is proposed to implement this subband selection process. Given a training texture set Tn, a testing texture set TC for classification, and a set of 18 subbands U = {81,192, ...,SN} from the wavelet packet decomposition, the MISS algorithm can be formulated as follows. 1. For each texture A,- E Tn, extract the energy values for all subbands. Notice that each time the texture is decomposed, the subband size is decreased by a factor of 4. When the subband size is small, the energy of the subband will not be a robust measure. In the implementation, the minimum subband size is set to 16 x 16. 2. Divide the subbands into 2 sets: SU and SR. Initialize SU to contain the sub- band that has the highest energy and SR 2 U \SU . 3. Move a subband S,- from SR to SU based on the following criterion: 3,- is chosen such that the feature corresponding to S,- has the smallest mutual information with the features extracted from the subbands in the set SU, i.e. i: arg min{I(SJ-; SU), Sj 6 SR} (2.14) 1 where S,- is the i-th wavelet subband. Here we use mutual information I to measure the dependence. This process stops when the number of subbands in S U reaches a predefined value, which can be determined by cross validation. In the experiments presented in this chapter, this value is set as a parameter and the relation between the value of this parameter to the classification accuracy is studied. Finally, the subbands in the set S U are used to construct the compact representation for the given images. Note that since the goal is to classify the images, it is not required that images are reconstructed from SU, i.e. the compact representation does not necessarily form a basis. 4. Extract energy values corresponding to the sparse representation SU for all images in the testing set Tc. This algorithm tries to select subbands such that the energy features from these subbands are as independent as possible. It does not require that the chosen sub- 19 SU ~--..__SR Figure 2.2. Illustration of MISS algorithm: At each step, a subband in SR with lowest mutual information with all subbands in SU is moved from SR to SU. In this figure, each square represents a single wavelet subband. bands construct a valid wavelet packet decomposition tree, as in [2, 27, 30]. A simple illustration of MISS algorithm is given in Figure 2.2. In step 3), the mutual information is used to quantify the dependence between a set of subbands in U and a single subband. Traditional measures of dependence, such as correlation and covariance are not defined between a set of variables and a single variable, while mutual information is well-defined to measure such dependencies. The computation of mutual information will be discussed in more detail in the following two subsections. 2.4.1 Mutual Information This subsection discusses the computation of mutual information, which is used in the step 3) of the MISS algorithm. Consider two random variables X E )I and Y E )7 with a joint distribution p(:r,y). The mutual information between X and Y is defined as: I (X ;Y) = Z 2 1703.31) 10g £33,715 = Exy [10g $%] = D(p(w.y)llp($)p(y)), 3:6}? yEf’ (2.15) where ‘D(o||o) is the Kullback-Leibler distance between two probability mass func- tions p(x, y) and p(x)p(y). When the logarithm function in the definition uses a base 20 of 2, the unit for the I (X ; Y) is bits. The mutual information is symmetric in X and Y, nonnegative, and is equal to zero if and only if X and Y are independent. The mutual information I (X ; Y) indicates how much information Y conveys about X. Given Y, the extra information required to fully describe X is given by conditional entropy H (X IY) [48]. Thus, the following equation holds: I(X; Y) = H(X) — H(X|Y), (2.16) where H (X ) is the entropy of random variable X. Similar to the definition of con- ditional entropy, the conditional mutual information between random variables X and Y given Z is defined as: I(X; Y|Z) = H(X|Z) — H(XlY, z) = E,(,,,,,.,[log W] (2.17) The conditional information has an interpretation similar to that of mutual infor- mation. Given the above definitions, the mutual information between a set of random variables {X 1, X2, ..., X.) and a single random variable Y can be defined as: n I(X1,X2, ---, Xn; Y) = Z I(Xz'; YIXi—1,Xi—2, ---,X1)- (2-18) .=1 This definition of mutual information can be used for evaluating the independence between the features extracted from a set of subbands SU and a single subband 3,. The higher the value of the mutual information, the easier it is to estimate the distribution of S,- given SU. The mutual information thus provides a criterion for the subband selection process. 2.4.2 Computation of Mutual Information Considering the right hand side of equation (5.4), one practical difficulty in compu- tation is the estimation of high-dimensional joint pdfs (or equivalently, conditional pdf), due to the possible large Size of the subband set U. For example, if the texture 21 2. ~ f(o) _ P040 = Figure 2.3. Dimension reduction in mutual information through a many-to-one map- ping function is decomposed up to level 3 using wavelet packets, then there are 85 subbands. It is impractical and unreliable to estimate such high-dimensional distributions. To avoid the problem of “curse of dimensionality”, we assume that the set of random variables {X1, X2, ..., Xn} provides information to Y only through a many-to-one scalar map- ping T : f(X1,X2, ..., X"), in this sense, the mutual information I(X1,X2, ...,Xn; Y) can be calculated as I (T; Y), as illustrated in Figure 2.3. In this model, T is obtained by processing X. Thus, applying the data-processing theorem [48] yields the following inequality: I(X1,X2, ...,Xn;Y) 2 I(T; Y). (2.19) The equality is achieved if and only if T = f (X 1, X2, ..., X") is the sufficient statis- tics for {X 1 , X2, ..., Xn} [48]. Since this is rarely true in practice, finding a proper form for the function f (o) is important to minimize the error caused by the inaccurate as- sumption. Since the true mutual information I (X 1, X2, ..., Xn; Y) is the upper bound for the estimated mutual information I (T; Y), the function f (0) should maximize the estimation I (T; Y). For the convenience of computation, f (o) is usually assumed to be a linear function. With this assumption, the optimization is still nontrivial. In [49—51], more approximations are proposed for computing mutual information. The approximations include replacing the definition of mutual information in equa- 22 tion (2.15) with Renyi’s definition, and using Parzen’s window method for modelling pdfs in high dimensional spaces. Even with all these assumptions and approximations, the complex iterative optimization steps can only achieve locally Optimal solution that highly depends on the initialization of the numerical procedure. A similar mutual information computation problem arises in the statistical mod- elling of wavelet coefficients and is solved in [38,42]. In [38], a simple linear model was adopted, as follows: T = f(X1,X2, ...,Xn) = Z W.X.-. (2.20) i=1 Optimizing this model, i.e., the weighting coefficients {Wi}, incurs similar diffi- culties discussed in [49—51]. To avoid the difficulties, a simple equal weight model, i.e., W,- = 1 / n, is used in [38] and the results from the computational model matches the statistical property of actual images well. An intuitive improvement on the equal weight model by exploiting the “parent-children” and “sibling” relation among wavelet coefficients is also proposed in [38], but shows similar results to that of the equal weight model. In this chapter, equal weights are used for mutual information computation, i.e., IV,- = l/n. In this case, T is the unbiased estimate for the mean of {X3}. At this point, the accuracy of this simplified model for mutual information com- putation and its effect on the MISS algorithm deserve further discussions. In [38], different weighting models, such as equal weight and optimal linear weights have been proposed. The experimental results show that the mutual information estimate does not vary much based on the weighting model. By using the equal weight model, the energy values in the set SU are averaged. In this process, the subbands with higher energy values play a more important role in the mutual information computation that determines the next selected subband. This approximation of mutual informa- tion computation may cause some bias in the selection of the subbands. The next selected subband is the one least dependent on those subbands in S U with high en- ergy values. This is consistent with the current knowledge that the wavelet subbands 23 with higher energy values are more informative for texture classification. As shown in the experimental results in the later sections, the simplified model is able to achieve the goal of wavelet feature selection in the sense that a small number of subbands selected by the MISS algorithm are able to achieve a relatively low classification error rate. In the actual implementation, the value of mutual information between two discrete vectors X and T needs to be computed in the discrete domain. Usu- ally, the ranges of X and T are partitioned into N x and NT intervals, respec- tively [52,53]. The continuous pdf p(:r,t) can be approximated by the 2D histogram {PX,T(i, j),1 S i S N X, 1 g j 5 Ny}. Similarly, the marginal distributions px(:z:) and pT(t) can be estimated by Px(i) and PT( j ), respectively. Based on the the es- timated discrete probability distributions, the mutual information can be calculated as: P__X,T(i J) PZZPXfi i ,j )log—— Px(i )PTU.) (2.21) If the random processes X, T are stationary and ergodic, then the histogram reliably approaches the pdf and I (X ; T) converges to I(X; T). 2.5 Subband Selection Combining Dependence and Energy As mentioned in Section 5.1, the evaluation of individual subbands provides useful information for subband selection. One widely used evaluation criterion is the mag- nitude of energy computed for each subband. It has been shown that images can be accurately identified by subbands with high energy values [1,7]. It is now clear that two factors, evaluation of individual subbands and dependence among subbands, af- fect the subband selection. In Section 2.4, dependence between different subbands is considered for subband selection. However, the subbands selected based on the inde- pendence criterion are not necessarily the most significant subbands for classification. In order to further improve subband selection for classification, Subband Grouping 24 and Selection (SGS) algorithm is proposed to combine the dependence between sub- bands and the evaluation of individual subbands. SGS achieves this objective through two steps. First, the algorithm discovers the structure of statistical dependency be- tween subbands, then it selects the subbands based on this structure using evaluation scores of individual subbands. In order to discover the structure of statistical de- pendency between subbands, SGS employs a grouping algorithm that partitions the subbands into different sets based on their dependency on each other, i.e. ideally the subbands from the same set are dependent, while subbands from different sets are independent. The proper partitioning of all the subbands reveals the structure of statistical dependency between subbands. In the second step, the subband with the highest energy from each subset is selected for classifying images. With these two steps, SGS successfully incorporates the previous research results in [1], and exploits the statistical dependency among subbands for a concise representation of the images. 2.5.1 Statistical Dependency Between Subbands The general problem of discovering the structure of statistical dependency can be described as follows: Given a set of subbands U = {51,52,...,SN}, which is the full wavelet packet decomposition in our problem, find a partition {U1,U2, ..., U M} of set U such that M UU,=U and U.nU,=0,z';£j (2.22) i=1 where M is the number of subsets. It is desired that that the subbands from the same subset U,- are dependent on each other, while the subbands from different subsets are independent. If this property holds, then we can choose only one feature from each subset to construct a concise representation of the texture. Discovering the structure of statistical dependency has been addressed in litera- ture. One of the recent research results in this area was introduced in [54], where a method based on hypothesis testing is proposed. Different hypotheses on subset par- titions, were compared based on log-likelihood. However, the number of all possible partitions of a set with M random variables is in the order of O( M M ), which is prac- 25 tically intractable. The hypothesis testing also requires evaluating the joint pdf of all the random variables in the same set. The estimation of the pdf in high-dimensional spaces is usually unstable and inaccurate, due to the sparsity of the training data [10]. To avoid this problem of estimating high-dimensional joint pdfs, we propose an ap- proximate method for statistically partitioning the subbands based on pairwise depen- dency. More specifically, we can estimate the marginal pdf of features extracted from each subband. Using these estimated marginal pdfs, the pairwise dependency between two subbands can be quantified. A data grouping algorithm is subsequently applied to features from all subbands, generating the partition of the features from each sub- band. This grouping algorithm only requires pairwise dependencies, thus avoiding computations in high-dimensional space. With this method, the dependent subbands tend to be grouped into the same subset. Note that this method is an approximation to the discovery of the structure of statistical dependency. For example, given three random variables X, Y, Z, it is possible that P(X|Y) = P(X), P(XIZ) = P(X), while P(X IY, Z) 76 P(X) [55]. If only pairwise dependence is measured, X will not be grouped into the same subband with Y and Z, though the dependence is evident. Another limitation lies in the data grouping algorithm itself. Research on data group- ing, though conducted for several decades, is still far from perfect [10]. This indicates that an accurate partitioning of the subbands such that the ones in the same subset are dependent, while the ones from different subsets are independent can only be approximated. Despite these limitations, the grouping algorithm, measuring pairwise dependency, is still able to find out the dependence structure in the data. In this chapter, K-Means grouping algorithm is used to generate the partition of subbands. Given N subbands, M partition subsets ( N > M), and the mean of each subset as {M1,u2, ..., 11M}, the K-Means algorithm can be described as follows [10]. 1. Randomly select M subbands from N subbands and set them as initial class means {u1,u2,...,uM}. 2. Classify the N subbands. Each subband is assigned to the mean u,- with which it has the highest mutual information. 26 3. Recompute the means {121,122, ...,uM} based on the partition of subbands from step (2). 4. Repeat steps 2) and 3) until the means do not change. It is clear that the K-Means is an iterative algorithm. The initial selection of means (“seeds”) can affect the final partition results. To avoid this randomness in final performance evaluation, we run the K—Means algorithm multiple times and the final evaluation is based on the average value from all runs. 2.5.2 Subband Grouping and Selection The problem of discovering the structure of statistical dependency among subbands has been addressed in Section 2.5.1. Incorporating this partitioning algorithm into our method, the subband with highest energy value in each subset is chosen for texture classification. In summary, given a training texture set Tn and a test texture set To for classification, and a subband set U = {31, 82, ..., S N} from wavelet packet decomposition, the SGS algorithm can be described as follows. 1. For each texture A,- E Tn, extract the energy values of all subbands from wavelet packet decomposition. Note that each time the texture is decomposed, the subband size is decreased by a factor of 4. When the size is too small, the energy of the subband will not be a robust measure. In the implementation, the minimum subband size is set to 16 x 16. 2. For each subband 5,, estimate the marginal pdf of its energy value using Parzen method with Gaussian kernel [10] based on the training set. 3. For any two subbands Si, 83-, use equation (2.21) to compute the mutual infor- mation between the extracted features using the pdfs from step 2. 4. Apply K-Means grouping algorithm with the pairwise mutual information as the metric, generate the partition of the subband set U: {U1,U2, ..., U M}. 27 Figure 2.4. A simplified illustration of SGS: subbands are first partitioned into dif- ferent groups (3 ellipses) and then the subband with highest energy (filled dot) is '3 at, 1 1w) selected from each group. 5. Select the subband with the highest energy values from each subset. From the subset U,-, a subband SS, is selected, such that: C) The SGS algorithm attempts to select independent subbands with high energy values for classification. This point is reflected in steps 4 and 5 of the algorithm, where subbands are grouped according to the dependence relation and only subbands with high energy values are selected. Note that SGS does not require the chosen subbands to construct a valid wavelet packet decomposition tree, as in [2]. A simple illustration of SGS for a three dimensional space is given in Figure 2.4. Note that this figure is simplified for illustration. In the actual application, the extracted features are in a much higher dimensional space and the different subbands may not be well separated. i = argmax{E(S,-),5j E U,} .‘i where E (SJ) is the energy feature extracted from the subband S,- and is defined by equation (2.5). The set of subbands used for representing the images is thus SU = {531,582, ...,SSM}. . Extract subband energies of all images in the testing set Tc. Use only the energy values corresponding to the subbands in the set S U for classification. 28 2.6 Experiments 2.6.1 Experiment Setting To evaluate the proposed wavelet subband selection algorithms, classification experi- ments are conducted on the Brodatz texture database [56]. All of the images used in the experiments are shown in Figure 2.5. All of the images are of gray-value and have a size of 512 X 512. Each texture is divided into 16 non-overlapping subimages with a size of 128 x 128. In this way, the data set for experiments contains 54 different texture classes, with 16 images in each class, resulting in 864 images for experiments. This experimental setting of using training and testing images from the same large texture image is used commonly in evaluation of texture classification methods, such as in [1,5,23,25,26]. It is known that the size of subimages affects the classification re- sults. Since the focus of this chapter is to demonstrate the effect of subband selection methods rather than finding a set of optimal parameters for texture classification, we fix the texture size to 128 x 128. For images at different sizes, the subband selection methods can be applied without any significant change. The gray-scale images are first normalized to a given mean and variance. Denote A(i, j) as the gray-level value of pixel (i, j) and [a and o as the mean and variance of A, respectively. The normalized image A’ is given by: 02 i'— . .. 110+ Maj—“23 If A(z,J)>u I o c — ”(MW)— 02(A(ij)-u)’ . . . “0+ J———; 11f A(z,J)Su where 110 and 00 are the predefined mean and variance of the adjusted image A’, (2.24) separately. In the experiment, we set #0 = 128 and oo = 30. The experiment includes two stages: the training stage and the testing stage. Half of the data set ( 27 x 16 = 432 images) is used for training and the other half is used for testing. In the training stage, each texture in the training set is decomposed with the wavelet packet transform up to 3 levels. Thus, for each texture, the number of subbands from a single wavelet basis is 1+4+16+64 = 85. A pre—determined number 29 Figure 2.5. Brodatz texture images used in the experiments. 30 of subbands are selected by applying the subband selection algorithms on the training set Tn. In the training stage, due to the randomness of selecting the initial seeds for the K-Means grouping algorithm, SGS was run 10 times for the same training and testing data set and the results were averaged. For this experimental setting, it is found empirically that averaging on 10 runs results in a relatively stable result. Classification experiments are conducted on the testing set, Tc, with the images decomposed only at the subbands selected in the training stage. In the testing stage, the classification error rates for different number of selected subbands are computed for performance evaluation. The classification error rate is computed using the KN N classifier with cross-validation (leave-one-out) [10]. All of the images in the test- ing set Tc are classified. In each round, one texture from T6 is taken out and the normalized Euclidian distance defined in equation (5.19) between this texture and all the other images from To are computed based on the energy features extracted from the selected subbands. The distance on energy features is defined by the nor- malized Euclidian distance, which was shown to be effective for measuring texture similarity [57]: n 2 d(F1,F2) = 2 i=1 F1“) " F2“) (71' (2.25) where F1, F2 are two different wavelet energy feature vectors, and o,- is the standard deviation of the feature extracted from subband i, estimated from the training set Tn. The “K” most similar images (i.e., the images with the smallest distances) determine the class label of the texture to be classified by a majority vote. After each texture in the testing set TC is classified in this way, the ratio of the number of misclassified images to the number of total images in T c is computed as the error rate. The value “K” in “KNN” is chosen to be 16. Considering that the number of images in each class is 16, we expect that most of the 16 most similar images come from the same class as the texture to be classified. For comparison, existing algorithms for wavelet subband selection are imple- mented and tested. The different methods that are implemented for comparison 31 include: (1) the subband selection algorithm based only on the magnitude of the en- ergy in each subband [1], referred to as “Energy” method in the following discussions; (2) the subband selection algorithm that evaluates the Fisher discrimination power of each subband [21,32], referred to as “Fisher” in the following discussion; (3) the best wavelet packet decomposition tree based on entropy [30], referred to as “Best Tree” method in the following discussions. In the “Best Tree” method, if the entropy value of a subband is less than the sum of entropy values of its children subbands, the subband is decomposed into children subbands. This criterion is iteratively applied to the leaf nodes of the current wavelet decomposition tree. With this criterion, we can not obtain any arbitrary number of subbands, since whether a subband is decom- posed or not depends on a fixed criterion. The only parameter that we can modify in this process is the decomposition level, which relates to the maximum number of subbands; (4) The subband selection algorithm that evaluates each subband with its entropy value. This method selects subbands with high entropy values for classifi- cation. This method is a combination of the ideas presented in [30], [32], and [21]. With this method, any number of subbands can be selected. This method is referred to as “Entropy” in the following discussions. Note that the criterion used in the ”Best Tree” method [30] aims at reducing the image reconstruction error, and not improving the classification accuracy. This method is included for comparison, since it is widely used for selecting the optimal wavelet packet tree structure. The four selection methods (“Energy”, “Fisher”, “Best Tree” and “Entropy”) are tested using the same experimental settings as the two algorithms proposed in this chapter (“MISS” and “SGS” ). The error rates for different number of subbands selected by different methods are reported and compared. 2.6.2 Performance with Different Wavelet Bases Different wavelet bases define the transform filters used in the tree structure decompo- sition introduced in Section 2.2. Therefore, using different wavelet bases will generate different energy distributions over subbands. The relationship between wavelet bases and classification performance was empirically discussed in [5]. In this chapter, three 32 wavelet bases are used for experiments: Haar, ’db10’ (Daubechies’ basis with 10 vanishing moments) and ’bio28’ (biorthogonal spline wavelets having 2 vanishing mo- ments in the decomposition filter and 8 vanishing moments in the reconstruction filter.) [13]. Since the total number of subbands for each wavelet basis is 85, the number of selected subbands is set to change from 5 to 85, with increments of 5. Figs. 2.6, 2.7 and 2.8 show the classification performance with the ’Haar’, ’db10’ and ’bio28’ wavelet bases, respectively. Several observations can be made from these figures. First of all, the dependence between features extracted from different subbands is useful for subband selection. This is verified by the performance of the MISS algorithm. For all of the three wavelet bases, MISS algorithm can achieve a classification accuracy comparable to the accuracy with the full decomposition using only 20 to 30 subbands. Second, the evaluation of individual subbands is also effective in subband selection. The “En- ergy”, “Entropy” and “Fisher” methods select subbands based on the evaluation of single subbands with different evaluation functions. For the “Haar” and “dblO” bases, MISS algorithm performs better than the three algorithms when smaller num- ber (less than 30) of subbands are selected. However for the “bio28” basis, “Fisher” method performs better for the same range. This shows that both factors (depen- dence between different subbands and the evaluation of individual subbands) affect the subband selection for classification. Third, the most prominent result is that SGS consistently outperforms the other algorithms. This is due to the fact that both dependence between subbands and the evaluation of individual subbands are incorpo- rated into the selection process. The advantage is more obvious when smaller number of subbands are selected. The number of selected subbands is the number of clusters in SGS algorithm. When the number of clusters is relatively small compared to the total number of subbands, the distribution of clusters reflect more reliably the un- derlying dependence structure, since more data points can be used to construct each cluster. Finally, the “Best Dee” algorithm does not achieve the best classification accuracy. The “Best Tree” is optimal in the sense of image reconstruction, not in the sense of classification. A small reconstruction error does not necessarily mean 33 Haar Basis 0.8 - , - o - Energy ‘, - o - MISS ‘, + Fisher ‘, - 0 - Entropy + SGS . + Best Tree 9 0’) -o Error Rate P h P N 100 20 40 so so Number of Selected Subbands Figure 2.6. The error rates with six subband selection algorithms with the ’Haar’ basis. a small classification error. For the wavelet packet decomposition of the Brodatz texture image, we find that the optimality criterion used in the “Best Tree” method always chooses to fully decompose an image, resulting in 5, 21 and 85 subbands for level 1,2 and 3 decompositions, respectively. This is why the “Best Tree” curves in the figures have only 3 data points. 2.6.3 Performance Using Subbands Selected from Two Wavelet Bases Traditional subband selection algorithm confines to selecting subbands that are gen- erated from a single wavelet basis. However, for classification, the pool of candidate subbands can be expanded to subbands generated by multiple wavelet bases. For example, we can combine subbands generated by the “Haar” basis and the “deO” basis. As long as the selected subbands form a set of features that are not correlated to each other and are discriminant in classification, it is not required that the se- lected subbands can be used to perfectly reconstruct the original image. Extension to selecting subbands from multiple wavelet bases does not require the modification of 34 db10 Basis .7 - 0 '3 - ° - Energy o_5 - 3 - 0 - MISS 3 ‘+ Fisher 0.5 - 9‘ ', - 0 - Entropy 3 “3 + SGS g 0.4 ‘3 + Best Tree 20 40 60 80 100 Number of Selected Subbands Figure 2.7. The error rates with six subband selection algorithms with the ’db10’ basis. bl028 Basis - 0 - Energy - o - MISS + Fisher - 6 - Entropy +SGS + Best Tree 20 4o 60 30 100 Number of Selected Subbands Figure 2.8. The error rates with six subbandselection algorithms with the ’bio28’ basis. 35 the algorithms described in Section 2.6.1 except that the pool of subbands to choose from has been enlarged. For the “Best Tree” algorithm, the features from subbands selected by the algorithm on different decomposition trees with diflerent bases are simply concatenated. _ In this subsection, experiments on selecting subbands from two wavelet bases are conducted for all of the subband selection methods mentioned in Section 2.6.1. Figs. 2.9, 2.10, and 2.11 Show the error rates obtained by the algorithms selecting subbands from “bio28” and “db10”,“Haar” and “bi028”, and “deO” and “Haar”, respectively. Each decomposition generates 85 subbands, so the total number of subbands for selection is 170. In order to make the results directly comparable with subband selection from a single wavelet basis, the number of selected subbands is still set to change from 5 to 85, with an increment of 5. In this modified experiment setting, the SGS still consistently outperforms other algorithms and the observations in Section 2.6.2 still hold here. It should be noted that the error rate obtained by selecting 85 subbands out of a possible 170 subbands from the two wavelet bases, regardless of the algorithm and the wavelet bases pairs used, is lower than the error rate using all 85 subbands from a single wavelet basis. This shows the advantage of selecting subbands from multiple wavelet bases or redundant dictionaries in general. More subbands can provide more information for classification and the proper selection algorithm can select the most informative subbands for classification. The minimum error rates achieved with different wavelet bases and combination of bases by applying different subband selection methods are summarized in Table 2.4 for comparison. Note that the different minimum error rates are not necessarily obtained with the same number of subbands. Based on the results in this table, the best classification performance is achieved by applying the SGS algorithm on the combination of “bio28” and “haar” bases. An empirical study on the relation between wavelet basis and the classification results on texture classification was discussed in [5]. Based on the conclusion from [5], the amount of shift variance in the decomposition filters in the wavelet basis is much more important than the regularity of the filters. 36 Table 2.4. Minimum Error Rates Achieved With Different Wavelet Bases and Differ- ent Selection Methods Best 'Ifee Energy Entropy Fisher MISS SGS bio28 0.0766 0.0688 0.0766 0.0779 0.0675 0.0649 db10 0.0831 0.0831 0.0688 0.0792 0.0714 0.0481 haar 0.0809 0.0603 0.0809 0.0765 0.0750 0.0603 bio28 and db10 0.1116 0.0785 0.0872 0.0802 0.0767 0.0488 bio28 and haar 0.1228 0.0632 0.0842 0.0895 0.0596 0.0368 haar and db10 0.1029 0.0663 0.0837 0.0767 0.0593 0.0558 db10 and bl028 Basic 1 - ---Energy . ‘ ' ' MISS 0.8~ l, +Flsher If. «Entropy 3 0 6~ ‘5 +SGS a ' 1‘3 +Best Tree) 2 m 0.4" ‘ g 0.2 o 1 1 1 I 1 0 20 40 60 80 100 Number of Selected Subbands Figure 2.9. The error rates with six subband selection algorithms by selecting sub- bands from the ’b1028’ basis and the ’db10’ basis. This was used to explain why “haar” basis performed better than the Daubechies wavelets for texture classification in [5]. Another empirical conclusion drawn from experimental results in [5] is that even-length biorthogonal filters perform better than odd-length ones. These conclusions also support our finding that the combination of “bio28” and “haar” bases achieves the best performance. Another support comes from the difference of the spaces spanned by the wavelet functions of “haar” and “bio28”. As shown in Figure 2.12, the two wavelet functions have quite different shape and therefore the combination of the two is able to capture a variety of structures in texture images such as smooth curves and sharp discontinuities. 37 hear and bl028 Basis 1 F - 0 - Energy '. -- -MISS 0.8- t, - +Flsher ‘3 - 0 - Entropy 3 o 6 _ 8‘. +SGS It“ ' X.“ -- Best Tree] 8 ‘ . h m 0.4 7 .‘ 0.2 o I 1 I l J 0 20 40 60 80 100 Number of Selected Subbands Figure 2.10. The error rates with six subband selection algorithms by selecting sub- bands from the ’Haar’ basis and the ’bio28’ basis. hear and db10 Basis 1 - - 0 - Energy 3 -- -MISS 0.8- +Fisher ‘1 ‘. - o - Entropy 3 0 6- '. ‘, +SGS I‘g' ' i. ‘. -- Best Tree] 2 a -. III (Mir ,‘3 'I 0.2 G I I l 1 I 0 20 40 60 80 100 Number of Selected Subbands Figure 2.11. The error rates with six subband selection algorithms by selecting sub- bands from the ’Haar’ basis and the ’dblO’ basis. 38 Figure 2.12. The wavelet basis functions at the first decomposition level for “haar” and “bio28”, in the first row and second row, respectively. In this figure, white regions correspond to high values and dark regions correspond to low values. 2.6.4 The Clustering of Subbands in the SGS Algorithm In order to understand why SGS performs the best compared to other methods, it is important to explore the distribution of subbands among clusters. The distribu- tion of subbands in different clusters determined by the SGS algorithm reflects the similarity of these subbands in identifying different texture structures. As shown by the discussions and simulation results in Section 2.3, the choice of the wavelet ba- sis and the statistical distribution of pixels in texture images jointly determine the dependence relations between energy values from different subbands. Therefore, it should be pointed out that the subband grouping results are dependent on the par- ticular wavelet basis and the texture images. However, if a certain pattern frequently appears in the subband grouping, such a pattern may reveal an inherent link among different subbands. To make this analysis easier for visualization, we analyze the grouping results with 2-level wavelet packet decomposition, i.e., there are 21 sub- bands in total. The grouping algorithm is run 1000 times. Each time half of the 39 0000000000000009 Figure 2.13. The subband labelling scheme for a 2-level wavelet packet decomposition. 864 texture images are randomly selected for decomposition and the energy features are used for grouping. When the number of clusters is set to 2, 998 out of the 1000 runs give the same grouping results: All approximation subbands (subband 1, 2 and 6 in Figure 2.13) are in one cluster and the rest of the subbands are in another clus- ter. This result indicates that subbands corresponding to the same spatial frequency range usually have higher degree of dependence and tend to be grouped into the same cluster. We further change the number of clusters in the K-Means algorithm to 4 and conduct the grouping algorithm. The number of clusters is set to 4 be- cause there are 4 types of subbands in wavelet packet decomposition: LL, LH, HL, HH. The resulting subband clusters are used to build a co—occurrence CM with a size of 21 x 21, where the value at the entry (i, j) is the number of times the i-th and j-th subbands are in the same cluster, with the subband la- belling scheme given in Figure 2.13. Since the grouping algorithm is run 1000 times, the maximum value in the matrix is 1000 and the minimum value is 0. The phenomenon observed in the 2-cluster case appears again, i.e., the approxi- mation subbands are always grouped in the same cluster with CM(1,2),C'M(1,6) and CM(2, 6) larger than 990 and other co—occurrence values in rows correspond- ing to subbands 1,2,6 are small (less than 20). Some subbands corresponding to 40 similar spatial frequency are grouped into the same cluster most of the time. For example, CM(7, 10), CM(8, 14), CM(9, 18), CM(12, 15), CM(13, 19), CM(17, 20) all have value larger than 990, which means that more than 99% of the time, these subbands are grouped into the same cluster. The common property of these subband pairs is that the paths from the root to the two subbands are swapped. For example, the path from the root to subband 7 is “LL+LH” and the path to subband 10 is “LH+LL”. The empirical analysis of the grouping results is consistent with the filtering structure of the wavelet packet analysis and verifies the validity of the proposed SGS algorithm. 2.6.5 Performance on Unseen Data The experimental results discussed so far are based on the setting that the training and testing subimages are cropped from different regions in the same large texture image. Although this experimental setting is commonly used for conducting experi- ments on texture classification, such as in [1,5,23,25,26], having training and testing texture images from the same large image may cause bias in the classification result. Therefore, experiments are conducted for studying the performance of the proposed algorithms on unseen data to see if the proposed algorithms are still effective. For this purpose, 9 large texture images are selected and divided into 3 classes, as shown in Figure 2.14, with one class in each row. Each of these 512 x 512 images is divided into 16 non-overlapping 128 x 128 subimages. Subimages from the first larger image in each row are used for training and the subimages from the other two images are used for testing. In this case, the training set includes 48 samples and the testing set includes 96 samples. The wavelet basis “bio28” is selected for decomposing the images. Other experimental settings are the same as in the previous sections. Given these conditions, the classification results are shown in Figure 2.15. For comparison, experiments are also conducted by using subimages from each large texture image in the training set. In this case, 5 subimages are selected from the 16 subimages of each larger texture into the training set and the remaining 11 subimages are used for testing. The classification results of this setting are shown in Figure 2.16. From Figure 2.15 and Figure 2.16, similar conclusions can be drawn as in the previous ex- 41 periment setting, i.e., selection based on either the evaluation of individual subbands or the dependence between subbands is effective, and subband selection based on the two factors results in the best performance. The classification on the unseen data introduces some fluctuations in the performance curve, but does not change the con- clusion on the comparison of different selection algorithms. The minimum error rates obtained by the two experimental settings are comparable. The major difference in the results is that a small error rate is achieved with fewer subbands in the second experimental setting (Figure 2.16) compared to the modified experiments with unseen data (Figure 2.15). The reason for the observed fluctuation in the error rates and the slight increase in the number of subbands selected is because of the fact that the tested images have more variability in the unseen data case. Experiments are also conducted with other two wavelet bases, i.e., “hear” and “db10”, and with the three combinations of two wavelet bases. The results from these experiments are similar to those that use training and testing subimages from the same larger image except that the performance curve has more fluctuations. This consistence Shows the applicabil- ity and validity of the proposed algorithms to a wider setting of texture classification problems. 2.6.6 Discussion In most applications of wavelet transforms such as denoising and compression, the goal is to reconstruct the original signal/ image. The perfect reconstruction goal con- straints how the subbands are chosen. However, for classification, it is not required that the selected subbands can be used to reconstruct the original image. The major requirement for classification applications is that the features extracted from the se- lected subbands are uncorrelated and discriminative. With this criterion, the features from the selected subbands will form a compact and informative representation of the images for the purpose of classification. The work presented in this chapter is motivated by the observation that most existing subband selection methods implicitly assume that features from different subbands are independent. MISS algorithm Shows that the dependence can be used 42 Figure 2.14. Texture images used for classification on unseen data. [First row: bars; Second row: grass; Third row: knit pattern. 43 Classification Results 0.35 - —-— Energy 0.3 - +MISS -- Fisher 0.25 - + Entropy 2 -°-SGS E 0-2 ’ +Best Tree 2 0.1 5 — In 0.1 - 0.05 "o 20 40 so so 100 Feature Dimension Figure 2.15. The error rates with six subband selection algorithms with the ’bio28’ basis on unseen data. Classification Results 0.5r ‘-°- Energy + MISS 0-4 ’ -- Fisher -s— Entropy -°-SGS 0'3 ‘ + Best Tree Error Rate o 20 40 610 so 100 Feature Dimension Figure 2.16. The error rates with six subband selection algorithms with the ’bio28’ basis and training data from all big texture images. 44 for selecting a compact set of subbands for classification. However, independent features do not necessarily imply that the features are informative for classification. To select subbands that are both compact and informative, SGS algorithm is proposed to combine dependence and the evaluation of individual subbands for better subband selection. Experimental results show that SGS outperforms the algorithm based on dependence only (MISS) and the algorithms based on the evaluation of individual subbands only (“Energy”, “Fisher” and “Entropy”). It has also been shown that the subband selection algorithms can be easily extended to select subbands from multiple wavelet bases. The idea of incrementally selecting an informative subset of wavelet subbands used in the MISS algorithm bears some similarities to “feature pursuit” with “minimax entrOpy principle” [34]. In [34], the feature pursuit process selects a new feature that maximizes the decrease in entropy between the existing feature set and the new feature set obtained by adding the new feature to the existing feature set. The feature pursuit is proposed for texture synthesis with a parametric probability model, while the objective of the MISS algorithm is texture classification. This difference directly results in the difference in the criterion used in the feature pursuit and the MISS algorithm. For the feature pursuit, the change in entropy values of different feature sets is used for selecting new features, whereas in the MISS algorithm, the new subband is selected based on change in the mutual information. It is also important to note that different experimental settings for the texture classification can affect the classification accuracy. For example, the distance measure, changes in the texture size and extracting features other than energy from wavelet subbands may all affect the classification accuracy. However, the focus of this chapter is not to propose a new method that can outperform the state-of-the—art texture classification methods, but rather to investigate one factor (dependence) that affects subband selection and incorporate it into the selection process. The proposed subband selection method is applicable to filterbank based classification of other signal and texture. . 45 2.7 Conclusion In this chapter, the dependence between features extracted from wavelet packet de- composition is investigated and incorporated into subband selection for classification. The traditional methods implicitly assume independence among features extracted from different subbands or only consider the dependence between the parent subband and the corresponding children subbands. We first demonstrate that the dependence between features extracted from different subbands exist by theoretical analysis and simulation. An algorithm exploiting the dependence, MISS, is proposed for subband selection. Experimental results show that dependence among features from different subbands is effective for subband selection. By exploiting the dependence among fea- tures from different subbands and incorporating subband selection based on individual evaluation of subbands, SGS has been shown to be more effective in subband selec- tion. Experimental results indicate that SGS can effectively select smaller number of subbands to achieve lower classification error rates than existing subband selection methods. 46 CHAPTER 3 A Sparse Representation Framework for Signal Classification 3. 1 Background Sparse representations of signals have received a great deal of attentions in recent years [15—18, 35, 58,59]. Sparse representations address the problem of finding the most compact representation of a signal in terms of a linear combination of atoms in an overcomplete dictionary. Recent developments in multi-scale and multi-orientation representation of signals, such as wavelet, ridgelet, curvelet and contourlet transforms are an important incentive for the research on sparse representations. Compared to methods based on orthonormal transforms or direct time domain processing, sparse representations usually offer better performance with their capacity for efficient Signal modelling. Current research has focused on three aspects of sparse representations: pursuit methods for solving the optimization problem, such as F OCUSS [60], match- ing pursuit [15], orthogonal matching pursuit [16], basis pursuit [17], LARS / homotopy methods [61]; design of the dictionary, such as the K-SVD method [12]; the applica- tions of sparse representations for different tasks, such as signal separation, denoising, coding, image inpainting [18,19,35,58,59,62]. For instance, in [59], sparse represen- tations are used for image separation. The overcomplete dictionary is generated by combining multiple standard transforms, including the curvelet transform, ridgelet transform and discrete cosine transform. In [35,58], application of sparse representa- tions to blind source separation is discussed and experimental results on EEG data analysis are demonstrated. In [62], a sparse image coding method with the wavelet 47 transform is presented. In [18], sparse representation with an adaptive dictionary is shown to have state-of-the—art performance in image denoising. The widely used shrinkage method for image denoising is shown to be the first iteration of basis pursuit that solves the sparse representation problem [19]. In the standard framework of sparse representations, the objective is to reduce the signal reconstruction error with as few number of atoms as possible. On the other hand, discriminative analysis methods, such as LDA, are more suitable for the tasks of classification. However, discriminative methods are usually sensitive to corruption in signals due to a lack of crucial prOperties for signal reconstruction. In this thesis, we propose a method for sparse representations for signal classification (SRSC), which modifies the standard sparse representation framework for signal classification. We first show that replacing the reconstruction error with discrimination power in the objective function of the sparse representation is more suitable for the tasks of clas- sification. When the signal is corrupted, the discriminative methods may fail since the information contained in the discriminative analysis may be easily distorted by noise, missing data and outliers. To address this robustness problem, the proposed SRSC approach combines discrimination power, signal reconstruction and sparsity in the objective function for classification. Our ultimate goal is to achieve a sparse and robust representation of signals in noise for effective classification. 3.1.1 Problem Formulation The problem of finding the sparse representation of a signal in a given overcom- plete dictionary can be formulated as follows. Given a N x M matrix A contain- ing the elements of an overcomplete dictionary in its columns, with M > N and usually M >> N, and a signal y 6 RN , the problem of sparse representation is to find an M x 1 coefficient vector x, such that y = Ax and ||x||0 is minimized, i.e., X = min ]]X’]]0 S.t. y : Ax_ (3.1) X where ||x||0 is the 80 norm and is equivalent to the number of non-zero components 48 in the vector x. Finding the solution to equation (3.1) is NP hard due to its nature of combinatorial optimization. Suboptimal solutions to this problem can be found by iterative methods like the matching pursuit and orthogonal matching pursuit. An approximate solution is obtained by replacing the 80 norm in equation (3.1) with the £1 norm, as follows: x = rr)1‘1’n||x’||1 s.t. y = Ax. (3-2) where ”XIII is the £1 norm. In [63], it is proved that if certain conditions on the sparsity is satisfied, i.e., the solution is sparse enough, the solution of equation (3.1) is equivalent to the solution of equation (3.2), which can be efficiently solved by basis pursuit using linear programming. A generalized version of equation (3.2), which allows noise, is to find x such that the following objective function is minimized: J1(X;)\) = My — Ax“: + ,\ llxlll (3.3) where the parameter A > O is a scalar regularization parameter that balances the tradeoff between reconstruction error and sparsity. In [64], a Bayesian approach is proposed for learning the optimal value for A. The problem formulated in equa- tion (3.3) has an equivalent interpretation in the framework of Bayesian decision as follows [65]. The signal y is assumed to be generated by the following model: y = Ax + 5 (3.4) where 5 is white Gaussian noise. Moreover, the prior distribution of x is assumed to be super-Gaussian: M p(x) ~ exp (—x\ Z [$i]p) (3.5) i=1 where p E [0, 1]. This prior has been shown to encourage sparsity in many situations, due to its heavy tails and sharp peak. Given this prior, maximum a posteriom' (MAP) 49 estimation of x is formulated as XMAP = arg maxp(X|y) = arg min -10gp(yIX)-10gp(X) = arg min lly - AXI|§+A MP (3.6) when p = 0, equation (3.6) is equivalent to the generalized form of equation (3.1); when p = 1, equation (3.6) is equivalent to equation (3.2). 3.1.2 Reconstruction and Discrimination Sparse representations work well in applications where the original signal y needs to be reconstructed as accurately as possible, such as denoising, image inpainting and coding. However, for applications like signal classification, it is more important that the representation is discriminative for the given signal classes than a small reconstruction error. The difference between reconstruction and discrimination has been widely inves- tigated in literature. It is known that typical reconstructive methods, such as prin- cipal component analysis (PCA) and independent component analysis (ICA), aim at obtaining a representation that enables sufficient reconstruction, thus are able to deal with signal corruption, i.e., noise, missing data and outliers. On the other hand, discriminative methods, such as LDA [10], generate a signal representation that maximizes the separation of distributions of signals from different classes. While both methods have broad applications in classification, the discriminative methods have often outperformed the reconstructive methods for the classification task [11,66]. However, this comparison between the two types of methods assumes that the signals being classified are ideal, i.e., noiseless, complete(without missing data) and without outliers. When this assumption does not hold, the classification may suffer from the nonrobust nature of the discriminative methods that contains insufficient information to successfully deal with signal corruptions. Specifically, the representation provided by the discriminative methods for optimal classification does not necessarily con- tain sufficient information for signal reconstruction, which is necessary for removing noise, recovering missing data and detecting outliers. This performance degrada- 50 tion of discriminative methods on corrupted signals is evident in the examples shown in [67]. On the other hand, reconstructive methods have shown successful perfor- mance in addressing these problems. In [18], the sparse representation is shown to achieve state-of-the-art performance in image denoising. In [68], missing pixels in im- ages are successfully recovered by inpainting method based on sparse representation. In [67,69], PCA method with subsampling effectively detects and excludes outliers for the following LDA analysis. All of these examples motivate the design of a new signal representation that com- bines the advantages of both reconstructive and discriminative methods to address the problem of robust classification when the signals are corrupted. The proposed method should generate a representation that contains discriminative information for classification, representative information for signal reconstruction and should be sparse. In current research [18,68], sparse representations have proven to be effective for signal reconstruction. Existing pursuit methods provide an efficient way to solve the sparse representation problem. Therefore, we choose the sparse representation as the basic framework for the SRSC and incorporate a measure of discrimination power into the objective function. With this new method, the sparse representation ob- tained by the proposed method contains both crucial information for reconstruction and discriminative information for classification, which enable a reasonable classifi- cation performance in the case of corrupted signals. The three objectives: sparsity, reconstruction and discrimination may not always be consistent. Therefore, weighting factors are introduced to adjust the tradeoff among these objectives, as the weighting factor A in equation (3.3). It should be noted that the aim of SRSC is not to improve the standard discriminative methods like‘LDA in the case of ideal signals, but to achieve comparable classification results with a few number of features when the sig- nals are corrupted. A recent work [67] that aims at robust classification shares some common ideas with the prOposed SRSC. In [67], PCA with subsampling proposed in [69] is applied to detect and exclude outliers in images and the rest of the pixels are used for computing LDA vectors. 51 3.2 Sparse Representation for Signal Classification In this section, the SRSC problem is formulated mathematically and a pursuit method is proposed to optimize the objective function. We first replace the term measuring reconstruction error with a term measuring discrimination power to show the different effects of reconstruction and discrimination. Further, we incorporate measure of dis— crimination power in the framework of standard sparse representations to effectively address the problem of classifying corrupted signals. The Fisher’s discrimination cri- terion [10] used in the LDA is applied to quantify the discrimination power. Other well-known discrimination criteria can easily be substituted. 3.2.1 Problem Formulation Given an N x 1 signal y and its M x 1 projection on an N x M (M > N) overcomplete dictionary A, i.e., y = Ax, we view x as the feature extracted from signal y for classification. The extracted feature should be as discriminative as possible between the different signal classes. Suppose that we have a set of K signals {y1, y2, ..., yK}, with representations in the overcomplete dictionary as {x1,x2, ..., xx}, of which K,- samples are in the class 0,, for 1 S i S 0’. Mean mi and variance 3? for class C,- are computed in the feature space as follows: 1 1 m. = 75 Z x. , s? = 75 Z ux. — min: (3.7) 1 jeCi 1 jECi K The mean of all samples are defined as: m = f E x,. Finally, the Fisher’s dis- i=1 crimination power can be defined as: [Ii] K.(m. — m> o <9 o c g 18 05020 ‘ 0 '2 + .g sabooooégqacoo’: 0 f2 5 2; ”we as“ s o 0 8° 80°§T§> 9' 9 c8 ’50 x 5 n- o 069 000 000 ' E 16150 O x 90 ‘ 6' O o o 00 < 0 0x xx 5 0 o O 8 00 0 fl (’3‘ 0x x X x K fl 0 X 0 x O X a X 5 1465110 0,2220 x0 ng 5 4 xxx: ’kx 2“"- x x x E o 9.6 '6 a x“ €po t 0x00 x9 Xx ’$¢ E X 2‘ xx xx ’5: x 812""0 x900... 3 25" “’3” w" 0 x o x 0 O xxlq] o fxx x x x x x x O o0 O 8" x" X ’I x xx xx“ 10 X 9): X %C&) x o x ,, J‘ ..., 1‘ 0 50 100 O 50 100 Sample Index Sample Index Figure 3.1. Distributions of projection of signals from two classes with the first atom selected by: J1(X, A) (the left figure) and J2(X, A) (the right figure). can be described by the sine function and most of the discrimination power is in the cosine function. The signal component with most energy is not necessarily the compo- nent with the most discrimination power. We consider the dictionary as {sin t, cost}. Optimizing the objective function J1 (X, A) with the pursuit method described in Sec- tion 3.2 selects sint as the first atom. On the other hand, optimizing the objective function J2(X, A) selects cost as the first atom. In the simulation, 100 samples are generated for each class and the pursuit algorithm stops after the first run. The projection of the signals from both classes to the first atom selected by J1(X,A) and J2(X, A) are shown in Figure 3.1. The difference in the distribution of the two classes shown in the figures has a direct impact on the classification. Linear classifi- cation on these distributions generates a classification error rate around 0.5 for the first case and 0.0 for the second case. 3.3.2 Sparse Classification with Hand Written Digit Images Classification with J1, J2 and J3(SRSC) is also conducted on the database of USPS handwritten digits. The database contains 8-bit grayscale images of “0” through “9” with a size of 16 x 16 and there are 1100 examples of each digit. Some samples of the 56 Figure 3.2. Samples of the USPS hand written digit images. digits are shown in Figure 3.3. Following the conclusion of [72], 10—fold stratified cross validation is adopted. Classification is conducted with the decomposition coefficients (’ X’ in equation (3.10)) as the selected feature and support vector machine (SVM) as the classifier. The evaluation of Fisher’s discrimination score implicitly assumes Gaussian distribution with equal variance for each class. Therefore, simple Euclidian distance is used to measure the distance between different samples. In this imple- mentation, the overcomplete dictionary is a combination of Haar wavelet basis and Gabor basis. Haar basis is good at modelling discontinuities in the image, whereas the Gabor basis is good at modelling continuous image components. In this experiment, noise and occlusion are added to the images to test the ro— bustness of SRSC. First, white Gaussian noise with increasing levels of energy, thus decreasing levels of signal-to—noise ratio (SNR), is added to each image. Table 3.3 summarizes the classification errors obtained with different SNRs. Second, different sizes of black squares are overlayed on each image at a random location to gener- ate occlusion (missing data). For the image size of 16 x 16, black squares with size of 3 x 3, 5 x 5, 7 x 7, 9 x 9 and 11 x 11 are overlayed as occlusion. Table 3.4 summarizes 57 Figure 3.3. Corrupted data for the experiments: (a) with different degrees of white Gaussian noise; (b) with different sizes of occlusion. Table 3.1. Classification errors with different levels of white Gaussian noise N oiseless 20dB 15dB 10dB 5dB J1 0.0855 0.0975 0.1375 0.1895 0.2310 J2 0.0605 0.0816 0.1475 0.2065 0.2785 J3 0.0727 0.0803 0.1025 0.1490 0.2060 the classification errors obtained with occlusion. In Table 3.3 and Table 3.4, the minimum error rate at each noise level is high- lighted with bold font. Results in Table 3.3 and Table 3.4 show that in the case that the signals are ideal (without missing data and noiseless) or nearly ideal, J2(X, A) is the best criterion for classification. This is consistent with the known conclusion that discriminative methods outperform reconstructive methods in classification. However, when the noise is increased or more data is missing (with larger area of occlusion), the accuracy based on J2(X, A) degrades faster than the accuracy based on J1 (X, A). This indicates that the signal structures recovered by the standard sparse representation are more robust to noise and occlusion, thus yield less performance degradation. On the other hand, the SRSC demonstrates lower error rates by combing the reconstruc- Table 3.2. Classification errors with different sizes of occlusion noocclusion 3x3 5x5 7x7 9x9 11x11 J1 0.0855 0.0930 0.1270 0.1605 0.2020 0.2990 J2 0.0605 0.0720 0.1095 0.1805 0.2405 0.3305 J3 0.0727 0.0775 0.1135 0.1465 0.1815 0.2590 58 tion property and the discrimination power in the case that signals are corrupted. 3.3.3 Experiments with the Object Recognition Experimental Settings Experiments are also conducted on the classification of images from 2 different objects in the COIL20 data set [73]. The focus of this experiment is to investigate the effect of different weighting values on the cost function J3(X, A1, A2). In the training stage, 12 images of each object with a size of 16 x 16 are used, and the remaining 60 images from each object are used for testing. Classification is conducted using the decomposition coefficients (‘X’ in equation (3.10)) as the features and the K nearest neighborhood (KN N) classifier for the ”leave-one-out” method. The Euclidian distance is used to measure the distance between features from different samples. In this implementation, the overcomplete dictionary is formed by combining Haar wavelet basis, Gabor basis and contourlet basis. Haar basis is good at modelling discontinuities in the signal. Gabor basis, on the other hand, is good at modelling continuous signal components and texture. The contourlet is a multi-resolution and multi-directional wavelet transform that can efficiently model object contours. In summary, the over-complete dictionary has 200 Gabor atoms, 441 Haar atoms and 454 contourlet atoms. Random Gaussian noise and occlusions are added to the original images to test the performance of SRSC with different parameter settings. First, white Gaussian noise with different levels of signal-to-noise ratio (SNR), are added to each image. Second, different sizes of black squares (zero values) are overlayed on each image at the image center to generate occlusions. For the image with size 16 x 16, black squares with size 3 x 3, 5 x 5 and 7 x 7 are overlayed as occlusions. Several example images with different levels of Gaussian noise and occlusion are shown in Figure 3.4. Since the three components of J3(X;A1,A2, A3) may not be on the same scale, we first normalize the three factors before setting the weighting factors. Given a vector Z = [21, 22, ..., 2M], where Z can be either one of the three items. The normal- 59 Figure 3.4. (a) Object images with different levels of white Gaussian noise; (b) Object images with different sizes of occlusion. The images in the first line of (a) and (b) are noiseless. ization is implemented by the following equation: 2; -}lz 1.- _ 3.15 2 Oz ( ) where Hz and oz are the mean and standard variance of Z, respectively. Study on the Effect of the Different Factors Table 3.3 summarizes the classification error rates for the different weighting factors for different SNR values. Table 3.4 summarizes the classification error rates with different sizes of occlusions. In both tables, the minimum error rate of each column is underscored. In this experiment, the pursuit algorithm stops when 547 atoms are chosen, i.e., half of the total number of atoms. From the experimental results, we can observe several phenomena. First, in a noiseless environment, the Fisher’s discrimination ratio by itself results in the lowest error rate. This result is consistent with the existing research that the discriminative 60 Table 3.3. Error rates with white Gaussian noise [A1, A2, A3] N oiseless 20dB 15dB 10dB [0,0,1] 0.1000 0.2583 0.3083 0.3417 [1,0,0] acne 0.2417 0.3417 0.3833 [1,1,0] 0.0833 0.2167 0.3000 0.3250 [0,1,1] 0.0917 0.2583 0.2750 @010 [1,0,1] 0.0833 m3 0.2833 0.3667 [1,1,1] 0.1000 0.2500 W 0.3333 Table 3.4. Error rates with occlusions [A1, A2, A3] no occlusion 3 x 3 5 x 5 7 x 7 [0,0,1] 0.1000 0.1167 0.1500 0._20_QQ [1,0,0] m 0.1667 0.1750 0.2167 [1,1,0] 0.0833 0.1250 0.2417 0.3250 [0,1,1] 0.0917 0.1333 0.1583 0.2667 [1,0,1] 0.0833 0.1417 0.2250 0.3000 [1,1,1] 0.1000 W 0% 0.2833 methods are better than the reconstructive methods for classification [11]. Second, as the signal is corrupted by Gaussian noise, or as more pixels are lost, the performance of all weighting schemes deteriorates. The different weighting schemes have different responses to noise and occlusions. In a noisy environment, the weighting vectors that combines discrimination power and reconstruction power usually achieve the best per- formance, as shown in the third and fourth columns of both tables. When the images are highly corrupted, such as shown in the last column of both tables, the inclusion of discriminative power actually results in a higher classification error rate. As shown in the last column of both tables, the minimum error rate is achieved with A1 = 0, which means that the discriminative power is not considered. These results indicate that when the signal is corrupted beyond a certain amount, the discriminative power estimated from the signal is not reliable any more, and the reconstruction error pro- vided by the framework of sparse representations and harmonic analysis can reduce the effect of noise and missing data to improve the classification accuracy. 61 0.16 J] / 3 l ‘/ / g 0.1m / . J’ / i I/ ‘ I'll“ / a 0.12 h / ‘.\ d/ C [I y 2 ,I \ // U {I /' 1 q I]. I o_1 L \‘1/ . . m . . 100 300 500 700 900 1000 Number of Atom Figure 3.5. The relation between error rate and sparsity. Study on Sparsity The sparsity of the representation is incorporated in the objective function with the i=1 K term A2 2 ||x,-||0. The sparsity is preferred in SRSC because it provides a compact representation and filters out noisy signal components, simultaneously. This idea is similar to that of feature selection in pattern recognition. In SRSC, the effect of sparsity can be controlled in two ways. First, we can increase the value of A2 in J3(X;A1,A2,A3) to emphasize sparsity. Second, we can reduce the number of iterations in the pursuit algorithm, i.e., the number of selected atoms. For the first case, we can compare the error rates for A2 = l and A2 = 0 in Tables 3.3 and 3.4. In most cases especially when the images are corrupted, emphasizing the sparsity of the representation by increasing A2 results in a more discriminative feature improving the classification accuracy. For the second case, we study the relation between sparsity and the classification error of SRSC when the images are occluded with a 3 x 3 square and the weighting factors are fixed: A1 = A2 = A3 = 1. The experimental results are shown in Figure 3.5. In this experiment, we change the number of selected atoms from 100 to 1000, with increments of 100. In this experiment, the minimum error rate is achieved with only 200 atoms from a total of 1091 atoms, i.e, with less than 20% of atoms. We also note that after 500 atoms, the error rate monotonically increases with the number of atoms. The results indicate 62 that not all atoms provide useful information for classification. Decomposition of the signals with certain atoms actually add noise rather than improving the classification. The results also show the effectiveness of the pursuit algorithm for selecting a small number of atoms that achieve low error rate. 3.4 Conclusions SRSC is motivated by the ongoing research in the area of sparse representations in signal processing. SRSC incorporates reconstruction error, discrimination power and sparsity for robust classification. In the current implementation of SRSC, the weight factors are empirically set to optimize the performance. Approaches to determine optimal values for the weighting factors are being investigated, following the methods similar to that introduced in [64]. The experimental results clearly show that the different parameters have signif- icant effect on the atom selection process and correspondingly, the following fea- ture extraction and classification performance. For example,when A1 = 1,A2 = 0 and A3 = 0, SRSC becomes a sequential version of the traditional LDA method. On the other hand, when A1 = 0, A2 = 1 and A3 = 1, SRSC becomes the traditional sparse representation. As reflected by the experimental results, a given parameter setting can not achieve the best performance for different levels of signal corruption. It is therefore interesting to investigate the optimal value for the parameters for different situations. Investigating the optimal number of iterations for the pursuit algorithm is another interesting problem, since it also has considerable effect on the classification accuracy. 63 CHAPTER 4 Sparse RepreSentation for Signal Classification: An Optimization Approach 4. 1 Introduction The SRSC framework proposed in Chapter 3 presented a promising method for robust and sparse feature extraction from images for classification. However, using Fisher’s ratio for measuring discrimination makes the objective function non-convex. This limitation makes it hard to find the optimal set of features. For this reason, an algo- rithm similar to matching pursuit was applied by selecting one atom at a time. This approach provides good classification accuracy, however, is still suboptimal. In this chapter, the SRSC method is improved by replacing the Fisher’s ratio with the class margin criterion used in the support vector machine to quantify the discrimination power. Using this new measure of discrimination, the SRSC problem can be formu- lated as a constrained optimization problem that can be iteratively optimized with two quadratic programming problems in each iteration. In order to further reduce the computational complexity, the SRSC problem is decomposed into two steps, with the first step being sparse reconstruction and the second step being sparse classification. For the sparse classification step, a new algorithm called Large Margin Dimension Reduction (LMDR) is proposed for obtaining sparse features. The formulation of LMDR incorporates the advantages of the Ll-norm SVM [74] and distance metric learning [75] into one framework by using the idea of distance metric learning to 64 search for an optimal linear transform on the original features and using the idea of Ll-norm SVM to determine significant feature components. In this way, LMDR can effectively identify the underlying features that compose the sparse feature rep- resentation for classification. LMDR is applicable to dimension reduction problems in general purpose classification, and performs better than Ll—norm SVM and linear discriminative analysis (LDA) in certain circumstances. The rest of this chapter is organized as follows. Section 4.2 briefly reviews the data representation model and defines the general notations used in this chapter. Section 4.3 proposes a new formulation of SRSC with the large margin method and its optimization with iterative quadratic programming. Dividing SRSC into two steps and employing LMDR in the second step are discussed in Section 4.4. Experiments and discussions are given in Section 4.5. 4.2 Data Model A dictionary A E ka" is a matrix where each column corresponds to an atom. Usually the number of atoms is much larger than the dimension of each atom, i.e., k << d in the traditional sparse representation problems. The set of observed signals Y = [y1,y3, ...,yn] is a set of k-dimensional ( y; E R") training data with class labels C = {c1,c2, ..., C11} (C.- E Z 1). Based on this training data set, we want to learn the representation of Y using selected atoms in A such that Y is representable by the dictionary A and the representation also has some other desired prOperties. In this formulation, X = [x1,x2, ...,xn] and X; 6 Rd is the decomposition of y, on the dictionary A. We can then use the decomposition coefficients X as features for classification. Mathematically, these goals can be formulated into a single objective function G(X) that needs to be optimized. For example, in the traditional sparse representation, G(X) includes both the signal reconstruction error and the sparsity measure. In the SRSC problem formulated in Chapter 3, G(X) includes the signal reconstruction error, sparsity measure and discrimination measure. 65 4.3 SRSC with Large Margin: An Optimization Model The large margin principle that is used in the design of support vector ma- chines (SVM) aims at maximizing the margin between the distributions of dif- ferent classes. Given linearly separable labelled features from two classes S; = {(x1,c1),(x2,c2),...,(xn,c,,)} with c,- 6 {—1,1}, the hyperplane (w,b) realizes the maximum margin classifier by solving the following optimization problem: min (w.r.t. w,b) (4.1) S.t. q(< w ,Xi > +0) 21 Tx. An intuitive where the operator <, > is the inner product, i.e., < w,x >= w explanation of this formulation lies in the connection between the margin 7 between the data distribution of two classes w: 7 = “w—lng. An illustration of the large margin principle is shown in Figure 4.1. In the SVM method, this optimization problem can be solved by solving the dual problem of equation (4.1) with quadratic programming. 4.3.1 SRSC with Large Margin: Formulation Similar to Fisher’s ratio, the margin between two classes provides a measure for feature discrimination. One advantage of the large margin classification is that it usually has a good generalization capacity, i.e., a better performance than other classifiers on unseen data, as shown by the statistical learning theory [76, 77]. This motivates the application of the large margin principle as the discrimination measure in SRSC, forming a new objective function for SRSC: min (wmt. w,b,x;) A1 < w,w > + A2 2 llxill1 + A3 Z My, — Axillg ( 2) i=1 i=1 4. S.t. Ci(< w ,Xi > +0) 2 1 where A1, A2 and A3 are non-negative weighting factors. The advantage of this opti- 66 Figure 4.1. Illustration of the max-margin linear classifier. 67 mization formulation is that it integrates feature extraction and classifier design into a single step. Once solved, the feature is sparse, robust to noise (by incorporating the reconstruction error) and discriminative (large margin). Since the £1 norm is not differentiable, a common trick to deal with this problem is to introduce two positive variables u, and vi: ui accounts for the positive components of xi; vi accounts for the absolute values of the negative components of xi. With the two variables, x; = ui—vi, ||xi]]1 = 1T(ui+vi), u, 2 0 and vi 2 0. This conversion of the minimum 61 optimization to linear representation has been known as the “Least Absolute Derivation” problem. Plugging this representation into equation (4.2), the problem can be formulated as: min (w.r.t. w,b,ui,vi) Alew +A22[1T(ui+vi)]+ A3 2: [Vi - A(ui - VillTlYi — A(ui - Vill 8.1". ci[wT(ui — vi) + b] 2 1 uiZO; viZO Considering that the samples are not exactly linearly separable, slack variables 6, are introduced. The effect of g, is illustrated in Figure 4.2. For the distribution shown in Figure 4.2, the data samples from the two classes are not linearly separable and the constraints c,-[wT(ui — vi) + b] 2 1 are not satisfiable. Therefore, there is no solution to the optimization problem. Slack variables, 5,, introduces the endurance of misclassification and extends the applicability of SVM to data that are not linearly separable. With the slack variables, the formulation can be rewritten as follows: min (w.r.t. w,b, ui,vi,§,-) AleW + A2 2: [1T(ui + Vi)l+ i=1 A3 2::1b’i- A(Ui - VillTlYi - A(Ui - Vi)]+>\4 2:5.- s.t. c,-[wT(ui — vi) + b] 2 1 — 5i (4.4) UiZO; ViZO; {£20 Both the objective function and the constraint of this optimization problem are in the square form. If there is only one constraint, the problem can be efficiently solved 68 Figure 4.2. Illustration of the max-margin linear classifier with samples that are not linearly separable. 69 by the S-procedure (see Appendix B of [78]). However, the number of constraints is equal to the number of training samples, which is larger than 1 in all cases. This is not even a convex optimization problem, since it can be easily verified that the constraint c,[wT(ui — vi) + b] 2 1 with variables w, u;,vi and b is not convex. To solve this problem, we adopt the following alternating Optimization approach. 4.3.2 SRSC with Large Margin: Iterative Optimization The major source of difficulty in the optimization model in equation (4.2) comes from the objective function that integrates feature extraction (deriving xi from yi) and classifier design ( w and b are unknown) into a single step. Even though this integration has valid theoretical motivations, it introduces non-convexity in the Op- timization problem. To make the Optimization tractable, an iterative optimization strategy is adopted. Mathematically, fixing the value of either w and X, makes the Optimization problem become a quadratic programming problem, which is a convex Optimization problem that is tractable. The implementation of this iterative alter- nating Optimization strategy can be formulated as: 1. Initialization: Set w = wo and b = be, where we and b0 are random initial values. 2. Feature extraction step: (ui,vi) = arg min (w.r.t. ui,v,) {Alew + A2 2 [13”(11i + Vi)]+ i=1 /\3 X: [Yi — A(ul + VillT [Yi - A(ui + Vi)l + /\4 2:16;} s.t. c,[wT(ui — vi) + b] 2 1 — 5,- (4.5) 11:20; ViZO; 70 3. Update the parameters Of the large margin classifier: (w, b, 5,) = arg min (wmt. w, b,§,-) {Alew + A2 i [1T(ui + vi)]+ A3 2:; [Yr - A(ui + VillT [Ya — A(u, + Vill + *4 :31le s.t. c,[wT(ui — vi) + b] 2 1 — 5i {2' Z 0; (4.6) 4. Repeat steps 2 and 3 until a given alternating step number is reached or the changes in the parameters w, b, 5,, 11], vi are less than a predefined value. One practical problem with this approach is the computational complexity of the quadratic programming with the high-dimensional vectors ui and vi, which have a dimension of d, i.e., the number Of atoms. Setting apprOpriate values for the 4 weight- ing parameters A1 to A4 is also a nontrivial problem. Even in the traditional sparse representation where there is only one weighting parameter A, setting the value is not trivial [64]. The classification result with inappropriate parameter setting is likely to deviate from the Optimal case shown in the theoretical analysis. Preliminary numeri- cal simulations verify our concerns on the computational complexity Of the proposed method. The simulation, conducted with the YALMIP Optimization toolbox [79] run- ning in the Matlab environment on a Pentium IV PC, needs more than 10 hours to Optimize a data set with 80 samples with a dictionary Of 32 atoms. These practical problems in the implementation motivates a more computationally efficient solution to SRSC problem, as presented in the next section. 4.4 SRSC with Large Margin: A Two-Step Ap- proximation The Optimization model and solution proposed in Section 4.3 for solving the SRSC problem has a solid theoretical foundation, but suffers from several difficulties in the 71 implementation stage. The difficulties arise from the ambitious goal of integrating both signal reconstruction (decomposition of a signal over the dictionary) and dimen- sion reduction (large margin and sparsity) into a single step, as shown in equation (4.4). To make the problem more tractable, in this section the two goals of signal reconstruction and dimension reduction are separately implemented in two steps. In the first step, sparse representation with a dictionary is used for denoising, such that noise and missing data do not affect the classification performance. In the second step, the sparsity regularization with large margin method is applied for feature selection and dimension reduction. Research on sparse reconstruction has been extensively conducted in the area of traditional sparse representation, such as [18, 19, 58,59,62]. In this section, the focus is on the second step, i.e., sparse feature selection and di- mension reduction. For this purpose, the large margin dimension reduction (LMDR) method is proposed for the implementation of the second step. LMDR is based on combining the idea Of the L1-norm SVM and distance metric learning. In the rest of this chapter, Section 4.4.1 introduces the formulation and application of the Ll-norm SVM; Section 4.4.2 reviews the idea of the distance metric learning; Section 4.4.3 presents the proposed LMDR method and finally Section 4.5 conducts experimental studies. 4.4.1 L1-norm Support Vector Machine Suppose that the features and class labels are already available for a 2-class su- pervised learning task. X = {x1,x2,...,xn} E 12"” is the collection of features and C = {c1,c2,...,c,,} E R1“, where c,- E {—1,+1}, is the collection of labels. In SRSC, sparsity is enforced on the feature set X for deriving sparse and discrimi- native features. One key Observation is that enforcing sparsity on a feature vector x; is equivalent to enforcing sparsity on the coefiicient vector w in a linear classifier, as in the SVM. In the context of large margin linear classifiers, this is exactly what the Ll-norm SVM [74] does. Formally, the Ll-norm SVM can be formulated with the 72 following Optimization problem: min (w.r.t. w,§,-) ||w||l + A Z 5,- i=1 s.t. c,-(wai + b) 2 1 — 6,- (4-7) 62' 2 0 Compared to the normal SVM, Ll-norm SVM replaces the L2—norm, ||w||2, with the Ll—norm, |]w||1, which enforces sparsity [17,74]. In this setting, enforcing sparsity is equivalent to feature selection. If a component of w, 10,-, has a value close to 0 that indicates that the jth component of the feature vector xij is uninformative for classi- fication and can be removed for classification. This paradigm of feature selection has been applied to various applications, such as medical structure classification, content based image classification [80—82]. The modification from L2-norm to Ll-norm also makes the Optimization problem solvable by linear programming, whereas quadratic programming is required to solve the L2-norm SVM problem. Note that with this feature selection strategy, kernel method can not be used, since there is no one—tO—0ne correspondence between the feature component and the weighting components in the kernel method [80—82]. To see this clearly, note that for the kernel method the deci- sion function is given by wT(xi) + b, where (e) is the kernel function. Therefore, the components of w do not directly correspond to the components of the feature vector xi. 4.4.2 Distance Metric Learning The dimension reduction with Ll-norm SVM is achieved through feature selection with classifier coefficients and not through any transforms on the original feature space. However, feature selection in a transformed feature space may yield better results in terms of dimension reduction. To see this point, consider the simple example illustrated in Figure 4.3, where the distribution of two dimensional features from two classes are plotted. Figure 4.3 (a) shows the original distribution Of the two classes and Figure 4.3 (b) is Obtained by rotating the feature values by 45 degrees in the 2D 73 space. In the distribution shown in Figure 4.3 (a), Ll—norm SVM will select both of the feature components, which are almost equally important for classification. However, with the rotation the Ll-norm SVM will only select one feature component in the new feature representation, since the feature in the direction of 3:1 is much more important than $2. The idea of applying a transform on the original feature before selection and clas- sification is widely adopted in the field of machine learning and pattern recognition. Typical examples of feature transform include PCA, LDA and more recently the dis— tance metric learning [75, 83—85]. The basic idea of distance metric learning is to search for a linear transform of the original features such that in the transformed feature space the samples from the same class are close to each other and samples from different classes are far away from each other. In this sense, distance metric learning shares the same motivation as LDA. The difference is that distance metric learning adopts a different mathematical formulation to achieve this goal [75]: min lehxj)€S (xi '_ xj)TA(Xi — xj) s.t. zéD (xi — leTAlxi - Xj) Z l (4.8) A >= 0 The requirement that A >= 0, i.e., A is semi-definite positive, is necessary since negative distances between feature vectors is not allowed. In this formulation, 3 is the set of feature pairs (xi, Xj) from the same class and D is the set of feature pairs (xi, Xj) from different classes. A is a semi-definite positive matrix to be Optimized such that in the transformed space, samples from the same class are close to each other subject to the constraint that the distances between samples from different classes are larger than a fixed value. Denoting A = BTB, where B = A%, the distance between two feature vectors can be represented as: (Xi — leTA(xi — X1) = (B(Xi — lelT (B(Xi - lel (49) Therefore, the distance metric learning problem is equivalent to finding a linear 74 8 Y T T o . ° > e . o 0 6L .e' . e e . o e o o . e. . .0 a. C O . -1 4b 0 . .0 Q . O O . . . o I c o J o , ° ° '° 0 or) 0 N p. o 0 D C' D x 2 O . 0° 0 ° 00 O O c . O o 0 0 0 0 O or o °° . O o O a 0 0 o o 0 ° ° 00 I2» ‘i’ 0 O '4 a ° A r r i i 0 1 2 3 4 5 x1 (3) 2° 1 T M r f Y #1 . O O c 0 o 15 . '. S ' o l O ;. - . - - 10 on o e o g o 0°C 0 o D O . . . o ' e . °8 ° (5' 1 Q 5L "9 0 ° 09 ° 0 ° . 0 .0. . 0 0° 0 ‘ v o o '3 '0 O O 0 o 0 O at)” 'l 0 00 'i 0 o 00 '3 .5» 0 O " .19 A 1 1 r 1 -5 4 «‘5 -2 -1 0 1 2 3 x1 (b) Figure 4.3. Illustration Of the effect Of transform for feature selection: (a) the original data distribution from two classes; (b) the distribution of the data after rotation. 75 transform on the original features with the matrix B and then using the Euclidian distance on the transformed features. Application of the linear transform B can be viewed as extracting new features from the original feature space. The successful application of the distance metric learning on some general classification and cluster- ing tasks, such as classifying the data sets in the UCI data set [86], indicates that transforming the original features helps to derive more discriminative features. This property is useful for dimension reduction. 4.4.3 Large Margin Dimension Reduction As discussed in Sections 4.4.1 and 4.4.2, the Ll—norm SVM and distance metric learn- ing have different properties in feature selection and transform. Ll-norm SVM can identify useful feature components but only operates on the original feature space. Distance metric learning searches for a linear transform that Optimizes the feature distribution in the transform space with B E 12“" and does not reduce the fea- ture dimension. An intuitive generalization is to incorporate the dimension reduction into the distance metric learning by assuming B E RdXd', where d’ < d. However, determining an appropriate value for d’ is data dependent and nontrivial. In this section, large margin dimension reduction algorithm (LMDR) is proposed to combine the advantages of both Ll-norm SVM and distance metric learning. The basic idea is to search for a linear transform on the original feature space and then apply the Ll-norm SVM to the transformed feature space for dimension reduction. Formally, this problem can be formulated as follows: min (w.r.t. w,b,B,§,-) ||w||1+ A :6,- i=1 .t. - T + b > 1 — ,- 8 “(W 7" )— 5 (4.10) 2; = BX; 5i 2 0 where B» 6 12"” and z, = Bx, is the feature in the transformed space. Selection on the components of 23 is based on the value of w, i.e., components corresponding to to, ~ 0 76 are removed. Note that this formulation is ill-posed. Given a solution w’ and B’ to the Optimization problem, a better solution can always be found by letting w” = w’/a and B” = aB’, where a > 1. The solution (w”, B”) satisfies the constraints but has a smaller 31 norm. One direct solution to this problem is to enforce regularization on B. Since the ambiguity is caused by a scaling factor, enforcement on the norm or trace of B can be applied. If the trace of B is normalized, the problem can be formulated as follows: min (w.r.t. w,b,B,£,) ||w||1+ A $315.- s.t. c,-(WTzi + b) 2 1 — £1 Zi = Bx, (4.11) trace(B) = 1 {,- > 0 Compared with Ll-norm SVM, LMDR is able to generate lower dimensional fea- tures, i.e., more sparse features, with the help Of the optimized linear transform B. Compared with the distance metric learning, LMDR incorporates the Ll-norm SVM to determine the dimension of the low dimensional feature. Another advantage of LMDR over the distance metric learning lies in the measure of separation in the transform feature space. For the distance metric learning, the measure is the total distance between all sample pairs from different classes and the same classes. The distance metric learning does not include a scheme to detect outliers that may exist in the data distribution. On the other hand, by using the margin between the classes and slack variables 6,- to deal with outliers, LMDR has more generalization capacity than the distance metric learning approach. Samples with large magnitude of £,- can be detected as outliers and excluded from training. It is also interesting to compare LMDR with general LDA, which derives low dimensional feature representations by maximizing the ratio Of inter-class scatter to inner-class scatter. In this sense, LDA also provides a method for simultaneous feature extraction and selection. In LDA, feature extraction is through a matrix 77 containing the eigenvectors of Sw‘ISB, where Sw is the inner-class scatter matrix and SB is the inter-class scatter matrix [10]. Feature selection is based on the relative magnitudes of the eigenvalues Of Sw’ISB . By using the second order statistics of the data samples for dimension reduction, LDA assumes a Gaussian distribution for the original feature space. When the Gaussianity assumption is violated, which Often happens in real applications, the performance of LDA suffers. On the other hand, LMDR does not enforce any assumptions on the distribution of the underlying data, which extends the applicability of LMDR. Another difference between LDA and LMDR lies in the discrimination measure. The margin used in LMDR for the discrimination measure has been shown to have better generalization capacity, as shown in general SVM [76,77]. The problem with the Optimization problem of LMDR is that the constraints c,(szi + b) 2 1 — (i are not convex, since the Optimization variables w and B are entangled. Here we adopt a simple alternating Optimization algorithm that breaks the entanglement between w and B as follows. 1. Initialize B = Bo, where the columns of B0 is the generalized eigenvectors from the equation SBx = Awa. This means that the transform matrix is initialized by the result Of the generalized LDA problem. 2. Search for the large margin classifier given B = Bo: (Wo,b, £1) = arg min (w.r.t. w, b, 3) ]]w|]1+ A Z 5, i=1 s.t. c.-(sz. + b) 2 1 — @- (4 12) 21 = Boxi €i>0 78 3. Search for the transform matrix given to = wo: (B1,b,{,) = arg min (w.r.t. B, b,§,-) ||wo||1+ A Z 5,- i=1 s.t. c,(onzi + b) 2 1 — f,- Zi = Bxi (4'13) trace(B) = 1 £120 4. Set B0 = B1 and repeat steps 2 and 3 until the change in w and B is less than a threshold value or a given number of iterations has been reached. The optimization formulated in equation (4.13) is clearly a linear programming problem. Actually the problem formulated in equation (4.12) is also a linear program— ming problem. TO see this, let w = u—v, where u E Rd“, v E R"le and u 2 0, v 2 0, then ]|w||1 = 1T(u+v). Each solution of u and v corresponds to a solution of w. The same trick was previously adopted in the basis pursuit algorithm [17] and in solving the Ll-norm SVM with linear programming [74,80]. 4.5 Experiments 4.5.1 Experiments with Synthetic Data We start with a simple synthetic numerical example. Two classes Of two dimensional data are generated, as illustrated in Figure 4.4 (a). Denote each 2D data sample as [$1,022] and given that 0 S :rl _<_ 5, samples are generated as 1122 = 11:1 + a + w for the first class and :52 = x1 — a + w for the second class, where a = 2.1 is a constant shift and w is white Gaussian noise with mean 0 and standard deviation 1. The data distribution is designed in a way such that both feature components 11:1 and $2 play equivalently important roles in classification. Running Ll-norm SVM on the data set results in w = [—0.5585,0.6619]T, which indicates that we can not reduce the 79 dimension Of this feature set, both dimensions have similar weights. On the other hand, running the iterative large margin dimension reduction as in equations (4.12) and (4.13) with 2 iterations results in the following solution: —2.0463 2.1336 T B = , w = [ 1.5338 0 ] (4.14) 2.1336 3.0463 Applying B to the original features yields the feature distribution shown in Figure 4.4 (b). As a comparison, B0 Obtained by the generalized LDA and the corresponding w from the Ll-norm SVM are as follows: —0.1060 0.0673 T B0 = , w = [ 4.4456 0 ] (4-15) 0.1059 0 Feature distribution Obtained by applying B0 to the original features results in the new distribution shown in Figure 4.4. Applying the transform B on the synthetic data set results in the data distributions shown in Figure 4.4 (b). From this figure, it is clear that in the transform feature space, the first feature component on is much more important for classification than the second component 2:2. Simultaneously, the value of w correctly reflects this difference between the two feature components. In this case, the feature dimension can be effectively reduced to 1 based on the value of w. From these numerical results and the distributions shown in Figure 4.4, it is clear that both LDA and the LMDR are effective for dimension reduction. From the co- efficients of Ll-norm SVM, as shown in equation (4.14) and (4.15), it is seen that both methods can reduce the feature dimension from 2 to 1. In order to compare the performance of classification in the reduced dimensional space, features are prO— jected onto 1D space and the minimum Bayesian classification errors are determined. As shown in Figure 4.4 and indicated by equations (4.14) and (4.15), .721 feature component is retained for the LMDR and generalized LDA based method and the 3:2 component is discarded. In this setting, 2 samples (1.96%) are misclassified for the LMDR method and 5 samples (2.90%) are misclassified for the generalized LDA 80 method. This comparison shows that the Optimization steps in the LMDR improves the distribution Of the features in the low dimensional space. 4.5.2 Experiments with the UCI Data Experiments are also conducted on two data sets: “iris” and “glass” from the UCI data repository [86]. The properties of the two data sets are summarized in Table 4.1. For the purposes Of comparison, the linear discriminative analysis (LDA) is implemented for dimension reduction. Since LDA’s dimension depends directly on the number Of classes, it can only derive up to C — 1 directions for feature projection, where C is the number Of different classes in the training set. This means that LDA can generate up to 2 dimensional features for “iris” and 5 dimensional features for “glass”. The experiment setting is as follows. The large-margin dimension reduction is based on a scenario of pairwise classification, as indicated by the constraints in equa- tions 4.12 and 4.13. The generalization of LMDR to the case Of multiple classes follows the same method as the general SVM, where pairwise classification (“1 vs 1”) or setting one class as positive label and others as negative label (“1 vs rest”) are com- monly used. In our method, LMDR based dimension reduction and classification are conducted pair-wise, i.e., based on data samples from each pair of classes c1 and C2, a large dimension reduction scheme is trained and feature selection and classification is based on the classifier built on this pair-wise data. For a data set with 0 different classes, a total of C(C — 1) / 2 dimension reduction schemes and classifiers are built. Each classifier assigns either 01 or Cg to each sample in the testing set. The final class label of a testing sample is determined by a simple majority vote on each class label. For the large-margin dimension reduction, the feature is first subjected to a linear transform B and the feature components are selected by the magnitude of the components of w. In the training process, the parameter A in equations 4.12 and 4.13 is set by cross validation. The iterative Optimization algorithm for the large-margin dimension reduction is run for 6 iterations each time. Half of the data set is randomly selected for training and the other half is used for testing. The KNN classifier with 81 original dlstrlbutlon e I ,1 , '. . 4 ‘ I .h ..D O a 2. "e ‘ fl ‘ '00 0111200 0‘ fl 3 0 0°. 0 ° °" 0° ° ° 0 D o (’0 3‘30 ° 0 o 0 °o 0° 0 C c .20 a o o 0 ° C” 4 A 0 1 2 3 4 5 x1 (3) Distrlbullon Wlih LmR 5v . . . 15 5:2.t: e . . o" e 10 o a .5 , a 5% c n ' 3" . a 5 Do 000 ”do“ . I "e a0 0 0 Q) I0 -' o o o o e o o o ¢°o% ] .5 563,00 0 49 4 ‘ ' -II -2 0 2 4 x1 (b) Distribution with generalized LDA D 0.6 00058.0 ‘-..._.. Q 04 o °° 03 ‘ .:- 4 o 8 o o ‘ \ o: o o n o 0n . , i O 0 o ’ 0.2 ° ° 03°00 ‘ ,.'- 980 g ,- ..t‘ ‘ n 09. fCPV’ a o. " . .' 15.4 412 o 0.2 04 x1 Figure 4.4. Feature distributions: (a) Original features; (b) Features obtained with LMDR; (0) Features obtained with the generalized LDA 82 Table 4.1. The properties of the data sets “iris” and “glass” Data set Number Of class Samples per class Feature dimension Iris 3 50 4 Glass 6 from 9 to 76 9 Euclidian distance is used for classification. The number of nearest neighbors, “K” in “KNN” is set to a quarter Of the number Of training samples in each pairwise classifier. Given the above experimental settings, the classification error rates at different dimensions selected by LDA and the large-margin dimension reduction are plotted in Figure 4.5, where (a) is the result on the “iris” data set and (b) is the result on the “glass” data set. The large-margin dimension reduction method performs dif- ferently for the two data sets. For the “iris” data set, the performance of LDA is comparable with that of large-margin dimension reduction. For the “glass” data set, LMDR performs better than LDA. The performance difference on the two data sets can be explained by looking at the data sample distributions in the two data sets, as illustrated in Figure 4.6, which plots the distribution of “iris” and “glass” in the 2D space Obtained by LDA. From this figure, it is clear that the data samples in “iris” are clearly separated. Another important Observation is that the distributions of each class in the 2D space can be roughly approximated by a Gaussian distribution. Therefore, the distribution of “iris” satisfies LDA’s assumption on the data distribu- tion quite well and this explains why LDA performs well on the “iris” data. On the other hand, Figure 4.6 (b) shows that the distribution of “glass” data is more complex than the “iris” data, in the sense that different classes are not well separated and the distribution of each class can not be accurately approximated as a Gaussian distri- bution. In this case, the assumption of a Gaussian distribution required by the LDA does not hold and therefore classification based on LDA yields high error rates. Since LMDR does not assume any prior information on the data distribution and since it uses class margin as the measure Of separation, it yields much better classification performance on the “glass” data set. 83 Results with the Irls data 0.07 0 1‘6 I; 0.06 01 0.05 8 go. . E :2 0.03 (I B °'°‘1 2 3 4 Feaute Dimension (a) Results wlth the Glass data 0.7 r . . . 0 —e—|_DA 3 0.0 + L1 SVM. g —¢—| MDR til 0.5 S '5 04- E W 3 0.3 2 u 0.20 2 4 6 8 10 Feaure Dlmenslon (bl Figure 4.5. The relation between classification error rate and feature dimension: Comparison between LDA, L1-norm SVM and large-margin dimension reduction on (a) “iris” and (b) “glass” data sets. 84 Iris Glass x class #1 0.8" x x class #1 a 23-3 > 0 class #2 0.75 ’ 0 ch” #2 am 23.7] ‘3 class #3 0 class #350 ° 0 c ‘1'”: 0.7- . . b c as: x 03% 23-6 %’ B _ x x D D 0 é” 0 class #6 0065 xf" . 0 O :33 23 5 g D r X 0 C 5 x 0.6 ’ 5“,“ o 8 “CF $0 0 a XJ§X O 03% g g 2304’ Dfix‘tg 0.55. "x 95 ”f. 0% ”3’ ° 23 3 ° D 0.5- g; €33 ° ., :22 ° °90 0" o o 23 2 0.45» o 0° 0 O D 0.4 _ ‘ g :0 23.1 " (I! 0.35» 23‘ 3 ° :0 _L n 22. L 4 1 -1 0 1 £95 -49 -48.5 -48 x1 x1 Figure 4.6. The projection of “iris” and “glass” data sets with LDA to 2D space. 4.5.3 Experiments with the Texture Data Experiments are also conducted on texture classification with the texture images from the Brodatz data set [56]. Images from 10 texture classes, with 16 images in each class, are used for classification. Half Of the images are used for training and another half is used for testing. are conducted again by decomposing the images into 3 wavelet level. In this case, each image has a feature feature of 85 dimensions. Based on the experiments conducted in Chapter 2, the features from most Of the subbands are noise, in the sense that only a small portion of the features can achieve a relatively low classification error rate. The indication of this phenomenon is two sided. First, the involvement of large number of noisy feature components greatly interfere the optimization process of the LMDR method. Second, the computational complexity Of the Optimization process is increased. To tackle these problems, we propose the combination of Ll-norm SVM and LMDR for high dimensional feature selection and transformation. For the high dimensional feature, the Ll-norm SVM is first applied to identify those important feature components and discard the noisy feature components. The LMDR is then applied to the low dimensional feature 85 this data set, and results in a very high classification error rate (higher than 70%), due to the interference Of noisy features. In Figure 4.7, the curve “LMDR” is Obtained by first applying the L1-norm SVM to select 35 features from the 85 features, and then applying the LMDR on the selected 35 features. Features of different dimensions are then selected by the value of w in equation (4.11). The curve “LISVM+LMDR” is Obtained by first applying the L1- norm SVM to select 77. features from the 85 features, and then applying the LMDR on the selected n features, generating n—dimensional features in the transformed space. In this case, the value of 77. changes from 5 to 35, with an increment Of 5. From this figure, the performance Of LMDR is worse than that Of Ll-norm SVM. As discussed in Chapter 2, the 35 features selected in by the Ll-norm SVM contains considerable amount Of noise that misleads the LMDR process. On the other hand, “L18VM+LMDR” always performs better than the “LMDR” and better than the ”LISVM” in the low dimensional space. This illustrates that searching for an opti- mal linear transform for the feature can substantially improve the accuracy of clas- sification. The phenomenon that “L13VM+LMDR” does not performs better than “LlsVM” in the high dimensional space illustrates that noisy features can interfere the Optimization of LMDR and deteriorate its performance. In this experimental set- ting, the minimum classification error rate is achieved by “L13VM+LMDR” at the dimension of 15. Therefore, the combination of Ll-norm SVM and LMDR is an effec- tive strategy for dimension reduction on high dimensional feature. The advantage is two-sided: Removing the noisy features and reducing the computational complexity. 4.6 Summary In this chapter, the SRSC problem proposed in the previous chapter is reformulated as a constrained Optimization problem by using the large margin as the measure Of discrimination. The formulation has solid theoretic foundation but faces practical nu— mericalproblems since multiple steps of iterative quadratic programming are involved. TO deal with this problem, the SRSC is divided into two steps, with the first step 86 Comparison of LDA and large-margin dimension reduction 0.5 I V I I I '4 —9— L1 SVM —6— LMDR —-8— L1 SVM+LMDR 0.45 r 1 Classification error rate l L 0'25 1 l l 5 10 15 20 25 30 35 Feature dimension Figure 4.7. The relation between classification error rate and feature dimension: Comparison between Ll-norm SVM, LMDR and the combination of Ll-norm SVM and LMDR on texture data. 87 being sparse reconstruction and the second step being sparse classification. For the sparse classification, the large margin dimension reduction (LMDR) that incorporates the Ll-norm SVM and distance metric learning is proposed. An efficient optimization algorithm based on iterative linear programming is proposed to solve the constrained optimization formulation. The {proposed algorithm inherits the advantages of distance metric learning for finding the Optimal linear feature transform and Ll-norm SVM for automatically determining the dimensionality of the low-dimensional space. Ex- perimental results on the UCI data sets show that compared with the LDA method, the proposed LMDR method achieves comparable results on dimension reduction on data samples following Gaussian distribution and better results on data samples that do not follow the Gaussian distribution. 88 CHAPTER 5 Feature Evaluation with Information Theoretic Measures 5. 1 Introduction All Of the previous chapters in this dissertation deal with Obtaining a low dimensional (sparse) feature set that is discriminative. In order to show that the features derived from the proposed methods are discriminative, empirical experiments are conducted for evaluating the “goodness” of different feature representations. This chapter dis- cusses the problem Of feature evaluation and proposes a new method for feature evaluation based on information theoretic measures that can be computed efficiently. Traditional pattern recognition problems usually include two consecutive steps: feature extraction and classification. Feature extraction aims at generating a com- pact and discriminative description of the data in terms of a feature vector. Multiple features describing different properties of the sample data can be extracted. For ex- ample, color and texture features can be extracted from an image for content based image retrieval. Therefore, there is a need for evaluating the different features. For example, in the face recognition application, features based on different linear trans- forms, such as PCA and LDA, have been widely investigated [11,66]. At this point, it is important to note that although the problem of feature evaluation shares some common characteristics with feature selection [31], the two methods differ in their basic goal. Feature selection evaluates and selects a subset Of the feature elements, while feature evaluation quantifies the performance Of the different feature vectors. Taking the color and texture features for an image as an example, feature selection 89 evaluates the different components of the color feature such. as red, green and blue, while feature evaluation focuses on the performance of the color feature versus the performance Of the texture feature. Feature evaluation is usually conducted by an empirical test, i.e., the classifica- tion accuracy on given testing data sets. Feature evaluation with empirical tests can be seen in various applications, such as text categorization [87, 88] and image retrieval [89]. However, this paradigm of feature evaluation does not only depend on the feature itself. The results Of empirical tests are Often subject to the distance metric that determines how to measure the similarity between the two features and the choice Of the classifier that determines how to assign class labels. The impact of the distance metric on the classification results is so substantial that distance metric learning, which learns a suitable distance metric for a specific classification task, has recently attracted a lot of attention [90—92]. It is also obvious that the choice Of the classifier and the parameter setting in a given classifier have strong impact on clas- sification results. Various examples of how the classifier affects classification results can be found in literature, such as the experimental results on the benchmark UCI data [86] in a recent paper [93]. Feature evaluation based on empirical tests is com- plicated due to the fact that tuning the distance metric and the classifier may benefit certain features and degrade the performance Of other features. However, overly tun- ing of either the distance metric or the classifier parameters can result in distorted evaluation for different features. One interesting discussion on empirical tests can be found in [94], which shows an extreme case that given any algorithm, a data set can be designed such that the algorithm can outperform other algorithms on this data set in the sense of classification accuracy. This can also be interpreted as that even a “bad” feature in the general sense can result in a high classification accuracy with certain classification algorithms. The constraints Of the empirical test for feature evaluation motivates the need for an alternative feature evaluation method that is independent Of the distance metric and the classifier. Fisher’s ratio used in the linear discriminative analysis (LDA) is one such choice [10]. The Fisher’s ratio measures the ratio between intra—class scatter 90 and inter-class scatter. For the samples drawn from Gaussian distributions, 3 large value Of Fisher’s ratio indicates a large distance between samples from different classes and a small distance among samples from the same class. In [95], mutual information (MI) between class label and features is computed for feature selection. A higher MI value indicates that the certain feature components provide more information for class label. This selection measure can also provide a criterion for feature evaluation. However, the computational model for estimating MI value proposed in [95] was only applied to 1D and 2D features and may be difficult to adapt to high dimensional features. In this chapter, mutual information between a feature and the class label is com- puted as a quantitative criterion for feature evaluation. The applicability of MI for feature evaluation is theoretically supported by Fano’s inequality [48], which reveals the direct connection between the MI value and the probability of misclassification. An uncorrelated linear discriminative analysis (ULDA) [96] based MI computational model is proposed such that the evaluation criterion is applicable for high dimensional features. The proposed method avoids the empirical test and thus is independent of the distance metric and classifier. Therefore, the empirical test, when combined with the proposed MI based measure, can provide a more complete and accurate criterion for feature evaluation. The rest of this chapter is organized as follows. Section 5.2 reviews the MI and Fano’s inequality that motivates the usage of MI as the criterion for feature evalu- ation. Section 5.3 analyzes the existing computational models for MI computation and proposes a new MI computation model based on uncorrelated linear discrimina- tive analysis. Section 5.4 conducts extensive experiments to test the validity of the proposed MI based feature evaluation method. Finally, Section 5.5 concludes this chapter and discusses the findings. 91 5.2 Quantitative Evaluation of Features with Mu- tual Information In the following analysis, the feature is modelled as a random vector: X E X C R“ and the class label is modelled as a discrete random variable: C E C C Z1. The joint distribution Of X and C is described by p(:r,c). The MI between X and C is defined as: I(X;C)= Z 2190‘ C) 103 p‘J(x)p' — EXC [log p“ l xEX CEC P(X)P( (C) (5.1) = 1703 (x,C)llp(X)P (C)). where D(o||o) is the relative entropy or Kullback-Leibler distance. When the log- arithm function in the definition uses a base of 2, the unit for the I (X ;C) is bits. MI is symmetric in X and C, nonnegative, and is equal to zero if and only if X and C are independent. I (X; C) indicates how much information X conveys about C. Given X, the extra information required to fully describe C is given by the conditional entropy H (C IX) [48]. I(X; C) = H(C) — H(CIX), (5.2) where H (C) is the entropy of the random variable C. Given this equation, mutual information computation is equivalent to the computation Of two entropy values. Similar to the definition of conditional entropy, the conditional MI between random variables X and C given a new random variable Z is defined as: I(X; CIZ) = H(XIZ) — H(X|C’, Z) p,x C Z = F(xvcvz) [logp p(X]Z)p(C]Z) l (5.3) The conditional MI has an interpretation similar to that of MI. Given the above definitions, the mutual information between a random vector X = [X1,X2, ...,Xd] and a random variable C can be defined as: 92 I(Xa C) = I(X11X21 "')Xd; C) d (5.4) = ZI(X;;CIX,‘_1,X{_2,...,Xl). i= 1 5.2. 1 Fano’s Inequality Intuitively, I (X; C) indicates how much information the feature X contains about the class label C. The larger I (X; C) is, the more accurate is the estimation Of the class label C from the feature X. Mathematically, the relation between the MI and the probability of misclassification is given by Fano’s inequality [48], as follows: H(CIX) —1 _ H(C) - I(X;C) —1 105(N) _ 103(Nc) P(C aé 0’) 2 (56) where NC is the number of possible values of C (i.e., the number of classes), and C’ is the estimation of C based on the feature X predicted by a classifier. For a given classification task, H (C) and log(Nc) are fixed. Thus, a large value of I (X; C) directly results in a small value of P(C 75 C’). Note that Fano’s inequality is not confined to any specific type of classifier or distance metric and is a suitable measure for feature evaluation. Due to the connection between P(C 75 C’) and I (X; C) revealed by Fano’s in- equality, MI has been widely applied to feature extraction [50, 51,97] and feature selection [98,99]. At this point, it should be noted that MI based feature extraction differs from the MI based feature evaluation. In the former case, the feature is un- known and chosen based on MI. In this chapter, the feature is already extracted and MI is used for the evaluation purposes. For MI based feature extraction, a distance metric and the classifier are usually known a priori to evaluate the effect of feature extraction, thus providing feedback for parameter adjustment for the extraction pro- cess. For MI based feature evaluation discussed in this chapter, no distance metric and classifier are involved. 93 X. _ f(.) T p(ylt) 1. Figure 5.1. MI computation with dimension reduction. The original feature X is in a high-dimensional space and the transformed feature Y is in a low-dimensional space. 5.3 Mutual Information Computation The difficulty Of MI computation lies in estimating the joint probability density func- tion (pdf), or equivalently, the conditional pdf between C and X, especially in a high dimensional space with limited number of data samples. A variety Of methods aiming at reducing the complexity by dimension reduction have been proposed. For example, in [100,101], the relationship between the different components of X is modelled by the first order Markov chain that simplifies the MI computation, as follows: I(Xilchi—13X1—2)"',X1) = I(Xz'; ClXi—1)i (5-6) A more general approach for dimension reduction is searching for a transform T(o), such that Y = T(X) E l7 C R“, that satisfies (1’ < d. Rather than computing I (X; C), I (Y; C) is computed as an approximation for I (X; C). This approach has been adopted in a variety of applications, such as in [38,49—51,102]. This transform process can be illustrated in Figure 5.1. With this transform, a Markov chain is formed as: X —+ Y —> C. According to the data processing theo- rem [48], I (X; C) 2 I (Y; C). The equality holds if and only if Y is the sufficient statistics Of X. Based on this ineqaulity, the transform T(o) that maximizes I (Y; C) should be chosen so that the error in the approximation is minimized. 94 A common method for solving this dimension reduction problem is to first as- sume a parametric form of T(o) and then optimize the parameters for T(o) such that I (Y; C) is maximized. For example, the linear transform Y = WX, where W 6 Rd,“ is the transformation matrix, is Often used for its simplicity. In order to esti- mate the Optimal value for W, I (Y; C) is explicitly expressed as a function of W and the standard optimization techniques, such as gradient descent, are then used to get an optimal value for W. This paradigm for searching for T(e) is adopted in [49—51] for feature extraction, i.e., deriving an informative low dimensional feature Y from a high dimensional feature X. In [49—51], the pdf p(Y, C) is modelled with the Parzen window method with a Gaussian kernel. Thus, p(Y,C) is written as a mixture of multiple Gaussian functions, which is non-convex. The combination of non-convexity and the gradient descent method used in [49—51] can only guarantee a locally Optimal solution for W. Due to the complexity in Optimizing W, simplified transforms are used in some applications. For example, the equal weight model where W = 7:10pm, is used for computing MI between wavelet coefficients in [38]. Note that the goal Of the transform T(e) is dimension reduction. A small value Of d’ reduces the complexity Of MI computation. However, a larger value Of d — d’ usually indicates a larger difference between I (X; C) and I (Y; C), since Y is not a sufficient statistics of X in the general case. For feature extraction, the classifi- cation is conducted with this low dimensional feature Y. Therefore, the large dif- ference between I (X; C) and I (Y; C) is not a severe problem, as long as I (Y;C) can be maximized. However, this estimation error in MI is a problem for feature evaluation. Given two features, X1 6 Rd and X2 6 Rd and their low dimen- sional representations Y1 E R“, and Y; 6 Rd], the reliability of concluding that I (X1; C) > I (X2; C) based on I (Y1; C) > I (Y3; C) is influenced by the error intro— duced by the inaccuracy of the computational model. The tradeoff between the computational complexity and the error in MI estimation caused by the value Of d’ are due to the statistical dependence among different com- ponents Of Y, i.e., Y1, Y2, ...de. If these components were statistically independent, the computation of I (Y; C) would become much easier: 95 ,, I(Y; 0) = 2101-; C) (5.7) If equation (5.7) holds, the complexity of computing MI does not substantially increase with an increase in (1’. Therefore, a large value of d’ can be chosen to reduce the error between I (X; C) and I (Y; C) and the computation of the MI is still tractable. Generally, it is difficult to search for a transform that results in statistical independence among the different components of Y. An approximate solution is to search for a linear transform that results in uncorrelated projections, i.e, cov(Y,-, Y3) = 0 for 1 S i 79 j g (1’. Since statistical uncorrelation is equivalent to statistical independence only when Y is distributed as a Gaussian, which is not satisfied in the general case, equation (5.7) does not strictly hold and should be rewritten as: ,. I(Y; C) s 2 104-; C) (5.8) The approximation error depends on how the distribution of Y deviates from a Gaussian distribution. As an approximation, this can be Obtained by measuring the Gaussianity Of the individual components Of Y, i.e, Y,- for 1 g i S d’. The measure Of the Gaussianity Of a random variable can be conducted by computing the kurtosis or negentropy Of the variable, as in the method of independent component analysis (ICA) [103]. Motivated by the above discussion, the uncorrelated linear discriminative analysis (ULDA) is chosen to Obtain the transform matrix W in this dissertation. ULDA was proposed in [104] and improved in [96] as a generalized linear discriminative analysis method that projects a feature into statistically uncorrelated feature components. The improved ULDA is also applicable to the undersampled problem, where the di- mension of the data is much larger than the number of samples. This property makes it especially suitable for processing high dimensional features. ULDA has been suc- cessfully applied to a variety of classification tasks, such as text categorization [96], face reCognition [105] and gene data classification [106]. ULDA derives up to Nc — 1 optimal discriminative vectors that maximizes the Fisher’s ratio, with an extra con- 96 straint that the different eigenvectors are St-Orthogonal [96], where St is the total scatter matrix of the features. Since ULDA also maximizes the Fisher’s ratio, the distribution Of samples on individual feature components, Y}, achieves the goal that samples from different classes are far away from each other and samples from the same class are close to each other. In this sense, the value of I (Y,; C) is optimized compared to other linear projections. Since I (Y; C) S I (X; C) according tO the data process- ing theorem, optimizing I (K; C) helps reduce the estimation error between I (Y; C) and I (X; C). When using ULDA for feature evaluation, the transform matrix W is given by the discriminative vectors Obtained by ULDA in each row. 5.3.1 Overview Of Uncorrelated Linear Discriminative Anal- ysis Uncorrelated linear discriminative analysis (ULDA) was proposed to extract statisti- cally uncorrelated discriminative features [96,104]. Assume that there are 10 classes in the training data set. Denote the mean vector, covariance matrix and a priori proba- bility of the ith class as m,, S,- and p,- respectively. Then the between-class scatter Sb, within-class scatter Sw and the total scatter matrix S, are defined as follows: It Sw = Z piSi i=1 s, = :31me — m)(m.- — m)T (5-9) St=Sb+Sw In the traditional linear discriminative analysis (LDA), the Objective function is to minimize the ratio Of within-class scatter to between-class scatter. This goal can be achieved by solving an eigenvalue problem on Sglsb, provided that the within-class scatter matrix 8,, is nonsingular. In this case, there are at most It — 1 discriminative vectors available, since the rank of S), is bounded above by k — 1. Although 8,, and Sb are symmetric matrices, generally 8,318), is not symmetric. Therefore, the discriminative vectors that are the eigenvectors of sys, are not orthogonal and the extracted features are not uncorrelated. 97 ULDA aims at finding the optimal discriminative vectors that are St-Orthogonal. Two vectors Y1 and Y2 are St—Orthogonal if YfSth = 0. Specifically, after 1' vectors $1,052, ..,¢,. have been extracted, the (r + 1)th vector 05,.“ is Obtained by solving the following Optimization problem: T m3Xf(¢) = $7533 s.t. (erStaS, = O, 2=1,2,...,7‘ (5.10) One solution to this Optimization problem was proposed in [104] by iteratively solving a generalized eigenvalue problem. A more computationally efficient solution Called ULDA/ QR algorithm was proposed in [96] based on QR decomposition and singu- lar vector decomposition (SVD). More details on solving this Optimization problem can be found in [96,104]. In our implementation, the ULDA/ QR algorithm is used for generating the discriminative vectors. It was shown that feature vectors trans- formed by ULDA are mutually uncorrelated [96,104]. The proof Of this property of uncorrelation is straightforward, as shown below. Given that _ 1 _ q ill 90f 112 (P; Y = , = , X (5.11) _ 31m ] _ ‘P; _] The covariance between any two feature components, y,- and y,- is computed as follows: Elfin - E(y.))(yj - E(yj))l = vista (5-12) Based on the prOperty given in equation (5.10), cpfStLp, = 0. Therefore, the feature components in the transformed space Obtained with the ULDA are uncorrelated. 98 Table 5.1. Classification Error Rates (in percentage) on “wine”, “iris” and “glass” Data set ]] Wine ]] Iris ]] Glass Error Rate ]] 1% to 5% I] 4% to 9% ]] 25% to 30% 5.4 Experiments 5.4.1 Experiments with the UCI data For the first part of the experiment, the goal is to show that “good” features have in general high MI between the feature vector and the class label. For this purpose, the data sets of “wine”, “iris” and “glass” from the UCI machine learning repository [86] are used for the experiment. Based on the previous experimental results with various distance metrics and classifiers on these data sets presented in literature, such as [93], we know that the classification error rates on the three data sets are in different ranges as shown in Table 5.1. Based on the results, it can be concluded that the features for “wine” is better than the features for “iris”, which is better than that of “glass”. We apply both the equal weight model where W = fildrxd and the ULDA based model for estimating MI values for the three data sets. Note that for the equal weight model, Y is in the one dimensional space. The results are plotted in Figure 5.2. For the ULDA based method, the feature X can be projected to a space with dimension up to NC —- 1. The data sets “wine” and “iris” have 3 classes and the “glass” data set has 6 classes. Therefore, the projected feature Y E R2 for “wine” and “iris” and Y E R5 for “glass”. Figure 5.2 shows that the MI estimated from the equal weight model is inconsistent with the empirical results. The inconsistency in this example indicates that the model Of the equal weights is not suitable for feature evaluation. Figure 5.2 also shows that if only the first component, i.e., Y1, obtained by the ULDA is retained, the evaluation given by the MI in the descending order is as “iris, wine, glass”, which is still inconsistent with the empirical results. However, it should be noted that the MI values with Y1 from the ULDA on the 3 data sets are larger than the corresponding MI values Obtained with the equal weight method. This indicates that the ULDA optimizes the MI estimation better than the equal weight model. 99 Mutual Information Estimation: wine, iris and glass 1.5- x winequual Weight —*—-wine:ULDA O irisquuaI Weight —e—iris:ULDA <> glass:Equal Weight —9—glass:ULDA c: 1’ .9 iii E .9 E E 3 3 5 o.5~ x O o l l l l l 0 1 2 3 4 5 Dimension of Projection Figure 5.2. Estimation of mutual information on 3 different data sets using equal weight and LDA based method. Finally, Figure 5.2 shows that the ULDA based method with additive approximation d, model I (Y; C) z 2 I (Y,; C) provides a feature evaluation measure that is consistent z: with the empirical classification results: “wine” is better than “iris”, which is better than “glass”. 5.4.2 Experiments with Energy Features for Wavelet Based Texture Classification In the previous experimental section, the 3 sets of features are from 3 different types of Objects. In this section, feature evaluation using the proposed method is conducted on 5 different wavelet features extracted from the same set Of texture images for classification. Wavelet feature extraction is a two-step process. In the first step, 100 an image is decomposed into multiple wavelet subbands. In the second step, one feature component is extracted from each subband and a feature vector is formed by concatenating all of the feature components [9,23,107]. Given a subband coefficient matrix R with size M x N, 5 different definitions of energy function are used to extract features from each subband: E1 (Standard Deviation), E2 (Residual), E3 ( L1 Energy), E4 ( L2 Energy) and E5 (Entropy). The feature vectors based on the following 5 energy functions are extracted and evaluated in the experiment. E — iiilrw ')- l2 (513) 1_\MN1.=1J.=1 3] ft ' M N E2 = Z Z IR