4‘

aw»-

.3?"

a

 

\‘
33': "'35? -‘
"3/1; 3%?!
. . {‘7.‘ ‘ f '-‘
:3 121:”. :55

‘ i
J
. - m rig: *
3 3w: ,tyq-m
~ v 3 . 12331;. '
91 '4'

.
‘ :a. Q .
'3, ”1!:3‘5

3".
9:3

\

\21 1... ‘u u

9‘ 13 “1‘.“=§3*"”“f
3. ﬁt; €123§%ia§*“~~
a .11 iéi?‘ '

.1

. 7.: ?- “ f. "i
{am}; “$35.

2-.—

I
‘3
a 3,.

._...
~>‘

.-1- p .
, ﬁg” ,
'1.

.;.~:,_
<5.“-

....¢-
w.»-

.‘v
«a
a.

M

W.

.- ,.
-n
. ., ,.
4 m0-
_ «a

H... ”-
J.

’
”2‘;

,4
“a

,... ,3
:3

I! .u'
u-.-
Ju'ar“
MW
4...
...

J ‘
.tt‘e-F: ...—-—-‘ "
_ ﬁre: —
.h" -
WW
2
.- «2.1
-4 ‘~»>‘
r. w W

 

.~ I “35's

(so;

  

   

iBRARY
sity

L

M'Chigan State
Univer

 

 

 

 

F

This is to certify that the
dissertation entitled

SPARSE REPRESENTATIONS FOR IMAGE
CLASSIFICATION

presented by

Ke Huang

has been accepted towards fulﬁllment
of the requirements for the

Ph.D. degree in Electrical and Computer
Engineeriﬂg

 

M

Major Professor’s Signature

08703/67~

 

Date

MSU is an affirmative-action, equal-opportunity employer

_._._._._ —._._.— -u-cu--._.-.- -.-.-—.- -.- «— -.-— - —‘-s-I-o-l-l-C-O-l-l-l-'-O-l-0-l-n— .-

lu-u-o-1-0-v-I---I-u-o—o-I—-—- ~

PLACE IN RETURN BOX to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

 

 

DATE DUE DATE DUE DATE DUE
Straws? ’3 mammal
i i
APR 2 i 23:;

 

 

 

 

 

 

 

 

 

 

 

 

 

6107 p'JClRC/Dateoue.indd-p.1

vgwggx—r -

 

SPARSE REPRESENTATIONS FOR IMAGE
CLASSIFICATION

By

Ke Huang

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Department of Electrical and Computer Engineering

2007

ABSTRACT

SPARSE REPRESENTATIONS FOR IMAGE
CLASSIFICATION

By

Ke Huang

In recent years, ﬁlter bank based approaches such as wavelet and wavelet packet
transforms have been used extensively in image classiﬁcation. One key issue in wavelet
based classiﬁcation methods is how to choose the best set of features (subbands). In
the existing methods, each subband is evaluated separately or only the children sub-
bands are compared with their parent subband for selection. In this thesis, we show
that this subband selection method does not consider the dependence from different
subbands in the selection process. In order to address this issue, we propose subband
selection methods that take the dependence into account. We offer a theoretical
and experimental analysis of dependence among features among different subbands.
Based on this analysis, mutual information based subband selection (MISS) algorithm
is proposed for subband selection based on feature dependence. The MISS algorithm
is further improved by the subband grouping and selection (SGS) algorithm which
combines the dependence between subbands and the evaluation score of each sub—
band. All of these methods result in a compact set of features for efﬁcient image
classiﬁcation.

The development of efficient subband selection methods for image classiﬁcation
motivates us to consider the more general problem of sparse representation of images
for classiﬁcation. We propose a new approach in the framework of the sparse repre-
sentations by combining the reconstruction error, classiﬁcation power and sparseness
in a single cost function.

The formulation of the proposed sparse representation for image classiﬁcation

method is further improved by using the large margin method for the measure of dis-

crimination. Based on this new and improved formulation, we can model the robust
and sparse feature extraction with an optimization problem that can be solved by
iterative quadratic programming. In order to reduce the computational complexity
required for the iterative quadratic programming, we propose decomposing the robust
and sparse feature extraction into two steps, with the ﬁrst step being sparse recon-
struction and the second step being sparse feature selection and dimension reduction.
For the second step, we propose a new method called large margin dimension reduc-
tion (LMDR). LMDR integrates the idea of Ll-norm support vector machine (SVM)
and distance metric learning for obtaining feature representation in a low dimensional
space.

Finally, to measure the goodness of features obtained by different methods, a
mutual information based feature evaluation criterion is proposed. The proposed
measure is independent of distance metric and classiﬁer. Computation of mutual
information in a high dimensional space is addressed by using the uncorrelated lin-
ear discrimination analysis (ULDA). The proposed computational model effectively
reduces the computational complexity of computing mutual information for feature

evaluation.

ACKNOWLEDGMENTS

I would like to express my sincere gratitude to my advisor, Prof. Selin Aviyente,
for her guidance, encouragement, and support in every stage of my graduate study.
Her knowledge, kindness, patience, and vision have provided me with lifetime beneﬁts.

I am grateful to Profs. John Deller, Rong Jin and Hyder Radha in my thesis
committee for their valuable comments and suggestions on the thesis drafts, as well
as for the experience as a student with these three outstanding teachers. I would
also like to thank many faculty members of MSU who were the instructors for the
courses I took. The course works have greatly enriched my knowledge and provided
the background and foundations for my thesis research.

I want to take this opportunity to thank Dr. Michelle Yan, Hong Shen, Sam Zheng
and Jason Tyan for their support before, during and after my internship at Siemens
Corporate Research. The amazing internship provided me very valuable experience in
both research and development, and aligned my expectation for a future job. Friends
made at Siemens Corporate Research are another valuable treasure I got.

My memory of graduate studies at MSU can never be complete without mentioning
my fellow graduate students at MSU. The survivor story would be much tougher if
without the accompanying by the students coming to MSU in the same year as me.
Special thanks to Rensheng Sun couple for their countless help, to Yunhao Liu, Chen
Wang and Zhenyun Zhuang for providing me encouragements and accommodation in
the coldest winter.

This thesis draws a period for my more than 20 years of education in schools. In
addition to the thanks to my teachers and classmates in the old days, I must mention
my parents, without whose love, nurturing, and support I could never accomplish all
these. I dedicate this thesis to my parents. Finally, thanks to my wife, Gangli, for

her love and patience.

iv

TABLE OF CONTENTS

LIST OF TABLES ' vii
LIST OF FIGURES viii
1 Introduction 1
1.1 Overview of Contributions ........................ 3
Wavelet Subband Selection for Image Classiﬁcation 6
2.1 Background on Wavelet and Wavelet Packet Transforms ....... 6
2.2 Wavelet Packet - Feature Extraction .................. 11
2.3 Dependence Analysis ........................... 12
2.3.1 Theoretical Analysis ....................... 14
2.3.2 Simulation ............................. 16

2.4 Subband Selection with Dependence ................... 18
2.4.1 Mutual Information ........................ 20
2.4.2 Computation of Mutual Information .............. 21

2.5 Subband Selection Combining Dependence and Energy ........ 24
2.5.1 Statistical Dependency Between Subbands ........... 25
2.5.2 Subband Grouping and Selection ................ 27

2.6 Experiments ................................ 29
2.6.1 Experiment Setting ........................ 29
2.6.2 Performance with Different Wavelet Bases ........... 32
2.6.3 Performance Using Subbands Selected from Two Wavelet Bases 34
2.6.4 The Clustering of Subbands in the SGS Algorithm ...... 39
2.6.5 Performance on Unseen Data .................. 41
2.6.6 Discussion ............................. 42

2.7 Conclusion ................................. 46
A Sparse Representation Framework for Signal Classiﬁcation 47
3.1 Background ................................ 47
3.1.1 Problem Formulation ....................... 48
3.1.2 Reconstruction and Discrimination ............... 50

3.2 Sparse Representation for Signal Classiﬁcation ............. 52
3.2.1 Problem Formulation ....................... 52
3.2.2 Problem Solution ......................... 54

3.3 Experimental Results ........................... 55
3.3.1 Synthetic Example ........................ 55

3.3.2 Sparse Classiﬁcation with Hand Written Digit Images ..... 56

3.3.3 Experiments with the Object Recognition ........... 59
3.4 Conclusions ................................ 63
4 Sparse Representation for Signal Classification: An Optimization

Approach ' 64
4.1 Introduction ................................ 64
4.2 Data Model ................................ 65
4.3 SRSC with Large Margin: An Optimization Model .......... 66
4.3.1 SRSC with Large Margin: Formulation ............. 66
4.3.2 SRSC with Large Margin: Iterative Optimization ....... 70
4.4 SRSC with Large Margin: A Two-Step Approximation ........ 71
4.4.1 Ll-norm Support Vector Machine ................ 72
4.4.2 Distance Metric Learning .................... 73
4.4.3 Large Margin Dimension Reduction ............... 76
4.5 Experiments ................................ 79
4.5.1 Experiments with Synthetic Data ................ 79
4.5.2 Experiments with the UCI Data ................. 81
4.5.3 Experiments with the Texture Data ............... 85
4.6 Summary ................................. 86
5 Feature Evaluation with Information Theoretic Measures 89
5.1 Introduction ................................ 89
5.2 Quantitative Evaluation of Features with Mutual Information . . . . 92
5.2.1 Fano’s Inequality ......................... 93
5.3 Mutual Information Computation .................... 94
5.3.1 Overview of Uncorrelated Linear Discriminative Analysis . . . 97
5.4 Experiments ................................ 99
5.4.1 Experiments with the UCI data ................. 99

5.4.2 Experiments with Energy Features for Wavelet Based Texture
Classiﬁcation ........................... 100
5.5 Conclusion ................................. 105
6 Summary and Conclusions 106
6.1 Summary ................................. 106
6.2 Future Work ................................ 108
6.2.1 Basis Design for Sparse Image Classiﬁcation .......... 108

6.2.2 Combination of Data-Driven and Model Based Methods for
Sparse Signal Representation and Classiﬁcation ........ 109
6.2.3 Saliency Analysis with Sparse Representation ......... 111
BIBLIOGRAPHY 113

2.1
2.2
2.3
2.4

3.1
3.2
3.3
3.4

4.1

5.1
5.2

LIST OF TABLES

Normalized energy Covariance with haar basis .............
Normalized energy covariance with db4 basis ..............
Normalized energy covariance between db4 basis and haar basis

Minimum Error Rates Achieved With Different Wavelet Bases and
Different Selection Methods .......................

Classiﬁcation errors with different levels of white Gaussian noise . . .
Classiﬁcation errors with different sizes of occlusion ..........
Error rates with white Gaussian noise ..................
Error rates with occlusions ........................

The properties of the data sets “iris” and “glass” ...........

Classiﬁcation Error Rates (in percentage) on “wine”, “iris” and “glass”

Classiﬁcation Error Rates with Different Energy Functions (in percent-
age) ....................................

vii

17
17
17

61
61

83

99

2.1

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

2.10

2.11

2.12

2.13
2.14

2.15

2.16

LIST OF FIGURES

A wavelet packet decomposition tree: (a) The original image, (b) The
2—level decomposition tree, (c) Projection of the original image onto
each leaf node of the tree. ........................
Illustration of MISS algorithm: At each step, a subband in SR with
lowest mutual information with all subbands in SU is moved from SR
to SU. In this ﬁgure, each square represents a single wavelet subband.
Dimension reduction in mutual information through a many-to-one
mapping function .............................
A simpliﬁed illustration of SGS: subbands are ﬁrst partitioned into
different groups (3 ellipses) and then the subband with highest energy
(filled dot) is selected from each group. .................
Brodatz texture images used in the experiments .............
The error rates with six subband selection algorithms with the ’Haar’
basis .....................................
The error rates with six subband selection algorithms with the ’dblO’
basis .....................................
The error rates with six subband selection algorithms with the ’bio28’
basis .....................................
The error rates with six subband selection algorithms by selecting sub-
bands from the ’bio28’ basis and the ’db10’ basis. ...........
The error rates with six subband selection algorithms by selecting sub-
bands from the ’Haar’ basis and the ’bio28’ basis. ...........
The error rates with six subband selection algorithms by selecting sub-
bands from the ’Haar’ basis and the ’db10’ basis. ...........
The wavelet basis functions at the ﬁrst decomposition level for “haar”
and “bio28”, in the ﬁrst row and second row, respectively. In this
ﬁgure, white regions correspond to high values and dark regions corre-
spond to low values. ...........................

13

20

22

28

30

34

35

35

39

The subband labelling scheme for a 2—1evel wavelet packet decomposition. 40

Texture images used for classiﬁcation on unseen data. First row: bars;
Second row: grass; Third row: knit pattern. ..............
The error rates with six subband selection algorithms with the ’bio28’
basis on unseen data ............................
The error rates with six subband selection algorithms with the ’bio28’
basis and training data from all big texture images. ..........

viii

44

44

3.1

3.2

3.3

3.4

3.5

4.1
4.2

4.3

4.4

4.5

4.6
4.7

5.1

5.2

5.3

Distributions of projection of Signals from two classes with the ﬁrst
atom selected by: J1(X,)\) (the left ﬁgure) and J2(X,/\) (the right
ﬁgure) ....................................
Samples of the USPS hand written digit images .............
Corrupted data for the experiments: (a) with different degrees of white
Gaussian noise; (b) With different sizes of occlusion. ..........
(a) Object images with different levels of white Gaussian noise; (b)
Object images with different sizes of occlusion. The images in the ﬁrst
line of (a) and (b) are noiseless. .....................
The relation between error rate and sparsity ...............

Illustration of the max—margin linear classiﬁer ..............
Illustration of the max-margin linear classiﬁer with samples that are
not linearly separable. ..........................
Illustration of the effect of transform for feature selection: (a) the
original data distribution from two classes; (b) the distribution of the
data after rotation. ............................
Feature distributions: (a) Original features; (b) Features obtained with
LMDR; (c) Features obtained with the generalized LDA .......
The relation between classiﬁcation error rate and feature dimension:
Comparison between LDA, Ll-norm SVM and large-margin dimension
reduction on (a) “iris” and (b) “glass” data sets .............
The projection of “iris” and “glass” data sets with LDA to 2D space.
The relation between classiﬁcation error rate and feature dimension:
Comparison between Ll-norm SVM, LMDR and the combination of
Ll-norm SVM and LMDR on texture data ................

MI computation with dimension reduction. The original feature X
is in a high-dimensional space and the transformed feature Y is in a
low-dimensional space. ..........................
Estimation of mutual information on 3 different data sets using equal
weight and LDA based method. .....................
Estimation of mutual information on 5 different energy features for
wavelet based texture classiﬁcation. ...................

ix

56
57

58

60
62

67

69

75

82

84
85

87

94

100

103

CHAPTER 1

Introduction

Image classiﬁcation is an important research problem that has applications including
computer vision, medical imaging and biometrics. A standard image classiﬁcation
system consists of feature extraction, feature selection, and classiﬁer design [1—8]. The
research in this dissertation focuses on the feature extraction, selection and evaluation
aspects of an image classiﬁcation system in the framework of transform (ﬁltering)
based feature extraction methods. The major objective is to obtain a compact and
robust set of features for a given image set using multi-resolution transforms such as
the wavelets, wavelet packets, directional wavelets and Gabor functions.

Image classiﬁcation usually relies on features extracted from transform or ﬁltering
operations [9]. The image classiﬁcation discussed in this dissertation is also based on
the ﬁltering features. The general paradigm for feature extraction is to ﬁlter images
with a set of ﬁlters and to compute energy values or other statistical values from
the ﬁltered images as features. The ﬁlters can be either data dependent or data
independent. Data dependent ﬁlters are derived from a training data set. Principal
component analysis (PCA) [10], linear discriminative analysis (LDA) [10,11] and K-
SVD [12] are examples of data dependent ﬁlters. Wavelets [13], wavelet packet [13],
directional wavelets [14] and Gabor ﬁlters [3] are example of data independent ﬁlters.

For images with different texture structures, multi-scale data independent ﬁlter-
ing, such as wavelets, has been applied to image classiﬁcation [1,2,7]. The multi—scale
structure of the wavelet ﬁlters is a good ﬁt for the multi—scale structure that often
appears in texture images. In this sense, the ﬁltering results with the wavelet ﬁlters
can discriminatively describe the texture structure for classiﬁcation. In this disserta—

tion, feature extraction and selection with data independent ﬁltering using wavelets

and wavelet packet are studied. Some of the techniques discussed in the dissertation
can also be applied to features obtained with data dependent ﬁltering.

The ﬁrst part of the proposed work focuses on the extraction of a set of features
from wavelet and wavelet packet transforms. The proposed approach ﬁrst quantiﬁes
the dependence between wavelet subbands and then exploits this property to select a
compact set of features that are discriminative for the given image set. This method is
then further improved by incorporating the individual discrimination power provided
by each subband into the selection process.

The promising results obtained with the proposed feature selection methods mo-
tivate the formulation of a more general feature selection problem for image clas—
siﬁcation. This more general feature selection problem can be stated as “How can
we choose the ‘best’ set of multi—resolution transforms for discriminating between
different image classes?” The ‘best’ set of transforms is deﬁned by the compactness
(sparseness) of the selected feature set, the robustness of the features in noisy environ—
ments and the accuracy of the classiﬁcation. Combining all of these requirements, we
pose this more general problem using sparse representations. Sparse representations
aim at ﬁnding a sparse and close approximation to a given signal using a large collec-
tion of functions, called a dictionary [15—19]. In the proposed work, this framework
is modiﬁed to address the question of image classiﬁcation by deﬁning a cost function
that incorporates the discrimination and reconstruction abilities of the elements in
the dictionary, as well as the sparseness of the selected feature vector. As part of the
proposed work, the different aspects of this problem will be investigated including the
selection of the dictionary elements, different cost functions to quantify the discrimi-
nation power of the selected features and the tradeoff between reconstruction power
and discrimination power.

In the third part, the formulation of the proposed Sparse representation for im-
age classiﬁcation method is further improved by using the large margin method for
the measure of discrimination. Based on this new and improved formulation, we can
model the robust and sparse feature extraction with an optimization problem that can

be solved by iterative quadratic programming. In order to reduce the computational

complexity required for the iterative quadratic programming, we propose decompos-
ing the robust and sparse feature extraction into two steps, with the ﬁrst step being
sparse reconstruction and the secOnd step being sparse feature selection and dimen-
sion reduction. For the second step, we propose a new method called large margin
dimension reduction (LMDR). LMDR integrates the idea of Ll-norm support vector
machine (SVM) and distance metric learning for obtaining feature representation in
a low dimensional space.

In the fourth part of this dissertation, we propose a mutual information based
feature evaluation criterion and a corresponding computational model, such that fea-
tures obtained from different selection methods can be quantitatively evaluated. Tra-
ditional feature evaluation based on empirical study is subjected to the selection of
distance metric and classiﬁer. The proposed measure is independent of distance metric
and classiﬁer, and reﬂects more objectively the goodness of a feature representation.
Computation of mutual information in a high dimensional Space is usually involved
in the model. To deal with this problem, we propose computing mutual information
with the uncorrelated linear discrimination analysis (ULDA). The proposed compu-
tational model effectively reduces the computational complexity of computing mutual

information for feature evaluation.

1.1 Overview of Contributions

The contributions of the thesis can be divided into four parts: subband selection
with the wavelet packet analysis, a general framework of sparse representation for
image classiﬁcation, sparse representation for image classiﬁcation with the large mar-
gin method and large margin dimension reduction and classiﬁer independent feature
evaluation with mutual information.

In wavelet packet based signal classiﬁcation, sparse representation is realized
through subband selection, i.e., selection of a subset of subbands for signal repre-

sentation. The major contributions of this work can be summarized as follows:

1. The dependence among features extracted from different subbands is quantiﬁed.

2. The dependence between subbands is incorporated into the selection process to

improve classiﬁcation accuracy.

3. The selected features are less redundant than the existing methods, thus re-
ducing the dimensionality of the feature vector. The reduced dimensionality
of the feature vectors increases the robustness of image classiﬁcation in noisy

environments.

4. The proposed feature selection method is easily generalizable to other multi-

resolution and directional transforms.

In the second part of the proposed research, a more general framework of sparse

representations for signal classiﬁcation is introduced with the following contributions:

1. The current framework for the general sparse representations is modiﬁed by in-
corporating both the reconstruction error and the discrimination power into the
optimization problem. This ﬂexibility improves the robustness of classiﬁcation

in noise.

2. The proposed framework will allow the design of adaptive basis functions that

can best represent and discriminate between the given signals.

3. The proposed framework will also have Signiﬁcant impact on a variety of clas-

siﬁcation problems such as texture classiﬁcation and object discrimination.

In the third part of the proposed research, an improved formulation of sparse
representations for signal classiﬁcation is introduced and a new dimension reduction

method is introduced with the following contributions:

1. The proposed framework of incorporating both the reconstruction error and the
discrimination power into the feature extraction process is reformulated based
on the large margin method. The new formulation makes the feature extraction

problem solvable with the iterative quadratic programming method.

2. A more computationally efficient algorithm, large margin dimension reduction,

is proposed for extracting discriminative features in the low dimensional space.

In the fourth part of the proposed research, a mutual information based mea-
sure is proposed for feature evaluation. A computational model is also prOposed for
practically computing mutual information in this case. The contributions can be

summarized as:

1. Feature evaluation based on empirical study is ﬁrst Analyzed. Based on this
analysis, a mutual information based feature evaluation method that is inde-

pendent of distance metric and classiﬁer is proposed.

2. An uncorrelated linear discrimination analysis based method is proposed for
computing mutual information in the high dimensional space. Compared with
the existing mutual information computational model, the computational model

retains the accuracy of computation with reasonable complexity.

CHAPTER 2

Wavelet Subband Selection for

Image Classiﬁcation

2.1 Background on Wavelet and Wavelet Packet
Transforms

Wavelet decomposition and its extension, wavelet packet decomposition have gained
popular applications in the ﬁeld of signal/ image processing and classiﬁcation. For
example, speech [20], EEG [21], and texture images [2,22] have been successfully clas-
siﬁed by statistical models built in the transform domain obtained from the wavelet
decomposition. Wavelet transforms enable the decomposition of the image into dif-
ferent frequency subbands, similar to the way the human visual system operates.
This property makes it especially suitable for the segmentation and classiﬁcation of
texture images [1,2, 7, 22—27]. For the purpose of texture classiﬁcation, appropriate
features need to be extracted to obtain a discriminative representation as possible in
the transform domain. A widely used wavelet feature is the energy of each wavelet
subband [1,2, 7, 20,21,23,27, 28]. This idea of extracting energy features from ﬁltered
images can be traced back to [29], where a bank of band-pass ﬁlters were used for
image analysis. A more recent survey on ﬁltering based methods for texture classiﬁca-
tion can be found in [9]. In early research, such as in [2,28], features from all wavelet
subbands are used for texture classiﬁcation. However, it is common knowledge in
the area of pattern recognition that proper feature selection is likely to improve the

classiﬁcation accuracy with fewer number of features [10]. At the same time, the

overcomplete structure of the wavelet packet transform motivates the selection of the
wavelet features for classiﬁcation. For the widely used energy feature, since one en—
ergy value is extracted from each subband, wavelet feature selection is thus equivalent
to selecting a set of subbands for texture decomposition. Therefore, “wavelet feature
selection” and “wavelet subband selection” are interchangeably used in this chapter.
Note that different from the general subband selection method that aims at signal
reconstruction, such as the entropy based subband selection method proposed in [30],
wavelet feature selection does not require the selected subbands to reconstruct the
texture image.

At this point, it is important to make a distinction between wavelet feature selec-
tion and the general feature selection discussed in pattern recognition [10,31]. Both
selection methods aim at obtaining a compact representation of the texture for clas-
siﬁcation. However, the two techniques are not exactly the same. First, for general
feature selection methods, the explicit knowledge of the feature extraction process
may not always be available. The input to the general feature selection process is
usually a vector of values representing the different features without any a priori in-
formation about how these features were obtained. On the other hand, the wavelet
feature selection methods can take advantage of the tree structure of the wavelet
decomposition for the selection process. For example, in this chapter, the wavelet
packet decomposition structure is analyzed and used in determining the optimal set
of subbands. Second, some selection methods based on component analysis, such
as the principle component analysis (PCA), independent component analysis (ICA),
and linear discriminant analysis (LDA), usually require that all of the original fea-
ture components are available at both the training and testing stages for projecting
features into a selected subspace. In the setting of wavelet packet decomposition, this
means that a full decomposition is required at both the training and testing stages.
On the other hand, for the wavelet feature selection method, an texture is only needed
be decomposed into the wavelet subbands selected by the wavelet feature selection
method during the training stage. In this sense, wavelet feature selection selection

also reduces the computational complexity of feature extraction.

One of the most well-known subband selection algorithms is based on the en-
tropy cost function [30], where the entropy of the coefﬁcients at the children nodes
is compared to the entropy at the parent node. In this method, the “best” wavelet
packet tree is chosen to minimize the entropy of the representation coefficients. As
such, the main goal of this method is signal reconstruction and not classiﬁcation. For
this reason, subband selection methods that particularly aim at achieving compact
signal representations for classiﬁcation have been proposed. For example, in [1], the
features are only extracted from subbands with high energy values higher than a pre-
determined threshold value. The energy distribution over these subbands with high
energy is then used as features for classiﬁcation. In [32], each subband is evaluated
based on the discrimination power of the extracted energy value, and the subbands
with high discrimination power are selected for subsequent classiﬁcation. A similar
evaluation method is also employed in [21], where the decomposition tree is pruned
by comparing the discrimination score of the parent and its children nodes. In [7], a
neuro—fuzzy method is used for subband selection to obtain a compact representation.

One important and effective principle guiding the feature selection process is to
exploit the dependence relation among different feature components [22, 25, 33—37].
For wavelet feature selection, this principle indicates that the dependence between
different subbands should be investigated and utilized. However, in the existing
research on wavelet feature selection [1,7,21,30,32], either each subband is evaluated
separately, or only the parent subband and children subbands are compared based
on a pre—determined criterion and the wavelet subband selection is based on the
evaluation. When each subband is evaluated separately, it is implicitly assumed that
the features from different subbands are independent. When selection is based on
comparing parent and children subbands, only the dependence between parent and
children subbands is considered. Usually the assumption of independence between
subbands does not hold, thus degrading the classiﬁcation accuracy.

The dependence between wavelet coefficients that have the “parent-child” or “sib-
ling” relationship has been successfully measured by mutual information [38], or mod-

elled by hidden Markov Model (HMM) [39,40]. It should be noted that the depen-

dence of energy features among subbands is different from the dependence of wavelet
coefﬁcients among subbands. The statistical properties of wavelet coefﬁcients of an
image have also been analyzed and modelled [41,42]. It has been shown that incor-
porating the dependence among wavelet coefﬁcients improves the image compression
eﬂiciency [43] and image denoising performance. In this chapter, we are interested
in the dependence of the extracted features, not the wavelet coefﬁcients. Since the
extracted features are a function of all the coefficients in a subband, the dependence
between features is more complicated than the dependence between individual wavelet
coefﬁcients.

The dependence among subbands can also be interpreted as the amount of redun-
dancy among subbands. In classiﬁcation, one can take advantage of this redundancy
in features by two different approaches. In the ﬁrst approach, the structure of re-
dundancy among the transform coefﬁcients is modelled with a parametric model and
the parameters of the model are extracted as features for classiﬁcation. Examples of
this ﬁrst approach can be seen in [22, 25,35], where the dependency among wavelet
subband coefﬁcients that have “parent-children” relation is modelled with the HMM
and used for texture classiﬁcation. In the second approach, ﬁrst features are extracted
from the transform coefﬁcients and then the redundant features are removed, since the
removal does not reduce information useful for classiﬁcation. Algorithms presented
in [33, 34, 36,37] belong to this second type of approaches. In [34], the “minimax
entropy principle” is proposed to require that a newly added feature should be “very
different” from the existing features and the degree of difference is measured by the
changes in the entropy caused by the inclusion of a new feature. To illustrate the
difference between the two types of approaches, consider the following simple exam-
ple. Let X = [X1,X2] be a 2-dimensional binary feature vector describing a data
sample. Assume that as the prior knowledge, it is known that X2 E X1. In the ﬁrst
type of dependency analysis, this dependence feature can be extracted as “the sec-
ond component is the complement of the ﬁrst component”. For the second method,
the component X2 is simply removed since the inclusion of X2 does not provide any

extra information for classiﬁcation. In this chapter, we adopt the second type of

approach, i.e., identifying the redundant features and then removing them. In this
sense, the models developed in [38—40] are not suitable for the problem of wavelet
feature selection for texture classiﬁcation studied in this chapter.

In this chapter, the dependence between energy values extracted from differ-
ent subbands are analyzed and incorporated for wavelet feature selection. Differ-
ent from the HMM model that only models dependence between “parent-children”
subbands [22, 25, 35], in this chapter the dependence between features from all sub-
bands are taken into account in the selection process. The analysis and the proposed
algorithms can be easily extended for other features, such as entropy, kurtosis, etc.
Applying the same analysis to other ﬁlter bank based methods, such as Gabor decom-
position, is also straightforward. The dependence between energy features extracted
from different subbands is theoretically analyzed and veriﬁed by the simulation re-
sults. Based on this analysis, Mutual Information based Subband Selection (MISS)
algorithm is proposed for subband selection based on feature dependence. Experi-
mental results show that dependence is an effective criterion for selecting subbands to
obtain a compact feature representation. In order to further combine the dependence
between the subbands and the evaluation score of individual subbands, Subband
Grouping and Selection (SGS) algorithm is proposed to incorporate both factors into
the subband selection process.

The contributions of this chapter include demonstrating the dependence among
extracted features; demonstrating that dependence can be used for effective subband
selection; and combining dependence between features from different subbands and
the evaluation score of each individual feature for better subband selection. Although
classiﬁcation of texture images are used to show the effectiveness of the proposed
methods, the focus of this chapter is not to propose a new feature extraction method
for texture classiﬁcation, but to identify the dependence among different components
of wavelet features and propose new subband selection methods for texture classi-
ﬁcation. This analysis and the proposed methods can be similarly applied to the
classiﬁcation of other signals (like EEG, speech) using the wavelet packet transform.

The organization for the rest of this chapter is as follows. Section 2.2 brieﬂy

10

reviews the standard 2D wavelet packet decomposition and feature extraction. In
Section 2.3, the dependence between energy features from different subbands is ana-
lyzed. Due to the lack of an accurate statistical model for natural images, simulation
results for covariance between energy values from different subbands are presented.
Based on the analysis provided in Section 2.3, Section 2.4 proposes an algorithm for
subband selection based on dependence. Section 2.5 proposes a second algorithm that
combines the dependence between subbands and the evaluation score of each individ-
ual subband. Experimental results and related discussions are given in Section 5.4.

Section 5.5 concludes the chapter with suggestions for possible future research.

2.2 Wavelet Packet - Feature Extraction

As an extension of the standard wavelets, wavelet packets represent a generalization
of the multi-resolution analysis and use the entire family of subband decompositions
to generate an over-complete representation of signals. In 2D discrete wavelet packet
transform (2D-DWPT), an image is decomposed into one approximation and three
detail images. The approximation and the detail images are then decomposed into
a second-level approximation and detail images, and the process is repeated. The
standard 2D-DWPT can be implemented with a low-pass ﬁlter h and a high-pass
ﬁlter 9 [13]. The 2D-DWPT of an N x M discrete image A up to level P+1 (P S
min(log2 N, log2 M )), is recursively deﬁned in terms of the coefﬁcients at level p as

follows:

Cit-cilia): 2 Zn: ”(mlh Ck ,(m+2i, n+2j) (2.1)
021:1“ 1):: 2 h( ((nm)g Ck (m+2i, n+2j)’ (2-2)
Gil-:2 (i j)— —Z :0” g(m Ck ,(m+2i, n+2j)’ (2-3)
05:3 (i j) zg: Em g(m k,(m+2i, n+2j)’ (2-4)

11

where C8 is the image A. At each step, the image C}: (with the, value of Cilia) at
the position (i, j )) is decomposed into four quarter-size images (15:1, Cf: +11,
02:32, 05:313. A two-level wavelet packet decomposition is illustrated in Figure 2.1.
2D-DWPT decomposition allows us to analyze an image simultaneously at differ-
ent resolution levels and orientations. Several energy functions can be used to extract
feature from each subband for classiﬁcation. Commonly used energy functions in-
clude magnitude [o|, magnitude square |o|2 and the rectiﬁed sigmoid |tanh(a)o| [9].
Both magnitude and magnitude square are parameter-free while the rectiﬁed sigmoid
function can be adjusted by the parameter a. The mean and the standard deviation
of subband coefﬁcients can also be extracted as features [24,44]. All these deﬁnitions
of energy are highly correlated. In this chapter, the deﬁnition of energy based on

squaring is used. The energy in different subbands is computed from the subband

coefﬁcient matrix as:
az<k>=zzwi..r <25»
3' .7'

where 0,2,(k) is the energy of the image projected onto the subspace at node (p, k).
The energy of each subband provides a measure of the image characteristics in that
subband. The energy distribution has important discriminatory properties for images

and as such can be used as a feature for texture classiﬁcation.

2.3 Dependence Analysis

Wavelet packet decomposition is an orthogonal transform. However, it is observed
that considerable dependence exists between coefficients in different subbands, es-
pecially for coefﬁcients from subbands having “parent-child” or “sibling” relations
in the decomposition tree. The dependence between coefficients has been modelled
by Hidden Markov Model (HMM) [39,40] and utilized for better image compression
efficiency [43]. However, no analysis has been done on the dependence among fea-
tures extracted from different subbands. In this section, covariance is used to analyze

the dependence between energy values from any two subbands. The analysis is ﬁrst

12

 

 

Figure 2.1. A wavelet packet decomposition tree: (a) The original image, (b) The
2-level decomposition tree, (c) Projection of the original image onto each leaf node of
the tree.

13

conducted theoretically and then a simulation is used for a given image model.

2.3.1 Theoretical Analysis

Based on equations (2.1) to (2.4), the wavelet coefﬁcients at level (p + 1) is obtained
by convolving the wavelet coefﬁcients at level p with the wavelet ﬁlters. The highest
level in the wavelet packet decomposition tree is the original image A. Therefore, the
coefﬁcients in any subband can be written as a linear combination of pixel values in the
original image A. The weighting coefﬁcients in the linear combination are determined
by the properties of the wavelet basis, i.e. the lengths and the values for ﬁlters 9
and h. Considering the deﬁnition of subband energy in equation 2.5, the energy of
wavelet coefﬁcients in a subband is a second order polynomial in terms of the pixel
values of the image A. Coefﬁcients in the polynomials are determined by the ﬁlters 9

and h and the decomposition level. Denote the coefﬁcient for the term A(i, j )A(i’, j’)

2

p(k), as follows:

as f};) (i, j, i’ , j’ ) in the representation of an energy function a

030:) = Z Z Z 2 film, i',j’)A(z',j)A(z",j’) (2.6)

Due to the downsampling, some of the coefficients ff(i, j, i’, j’) are zero. Hence, the

covariance between two energy values is deﬁned as

COV(0§1(k1),0§2(k2)) = El0§.(k1)03.(k2)l — Ela§.(k1)lE[0§.(k2)l- (2'7)

Cov(o§,(k1),o,2,2(k2)) will be a fourth-ordered polynomial in terms of the pixel
values in the image A. Using deﬁnitions (2.1) to (2.4), the correlation between energy

values of two children nodes, of(0) and 012(1), is:

14

E0i(0)oi(1)l ,
= E]; 23; (1%: 2n: h(m)h(n)x(m + 2i, n + 2j)) (2.8)

2p: 2‘]: (21: Zr: h(l)h(7‘):z:(l + 2p, 7“ + 2q)) ]

Given that g and h are FIR ﬁlters, the random variables in equation (2.8) are
the different pixel values. Therefore, the covariance is a function of the 4th—order
statistics of the original image A. If we have a statistical model for the pixels, the
correlation value can be easily calculated.

To illustrate this, suppose that the image A is of size 4 x 4. The image model
can be described using AR—l Gaussian model, which is simple but often used in
digital image processing [45]. A useful property of AR—l Gaussian model is that
the marginal distribution of each pixel is also Gaussian [46]. The Gaussian AR-l
process with mean a is usually written in terms of a series of white noise innovation
processes {En}:

Xn — [1, = 0.(Xn_1 — [1) + En (2.9)

where E, ~ N (0, 02) are i.i.d. and [al < 1. The marginal distribution is also normal:

0.2

l—a2

 

Xn ~ N01, ) (2.10)

Given the marginal distribution, equation (2.8) can be expanded as follows:

E[0I(0)UI(1)l = ’Ul(h,g)E[Xi1]+ v2(hag)Ein3X2l
+v3(h, g)E[XfX§] + 22402. g)E[X,2X,X3] + v5(h, g)E[X1X2X3X4] (2.11)

where {X1,X2,X3,X4} are i.i.d. Gaussian random variables correspond-
ing to different pixels distributed as given by equation (2.10), and
{v1(h,'g), u2(h, g), u3(h,g), u4(h,g)} are functions of wavelet ﬁlters.

Suppose that the size of the original image X is N x M, and deﬁne K =

15

min (N, M). u1(h,g)E[Xf] represents those terms on the right hand side of
equation (2.8) where the indices of four pixel values are exactly the same.
Thus, u1(h, g)E[X{‘] contains exactly K terms. Similarly, u2(h, g)E[Xi’X2] contains
4K(K-1), u3(h,g)E[XfX22] contains 3K(K-1), v4(h,g)E[X12X2X3] contains 6K(K-
1)(K-2) and v5(h, g)E[X1X2X3X4] contains K(K-1)(K-2)(K—3) terms from the right
hand side of equation (2.8). Based on the marginal distribution, the four expected

values are given as follows:

2 2 2
EN] = 11“ + 61121-3: + 3(1-3-5)
E[Xi°'X2] = #4 + 3mg,
2
E[X,2X§] = (#2 + 14,) (2.12)
EIXIX2X31= #4 + #27310:
E[X1X2X3X4] = [1.4

Therefore, we can obtain the covariance between the energy features from two sub-
bands analytically. In the next section, we compute the derived covariance values for

a speciﬁc set of wavelet ﬁlters.

2.3.2 Simulation

Suppose that the Haar wavelet basis is used, i.e., g={0.7071, 0.7071} and h = {-
0.7071, 0.7071}. The parameters in equation (2.10) are chosen as follows: a = 128,
o = 12 and a = 0.8, which are standard values for a 256-valued grey image. The

normalized covariance:

_ Cou(X,Y)
N-COU(X,Y)—\/C0u(X,X)\/C0u(Y,Y) (2.13)

 

 

is used for measuring the dependence between energy of different subbands. With
these assumptions, the covariance matrix for energy values from any two subbands
in the wavelet packet decomposition tree can be easily calculated. The size of images

at the. ﬁrst decomposition level is 2 x 2. The covariance matrix between the four

16

Table 2.1. Normalized energy covariance with haar basis

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Imam h—aiU) mam) h-ai(3)
hof(0) 1.0000 0.5656 0.2765 -0.0669
hf_o(1) 0.5656 1.0000 0.2782 -0.0610
h.of(2) 0.2765 0.2782 1.0000 -0.2705
hof(3) -0.0669 -0.0610 -0.2705 1.0000
Table 2.2. Normalized energy covariance with db4 basis

d_o12(0) d_of( 1) d-o'f’(2) d_of (3)

do¥(0) 1.0000 0.5560 0.5635 0.0191
d_o'f’(1) 0.5560 1.0000 -0.1294 0.2210
d.ol (2) 0.5635 -0. 1294 1.0000 -0.0029
d_ol(3) 0.0191 0.2210 -0.0029 1.0000

 

 

energy values from the ﬁrst decomposition level, 01(0) 01(1), of(2), of(3), is shown
in Table 2.1. For the Daubechies’ 4-tap (2 vanishing moments) ﬁlters, where g =
{0.4830, 0.8365, 0.2241, -0.1294} and h = {0.1294, 0.2241, -0.8365, 0.4830}, the same
4 x 4 covariance matrix is shown in Table 2.2. In Table 2.3, the covariance between
4 energy values computed from the subbands in the ﬁrst decomposition level of Haar
basis and 4 energy values for the corresponding subbands in Daubechies’ 4-tap basis
is given.

These results Show that various degrees of dependence exist between the energy
features extracted from different subbands within a wavelet basis and from different

wavelet bases. The wavelet basis, i.e., the high and low pass ﬁlter pair h and g, affects

the degree of dependence. To analyze the dependence between energy values more

Table 2.3. Normalized energy covariance between db4 basis and haar basis

 

 

 

 

 

 

 

h_of(0) h_of(1) h_of(2) h_of(3)

(1.012 (0) -0.0516 -0.0214 -0.0707 0.9870
d of( 1) 0.1279 0.0295 0.3520 0.4322
(1.01 (2) 0.0665 0.3608 0.0189 0.6090
ﬂ1(3) 0.2174 0.1972 0.0339 0.0666

 

 

 

 

 

 

17

accurately, more complicated statistical image models are needed.

The analysis in this section illustrates that given the statistical model for image
pixels, we can calculate the covariance between energy values for any two subbands.
The results indicate that the wavelet basis has a great impact on the degree of de-
pendence between energy values from different subbands. The energy values from
subbands with different wavelet bases may have different degrees of dependence.
Therefore, the assumption used in many image processing algorithms that different
wavelet subbands yield independent energy values is incorrect. Thus, the independent

subbands can be combined to form a sparse representation for classiﬁcation.

2.4 Subband Selection with Dependence

In Section 2.3, an analytical procedure is given for quantifying the dependence be-
tween energy values from different subbands using different wavelet bases, given that
the underlying image model is known. Although in most applications, an accurate sta-
tistical image model is not available [47], the simulation in Section 2.3 still strongly
supports the hypothesis that the energy values from different subbands and even
different wavelet bases may be dependent. This observation motivates us to select
subbands based on dependence to generate a compact representation of the wavelet
features for texture classiﬁcation.

In real applications, due to the lack of suitable statistical image models, it is not
likely that dependence between features can be easily calculated analytically. How-
ever, the dependence relationship between energy values from subbands can still be
estimated with nonparametric methods based on a training set. More speciﬁcally,
given training samples, the dependence between the energy values from different
subbands can be estimated empirically from the training data. Based on this estima-
tion, independent subbands can be chosen to achieve a compact representation for
the subsequent classiﬁcation. In this chapter, Mutual Information based Subband
Selection (MISS) is proposed to implement this subband selection process. Given

a training texture set Tn, a testing texture set TC for classiﬁcation, and a set of

18

subbands U = {81,192, ...,SN} from the wavelet packet decomposition, the MISS

algorithm can be formulated as follows.

1. For each texture A,- E Tn, extract the energy values for all subbands. Notice
that each time the texture is decomposed, the subband size is decreased by a
factor of 4. When the subband size is small, the energy of the subband will not
be a robust measure. In the implementation, the minimum subband size is set

to 16 x 16.

2. Divide the subbands into 2 sets: SU and SR. Initialize SU to contain the sub-
band that has the highest energy and SR 2 U \SU .

3. Move a subband S,- from SR to SU based on the following criterion: 3,- is chosen
such that the feature corresponding to S,- has the smallest mutual information

with the features extracted from the subbands in the set SU, i.e.
i: arg min{I(SJ-; SU), Sj 6 SR} (2.14)
1

where S,- is the i-th wavelet subband. Here we use mutual information I to
measure the dependence. This process stops when the number of subbands
in S U reaches a predeﬁned value, which can be determined by cross validation.
In the experiments presented in this chapter, this value is set as a parameter and
the relation between the value of this parameter to the classiﬁcation accuracy is
studied. Finally, the subbands in the set S U are used to construct the compact
representation for the given images. Note that since the goal is to classify
the images, it is not required that images are reconstructed from SU, i.e. the

compact representation does not necessarily form a basis.

4. Extract energy values corresponding to the sparse representation SU for all

images in the testing set Tc.

This algorithm tries to select subbands such that the energy features from these

subbands are as independent as possible. It does not require that the chosen sub-

19

SU ~--..__SR

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 2.2. Illustration of MISS algorithm: At each step, a subband in SR with lowest
mutual information with all subbands in SU is moved from SR to SU. In this ﬁgure,
each square represents a single wavelet subband.

bands construct a valid wavelet packet decomposition tree, as in [2, 27, 30]. A simple
illustration of MISS algorithm is given in Figure 2.2.

In step 3), the mutual information is used to quantify the dependence between a
set of subbands in U and a single subband. Traditional measures of dependence, such
as correlation and covariance are not deﬁned between a set of variables and a single
variable, while mutual information is well-deﬁned to measure such dependencies. The
computation of mutual information will be discussed in more detail in the following

two subsections.

2.4.1 Mutual Information

This subsection discusses the computation of mutual information, which is used in the
step 3) of the MISS algorithm. Consider two random variables X E )I and Y E )7
with a joint distribution p(:r,y). The mutual information between X and Y is deﬁned

as:

I (X ;Y) = Z 2 1703.31) 10g £33,715 = Exy [10g $%] = D(p(w.y)llp($)p(y)),
3:6}? yEf’
(2.15)
where ‘D(o||o) is the Kullback-Leibler distance between two probability mass func-

tions p(x, y) and p(x)p(y). When the logarithm function in the deﬁnition uses a base

20

of 2, the unit for the I (X ; Y) is bits. The mutual information is symmetric in X
and Y, nonnegative, and is equal to zero if and only if X and Y are independent.
The mutual information I (X ; Y) indicates how much information Y conveys about X.
Given Y, the extra information required to fully describe X is given by conditional

entropy H (X IY) [48]. Thus, the following equation holds:

I(X; Y) = H(X) — H(X|Y), (2.16)

where H (X ) is the entropy of random variable X. Similar to the deﬁnition of con-
ditional entropy, the conditional mutual information between random variables X

and Y given Z is deﬁned as:

I(X; Y|Z) = H(X|Z) — H(XlY, z) = E,(,,,,,.,[log W] (2.17)

The conditional information has an interpretation similar to that of mutual infor-
mation. Given the above deﬁnitions, the mutual information between a set of random

variables {X 1, X2, ..., X.) and a single random variable Y can be deﬁned as:

n

I(X1,X2, ---, Xn; Y) = Z I(Xz'; YIXi—1,Xi—2, ---,X1)- (2-18)

.=1

This deﬁnition of mutual information can be used for evaluating the independence
between the features extracted from a set of subbands SU and a single subband
3,. The higher the value of the mutual information, the easier it is to estimate the
distribution of S,- given SU. The mutual information thus provides a criterion for

the subband selection process.

2.4.2 Computation of Mutual Information

Considering the right hand side of equation (5.4), one practical difficulty in compu-
tation is the estimation of high-dimensional joint pdfs (or equivalently, conditional

pdf), due to the possible large Size of the subband set U. For example, if the texture

21

 

 

 

 

2. ~ f(o) _ P040 =

 

 

 

Figure 2.3. Dimension reduction in mutual information through a many-to-one map-
ping function

is decomposed up to level 3 using wavelet packets, then there are 85 subbands. It is
impractical and unreliable to estimate such high-dimensional distributions. To avoid
the problem of “curse of dimensionality”, we assume that the set of random variables
{X1, X2, ..., Xn} provides information to Y only through a many-to-one scalar map-
ping T : f(X1,X2, ..., X"), in this sense, the mutual information I(X1,X2, ...,Xn; Y)
can be calculated as I (T; Y), as illustrated in Figure 2.3.

In this model, T is obtained by processing X. Thus, applying the data-processing
theorem [48] yields the following inequality:

I(X1,X2, ...,Xn;Y) 2 I(T; Y). (2.19)

The equality is achieved if and only if T = f (X 1, X2, ..., X") is the sufﬁcient statis-
tics for {X 1 , X2, ..., Xn} [48]. Since this is rarely true in practice, ﬁnding a proper form
for the function f (o) is important to minimize the error caused by the inaccurate as-
sumption. Since the true mutual information I (X 1, X2, ..., Xn; Y) is the upper bound
for the estimated mutual information I (T; Y), the function f (0) should maximize
the estimation I (T; Y). For the convenience of computation, f (o) is usually assumed
to be a linear function. With this assumption, the optimization is still nontrivial.
In [49—51], more approximations are proposed for computing mutual information.

The approximations include replacing the deﬁnition of mutual information in equa-

22

tion (2.15) with Renyi’s deﬁnition, and using Parzen’s window method for modelling
pdfs in high dimensional spaces. Even with all these assumptions and approximations,
the complex iterative optimization steps can only achieve locally Optimal solution that
highly depends on the initialization of the numerical procedure.

A similar mutual information computation problem arises in the statistical mod-
elling of wavelet coefﬁcients and is solved in [38,42]. In [38], a simple linear model

was adopted, as follows:

T = f(X1,X2, ...,Xn) = Z W.X.-. (2.20)
i=1

Optimizing this model, i.e., the weighting coefﬁcients {Wi}, incurs similar difﬁ-
culties discussed in [49—51]. To avoid the difﬁculties, a simple equal weight model,
i.e., W,- = 1 / n, is used in [38] and the results from the computational model matches
the statistical property of actual images well. An intuitive improvement on the
equal weight model by exploiting the “parent-children” and “sibling” relation among
wavelet coefﬁcients is also proposed in [38], but shows similar results to that of the
equal weight model. In this chapter, equal weights are used for mutual information
computation, i.e., IV,- = l/n. In this case, T is the unbiased estimate for the mean
of {X3}.

At this point, the accuracy of this simpliﬁed model for mutual information com-
putation and its effect on the MISS algorithm deserve further discussions. In [38],
different weighting models, such as equal weight and optimal linear weights have been
proposed. The experimental results show that the mutual information estimate does
not vary much based on the weighting model. By using the equal weight model, the
energy values in the set SU are averaged. In this process, the subbands with higher
energy values play a more important role in the mutual information computation
that determines the next selected subband. This approximation of mutual informa-
tion computation may cause some bias in the selection of the subbands. The next
selected subband is the one least dependent on those subbands in S U with high en-

ergy values. This is consistent with the current knowledge that the wavelet subbands

23

with higher energy values are more informative for texture classiﬁcation. As shown in
the experimental results in the later sections, the simpliﬁed model is able to achieve
the goal of wavelet feature selection in the sense that a small number of subbands
selected by the MISS algorithm are able to achieve a relatively low classiﬁcation error
rate.

In the actual implementation, the value of mutual information between two
discrete vectors X and T needs to be computed in the discrete domain. Usu-
ally, the ranges of X and T are partitioned into N x and NT intervals, respec-
tively [52,53]. The continuous pdf p(:r,t) can be approximated by the 2D histogram
{PX,T(i, j),1 S i S N X, 1 g j 5 Ny}. Similarly, the marginal distributions px(:z:)
and pT(t) can be estimated by Px(i) and PT( j ), respectively. Based on the the es-
timated discrete probability distributions, the mutual information can be calculated

as:

P__X,T(i J)
PZZPXﬁ i ,j )log—— Px(i )PTU.) (2.21)

If the random processes X, T are stationary and ergodic, then the histogram reliably

approaches the pdf and I (X ; T) converges to I(X; T).

2.5 Subband Selection Combining Dependence
and Energy

As mentioned in Section 5.1, the evaluation of individual subbands provides useful
information for subband selection. One widely used evaluation criterion is the mag-
nitude of energy computed for each subband. It has been shown that images can be
accurately identiﬁed by subbands with high energy values [1,7]. It is now clear that
two factors, evaluation of individual subbands and dependence among subbands, af-
fect the subband selection. In Section 2.4, dependence between different subbands is
considered for subband selection. However, the subbands selected based on the inde-
pendence criterion are not necessarily the most signiﬁcant subbands for classiﬁcation.

In order to further improve subband selection for classiﬁcation, Subband Grouping

24

and Selection (SGS) algorithm is proposed to combine the dependence between sub-
bands and the evaluation of individual subbands. SGS achieves this objective through
two steps. First, the algorithm discovers the structure of statistical dependency be-
tween subbands, then it selects the subbands based on this structure using evaluation
scores of individual subbands. In order to discover the structure of statistical de-
pendency between subbands, SGS employs a grouping algorithm that partitions the
subbands into different sets based on their dependency on each other, i.e. ideally
the subbands from the same set are dependent, while subbands from different sets
are independent. The proper partitioning of all the subbands reveals the structure of
statistical dependency between subbands. In the second step, the subband with the
highest energy from each subset is selected for classifying images. With these two
steps, SGS successfully incorporates the previous research results in [1], and exploits

the statistical dependency among subbands for a concise representation of the images.

2.5.1 Statistical Dependency Between Subbands

The general problem of discovering the structure of statistical dependency can be
described as follows: Given a set of subbands U = {51,52,...,SN}, which is the
full wavelet packet decomposition in our problem, ﬁnd a partition {U1,U2, ..., U M} of
set U such that M

UU,=U and U.nU,=0,z';£j (2.22)

i=1
where M is the number of subsets. It is desired that that the subbands from the same
subset U,- are dependent on each other, while the subbands from different subsets are
independent. If this property holds, then we can choose only one feature from each
subset to construct a concise representation of the texture.

Discovering the structure of statistical dependency has been addressed in litera-
ture. One of the recent research results in this area was introduced in [54], where a
method based on hypothesis testing is proposed. Different hypotheses on subset par-
titions, were compared based on log-likelihood. However, the number of all possible

partitions of a set with M random variables is in the order of O( M M ), which is prac-

25

tically intractable. The hypothesis testing also requires evaluating the joint pdf of all
the random variables in the same set. The estimation of the pdf in high-dimensional
spaces is usually unstable and inaccurate, due to the sparsity of the training data [10].
To avoid this problem of estimating high-dimensional joint pdfs, we propose an ap-
proximate method for statistically partitioning the subbands based on pairwise depen-
dency. More speciﬁcally, we can estimate the marginal pdf of features extracted from
each subband. Using these estimated marginal pdfs, the pairwise dependency between
two subbands can be quantiﬁed. A data grouping algorithm is subsequently applied
to features from all subbands, generating the partition of the features from each sub-
band. This grouping algorithm only requires pairwise dependencies, thus avoiding
computations in high-dimensional space. With this method, the dependent subbands
tend to be grouped into the same subset. Note that this method is an approximation
to the discovery of the structure of statistical dependency. For example, given three
random variables X, Y, Z, it is possible that P(X|Y) = P(X), P(XIZ) = P(X),
while P(X IY, Z) 76 P(X) [55]. If only pairwise dependence is measured, X will not
be grouped into the same subband with Y and Z, though the dependence is evident.
Another limitation lies in the data grouping algorithm itself. Research on data group-
ing, though conducted for several decades, is still far from perfect [10]. This indicates
that an accurate partitioning of the subbands such that the ones in the same subset
are dependent, while the ones from different subsets are independent can only be
approximated. Despite these limitations, the grouping algorithm, measuring pairwise
dependency, is still able to ﬁnd out the dependence structure in the data.

In this chapter, K-Means grouping algorithm is used to generate the partition of
subbands. Given N subbands, M partition subsets ( N > M), and the mean of each
subset as {M1,u2, ..., 11M}, the K-Means algorithm can be described as follows [10].

1. Randomly select M subbands from N subbands and set them as initial class

means {u1,u2,...,uM}.

2. Classify the N subbands. Each subband is assigned to the mean u,- with which

it has the highest mutual information.

26

3. Recompute the means {121,122, ...,uM} based on the partition of subbands from
step (2).

4. Repeat steps 2) and 3) until the means do not change.

It is clear that the K-Means is an iterative algorithm. The initial selection of
means (“seeds”) can affect the ﬁnal partition results. To avoid this randomness in
ﬁnal performance evaluation, we run the K—Means algorithm multiple times and the

ﬁnal evaluation is based on the average value from all runs.

2.5.2 Subband Grouping and Selection

The problem of discovering the structure of statistical dependency among subbands
has been addressed in Section 2.5.1. Incorporating this partitioning algorithm into
our method, the subband with highest energy value in each subset is chosen for
texture classiﬁcation. In summary, given a training texture set Tn and a test texture
set To for classiﬁcation, and a subband set U = {31, 82, ..., S N} from wavelet packet

decomposition, the SGS algorithm can be described as follows.

1. For each texture A,- E Tn, extract the energy values of all subbands from
wavelet packet decomposition. Note that each time the texture is decomposed,
the subband size is decreased by a factor of 4. When the size is too small, the
energy of the subband will not be a robust measure. In the implementation,

the minimum subband size is set to 16 x 16.

2. For each subband 5,, estimate the marginal pdf of its energy value using Parzen

method with Gaussian kernel [10] based on the training set.

3. For any two subbands Si, 83-, use equation (2.21) to compute the mutual infor-

mation between the extracted features using the pdfs from step 2.

4. Apply K-Means grouping algorithm with the pairwise mutual information as

the metric, generate the partition of the subband set U: {U1,U2, ..., U M}.

27

Figure 2.4. A simpliﬁed illustration of SGS: subbands are ﬁrst partitioned into dif-
ferent groups (3 ellipses) and then the subband with highest energy (ﬁlled dot) is

'3
at,
1

  

1w)

selected from each group.

5. Select the subband with the highest energy values from each subset. From the
subset U,-, a subband SS, is selected, such that:

C)

The SGS algorithm attempts to select independent subbands with high energy
values for classiﬁcation. This point is reﬂected in steps 4 and 5 of the algorithm,
where subbands are grouped according to the dependence relation and only subbands
with high energy values are selected. Note that SGS does not require the chosen
subbands to construct a valid wavelet packet decomposition tree, as in [2]. A simple
illustration of SGS for a three dimensional space is given in Figure 2.4. Note that this
ﬁgure is simpliﬁed for illustration. In the actual application, the extracted features
are in a much higher dimensional space and the different subbands may not be well

separated.

i = argmax{E(S,-),5j E U,}
.‘i

where E (SJ) is the energy feature extracted from the subband S,- and is deﬁned
by equation (2.5). The set of subbands used for representing the images is

thus SU = {531,582, ...,SSM}.

. Extract subband energies of all images in the testing set Tc. Use only the

energy values corresponding to the subbands in the set S U for classiﬁcation.

28

2.6 Experiments

2.6.1 Experiment Setting

To evaluate the proposed wavelet subband selection algorithms, classiﬁcation experi-
ments are conducted on the Brodatz texture database [56]. All of the images used in
the experiments are shown in Figure 2.5. All of the images are of gray-value and have
a size of 512 X 512. Each texture is divided into 16 non-overlapping subimages with
a size of 128 x 128. In this way, the data set for experiments contains 54 different
texture classes, with 16 images in each class, resulting in 864 images for experiments.
This experimental setting of using training and testing images from the same large
texture image is used commonly in evaluation of texture classiﬁcation methods, such
as in [1,5,23,25,26]. It is known that the size of subimages affects the classiﬁcation re-
sults. Since the focus of this chapter is to demonstrate the effect of subband selection
methods rather than ﬁnding a set of optimal parameters for texture classiﬁcation, we
ﬁx the texture size to 128 x 128. For images at different sizes, the subband selection
methods can be applied without any signiﬁcant change.

The gray-scale images are ﬁrst normalized to a given mean and variance.
Denote A(i, j) as the gray-level value of pixel (i, j) and [a and o as the mean and

variance of A, respectively. The normalized image A’ is given by:

02 i'— . ..
110+ Maj—“23 If A(z,J)>u

I o c —
”(MW)— 02(A(ij)-u)’ . . .
“0+ J———; 11f A(z,J)Su

where 110 and 00 are the predeﬁned mean and variance of the adjusted image A’,

(2.24)

separately. In the experiment, we set #0 = 128 and oo = 30.

The experiment includes two stages: the training stage and the testing stage. Half
of the data set ( 27 x 16 = 432 images) is used for training and the other half is used
for testing. In the training stage, each texture in the training set is decomposed with
the wavelet packet transform up to 3 levels. Thus, for each texture, the number of

subbands from a single wavelet basis is 1+4+16+64 = 85. A pre—determined number

29

 

Figure 2.5. Brodatz texture images used in the experiments.

30

of subbands are selected by applying the subband selection algorithms on the training
set Tn. In the training stage, due to the randomness of selecting the initial seeds for
the K-Means grouping algorithm, SGS was run 10 times for the same training and
testing data set and the results were averaged. For this experimental setting, it is
found empirically that averaging on 10 runs results in a relatively stable result.
Classiﬁcation experiments are conducted on the testing set, Tc, with the images
decomposed only at the subbands selected in the training stage. In the testing stage,
the classiﬁcation error rates for different number of selected subbands are computed
for performance evaluation. The classiﬁcation error rate is computed using the KN N
classiﬁer with cross-validation (leave-one-out) [10]. All of the images in the test-
ing set Tc are classiﬁed. In each round, one texture from T6 is taken out and the
normalized Euclidian distance deﬁned in equation (5.19) between this texture and
all the other images from To are computed based on the energy features extracted
from the selected subbands. The distance on energy features is deﬁned by the nor-
malized Euclidian distance, which was shown to be effective for measuring texture

similarity [57]:

n 2

d(F1,F2) = 2

i=1

F1“) " F2“)

(71'

(2.25)

 

 

 

where F1, F2 are two different wavelet energy feature vectors, and o,- is the standard
deviation of the feature extracted from subband i, estimated from the training set Tn.
The “K” most similar images (i.e., the images with the smallest distances) determine
the class label of the texture to be classiﬁed by a majority vote. After each texture
in the testing set TC is classiﬁed in this way, the ratio of the number of misclassiﬁed
images to the number of total images in T c is computed as the error rate. The value
“K” in “KNN” is chosen to be 16. Considering that the number of images in each
class is 16, we expect that most of the 16 most similar images come from the same
class as the texture to be classiﬁed.

For comparison, existing algorithms for wavelet subband selection are imple-

mented and tested. The different methods that are implemented for comparison

31

include: (1) the subband selection algorithm based only on the magnitude of the en-
ergy in each subband [1], referred to as “Energy” method in the following discussions;
(2) the subband selection algorithm that evaluates the Fisher discrimination power
of each subband [21,32], referred to as “Fisher” in the following discussion; (3) the
best wavelet packet decomposition tree based on entropy [30], referred to as “Best
Tree” method in the following discussions. In the “Best Tree” method, if the entropy
value of a subband is less than the sum of entropy values of its children subbands, the
subband is decomposed into children subbands. This criterion is iteratively applied
to the leaf nodes of the current wavelet decomposition tree. With this criterion, we
can not obtain any arbitrary number of subbands, since whether a subband is decom-
posed or not depends on a ﬁxed criterion. The only parameter that we can modify
in this process is the decomposition level, which relates to the maximum number of
subbands; (4) The subband selection algorithm that evaluates each subband with its
entropy value. This method selects subbands with high entropy values for classiﬁ-
cation. This method is a combination of the ideas presented in [30], [32], and [21].
With this method, any number of subbands can be selected. This method is referred
to as “Entropy” in the following discussions.

Note that the criterion used in the ”Best Tree” method [30] aims at reducing
the image reconstruction error, and not improving the classiﬁcation accuracy. This
method is included for comparison, since it is widely used for selecting the optimal
wavelet packet tree structure. The four selection methods (“Energy”, “Fisher”, “Best
Tree” and “Entropy”) are tested using the same experimental settings as the two
algorithms proposed in this chapter (“MISS” and “SGS” ). The error rates for different

number of subbands selected by different methods are reported and compared.

2.6.2 Performance with Different Wavelet Bases

Different wavelet bases deﬁne the transform ﬁlters used in the tree structure decompo-
sition introduced in Section 2.2. Therefore, using different wavelet bases will generate
different energy distributions over subbands. The relationship between wavelet bases

and classiﬁcation performance was empirically discussed in [5]. In this chapter, three

32

wavelet bases are used for experiments: Haar, ’db10’ (Daubechies’ basis with 10
vanishing moments) and ’bio28’ (biorthogonal spline wavelets having 2 vanishing mo-
ments in the decomposition ﬁlter and 8 vanishing moments in the reconstruction
ﬁlter.) [13]. Since the total number of subbands for each wavelet basis is 85, the
number of selected subbands is set to change from 5 to 85, with increments of 5.
Figs. 2.6, 2.7 and 2.8 show the classiﬁcation performance with the ’Haar’, ’db10’ and
’bio28’ wavelet bases, respectively.

Several observations can be made from these ﬁgures. First of all, the dependence
between features extracted from different subbands is useful for subband selection.
This is veriﬁed by the performance of the MISS algorithm. For all of the three
wavelet bases, MISS algorithm can achieve a classiﬁcation accuracy comparable to
the accuracy with the full decomposition using only 20 to 30 subbands. Second, the
evaluation of individual subbands is also effective in subband selection. The “En-
ergy”, “Entropy” and “Fisher” methods select subbands based on the evaluation
of single subbands with different evaluation functions. For the “Haar” and “dblO”
bases, MISS algorithm performs better than the three algorithms when smaller num-
ber (less than 30) of subbands are selected. However for the “bio28” basis, “Fisher”
method performs better for the same range. This shows that both factors (depen-
dence between different subbands and the evaluation of individual subbands) affect
the subband selection for classiﬁcation. Third, the most prominent result is that
SGS consistently outperforms the other algorithms. This is due to the fact that both
dependence between subbands and the evaluation of individual subbands are incorpo-
rated into the selection process. The advantage is more obvious when smaller number
of subbands are selected. The number of selected subbands is the number of clusters
in SGS algorithm. When the number of clusters is relatively small compared to the
total number of subbands, the distribution of clusters reﬂect more reliably the un-
derlying dependence structure, since more data points can be used to construct each
cluster. Finally, the “Best Dee” algorithm does not achieve the best classiﬁcation
accuracy. The “Best Tree” is optimal in the sense of image reconstruction, not in

the sense of classiﬁcation. A small reconstruction error does not necessarily mean

33

Haar Basis
0.8 -

 

, - o - Energy
‘, - o - MISS

‘, + Fisher

‘, - 0 - Entropy
+ SGS

. + Best Tree

9
0’)
-o

 

 

 

Error Rate
P
h

P
N

 

 

 

100

20 40 so so
Number of Selected Subbands

Figure 2.6. The error rates with six subband selection algorithms with the ’Haar’
basis.

a small classiﬁcation error. For the wavelet packet decomposition of the Brodatz
texture image, we ﬁnd that the optimality criterion used in the “Best Tree” method
always chooses to fully decompose an image, resulting in 5, 21 and 85 subbands for
level 1,2 and 3 decompositions, respectively. This is why the “Best Tree” curves in

the ﬁgures have only 3 data points.

2.6.3 Performance Using Subbands Selected from Two

Wavelet Bases

Traditional subband selection algorithm conﬁnes to selecting subbands that are gen-
erated from a single wavelet basis. However, for classiﬁcation, the pool of candidate
subbands can be expanded to subbands generated by multiple wavelet bases. For
example, we can combine subbands generated by the “Haar” basis and the “deO”
basis. As long as the selected subbands form a set of features that are not correlated
to each other and are discriminant in classiﬁcation, it is not required that the se-
lected subbands can be used to perfectly reconstruct the original image. Extension to

selecting subbands from multiple wavelet bases does not require the modiﬁcation of

34

db10 Basis

 

 

.7 -
0 '3 - ° - Energy
o_5 - 3 - 0 - MISS
3 ‘+ Fisher
0.5 - 9‘ ', - 0 - Entropy
3 “3 + SGS
g 0.4 ‘3 + Best Tree

 

 

 

 

20 40 60 80 100
Number of Selected Subbands

Figure 2.7. The error rates with six subband selection algorithms with the ’db10’
basis.

bl028 Basis

  
 
  
  

 

- 0 - Energy
- o - MISS
+ Fisher

- 6 - Entropy
+SGS

+ Best Tree

 

 

 

 

20 4o 60 30 100
Number of Selected Subbands

Figure 2.8. The error rates with six subbandselection algorithms with the ’bio28’
basis.

35

the algorithms described in Section 2.6.1 except that the pool of subbands to choose
from has been enlarged. For the “Best Tree” algorithm, the features from subbands
selected by the algorithm on different decomposition trees with diﬂerent bases are
simply concatenated. _

In this subsection, experiments on selecting subbands from two wavelet bases are
conducted for all of the subband selection methods mentioned in Section 2.6.1. Figs.
2.9, 2.10, and 2.11 Show the error rates obtained by the algorithms selecting subbands
from “bio28” and “db10”,“Haar” and “bi028”, and “deO” and “Haar”, respectively.
Each decomposition generates 85 subbands, so the total number of subbands for
selection is 170. In order to make the results directly comparable with subband
selection from a single wavelet basis, the number of selected subbands is still set to
change from 5 to 85, with an increment of 5.

In this modiﬁed experiment setting, the SGS still consistently outperforms other
algorithms and the observations in Section 2.6.2 still hold here. It should be noted that
the error rate obtained by selecting 85 subbands out of a possible 170 subbands from
the two wavelet bases, regardless of the algorithm and the wavelet bases pairs used,
is lower than the error rate using all 85 subbands from a single wavelet basis. This
shows the advantage of selecting subbands from multiple wavelet bases or redundant
dictionaries in general. More subbands can provide more information for classiﬁcation
and the proper selection algorithm can select the most informative subbands for
classiﬁcation.

The minimum error rates achieved with different wavelet bases and combination
of bases by applying different subband selection methods are summarized in Table
2.4 for comparison. Note that the different minimum error rates are not necessarily
obtained with the same number of subbands. Based on the results in this table,
the best classiﬁcation performance is achieved by applying the SGS algorithm on the
combination of “bio28” and “haar” bases. An empirical study on the relation between
wavelet basis and the classiﬁcation results on texture classiﬁcation was discussed in [5].
Based on the conclusion from [5], the amount of shift variance in the decomposition

ﬁlters in the wavelet basis is much more important than the regularity of the ﬁlters.

36

Table 2.4. Minimum Error Rates Achieved With Different Wavelet Bases and Differ-

ent Selection Methods

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Best 'Ifee Energy Entropy Fisher MISS SGS
bio28 0.0766 0.0688 0.0766 0.0779 0.0675 0.0649
db10 0.0831 0.0831 0.0688 0.0792 0.0714 0.0481
haar 0.0809 0.0603 0.0809 0.0765 0.0750 0.0603
bio28 and db10 0.1116 0.0785 0.0872 0.0802 0.0767 0.0488
bio28 and haar 0.1228 0.0632 0.0842 0.0895 0.0596 0.0368
haar and db10 0.1029 0.0663 0.0837 0.0767 0.0593 0.0558
db10 and bl028 Basic
1 -
---Energy
. ‘ ' ' MISS
0.8~ l, +Flsher
If. «Entropy
3 0 6~ ‘5 +SGS
a ' 1‘3 +Best Tree)
2
m 0.4" ‘ g
0.2
o 1 1 1 I 1
0 20 40 60 80 100

Number of Selected Subbands

Figure 2.9. The error rates with six subband selection algorithms by selecting sub-
bands from the ’b1028’ basis and the ’db10’ basis.

This was used to explain why “haar” basis performed better than the Daubechies
wavelets for texture classiﬁcation in [5]. Another empirical conclusion drawn from
experimental results in [5] is that even-length biorthogonal ﬁlters perform better than
odd-length ones. These conclusions also support our ﬁnding that the combination of
“bio28” and “haar” bases achieves the best performance. Another support comes from
the difference of the spaces spanned by the wavelet functions of “haar” and “bio28”.
As shown in Figure 2.12, the two wavelet functions have quite different shape and
therefore the combination of the two is able to capture a variety of structures in

texture images such as smooth curves and sharp discontinuities.

37

hear and bl028 Basis

 

 

 

 

 

 

 

1
F - 0 - Energy
'. -- -MISS
0.8- t, - +Flsher
‘3 - 0 - Entropy
3 o 6 _ 8‘. +SGS
It“ ' X.“ -- Best Tree]
8 ‘ .
h
m 0.4 7 .‘
0.2
o I 1 I l J
0 20 40 60 80 100

Number of Selected Subbands

Figure 2.10. The error rates with six subband selection algorithms by selecting sub-
bands from the ’Haar’ basis and the ’bio28’ basis.

 

 

 

 

 

 

 

hear and db10 Basis
1 -
- 0 - Energy
3 -- -MISS
0.8- +Fisher
‘1 ‘. - o - Entropy
3 0 6- '. ‘, +SGS
I‘g' ' i. ‘. -- Best Tree]
2 a -.
III (Mir ,‘3
'I
0.2
G I I l 1 I
0 20 40 60 80 100

Number of Selected Subbands

Figure 2.11. The error rates with six subband selection algorithms by selecting sub-
bands from the ’Haar’ basis and the ’dblO’ basis.

38

 

 

 

 

 

 

 

Figure 2.12. The wavelet basis functions at the ﬁrst decomposition level for “haar”
and “bio28”, in the ﬁrst row and second row, respectively. In this ﬁgure, white regions
correspond to high values and dark regions correspond to low values.

2.6.4 The Clustering of Subbands in the SGS Algorithm

In order to understand why SGS performs the best compared to other methods, it
is important to explore the distribution of subbands among clusters. The distribu-
tion of subbands in different clusters determined by the SGS algorithm reﬂects the
similarity of these subbands in identifying different texture structures. As shown by
the discussions and simulation results in Section 2.3, the choice of the wavelet ba-
sis and the statistical distribution of pixels in texture images jointly determine the
dependence relations between energy values from different subbands. Therefore, it
should be pointed out that the subband grouping results are dependent on the par-
ticular wavelet basis and the texture images. However, if a certain pattern frequently
appears in the subband grouping, such a pattern may reveal an inherent link among
different subbands. To make this analysis easier for visualization, we analyze the
grouping results with 2-level wavelet packet decomposition, i.e., there are 21 sub-

bands in total. The grouping algorithm is run 1000 times. Each time half of the

39

0000000000000009

Figure 2.13. The subband labelling scheme for a 2-level wavelet packet decomposition.

864 texture images are randomly selected for decomposition and the energy features
are used for grouping. When the number of clusters is set to 2, 998 out of the 1000
runs give the same grouping results: All approximation subbands (subband 1, 2 and
6 in Figure 2.13) are in one cluster and the rest of the subbands are in another clus-
ter. This result indicates that subbands corresponding to the same spatial frequency
range usually have higher degree of dependence and tend to be grouped into the same
cluster.

We further change the number of clusters in the K-Means algorithm to 4
and conduct the grouping algorithm. The number of clusters is set to 4 be-
cause there are 4 types of subbands in wavelet packet decomposition: LL, LH,
HL, HH. The resulting subband clusters are used to build a co—occurrence CM
with a size of 21 x 21, where the value at the entry (i, j) is the number of
times the i-th and j-th subbands are in the same cluster, with the subband la-
belling scheme given in Figure 2.13. Since the grouping algorithm is run 1000
times, the maximum value in the matrix is 1000 and the minimum value is 0.
The phenomenon observed in the 2-cluster case appears again, i.e., the approxi-
mation subbands are always grouped in the same cluster with CM(1,2),C'M(1,6)
and CM(2, 6) larger than 990 and other co—occurrence values in rows correspond-

ing to subbands 1,2,6 are small (less than 20). Some subbands corresponding to

40

similar spatial frequency are grouped into the same cluster most of the time. For
example, CM(7, 10), CM(8, 14), CM(9, 18), CM(12, 15), CM(13, 19), CM(17, 20) all have
value larger than 990, which means that more than 99% of the time, these subbands
are grouped into the same cluster. The common property of these subband pairs is
that the paths from the root to the two subbands are swapped. For example, the path
from the root to subband 7 is “LL+LH” and the path to subband 10 is “LH+LL”.
The empirical analysis of the grouping results is consistent with the ﬁltering structure

of the wavelet packet analysis and veriﬁes the validity of the proposed SGS algorithm.

2.6.5 Performance on Unseen Data

The experimental results discussed so far are based on the setting that the training
and testing subimages are cropped from different regions in the same large texture
image. Although this experimental setting is commonly used for conducting experi-
ments on texture classiﬁcation, such as in [1,5,23,25,26], having training and testing
texture images from the same large image may cause bias in the classiﬁcation result.
Therefore, experiments are conducted for studying the performance of the proposed
algorithms on unseen data to see if the proposed algorithms are still effective. For
this purpose, 9 large texture images are selected and divided into 3 classes, as shown
in Figure 2.14, with one class in each row. Each of these 512 x 512 images is divided
into 16 non-overlapping 128 x 128 subimages. Subimages from the ﬁrst larger image
in each row are used for training and the subimages from the other two images are
used for testing. In this case, the training set includes 48 samples and the testing
set includes 96 samples. The wavelet basis “bio28” is selected for decomposing the
images. Other experimental settings are the same as in the previous sections. Given
these conditions, the classiﬁcation results are shown in Figure 2.15. For comparison,
experiments are also conducted by using subimages from each large texture image
in the training set. In this case, 5 subimages are selected from the 16 subimages of
each larger texture into the training set and the remaining 11 subimages are used
for testing. The classiﬁcation results of this setting are shown in Figure 2.16. From

Figure 2.15 and Figure 2.16, similar conclusions can be drawn as in the previous ex-

41

periment setting, i.e., selection based on either the evaluation of individual subbands
or the dependence between subbands is effective, and subband selection based on the
two factors results in the best performance. The classiﬁcation on the unseen data
introduces some ﬂuctuations in the performance curve, but does not change the con-
clusion on the comparison of different selection algorithms. The minimum error rates
obtained by the two experimental settings are comparable. The major difference in
the results is that a small error rate is achieved with fewer subbands in the second
experimental setting (Figure 2.16) compared to the modiﬁed experiments with unseen
data (Figure 2.15). The reason for the observed ﬂuctuation in the error rates and
the slight increase in the number of subbands selected is because of the fact that the
tested images have more variability in the unseen data case. Experiments are also
conducted with other two wavelet bases, i.e., “hear” and “db10”, and with the three
combinations of two wavelet bases. The results from these experiments are similar to
those that use training and testing subimages from the same larger image except that
the performance curve has more ﬂuctuations. This consistence Shows the applicabil-
ity and validity of the proposed algorithms to a wider setting of texture classiﬁcation

problems.

2.6.6 Discussion

In most applications of wavelet transforms such as denoising and compression, the
goal is to reconstruct the original signal/ image. The perfect reconstruction goal con-
straints how the subbands are chosen. However, for classiﬁcation, it is not required
that the selected subbands can be used to reconstruct the original image. The major
requirement for classiﬁcation applications is that the features extracted from the se-
lected subbands are uncorrelated and discriminative. With this criterion, the features
from the selected subbands will form a compact and informative representation of the
images for the purpose of classiﬁcation.

The work presented in this chapter is motivated by the observation that most
existing subband selection methods implicitly assume that features from different

subbands are independent. MISS algorithm Shows that the dependence can be used

42

   

Figure 2.14. Texture images used for classiﬁcation on unseen data. [First row: bars;
Second row: grass; Third row: knit pattern.

 

 

43

Classification Results

  
 
 
   

 

 

 

 

 

 

0.35 -
—-— Energy
0.3 - +MISS
-- Fisher
0.25 - + Entropy
2 -°-SGS
E 0-2 ’ +Best Tree
2 0.1 5 —
In
0.1 -
0.05
"o 20 40 so so 100

Feature Dimension

Figure 2.15. The error rates with six subband selection algorithms with the ’bio28’
basis on unseen data.

Classification Results

 

0.5r
‘-°- Energy
+ MISS
0-4 ’ -- Fisher
-s— Entropy
-°-SGS
0'3 ‘ + Best Tree

 

 

 

Error Rate

 

 

o 20 40 610 so 100
Feature Dimension

Figure 2.16. The error rates with six subband selection algorithms with the ’bio28’
basis and training data from all big texture images.

44

for selecting a compact set of subbands for classiﬁcation. However, independent
features do not necessarily imply that the features are informative for classiﬁcation.
To select subbands that are both compact and informative, SGS algorithm is proposed
to combine dependence and the evaluation of individual subbands for better subband
selection. Experimental results show that SGS outperforms the algorithm based on
dependence only (MISS) and the algorithms based on the evaluation of individual
subbands only (“Energy”, “Fisher” and “Entropy”). It has also been shown that the
subband selection algorithms can be easily extended to select subbands from multiple
wavelet bases.

The idea of incrementally selecting an informative subset of wavelet subbands used
in the MISS algorithm bears some similarities to “feature pursuit” with “minimax
entrOpy principle” [34]. In [34], the feature pursuit process selects a new feature
that maximizes the decrease in entropy between the existing feature set and the
new feature set obtained by adding the new feature to the existing feature set. The
feature pursuit is proposed for texture synthesis with a parametric probability model,
while the objective of the MISS algorithm is texture classiﬁcation. This difference
directly results in the difference in the criterion used in the feature pursuit and the
MISS algorithm. For the feature pursuit, the change in entropy values of different
feature sets is used for selecting new features, whereas in the MISS algorithm, the
new subband is selected based on change in the mutual information.

It is also important to note that different experimental settings for the texture
classiﬁcation can affect the classiﬁcation accuracy. For example, the distance measure,
changes in the texture size and extracting features other than energy from wavelet
subbands may all affect the classiﬁcation accuracy. However, the focus of this chapter
is not to propose a new method that can outperform the state-of-the—art texture
classiﬁcation methods, but rather to investigate one factor (dependence) that affects
subband selection and incorporate it into the selection process. The proposed subband
selection method is applicable to ﬁlterbank based classiﬁcation of other signal and

texture. .

45

2.7 Conclusion

In this chapter, the dependence between features extracted from wavelet packet de-
composition is investigated and incorporated into subband selection for classiﬁcation.
The traditional methods implicitly assume independence among features extracted
from different subbands or only consider the dependence between the parent subband
and the corresponding children subbands. We ﬁrst demonstrate that the dependence
between features extracted from different subbands exist by theoretical analysis and
simulation. An algorithm exploiting the dependence, MISS, is proposed for subband
selection. Experimental results show that dependence among features from different
subbands is effective for subband selection. By exploiting the dependence among fea-
tures from different subbands and incorporating subband selection based on individual
evaluation of subbands, SGS has been shown to be more effective in subband selec-
tion. Experimental results indicate that SGS can effectively select smaller number of
subbands to achieve lower classiﬁcation error rates than existing subband selection

methods.

46

CHAPTER 3

A Sparse Representation

Framework for Signal Classiﬁcation

3. 1 Background

Sparse representations of signals have received a great deal of attentions in recent
years [15—18, 35, 58,59]. Sparse representations address the problem of ﬁnding the
most compact representation of a signal in terms of a linear combination of atoms in
an overcomplete dictionary. Recent developments in multi-scale and multi-orientation
representation of signals, such as wavelet, ridgelet, curvelet and contourlet transforms
are an important incentive for the research on sparse representations. Compared to
methods based on orthonormal transforms or direct time domain processing, sparse
representations usually offer better performance with their capacity for efﬁcient Signal
modelling. Current research has focused on three aspects of sparse representations:
pursuit methods for solving the optimization problem, such as F OCUSS [60], match-
ing pursuit [15], orthogonal matching pursuit [16], basis pursuit [17], LARS / homotopy
methods [61]; design of the dictionary, such as the K-SVD method [12]; the applica-
tions of sparse representations for different tasks, such as signal separation, denoising,
coding, image inpainting [18,19,35,58,59,62]. For instance, in [59], sparse represen-
tations are used for image separation. The overcomplete dictionary is generated by
combining multiple standard transforms, including the curvelet transform, ridgelet
transform and discrete cosine transform. In [35,58], application of sparse representa-
tions to blind source separation is discussed and experimental results on EEG data

analysis are demonstrated. In [62], a sparse image coding method with the wavelet

47

transform is presented. In [18], sparse representation with an adaptive dictionary
is shown to have state-of-the—art performance in image denoising. The widely used
shrinkage method for image denoising is shown to be the ﬁrst iteration of basis pursuit
that solves the sparse representation problem [19].

In the standard framework of sparse representations, the objective is to reduce the
signal reconstruction error with as few number of atoms as possible. On the other
hand, discriminative analysis methods, such as LDA, are more suitable for the tasks
of classiﬁcation. However, discriminative methods are usually sensitive to corruption
in signals due to a lack of crucial prOperties for signal reconstruction. In this thesis,
we propose a method for sparse representations for signal classiﬁcation (SRSC), which
modiﬁes the standard sparse representation framework for signal classiﬁcation. We
ﬁrst show that replacing the reconstruction error with discrimination power in the
objective function of the sparse representation is more suitable for the tasks of clas-
siﬁcation. When the signal is corrupted, the discriminative methods may fail since
the information contained in the discriminative analysis may be easily distorted by
noise, missing data and outliers. To address this robustness problem, the proposed
SRSC approach combines discrimination power, signal reconstruction and sparsity in
the objective function for classiﬁcation. Our ultimate goal is to achieve a sparse and

robust representation of signals in noise for effective classiﬁcation.

3.1.1 Problem Formulation

The problem of ﬁnding the sparse representation of a signal in a given overcom-
plete dictionary can be formulated as follows. Given a N x M matrix A contain-
ing the elements of an overcomplete dictionary in its columns, with M > N and
usually M >> N, and a signal y 6 RN , the problem of sparse representation is to

ﬁnd an M x 1 coefﬁcient vector x, such that y = Ax and ||x||0 is minimized, i.e.,
X = min ]]X’]]0 S.t. y : Ax_ (3.1)
X

where ||x||0 is the 80 norm and is equivalent to the number of non-zero components

48

 

in the vector x. Finding the solution to equation (3.1) is NP hard due to its nature
of combinatorial optimization. Suboptimal solutions to this problem can be found
by iterative methods like the matching pursuit and orthogonal matching pursuit. An
approximate solution is obtained by replacing the 80 norm in equation (3.1) with

the £1 norm, as follows:
x = rr)1‘1’n||x’||1 s.t. y = Ax. (3-2)

where ”XIII is the £1 norm. In [63], it is proved that if certain conditions on the
sparsity is satisﬁed, i.e., the solution is sparse enough, the solution of equation (3.1)
is equivalent to the solution of equation (3.2), which can be efﬁciently solved by basis
pursuit using linear programming. A generalized version of equation (3.2), which

allows noise, is to ﬁnd x such that the following objective function is minimized:
J1(X;)\) = My — Ax“: + ,\ llxlll (3.3)

where the parameter A > O is a scalar regularization parameter that balances the
tradeoff between reconstruction error and sparsity. In [64], a Bayesian approach is
proposed for learning the optimal value for A. The problem formulated in equa-
tion (3.3) has an equivalent interpretation in the framework of Bayesian decision as

follows [65]. The signal y is assumed to be generated by the following model:

y = Ax + 5 (3.4)

where 5 is white Gaussian noise. Moreover, the prior distribution of x is assumed to

be super-Gaussian:
M
p(x) ~ exp (—x\ Z [$i]p) (3.5)
i=1

where p E [0, 1]. This prior has been shown to encourage sparsity in many situations,

due to its heavy tails and sharp peak. Given this prior, maximum a posteriom' (MAP)

49

estimation of x is formulated as

XMAP = arg maxp(X|y) = arg min -10gp(yIX)-10gp(X) = arg min lly - AXI|§+A MP
(3.6)
when p = 0, equation (3.6) is equivalent to the generalized form of equation (3.1);

when p = 1, equation (3.6) is equivalent to equation (3.2).

3.1.2 Reconstruction and Discrimination

Sparse representations work well in applications where the original signal y needs
to be reconstructed as accurately as possible, such as denoising, image inpainting
and coding. However, for applications like signal classiﬁcation, it is more important
that the representation is discriminative for the given signal classes than a small
reconstruction error.

The difference between reconstruction and discrimination has been widely inves-
tigated in literature. It is known that typical reconstructive methods, such as prin-
cipal component analysis (PCA) and independent component analysis (ICA), aim
at obtaining a representation that enables sufﬁcient reconstruction, thus are able
to deal with signal corruption, i.e., noise, missing data and outliers. On the other
hand, discriminative methods, such as LDA [10], generate a signal representation
that maximizes the separation of distributions of signals from different classes. While
both methods have broad applications in classiﬁcation, the discriminative methods
have often outperformed the reconstructive methods for the classiﬁcation task [11,66].
However, this comparison between the two types of methods assumes that the signals
being classiﬁed are ideal, i.e., noiseless, complete(without missing data) and without
outliers. When this assumption does not hold, the classiﬁcation may suffer from the
nonrobust nature of the discriminative methods that contains insufﬁcient information
to successfully deal with signal corruptions. Speciﬁcally, the representation provided
by the discriminative methods for optimal classiﬁcation does not necessarily con-
tain sufﬁcient information for signal reconstruction, which is necessary for removing

noise, recovering missing data and detecting outliers. This performance degrada-

50

tion of discriminative methods on corrupted signals is evident in the examples shown
in [67]. On the other hand, reconstructive methods have shown successful perfor-
mance in addressing these problems. In [18], the sparse representation is shown to
achieve state-of-the-art performance in image denoising. In [68], missing pixels in im-
ages are successfully recovered by inpainting method based on sparse representation.
In [67,69], PCA method with subsampling effectively detects and excludes outliers
for the following LDA analysis.

All of these examples motivate the design of a new signal representation that com-
bines the advantages of both reconstructive and discriminative methods to address
the problem of robust classiﬁcation when the signals are corrupted. The proposed
method should generate a representation that contains discriminative information
for classiﬁcation, representative information for signal reconstruction and should be
sparse. In current research [18,68], sparse representations have proven to be effective
for signal reconstruction. Existing pursuit methods provide an efﬁcient way to solve
the sparse representation problem. Therefore, we choose the sparse representation as
the basic framework for the SRSC and incorporate a measure of discrimination power
into the objective function. With this new method, the sparse representation ob-
tained by the proposed method contains both crucial information for reconstruction
and discriminative information for classiﬁcation, which enable a reasonable classiﬁ-
cation performance in the case of corrupted signals. The three objectives: sparsity,
reconstruction and discrimination may not always be consistent. Therefore, weighting
factors are introduced to adjust the tradeoff among these objectives, as the weighting
factor A in equation (3.3). It should be noted that the aim of SRSC is not to improve
the standard discriminative methods like‘LDA in the case of ideal signals, but to
achieve comparable classiﬁcation results with a few number of features when the sig-
nals are corrupted. A recent work [67] that aims at robust classiﬁcation shares some
common ideas with the prOposed SRSC. In [67], PCA with subsampling proposed
in [69] is applied to detect and exclude outliers in images and the rest of the pixels

are used for computing LDA vectors.

51

3.2 Sparse Representation for Signal Classiﬁcation

In this section, the SRSC problem is formulated mathematically and a pursuit method
is proposed to optimize the objective function. We ﬁrst replace the term measuring
reconstruction error with a term measuring discrimination power to show the different
effects of reconstruction and discrimination. Further, we incorporate measure of dis—
crimination power in the framework of standard sparse representations to effectively
address the problem of classifying corrupted signals. The Fisher’s discrimination cri-
terion [10] used in the LDA is applied to quantify the discrimination power. Other

well-known discrimination criteria can easily be substituted.

3.2.1 Problem Formulation

Given an N x 1 signal y and its M x 1 projection on an N x M (M > N) overcomplete
dictionary A, i.e., y = Ax, we view x as the feature extracted from signal y for
classiﬁcation. The extracted feature should be as discriminative as possible between
the different signal classes. Suppose that we have a set of K signals {y1, y2, ..., yK},
with representations in the overcomplete dictionary as {x1,x2, ..., xx}, of which K,-
samples are in the class 0,, for 1 S i S 0’. Mean mi and variance 3? for class C,- are

computed in the feature space as follows:

1 1
m. = 75 Z x. , s? = 75 Z ux. — min: (3.7)

1 jeCi 1 jECi

K
The mean of all samples are deﬁned as: m = f E x,. Finally, the Fisher’s dis-
i=1

crimination power can be deﬁned as:

[Ii] K.(m. — m><m.- — am)?“ 2

F(X) = g; = C 2. (3.8)

 

 

 

2

can
2

0
Z K,(m, — m)(m,- — m)T

i=1

The difference between the sample means S B =

 

 

 

 

52

be interpreted as the ‘inter-class distance’ and the sum of variance SW = f: s? can
be similarly interpreted as the ‘inner—class scatter’. Fisher’s criterion is m=dtivated
by the intuitive idea that the discrimination power is maximized when the spatial
distribution of different classes are as far away as possible from each other and the
spatial distribution of samples from the same class are as close as possible to each
other.

Let Y = [y1,y2,...,yK] be the signal matrix and X = [x1,x2,...,xK] be the

feature matrix. Replacing the reconstruction error with the discrimination power,

the objective function that focuses only on classiﬁcation can be written as:

K
J2(x, A) = P(X) — A: llxillo (3.9)

where A is a positive scalar weighting factor chosen to adjust the tradeoff between
discrimination power and sparsity. Maximizing J2(X, A) generates a sparse repre-
sentation that has a good discrimination power. When the discrimination power,
reconstruction error and sparsity are combined, the objective function can be written

as:

K K
J3(X, A1, A2) = fo) " A1 2 ”xillo — A2 2: “Vi — A39“: (310)
i=1 i=1

where A1 and A2 are positive scalar weighting factors chosen to adjust the
tradeoff between the discrimination power, sparsity and the reconstruction error.
Maximizing J3(X,A1,A2) ensures a representation with high discrimination power,
low reconstruction error and sparsity for robust classiﬁcation of corrupted signals. In
the case that the signals are corrupted, the two terms 2 ||x,:||0 and Z “39- A39”:
robustly recover the signal structure, as in [18,68]. On the other handilthe inclusion
of the term F (X) requires that the obtained representation contains discriminative

information for classiﬁcation. In the following discussions, we refer to the solution of

the objective function J3(X, A1, A2) as the features for the proposed SRSC.

53

3.2.2 Problem Solution

Both the objective function J2(X, A) deﬁned in equation (3.9) and the objective
function J3(X,A1,A2) deﬁned in equation (3.10) have similar forms to the objec-
tive function deﬁned in the standard sparse representation, J1(x; A) in equation
(3.3). However, the key difference is that the evaluation of F (X) in J2(X, A)
and J3(X,A1,A2) involves not only a single sample, as in J1(x; A), but also all
other samples. Therefore, not all the pursuit methods, such as basis pursuit and
LARS/Homotopy methods, that are applicable to the standard sparse representation
problem can be directly applied to optimize J2(X, A) and J3(X, A1, A2). However, the
iterative optimization methods employed in the matching pursuit and the orthogonal
matching pursuit can be used in the Optimization of J2(X, A) and J3(X, A1, A2). In
this thesis, we propose an algorithm similar to the orthogonal matching pursuit in—
spired by the simultaneous sparse approximation algorithm described in [70,71]. The

pursuit algorithm for optimizing J3(X, A1, A2) can be summarized as follows:

1. Initialize the residue matrix R4) = Y and the iteration counter t = 0.

2. Choose the atom from the dictionary, A, that maximizes the objective function:

gt = argmaxgeAJ3(gTRt, /\1, A2) (3.11)

3. Denote the collection of selected basis as G = [g1, g2, ..., gt], apply the Gram-

Schmidt orthonormalization on G to get G1.

4. Determine the orthogonal projection matrix P, onto the span of the chosen

atoms, i.e., P, = GIGIT, and compute the new residue.

m=Y—mv an)

54

5. Increment t and return to Step 2 until t is less than or equal to a pre—determined
number.
N
The algorithm is recursive, so the sparsity term A1 Z: ”x,- “o is implicitly considered
in this process. The pursuit algorithm for optimizing=}2(X, A) also follows the same
steps. Note that as other greedy pursuit methods, the optimality of this pursuit

algorithm is not guaranteed. Detailed analysis of this pursuit algorithm can be found

in [70, 71].

3.3 Experimental Results

In order to understand the behavior of the different cost functions, two sets of ex-
periments are conducted. In Section 3.3.1, synthesized signals are analyzed to show
the difference between the features extracted by J1(X, A) and J2(X, A), which reﬂects
the properties of reconstruction and discrimination. In Section 3.3.2, classiﬁcation on
real data is shown. Random noise and occlusion are added to the original signals to

test the robustness of SRSC.

3.3. 1 Synthetic Example

Two simple signal classes, f1 (t) and f2(t), are constructed with the Fourier basis. The
signals are constructed to show the difference between the reconstructive methods and

discriminative methods.
f1(t) = 91 COSt + hl sint (3.13)

f2(t) = 92 COSt + [12 sint (3.14)

The random variable 91 is uniformly distributed in the interval [0,5], and 92
is uniformly distributed in the interval [5,10]. The coefﬁcients hl and ’12 are uni-

formly distributed in the interval [10,20]. Therefore, most of the energy of the signal

55

Selected by .11 Selected by J2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

20.: o .. 10 0%0 o 60 AA
x X’ﬁ x0 x f1 0 0 g QC 2: '1
o
o x ”a ”‘ > o <9 o c
g 18 05020 ‘ 0 '2 + .g sabooooégqacoo’: 0 f2
5 2; ”we as“ s o 0 8° 80°§T§>
9' 9 c8 ’50 x 5 n- o 069 000 000 '
E 16150 O x 90 ‘ 6' O o o 00
< 0 0x xx 5 0 o O 8 00 0
ﬂ (’3‘ 0x x X x K ﬂ 0 X 0 x O X a
X
5 1465110 0,2220 x0 ng 5 4 xxx: ’kx 2“"- x x x
E o 9.6 '6 a x“ €po
t 0x00 x9 Xx ’$¢ E X 2‘ xx xx ’5: x
812""0 x900... 3 25" “’3” w"
0 x o x 0 O xxlq] o fxx x x x x x x
O o0 O 8" x" X ’I x xx xx“
10 X 9): X %C&) x o x ,, J‘ ..., 1‘
0 50 100 O 50 100
Sample Index Sample Index

Figure 3.1. Distributions of projection of signals from two classes with the ﬁrst atom
selected by: J1(X, A) (the left ﬁgure) and J2(X, A) (the right ﬁgure).

can be described by the sine function and most of the discrimination power is in the
cosine function. The signal component with most energy is not necessarily the compo-
nent with the most discrimination power. We consider the dictionary as {sin t, cost}.
Optimizing the objective function J1 (X, A) with the pursuit method described in Sec-
tion 3.2 selects sint as the ﬁrst atom. On the other hand, optimizing the objective
function J2(X, A) selects cost as the ﬁrst atom. In the simulation, 100 samples are
generated for each class and the pursuit algorithm stops after the ﬁrst run. The
projection of the signals from both classes to the ﬁrst atom selected by J1(X,A)
and J2(X, A) are shown in Figure 3.1. The difference in the distribution of the two
classes shown in the ﬁgures has a direct impact on the classiﬁcation. Linear classiﬁ-
cation on these distributions generates a classiﬁcation error rate around 0.5 for the

ﬁrst case and 0.0 for the second case.

3.3.2 Sparse Classiﬁcation with Hand Written Digit Images

Classiﬁcation with J1, J2 and J3(SRSC) is also conducted on the database of USPS
handwritten digits. The database contains 8-bit grayscale images of “0” through “9”

with a size of 16 x 16 and there are 1100 examples of each digit. Some samples of the

56

 

Figure 3.2. Samples of the USPS hand written digit images.

digits are shown in Figure 3.3. Following the conclusion of [72], 10—fold stratiﬁed cross
validation is adopted. Classiﬁcation is conducted with the decomposition coefﬁcients
(’ X’ in equation (3.10)) as the selected feature and support vector machine (SVM)
as the classiﬁer. The evaluation of Fisher’s discrimination score implicitly assumes
Gaussian distribution with equal variance for each class. Therefore, simple Euclidian
distance is used to measure the distance between different samples. In this imple-
mentation, the overcomplete dictionary is a combination of Haar wavelet basis and
Gabor basis. Haar basis is good at modelling discontinuities in the image, whereas
the Gabor basis is good at modelling continuous image components.

In this experiment, noise and occlusion are added to the images to test the ro—
bustness of SRSC. First, white Gaussian noise with increasing levels of energy, thus
decreasing levels of signal-to—noise ratio (SNR), is added to each image. Table 3.3
summarizes the classiﬁcation errors obtained with different SNRs. Second, different
sizes of black squares are overlayed on each image at a random location to gener-
ate occlusion (missing data). For the image size of 16 x 16, black squares with size

of 3 x 3, 5 x 5, 7 x 7, 9 x 9 and 11 x 11 are overlayed as occlusion. Table 3.4 summarizes

57

Figure 3.3. Corrupted data for the experiments: (a) with different degrees of white

 

Gaussian noise; (b) with different sizes of occlusion.

Table 3.1. Classiﬁcation errors with different levels of white Gaussian noise

 

 

 

 

 

 

 

 

 

 

 

 

N oiseless 20dB 15dB 10dB 5dB
J1 0.0855 0.0975 0.1375 0.1895 0.2310
J2 0.0605 0.0816 0.1475 0.2065 0.2785
J3 0.0727 0.0803 0.1025 0.1490 0.2060

 

 

the classiﬁcation errors obtained with occlusion.

In Table 3.3 and Table 3.4, the minimum error rate at each noise level is high-
lighted with bold font. Results in Table 3.3 and Table 3.4 show that in the case that
the signals are ideal (without missing data and noiseless) or nearly ideal, J2(X, A) is
the best criterion for classiﬁcation. This is consistent with the known conclusion that
discriminative methods outperform reconstructive methods in classiﬁcation. However,
when the noise is increased or more data is missing (with larger area of occlusion), the
accuracy based on J2(X, A) degrades faster than the accuracy based on J1 (X, A). This
indicates that the signal structures recovered by the standard sparse representation

are more robust to noise and occlusion, thus yield less performance degradation. On

the other hand, the SRSC demonstrates lower error rates by combing the reconstruc-

Table 3.2. Classiﬁcation errors with different sizes of occlusion

 

 

 

 

 

 

 

 

 

 

 

 

 

noocclusion 3x3 5x5 7x7 9x9 11x11
J1 0.0855 0.0930 0.1270 0.1605 0.2020 0.2990
J2 0.0605 0.0720 0.1095 0.1805 0.2405 0.3305
J3 0.0727 0.0775 0.1135 0.1465 0.1815 0.2590

 

58

 

tion property and the discrimination power in the case that signals are corrupted.

3.3.3 Experiments with the Object Recognition

Experimental Settings

Experiments are also conducted on the classiﬁcation of images from 2 different objects
in the COIL20 data set [73]. The focus of this experiment is to investigate the
effect of different weighting values on the cost function J3(X, A1, A2). In the training
stage, 12 images of each object with a size of 16 x 16 are used, and the remaining
60 images from each object are used for testing. Classiﬁcation is conducted using
the decomposition coefﬁcients (‘X’ in equation (3.10)) as the features and the K
nearest neighborhood (KN N) classiﬁer for the ”leave-one-out” method. The Euclidian
distance is used to measure the distance between features from different samples.
In this implementation, the overcomplete dictionary is formed by combining Haar
wavelet basis, Gabor basis and contourlet basis. Haar basis is good at modelling
discontinuities in the signal. Gabor basis, on the other hand, is good at modelling
continuous signal components and texture. The contourlet is a multi-resolution and
multi-directional wavelet transform that can efficiently model object contours. In
summary, the over-complete dictionary has 200 Gabor atoms, 441 Haar atoms and
454 contourlet atoms.

Random Gaussian noise and occlusions are added to the original images to test
the performance of SRSC with different parameter settings. First, white Gaussian
noise with different levels of signal-to-noise ratio (SNR), are added to each image.
Second, different sizes of black squares (zero values) are overlayed on each image at
the image center to generate occlusions. For the image with size 16 x 16, black squares
with size 3 x 3, 5 x 5 and 7 x 7 are overlayed as occlusions. Several example images
with different levels of Gaussian noise and occlusion are shown in Figure 3.4.

Since the three components of J3(X;A1,A2, A3) may not be on the same scale,
we ﬁrst normalize the three factors before setting the weighting factors. Given a

vector Z = [21, 22, ..., 2M], where Z can be either one of the three items. The normal-

59

 

Figure 3.4. (a) Object images with different levels of white Gaussian noise; (b) Object

images with different sizes of occlusion. The images in the ﬁrst line of (a) and (b) are
noiseless.

ization is implemented by the following equation:

2; -}lz
1.- _ 3.15
2 Oz ( )

where Hz and oz are the mean and standard variance of Z, respectively.

Study on the Effect of the Different Factors

Table 3.3 summarizes the classiﬁcation error rates for the different weighting factors
for different SNR values. Table 3.4 summarizes the classiﬁcation error rates with
different sizes of occlusions. In both tables, the minimum error rate of each column
is underscored. In this experiment, the pursuit algorithm stops when 547 atoms are
chosen, i.e., half of the total number of atoms.

From the experimental results, we can observe several phenomena. First, in a

noiseless environment, the Fisher’s discrimination ratio by itself results in the lowest

error rate. This result is consistent with the existing research that the discriminative

60

 

 

Table 3.3. Error rates with white Gaussian noise

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

[A1, A2, A3] N oiseless 20dB 15dB 10dB
[0,0,1] 0.1000 0.2583 0.3083 0.3417
[1,0,0] acne 0.2417 0.3417 0.3833
[1,1,0] 0.0833 0.2167 0.3000 0.3250
[0,1,1] 0.0917 0.2583 0.2750 @010
[1,0,1] 0.0833 m3 0.2833 0.3667
[1,1,1] 0.1000 0.2500 W 0.3333

Table 3.4. Error rates with occlusions
[A1, A2, A3] no occlusion 3 x 3 5 x 5 7 x 7
[0,0,1] 0.1000 0.1167 0.1500 0._20_QQ
[1,0,0] m 0.1667 0.1750 0.2167
[1,1,0] 0.0833 0.1250 0.2417 0.3250
[0,1,1] 0.0917 0.1333 0.1583 0.2667
[1,0,1] 0.0833 0.1417 0.2250 0.3000
[1,1,1] 0.1000 W 0% 0.2833

 

 

methods are better than the reconstructive methods for classiﬁcation [11]. Second, as
the signal is corrupted by Gaussian noise, or as more pixels are lost, the performance
of all weighting schemes deteriorates. The different weighting schemes have different
responses to noise and occlusions. In a noisy environment, the weighting vectors that
combines discrimination power and reconstruction power usually achieve the best per-
formance, as shown in the third and fourth columns of both tables. When the images
are highly corrupted, such as shown in the last column of both tables, the inclusion
of discriminative power actually results in a higher classiﬁcation error rate. As shown
in the last column of both tables, the minimum error rate is achieved with A1 = 0,
which means that the discriminative power is not considered. These results indicate
that when the signal is corrupted beyond a certain amount, the discriminative power
estimated from the signal is not reliable any more, and the reconstruction error pro-
vided by the framework of sparse representations and harmonic analysis can reduce

the effect of noise and missing data to improve the classiﬁcation accuracy.

61

 

 

 

 

0.16 J]
/
3 l ‘/
/
g 0.1m /
. J’
/
i I/ ‘
I'll“ /
a 0.12 h / ‘.\ d/
C [I y
2 ,I \ //
U {I /' 1
q I]. I
o_1 L \‘1/ . . m . .
100 300 500 700 900 1000

Number of Atom

Figure 3.5. The relation between error rate and sparsity.

Study on Sparsity

The sparsity of the representation is incorporated in the objective function with the
i=1

K
term A2 2 ||x,-||0. The sparsity is preferred in SRSC because it provides a compact

representation and ﬁlters out noisy signal components, simultaneously. This idea

is similar to that of feature selection in pattern recognition. In SRSC, the effect

of sparsity can be controlled in two ways. First, we can increase the value of A2
in J3(X;A1,A2,A3) to emphasize sparsity. Second, we can reduce the number of
iterations in the pursuit algorithm, i.e., the number of selected atoms.

For the ﬁrst case, we can compare the error rates for A2 = l and A2 = 0 in Tables
3.3 and 3.4. In most cases especially when the images are corrupted, emphasizing
the sparsity of the representation by increasing A2 results in a more discriminative
feature improving the classiﬁcation accuracy. For the second case, we study the
relation between sparsity and the classiﬁcation error of SRSC when the images are
occluded with a 3 x 3 square and the weighting factors are ﬁxed: A1 = A2 = A3 = 1.
The experimental results are shown in Figure 3.5. In this experiment, we change
the number of selected atoms from 100 to 1000, with increments of 100. In this
experiment, the minimum error rate is achieved with only 200 atoms from a total of

1091 atoms, i.e, with less than 20% of atoms. We also note that after 500 atoms, the

error rate monotonically increases with the number of atoms. The results indicate

62

that not all atoms provide useful information for classiﬁcation. Decomposition of the
signals with certain atoms actually add noise rather than improving the classiﬁcation.

The results also show the effectiveness of the pursuit algorithm for selecting a small

number of atoms that achieve low error rate.

3.4 Conclusions

SRSC is motivated by the ongoing research in the area of sparse representations in
signal processing. SRSC incorporates reconstruction error, discrimination power and
sparsity for robust classiﬁcation. In the current implementation of SRSC, the weight
factors are empirically set to optimize the performance. Approaches to determine
optimal values for the weighting factors are being investigated, following the methods
similar to that introduced in [64].

The experimental results clearly show that the different parameters have signif-
icant effect on the atom selection process and correspondingly, the following fea-
ture extraction and classiﬁcation performance. For example,when A1 = 1,A2 = 0
and A3 = 0, SRSC becomes a sequential version of the traditional LDA method. On
the other hand, when A1 = 0, A2 = 1 and A3 = 1, SRSC becomes the traditional sparse
representation. As reﬂected by the experimental results, a given parameter setting
can not achieve the best performance for different levels of signal corruption. It is
therefore interesting to investigate the optimal value for the parameters for different
situations. Investigating the optimal number of iterations for the pursuit algorithm is

another interesting problem, since it also has considerable effect on the classiﬁcation

accuracy.

63

CHAPTER 4

Sparse RepreSentation for Signal

Classiﬁcation: An Optimization

Approach

4. 1 Introduction

The SRSC framework proposed in Chapter 3 presented a promising method for robust
and sparse feature extraction from images for classiﬁcation. However, using Fisher’s
ratio for measuring discrimination makes the objective function non-convex. This
limitation makes it hard to ﬁnd the optimal set of features. For this reason, an algo-
rithm similar to matching pursuit was applied by selecting one atom at a time. This
approach provides good classiﬁcation accuracy, however, is still suboptimal. In this
chapter, the SRSC method is improved by replacing the Fisher’s ratio with the class
margin criterion used in the support vector machine to quantify the discrimination
power. Using this new measure of discrimination, the SRSC problem can be formu-
lated as a constrained optimization problem that can be iteratively optimized with
two quadratic programming problems in each iteration. In order to further reduce the
computational complexity, the SRSC problem is decomposed into two steps, with the
ﬁrst step being sparse reconstruction and the second step being sparse classiﬁcation.
For the sparse classiﬁcation step, a new algorithm called Large Margin Dimension
Reduction (LMDR) is proposed for obtaining sparse features. The formulation of
LMDR incorporates the advantages of the Ll-norm SVM [74] and distance metric

learning [75] into one framework by using the idea of distance metric learning to

64

search for an optimal linear transform on the original features and using the idea
of Ll-norm SVM to determine signiﬁcant feature components. In this way, LMDR
can effectively identify the underlying features that compose the sparse feature rep-
resentation for classiﬁcation. LMDR is applicable to dimension reduction problems
in general purpose classiﬁcation, and performs better than Ll—norm SVM and linear
discriminative analysis (LDA) in certain circumstances.

The rest of this chapter is organized as follows. Section 4.2 brieﬂy reviews the
data representation model and deﬁnes the general notations used in this chapter.
Section 4.3 proposes a new formulation of SRSC with the large margin method and
its optimization with iterative quadratic programming. Dividing SRSC into two steps
and employing LMDR in the second step are discussed in Section 4.4. Experiments

and discussions are given in Section 4.5.

4.2 Data Model

A dictionary A E ka" is a matrix where each column corresponds to an atom.
Usually the number of atoms is much larger than the dimension of each atom,
i.e., k << d in the traditional sparse representation problems. The set of observed
signals Y = [y1,y3, ...,yn] is a set of k-dimensional ( y; E R") training data with
class labels C = {c1,c2, ..., C11} (C.- E Z 1). Based on this training data set, we want to
learn the representation of Y using selected atoms in A such that Y is representable
by the dictionary A and the representation also has some other desired prOperties.
In this formulation, X = [x1,x2, ...,xn] and X; 6 Rd is the decomposition of y, on
the dictionary A. We can then use the decomposition coefficients X as features for
classiﬁcation. Mathematically, these goals can be formulated into a single objective
function G(X) that needs to be optimized. For example, in the traditional sparse
representation, G(X) includes both the signal reconstruction error and the sparsity
measure. In the SRSC problem formulated in Chapter 3, G(X) includes the signal

reconstruction error, sparsity measure and discrimination measure.

65

4.3 SRSC with Large Margin: An Optimization
Model

The large margin principle that is used in the design of support vector ma-
chines (SVM) aims at maximizing the margin between the distributions of dif-
ferent classes. Given linearly separable labelled features from two classes S; =
{(x1,c1),(x2,c2),...,(xn,c,,)} with c,- 6 {—1,1}, the hyperplane (w,b) realizes the

maximum margin classiﬁer by solving the following optimization problem:

min (w.r.t. w,b) <w,w>

(4.1)
S.t. q(< w ,Xi > +0) 21

Tx. An intuitive

where the operator <, > is the inner product, i.e., < w,x >= w
explanation of this formulation lies in the connection between the margin 7 between
the data distribution of two classes w: 7 = “w—lng. An illustration of the large margin
principle is shown in Figure 4.1. In the SVM method, this optimization problem can

be solved by solving the dual problem of equation (4.1) with quadratic programming.

4.3.1 SRSC with Large Margin: Formulation

Similar to Fisher’s ratio, the margin between two classes provides a measure for
feature discrimination. One advantage of the large margin classiﬁcation is that it
usually has a good generalization capacity, i.e., a better performance than other
classiﬁers on unseen data, as shown by the statistical learning theory [76, 77]. This
motivates the application of the large margin principle as the discrimination measure

in SRSC, forming a new objective function for SRSC:

min (wmt. w,b,x;) A1 < w,w > + A2 2 llxill1 + A3 Z My, — Axillg ( 2)
i=1 i=1 4.

S.t. Ci(< w ,Xi > +0) 2 1
where A1, A2 and A3 are non-negative weighting factors. The advantage of this opti-

66

 

Figure 4.1. Illustration of the max-margin linear classiﬁer.

67

mization formulation is that it integrates feature extraction and classiﬁer design into
a single step. Once solved, the feature is sparse, robust to noise (by incorporating
the reconstruction error) and discriminative (large margin).

Since the £1 norm is not differentiable, a common trick to deal with this problem is
to introduce two positive variables u, and vi: ui accounts for the positive components
of xi; vi accounts for the absolute values of the negative components of xi. With the
two variables, x; = ui—vi, ||xi]]1 = 1T(ui+vi), u, 2 0 and vi 2 0. This conversion of
the minimum 61 optimization to linear representation has been known as the “Least

Absolute Derivation” problem. Plugging this representation into equation (4.2), the

problem can be formulated as:

min (w.r.t. w,b,ui,vi) Alew +A22[1T(ui+vi)]+

A3 2: [Vi - A(ui - VillTlYi — A(ui - Vill
8.1". ci[wT(ui — vi) + b] 2 1

uiZO; viZO

Considering that the samples are not exactly linearly separable, slack variables 6,
are introduced. The effect of g, is illustrated in Figure 4.2. For the distribution shown
in Figure 4.2, the data samples from the two classes are not linearly separable and
the constraints c,-[wT(ui — vi) + b] 2 1 are not satisﬁable. Therefore, there is no
solution to the optimization problem. Slack variables, 5,, introduces the endurance of
misclassiﬁcation and extends the applicability of SVM to data that are not linearly
separable. With the slack variables, the formulation can be rewritten as follows:

min (w.r.t. w,b, ui,vi,§,-) AleW + A2 2: [1T(ui + Vi)l+

i=1

A3 2::1b’i- A(Ui - VillTlYi - A(Ui - Vi)]+>\4 2:5.-

s.t. c,-[wT(ui — vi) + b] 2 1 — 5i

(4.4)

UiZO; ViZO; {£20

Both the objective function and the constraint of this optimization problem are in

the square form. If there is only one constraint, the problem can be efficiently solved

68

 

 

 

 

Figure 4.2. Illustration of the max-margin linear classiﬁer with samples that are not
linearly separable.

69

by the S-procedure (see Appendix B of [78]). However, the number of constraints is
equal to the number of training samples, which is larger than 1 in all cases. This
is not even a convex optimization problem, since it can be easily veriﬁed that the
constraint c,[wT(ui — vi) + b] 2 1 with variables w, u;,vi and b is not convex. To

solve this problem, we adopt the following alternating Optimization approach.

4.3.2 SRSC with Large Margin: Iterative Optimization

The major source of difﬁculty in the optimization model in equation (4.2) comes
from the objective function that integrates feature extraction (deriving xi from yi)
and classiﬁer design ( w and b are unknown) into a single step. Even though this
integration has valid theoretical motivations, it introduces non-convexity in the Op-
timization problem. To make the Optimization tractable, an iterative optimization
strategy is adopted. Mathematically, ﬁxing the value of either w and X, makes the
Optimization problem become a quadratic programming problem, which is a convex
Optimization problem that is tractable. The implementation of this iterative alter-

nating Optimization strategy can be formulated as:

1. Initialization: Set w = wo and b = be, where we and b0 are random initial

values.

2. Feature extraction step:

(ui,vi) = arg min (w.r.t. ui,v,) {Alew + A2 2 [13”(11i + Vi)]+

i=1

/\3 X: [Yi — A(ul + VillT [Yi - A(ui + Vi)l + /\4 2:16;}

s.t. c,[wT(ui — vi) + b] 2 1 — 5,-

(4.5)

11:20; ViZO;

70

3. Update the parameters Of the large margin classiﬁer:

(w, b, 5,) = arg min (wmt. w, b,§,-) {Alew + A2 i [1T(ui + vi)]+

A3 2:; [Yr - A(ui + VillT [Ya — A(u, + Vill + *4 :31le
s.t. c,[wT(ui — vi) + b] 2 1 — 5i

{2' Z 0;
(4.6)

4. Repeat steps 2 and 3 until a given alternating step number is reached or the

changes in the parameters w, b, 5,, 11], vi are less than a predeﬁned value.

One practical problem with this approach is the computational complexity of the
quadratic programming with the high-dimensional vectors ui and vi, which have a
dimension of d, i.e., the number Of atoms. Setting apprOpriate values for the 4 weight-
ing parameters A1 to A4 is also a nontrivial problem. Even in the traditional sparse
representation where there is only one weighting parameter A, setting the value is not
trivial [64]. The classiﬁcation result with inappropriate parameter setting is likely to
deviate from the Optimal case shown in the theoretical analysis. Preliminary numeri-
cal simulations verify our concerns on the computational complexity Of the proposed
method. The simulation, conducted with the YALMIP Optimization toolbox [79] run-
ning in the Matlab environment on a Pentium IV PC, needs more than 10 hours to
Optimize a data set with 80 samples with a dictionary Of 32 atoms. These practical
problems in the implementation motivates a more computationally efﬁcient solution

to SRSC problem, as presented in the next section.

4.4 SRSC with Large Margin: A Two-Step Ap-
proximation

The Optimization model and solution proposed in Section 4.3 for solving the SRSC

problem has a solid theoretical foundation, but suffers from several difﬁculties in the

71

implementation stage. The difficulties arise from the ambitious goal of integrating
both signal reconstruction (decomposition of a signal over the dictionary) and dimen-
sion reduction (large margin and sparsity) into a single step, as shown in equation
(4.4). To make the problem more tractable, in this section the two goals of signal
reconstruction and dimension reduction are separately implemented in two steps. In
the ﬁrst step, sparse representation with a dictionary is used for denoising, such that
noise and missing data do not affect the classiﬁcation performance. In the second step,
the sparsity regularization with large margin method is applied for feature selection
and dimension reduction. Research on sparse reconstruction has been extensively
conducted in the area of traditional sparse representation, such as [18, 19, 58,59,62].
In this section, the focus is on the second step, i.e., sparse feature selection and di-
mension reduction. For this purpose, the large margin dimension reduction (LMDR)
method is proposed for the implementation of the second step. LMDR is based on
combining the idea Of the L1-norm SVM and distance metric learning. In the rest of
this chapter, Section 4.4.1 introduces the formulation and application of the Ll-norm
SVM; Section 4.4.2 reviews the idea of the distance metric learning; Section 4.4.3
presents the proposed LMDR method and ﬁnally Section 4.5 conducts experimental

studies.

4.4.1 L1-norm Support Vector Machine

Suppose that the features and class labels are already available for a 2-class su-
pervised learning task. X = {x1,x2,...,xn} E 12"” is the collection of features
and C = {c1,c2,...,c,,} E R1“, where c,- E {—1,+1}, is the collection of labels.
In SRSC, sparsity is enforced on the feature set X for deriving sparse and discrimi-
native features. One key Observation is that enforcing sparsity on a feature vector x;
is equivalent to enforcing sparsity on the coeﬁicient vector w in a linear classiﬁer, as
in the SVM. In the context of large margin linear classiﬁers, this is exactly what the

Ll-norm SVM [74] does. Formally, the Ll-norm SVM can be formulated with the

72

following Optimization problem:

min (w.r.t. w,§,-) ||w||l + A Z 5,-
i=1
s.t. c,-(wai + b) 2 1 — 6,- (4-7)
62' 2 0

Compared to the normal SVM, Ll-norm SVM replaces the L2—norm, ||w||2, with
the Ll—norm, |]w||1, which enforces sparsity [17,74]. In this setting, enforcing sparsity
is equivalent to feature selection. If a component of w, 10,-, has a value close to 0 that
indicates that the jth component of the feature vector xij is uninformative for classi-
ﬁcation and can be removed for classiﬁcation. This paradigm of feature selection has
been applied to various applications, such as medical structure classiﬁcation, content
based image classiﬁcation [80—82]. The modiﬁcation from L2-norm to Ll-norm also
makes the Optimization problem solvable by linear programming, whereas quadratic
programming is required to solve the L2-norm SVM problem. Note that with this
feature selection strategy, kernel method can not be used, since there is no one—tO—0ne
correspondence between the feature component and the weighting components in the
kernel method [80—82]. To see this clearly, note that for the kernel method the deci-
sion function is given by wT<I>(xi) + b, where <I>(e) is the kernel function. Therefore,
the components of w do not directly correspond to the components of the feature

vector xi.

4.4.2 Distance Metric Learning

The dimension reduction with Ll-norm SVM is achieved through feature selection
with classiﬁer coefﬁcients and not through any transforms on the original feature
space. However, feature selection in a transformed feature space may yield better
results in terms of dimension reduction. To see this point, consider the simple example
illustrated in Figure 4.3, where the distribution of two dimensional features from two
classes are plotted. Figure 4.3 (a) shows the original distribution Of the two classes

and Figure 4.3 (b) is Obtained by rotating the feature values by 45 degrees in the 2D

73

space. In the distribution shown in Figure 4.3 (a), Ll—norm SVM will select both
of the feature components, which are almost equally important for classiﬁcation.
However, with the rotation the Ll-norm SVM will only select one feature component
in the new feature representation, since the feature in the direction of 3:1 is much
more important than $2.

The idea of applying a transform on the original feature before selection and clas-
siﬁcation is widely adopted in the ﬁeld of machine learning and pattern recognition.
Typical examples of feature transform include PCA, LDA and more recently the dis—
tance metric learning [75, 83—85]. The basic idea of distance metric learning is to
search for a linear transform of the original features such that in the transformed
feature space the samples from the same class are close to each other and samples
from different classes are far away from each other. In this sense, distance metric
learning shares the same motivation as LDA. The difference is that distance metric

learning adopts a different mathematical formulation to achieve this goal [75]:

min lehxj)€S (xi '_ xj)TA(Xi — xj)
s.t. z<xhXJ>éD (xi — leTAlxi - Xj) Z l (4.8)
A >= 0

The requirement that A >= 0, i.e., A is semi-deﬁnite positive, is necessary since
negative distances between feature vectors is not allowed. In this formulation, 3 is the
set of feature pairs (xi, Xj) from the same class and D is the set of feature pairs (xi, Xj)
from different classes. A is a semi-deﬁnite positive matrix to be Optimized such that
in the transformed space, samples from the same class are close to each other subject
to the constraint that the distances between samples from different classes are larger
than a ﬁxed value. Denoting A = BTB, where B = A%, the distance between two

feature vectors can be represented as:
(Xi — leTA(xi — X1) = (B(Xi — lelT (B(Xi - lel (49)
Therefore, the distance metric learning problem is equivalent to ﬁnding a linear

74

 

 

 

 

 

 

 

 

8 Y T T
o . ° >
e . o 0
6L .e' . e
e . o
e o o
. e. . .0 a.
C O . -1
4b 0 . .0 Q . O O
. . . o I c o J o
, ° ° '° 0 or) 0
N p. o 0 D C' D
x 2 O . 0° 0 ° 00 O
O
c . O o 0 0
0 0
O
or o °° .
O o O
a 0 0
o o 0 ° °
00
I2» ‘i’ 0 O '4
a °
A r r i i
0 1 2 3 4 5
x1
(3)
2° 1 T M r f Y #1
. O O
c 0 o
15 . '. S ' o l
O
;. - . - -
10 on o e o g
o 0°C 0 o D O . . . o ' e .
°8 ° (5' 1
Q 5L "9 0 ° 09 ° 0 ° . 0 .0. .
0 0° 0 ‘ v o
o '3 '0
O
O
0 o 0 O at)” 'l
0
00 'i
0 o 00
'3
.5» 0 O "
.19 A 1 1 r 1
-5 4 «‘5 -2 -1 0 1 2 3
x1

(b)

Figure 4.3. Illustration Of the effect Of transform for feature selection: (a) the original
data distribution from two classes; (b) the distribution of the data after rotation.

75

transform on the original features with the matrix B and then using the Euclidian
distance on the transformed features. Application of the linear transform B can be
viewed as extracting new features from the original feature space. The successful
application of the distance metric learning on some general classiﬁcation and cluster-
ing tasks, such as classifying the data sets in the UCI data set [86], indicates that
transforming the original features helps to derive more discriminative features. This

property is useful for dimension reduction.

4.4.3 Large Margin Dimension Reduction

As discussed in Sections 4.4.1 and 4.4.2, the Ll—norm SVM and distance metric learn-
ing have different properties in feature selection and transform. Ll-norm SVM can
identify useful feature components but only operates on the original feature space.
Distance metric learning searches for a linear transform that Optimizes the feature
distribution in the transform space with B E 12“" and does not reduce the fea-
ture dimension. An intuitive generalization is to incorporate the dimension reduction
into the distance metric learning by assuming B E RdXd', where d’ < d. However,
determining an appropriate value for d’ is data dependent and nontrivial.

In this section, large margin dimension reduction algorithm (LMDR) is proposed
to combine the advantages of both Ll-norm SVM and distance metric learning. The
basic idea is to search for a linear transform on the original feature space and then
apply the Ll-norm SVM to the transformed feature space for dimension reduction.

Formally, this problem can be formulated as follows:

min (w.r.t. w,b,B,§,-) ||w||1+ A :6,-
i=1
.t. - T + b > 1 — ,-
8 “(W 7" )— 5 (4.10)
2; = BX;
5i 2 0
where B» 6 12"” and z, = Bx, is the feature in the transformed space. Selection on the

components of 23 is based on the value of w, i.e., components corresponding to to, ~ 0

76

are removed. Note that this formulation is ill-posed. Given a solution w’ and B’ to
the Optimization problem, a better solution can always be found by letting w” = w’/a
and B” = aB’, where a > 1. The solution (w”, B”) satisﬁes the constraints but has
a smaller 31 norm. One direct solution to this problem is to enforce regularization
on B. Since the ambiguity is caused by a scaling factor, enforcement on the norm
or trace of B can be applied. If the trace of B is normalized, the problem can be

formulated as follows:

min (w.r.t. w,b,B,£,) ||w||1+ A $315.-

s.t. c,-(WTzi + b) 2 1 — £1
Zi = Bx, (4.11)
trace(B) = 1
{,- > 0

Compared with Ll-norm SVM, LMDR is able to generate lower dimensional fea-
tures, i.e., more sparse features, with the help Of the optimized linear transform B.
Compared with the distance metric learning, LMDR incorporates the Ll-norm SVM
to determine the dimension of the low dimensional feature. Another advantage of
LMDR over the distance metric learning lies in the measure of separation in the
transform feature space. For the distance metric learning, the measure is the total
distance between all sample pairs from different classes and the same classes. The
distance metric learning does not include a scheme to detect outliers that may exist
in the data distribution. On the other hand, by using the margin between the classes
and slack variables 6,- to deal with outliers, LMDR has more generalization capacity
than the distance metric learning approach. Samples with large magnitude of £,- can
be detected as outliers and excluded from training.

It is also interesting to compare LMDR with general LDA, which derives low
dimensional feature representations by maximizing the ratio Of inter-class scatter
to inner-class scatter. In this sense, LDA also provides a method for simultaneous

feature extraction and selection. In LDA, feature extraction is through a matrix

77

containing the eigenvectors of Sw‘ISB, where Sw is the inner-class scatter matrix
and SB is the inter-class scatter matrix [10]. Feature selection is based on the relative
magnitudes of the eigenvalues Of Sw’ISB . By using the second order statistics
of the data samples for dimension reduction, LDA assumes a Gaussian distribution
for the original feature space. When the Gaussianity assumption is violated, which
Often happens in real applications, the performance of LDA suffers. On the other
hand, LMDR does not enforce any assumptions on the distribution of the underlying
data, which extends the applicability of LMDR. Another difference between LDA
and LMDR lies in the discrimination measure. The margin used in LMDR for the
discrimination measure has been shown to have better generalization capacity, as
shown in general SVM [76,77].

The problem with the Optimization problem of LMDR is that the
constraints c,(szi + b) 2 1 — (i are not convex, since the Optimization variables w
and B are entangled. Here we adopt a simple alternating Optimization algorithm that

breaks the entanglement between w and B as follows.

1. Initialize B = Bo, where the columns of B0 is the generalized eigenvectors from
the equation SBx = Awa. This means that the transform matrix is initialized

by the result Of the generalized LDA problem.

2. Search for the large margin classiﬁer given B = Bo:

(Wo,b, £1) = arg min (w.r.t. w, b, 3) ]]w|]1+ A Z 5,
i=1

s.t. c.-(sz. + b) 2 1 — @- (4 12)

21 = Boxi

€i>0

78

3. Search for the transform matrix given to = wo:

(B1,b,{,) = arg min (w.r.t. B, b,§,-) ||wo||1+ A Z 5,-
i=1

s.t. c,(onzi + b) 2 1 — f,-
Zi = Bxi (4'13)
trace(B) = 1

£120

4. Set B0 = B1 and repeat steps 2 and 3 until the change in w and B is less than

a threshold value or a given number of iterations has been reached.

The optimization formulated in equation (4.13) is clearly a linear programming
problem. Actually the problem formulated in equation (4.12) is also a linear program—
ming problem. TO see this, let w = u—v, where u E Rd“, v E R"le and u 2 0, v 2 0,
then ]|w||1 = 1T(u+v). Each solution of u and v corresponds to a solution of w. The
same trick was previously adopted in the basis pursuit algorithm [17] and in solving

the Ll-norm SVM with linear programming [74,80].

4.5 Experiments

4.5.1 Experiments with Synthetic Data

We start with a simple synthetic numerical example. Two classes Of two dimensional
data are generated, as illustrated in Figure 4.4 (a). Denote each 2D data sample as
[$1,022] and given that 0 S :rl _<_ 5, samples are generated as 1122 = 11:1 + a + w for
the ﬁrst class and :52 = x1 — a + w for the second class, where a = 2.1 is a constant
shift and w is white Gaussian noise with mean 0 and standard deviation 1. The data
distribution is designed in a way such that both feature components 11:1 and $2 play
equivalently important roles in classiﬁcation. Running Ll-norm SVM on the data

set results in w = [—0.5585,0.6619]T, which indicates that we can not reduce the

79

dimension Of this feature set, both dimensions have similar weights. On the other
hand, running the iterative large margin dimension reduction as in equations (4.12)

and (4.13) with 2 iterations results in the following solution:

—2.0463 2.1336 T
B = , w = [ 1.5338 0 ] (4.14)
2.1336 3.0463
Applying B to the original features yields the feature distribution shown in

Figure 4.4 (b). As a comparison, B0 Obtained by the generalized LDA and the

corresponding w from the Ll-norm SVM are as follows:

—0.1060 0.0673 T
B0 = , w = [ 4.4456 0 ] (4-15)
0.1059 0

Feature distribution Obtained by applying B0 to the original features results in
the new distribution shown in Figure 4.4. Applying the transform B on the synthetic
data set results in the data distributions shown in Figure 4.4 (b). From this ﬁgure,
it is clear that in the transform feature space, the ﬁrst feature component on is much
more important for classiﬁcation than the second component 2:2. Simultaneously, the
value of w correctly reflects this difference between the two feature components. In
this case, the feature dimension can be effectively reduced to 1 based on the value
of w.

From these numerical results and the distributions shown in Figure 4.4, it is clear
that both LDA and the LMDR are effective for dimension reduction. From the co-
efﬁcients of Ll-norm SVM, as shown in equation (4.14) and (4.15), it is seen that
both methods can reduce the feature dimension from 2 to 1. In order to compare
the performance of classiﬁcation in the reduced dimensional space, features are prO—
jected onto 1D space and the minimum Bayesian classiﬁcation errors are determined.
As shown in Figure 4.4 and indicated by equations (4.14) and (4.15), .721 feature
component is retained for the LMDR and generalized LDA based method and the
3:2 component is discarded. In this setting, 2 samples (1.96%) are misclassiﬁed for

the LMDR method and 5 samples (2.90%) are misclassiﬁed for the generalized LDA

80

method. This comparison shows that the Optimization steps in the LMDR improves

the distribution Of the features in the low dimensional space.

4.5.2 Experiments with the UCI Data

Experiments are also conducted on two data sets: “iris” and “glass” from the UCI
data repository [86]. The properties of the two data sets are summarized in Table
4.1. For the purposes Of comparison, the linear discriminative analysis (LDA) is
implemented for dimension reduction. Since LDA’s dimension depends directly on
the number Of classes, it can only derive up to C — 1 directions for feature projection,
where C is the number Of different classes in the training set. This means that LDA
can generate up to 2 dimensional features for “iris” and 5 dimensional features for
“glass”.

The experiment setting is as follows. The large-margin dimension reduction is
based on a scenario of pairwise classiﬁcation, as indicated by the constraints in equa-
tions 4.12 and 4.13. The generalization of LMDR to the case Of multiple classes
follows the same method as the general SVM, where pairwise classiﬁcation (“1 vs 1”)
or setting one class as positive label and others as negative label (“1 vs rest”) are com-
monly used. In our method, LMDR based dimension reduction and classiﬁcation are
conducted pair-wise, i.e., based on data samples from each pair of classes c1 and C2,
a large dimension reduction scheme is trained and feature selection and classiﬁcation
is based on the classiﬁer built on this pair-wise data. For a data set with 0 different
classes, a total of C(C — 1) / 2 dimension reduction schemes and classiﬁers are built.
Each classiﬁer assigns either 01 or Cg to each sample in the testing set. The ﬁnal
class label of a testing sample is determined by a simple majority vote on each class
label. For the large-margin dimension reduction, the feature is ﬁrst subjected to a
linear transform B and the feature components are selected by the magnitude of the
components of w. In the training process, the parameter A in equations 4.12 and 4.13
is set by cross validation. The iterative Optimization algorithm for the large-margin
dimension reduction is run for 6 iterations each time. Half of the data set is randomly

selected for training and the other half is used for testing. The KNN classiﬁer with

81

original dlstrlbutlon

 

 

 

 

 

 

 

 

 

 

 

 

 

 

e I
,1 , '. .
4 ‘ I .h ..D O
a 2. "e ‘ ﬂ ‘ '00 0111200 0‘
ﬂ 3 0 0°. 0 ° °" 0° ° °
0 D o (’0 3‘30 ° 0 o 0
°o 0° 0 C c
.20 a o
o 0 ° C”
4 A
0 1 2 3 4 5
x1
(3)
Distrlbullon Wlih LmR
5v . . .
15 5:2.t: e
. . o" e
10 o a .5 ,
a 5% c n ' 3" .
a 5 Do 000 ”do“ . I "e
a0 0 0 Q) I0 -'
o o o o e
o o o ¢°o% ]
.5 563,00 0
49 4 ‘ '
-II -2 0 2 4
x1
(b)
Distribution with generalized LDA
D
0.6
00058.0 ‘-..._..
Q 04 o °° 03 ‘ .:- 4
o 8 o o ‘ \ o:
o o n o 0n . , i
O 0 o ’
0.2 ° ° 03°00 ‘ ,.'-
980 g ,- ..t‘ ‘
n 09. fCPV’ a o. " . .'
15.4 412 o 0.2 04
x1

Figure 4.4. Feature distributions: (a) Original features; (b) Features obtained with
LMDR; (0) Features obtained with the generalized LDA

82

Table 4.1. The properties of the data sets “iris” and “glass”

 

Data set Number Of class Samples per class Feature dimension
Iris 3 50 4
Glass 6 from 9 to 76 9

 

 

 

 

 

 

 

 

Euclidian distance is used for classiﬁcation. The number of nearest neighbors, “K”
in “KNN” is set to a quarter Of the number Of training samples in each pairwise
classiﬁer.

Given the above experimental settings, the classiﬁcation error rates at different
dimensions selected by LDA and the large-margin dimension reduction are plotted
in Figure 4.5, where (a) is the result on the “iris” data set and (b) is the result on
the “glass” data set. The large-margin dimension reduction method performs dif-
ferently for the two data sets. For the “iris” data set, the performance of LDA is
comparable with that of large-margin dimension reduction. For the “glass” data set,
LMDR performs better than LDA. The performance difference on the two data sets
can be explained by looking at the data sample distributions in the two data sets,
as illustrated in Figure 4.6, which plots the distribution of “iris” and “glass” in the
2D space Obtained by LDA. From this ﬁgure, it is clear that the data samples in
“iris” are clearly separated. Another important Observation is that the distributions
of each class in the 2D space can be roughly approximated by a Gaussian distribution.
Therefore, the distribution of “iris” satisﬁes LDA’s assumption on the data distribu-
tion quite well and this explains why LDA performs well on the “iris” data. On the
other hand, Figure 4.6 (b) shows that the distribution of “glass” data is more complex
than the “iris” data, in the sense that different classes are not well separated and the
distribution of each class can not be accurately approximated as a Gaussian distri-
bution. In this case, the assumption of a Gaussian distribution required by the LDA
does not hold and therefore classiﬁcation based on LDA yields high error rates. Since
LMDR does not assume any prior information on the data distribution and since it
uses class margin as the measure Of separation, it yields much better classiﬁcation

performance on the “glass” data set.

83

Results with the Irls data

 

  

 

 

 

 

 

 

 

 

 

 

 

0.07
0
1‘6
I; 0.06
01 0.05
8
go. .
E
:2 0.03
(I
B

°'°‘1 2 3 4

Feaute Dimension
(a)
Results wlth the Glass data

0.7 r . . .
0 —e—|_DA
3 0.0 + L1 SVM.
g —¢—| MDR
til 0.5
S
'5 04-
E W
3 0.3
2
u

0.20 2 4 6 8 10

Feaure Dlmenslon

(bl

Figure 4.5. The relation between classiﬁcation error rate and feature dimension:
Comparison between LDA, L1-norm SVM and large-margin dimension reduction on
(a) “iris” and (b) “glass” data sets.

84

 

 

 

 

 

 

 

 

 

 

 

 

 

Iris Glass x class #1
0.8" x x class #1 a 23-3 > 0 class #2
0.75 ’ 0 ch” #2 am 23.7] ‘3 class #3
0 class #350 ° 0 c ‘1'”:
0.7- . . b c as:
x 03% 23-6 %’ B
_ x x D D 0 é” 0 class #6
0065 xf" . 0 O :33 23 5 g D r
X 0 C 5 x
0.6 ’ 5“,“ o 8 “CF $0 0
a XJ§X O 03% g g 2304’ Dﬁx‘tg
0.55. "x 95
”f. 0% ”3’ ° 23 3 ° D
0.5- g; €33 ° .,
:22 ° °90 0" o o 23 2
0.45» o 0°
0 O D
0.4 _ ‘ g :0 23.1 "
(I!
0.35» 23‘ 3
° :0
_L n 22. L 4 1
-1 0 1 £95 -49 -48.5 -48
x1 x1

Figure 4.6. The projection of “iris” and “glass” data sets with LDA to 2D space.

4.5.3 Experiments with the Texture Data

Experiments are also conducted on texture classiﬁcation with the texture images
from the Brodatz data set [56]. Images from 10 texture classes, with 16 images in
each class, are used for classiﬁcation. Half Of the images are used for training and
another half is used for testing. are conducted again by decomposing the images
into 3 wavelet level. In this case, each image has a feature feature of 85 dimensions.
Based on the experiments conducted in Chapter 2, the features from most Of the
subbands are noise, in the sense that only a small portion of the features can achieve
a relatively low classiﬁcation error rate. The indication of this phenomenon is two
sided. First, the involvement of large number of noisy feature components greatly
interfere the optimization process of the LMDR method. Second, the computational
complexity Of the Optimization process is increased. To tackle these problems, we
propose the combination of Ll-norm SVM and LMDR for high dimensional feature
selection and transformation. For the high dimensional feature, the Ll-norm SVM
is ﬁrst applied to identify those important feature components and discard the noisy

feature components. The LMDR is then applied to the low dimensional feature

85

this data set, and results in a very high classiﬁcation error rate (higher than 70%),
due to the interference Of noisy features.

In Figure 4.7, the curve “LMDR” is Obtained by ﬁrst applying the L1-norm SVM
to select 35 features from the 85 features, and then applying the LMDR on the selected
35 features. Features of different dimensions are then selected by the value of w in
equation (4.11). The curve “LISVM+LMDR” is Obtained by ﬁrst applying the L1-
norm SVM to select 77. features from the 85 features, and then applying the LMDR on
the selected n features, generating n—dimensional features in the transformed space.
In this case, the value of 77. changes from 5 to 35, with an increment Of 5.

From this ﬁgure, the performance Of LMDR is worse than that Of Ll-norm SVM.
As discussed in Chapter 2, the 35 features selected in by the Ll-norm SVM contains
considerable amount Of noise that misleads the LMDR process. On the other hand,
“L18VM+LMDR” always performs better than the “LMDR” and better than the
”LISVM” in the low dimensional space. This illustrates that searching for an opti-
mal linear transform for the feature can substantially improve the accuracy of clas-
siﬁcation. The phenomenon that “L13VM+LMDR” does not performs better than
“LlsVM” in the high dimensional space illustrates that noisy features can interfere
the Optimization of LMDR and deteriorate its performance. In this experimental set-
ting, the minimum classiﬁcation error rate is achieved by “L13VM+LMDR” at the
dimension of 15. Therefore, the combination of Ll-norm SVM and LMDR is an effec-
tive strategy for dimension reduction on high dimensional feature. The advantage is

two-sided: Removing the noisy features and reducing the computational complexity.

4.6 Summary

In this chapter, the SRSC problem proposed in the previous chapter is reformulated
as a constrained Optimization problem by using the large margin as the measure Of
discrimination. The formulation has solid theoretic foundation but faces practical nu—
mericalproblems since multiple steps of iterative quadratic programming are involved.

TO deal with this problem, the SRSC is divided into two steps, with the ﬁrst step

86

 

 

Comparison of LDA and large-margin dimension reduction

 

 

 

 

 

0.5 I V I I I
'4 —9— L1 SVM
—6— LMDR
—-8— L1 SVM+LMDR
0.45 r 1

Classification error rate

 

 

 

l L

 

0'25 1 l l
5 10 15 20 25 30 35

Feature dimension

Figure 4.7. The relation between classiﬁcation error rate and feature dimension:
Comparison between Ll-norm SVM, LMDR and the combination of Ll-norm SVM
and LMDR on texture data.

87

being sparse reconstruction and the second step being sparse classiﬁcation. For the
sparse classiﬁcation, the large margin dimension reduction (LMDR) that incorporates
the Ll-norm SVM and distance metric learning is proposed. An efﬁcient optimization
algorithm based on iterative linear programming is proposed to solve the constrained
optimization formulation. The {proposed algorithm inherits the advantages of distance
metric learning for ﬁnding the Optimal linear feature transform and Ll-norm SVM
for automatically determining the dimensionality of the low-dimensional space. Ex-
perimental results on the UCI data sets show that compared with the LDA method,
the proposed LMDR method achieves comparable results on dimension reduction on
data samples following Gaussian distribution and better results on data samples that

do not follow the Gaussian distribution.

88

CHAPTER 5

Feature Evaluation with

Information Theoretic Measures

5. 1 Introduction

All Of the previous chapters in this dissertation deal with Obtaining a low dimensional
(sparse) feature set that is discriminative. In order to show that the features derived
from the proposed methods are discriminative, empirical experiments are conducted
for evaluating the “goodness” of different feature representations. This chapter dis-
cusses the problem Of feature evaluation and proposes a new method for feature
evaluation based on information theoretic measures that can be computed efficiently.

Traditional pattern recognition problems usually include two consecutive steps:
feature extraction and classiﬁcation. Feature extraction aims at generating a com-
pact and discriminative description of the data in terms of a feature vector. Multiple
features describing different properties of the sample data can be extracted. For ex-
ample, color and texture features can be extracted from an image for content based
image retrieval. Therefore, there is a need for evaluating the different features. For
example, in the face recognition application, features based on different linear trans-
forms, such as PCA and LDA, have been widely investigated [11,66]. At this point,
it is important to note that although the problem of feature evaluation shares some
common characteristics with feature selection [31], the two methods differ in their
basic goal. Feature selection evaluates and selects a subset Of the feature elements,
while feature evaluation quantiﬁes the performance Of the different feature vectors.

Taking the color and texture features for an image as an example, feature selection

89

 

evaluates the different components of the color feature such. as red, green and blue,
while feature evaluation focuses on the performance of the color feature versus the
performance Of the texture feature.

Feature evaluation is usually conducted by an empirical test, i.e., the classiﬁca-
tion accuracy on given testing data sets. Feature evaluation with empirical tests
can be seen in various applications, such as text categorization [87, 88] and image
retrieval [89]. However, this paradigm of feature evaluation does not only depend
on the feature itself. The results Of empirical tests are Often subject to the distance
metric that determines how to measure the similarity between the two features and
the choice Of the classiﬁer that determines how to assign class labels. The impact of
the distance metric on the classiﬁcation results is so substantial that distance metric
learning, which learns a suitable distance metric for a speciﬁc classiﬁcation task, has
recently attracted a lot of attention [90—92]. It is also obvious that the choice Of the
classiﬁer and the parameter setting in a given classiﬁer have strong impact on clas-
siﬁcation results. Various examples of how the classiﬁer affects classiﬁcation results
can be found in literature, such as the experimental results on the benchmark UCI
data [86] in a recent paper [93]. Feature evaluation based on empirical tests is com-
plicated due to the fact that tuning the distance metric and the classiﬁer may beneﬁt
certain features and degrade the performance Of other features. However, overly tun-
ing of either the distance metric or the classiﬁer parameters can result in distorted
evaluation for different features. One interesting discussion on empirical tests can be
found in [94], which shows an extreme case that given any algorithm, a data set can
be designed such that the algorithm can outperform other algorithms on this data
set in the sense of classiﬁcation accuracy. This can also be interpreted as that even
a “bad” feature in the general sense can result in a high classiﬁcation accuracy with
certain classiﬁcation algorithms.

The constraints Of the empirical test for feature evaluation motivates the need for
an alternative feature evaluation method that is independent Of the distance metric
and the classiﬁer. Fisher’s ratio used in the linear discriminative analysis (LDA) is

one such choice [10]. The Fisher’s ratio measures the ratio between intra—class scatter

90

and inter-class scatter. For the samples drawn from Gaussian distributions, 3 large
value Of Fisher’s ratio indicates a large distance between samples from different classes
and a small distance among samples from the same class. In [95], mutual information
(MI) between class label and features is computed for feature selection. A higher
MI value indicates that the certain feature components provide more information for
class label. This selection measure can also provide a criterion for feature evaluation.
However, the computational model for estimating MI value proposed in [95] was only
applied to 1D and 2D features and may be difﬁcult to adapt to high dimensional
features.

In this chapter, mutual information between a feature and the class label is com-
puted as a quantitative criterion for feature evaluation. The applicability of MI for
feature evaluation is theoretically supported by Fano’s inequality [48], which reveals
the direct connection between the MI value and the probability of misclassiﬁcation.
An uncorrelated linear discriminative analysis (ULDA) [96] based MI computational
model is proposed such that the evaluation criterion is applicable for high dimensional
features. The proposed method avoids the empirical test and thus is independent of
the distance metric and classiﬁer. Therefore, the empirical test, when combined with
the proposed MI based measure, can provide a more complete and accurate criterion
for feature evaluation.

The rest of this chapter is organized as follows. Section 5.2 reviews the MI and
Fano’s inequality that motivates the usage of MI as the criterion for feature evalu-
ation. Section 5.3 analyzes the existing computational models for MI computation
and proposes a new MI computation model based on uncorrelated linear discrimina-
tive analysis. Section 5.4 conducts extensive experiments to test the validity of the
proposed MI based feature evaluation method. Finally, Section 5.5 concludes this

chapter and discusses the ﬁndings.

91

 

 

5.2 Quantitative Evaluation of Features with Mu-
tual Information

In the following analysis, the feature is modelled as a random vector: X E X C R“
and the class label is modelled as a discrete random variable: C E C C Z1. The joint
distribution Of X and C is described by p(:r,c). The MI between X and C is deﬁned
as:

I(X;C)= Z 2190‘ C) 103 p‘J(x)p'<c> — EXC [log p“ l

xEX CEC P(X)P( (C) (5.1)

= 1703 (x,C)llp(X)P (C)).
where D(o||o) is the relative entropy or Kullback-Leibler distance. When the log-
arithm function in the deﬁnition uses a base of 2, the unit for the I (X ;C) is bits.
MI is symmetric in X and C, nonnegative, and is equal to zero if and only if X
and C are independent. I (X; C) indicates how much information X conveys about C.
Given X, the extra information required to fully describe C is given by the conditional

entropy H (C IX) [48].

I(X; C) = H(C) — H(CIX), (5.2)

where H (C) is the entropy of the random variable C. Given this equation, mutual
information computation is equivalent to the computation Of two entropy values.
Similar to the deﬁnition of conditional entropy, the conditional MI between random

variables X and C given a new random variable Z is deﬁned as:

I(X; CIZ) = H(XIZ) — H(X|C’, Z)

p,x C Z
= F(xvcvz) [logp p(X]Z)p(C]Z) l

(5.3)

The conditional MI has an interpretation similar to that of MI. Given the above
deﬁnitions, the mutual information between a random vector X = [X1,X2, ...,Xd]

and a random variable C can be deﬁned as:

92

 

I(Xa C) = I(X11X21 "')Xd; C)
d (5.4)
= ZI(X;;CIX,‘_1,X{_2,...,Xl).

i= 1
5.2. 1 Fano’s Inequality

Intuitively, I (X; C) indicates how much information the feature X contains about the
class label C. The larger I (X; C) is, the more accurate is the estimation Of the class
label C from the feature X. Mathematically, the relation between the MI and the
probability of misclassiﬁcation is given by Fano’s inequality [48], as follows:

H(CIX) —1 _ H(C) - I(X;C) —1
105(N) _ 103(Nc)

 

P(C aé 0’) 2 (56)
where NC is the number of possible values of C (i.e., the number of classes), and C’
is the estimation of C based on the feature X predicted by a classiﬁer. For a given
classiﬁcation task, H (C) and log(Nc) are ﬁxed. Thus, a large value of I (X; C)
directly results in a small value of P(C 75 C’). Note that Fano’s inequality is not
conﬁned to any speciﬁc type of classiﬁer or distance metric and is a suitable measure
for feature evaluation.

Due to the connection between P(C 75 C’) and I (X; C) revealed by Fano’s in-
equality, MI has been widely applied to feature extraction [50, 51,97] and feature
selection [98,99]. At this point, it should be noted that MI based feature extraction
differs from the MI based feature evaluation. In the former case, the feature is un-
known and chosen based on MI. In this chapter, the feature is already extracted and
MI is used for the evaluation purposes. For MI based feature extraction, a distance
metric and the classiﬁer are usually known a priori to evaluate the effect of feature
extraction, thus providing feedback for parameter adjustment for the extraction pro-
cess. For MI based feature evaluation discussed in this chapter, no distance metric

and classiﬁer are involved.

93

 

 

 

 

 

 

X. _ f(.) T p(ylt) 1.

 

 

 

Figure 5.1. MI computation with dimension reduction. The original feature X is in a
high-dimensional space and the transformed feature Y is in a low-dimensional space.

5.3 Mutual Information Computation

The difﬁculty Of MI computation lies in estimating the joint probability density func-
tion (pdf), or equivalently, the conditional pdf between C and X, especially in a high
dimensional space with limited number of data samples. A variety Of methods aiming
at reducing the complexity by dimension reduction have been proposed. For example,
in [100,101], the relationship between the different components of X is modelled by

the ﬁrst order Markov chain that simpliﬁes the MI computation, as follows:

I(Xilchi—13X1—2)"',X1) = I(Xz'; ClXi—1)i (5-6)

A more general approach for dimension reduction is searching for a transform T(o),
such that Y = T(X) E l7 C R“, that satisﬁes (1’ < d. Rather than
computing I (X; C), I (Y; C) is computed as an approximation for I (X; C). This
approach has been adopted in a variety of applications, such as in [38,49—51,102].
This transform process can be illustrated in Figure 5.1. With this transform, a
Markov chain is formed as: X —+ Y —> C. According to the data processing theo-
rem [48], I (X; C) 2 I (Y; C). The equality holds if and only if Y is the sufﬁcient
statistics Of X. Based on this ineqaulity, the transform T(o) that maximizes I (Y; C)

should be chosen so that the error in the approximation is minimized.

94

 

 

A common method for solving this dimension reduction problem is to ﬁrst as-
sume a parametric form of T(o) and then optimize the parameters for T(o) such
that I (Y; C) is maximized. For example, the linear transform Y = WX, where W 6
Rd,“ is the transformation matrix, is Often used for its simplicity. In order to esti-
mate the Optimal value for W, I (Y; C) is explicitly expressed as a function of W and
the standard optimization techniques, such as gradient descent, are then used to get
an optimal value for W. This paradigm for searching for T(e) is adopted in [49—51]
for feature extraction, i.e., deriving an informative low dimensional feature Y from a
high dimensional feature X. In [49—51], the pdf p(Y, C) is modelled with the Parzen
window method with a Gaussian kernel. Thus, p(Y,C) is written as a mixture of
multiple Gaussian functions, which is non-convex. The combination of non-convexity
and the gradient descent method used in [49—51] can only guarantee a locally Optimal
solution for W. Due to the complexity in Optimizing W, simpliﬁed transforms are
used in some applications. For example, the equal weight model where W = 7:10pm,
is used for computing MI between wavelet coefﬁcients in [38].

Note that the goal Of the transform T(e) is dimension reduction. A small value
Of d’ reduces the complexity Of MI computation. However, a larger value Of d — d’
usually indicates a larger difference between I (X; C) and I (Y; C), since Y is not
a sufﬁcient statistics of X in the general case. For feature extraction, the classiﬁ-
cation is conducted with this low dimensional feature Y. Therefore, the large dif-
ference between I (X; C) and I (Y; C) is not a severe problem, as long as I (Y;C)
can be maximized. However, this estimation error in MI is a problem for feature
evaluation. Given two features, X1 6 Rd and X2 6 Rd and their low dimen-
sional representations Y1 E R“, and Y; 6 Rd], the reliability of concluding that
I (X1; C) > I (X2; C) based on I (Y1; C) > I (Y3; C) is inﬂuenced by the error intro—
duced by the inaccuracy of the computational model.

The tradeoff between the computational complexity and the error in MI estimation
caused by the value Of d’ are due to the statistical dependence among different com-
ponents Of Y, i.e., Y1, Y2, ...de. If these components were statistically independent,

the computation of I (Y; C) would become much easier:

95

,,
I(Y; 0) = 2101-; C) (5.7)

If equation (5.7) holds, the complexity of computing MI does not substantially
increase with an increase in (1’. Therefore, a large value of d’ can be chosen to
reduce the error between I (X; C) and I (Y; C) and the computation of the MI is still
tractable. Generally, it is difﬁcult to search for a transform that results in statistical
independence among the different components of Y. An approximate solution is to
search for a linear transform that results in uncorrelated projections, i.e, cov(Y,-, Y3) =
0 for 1 S i 79 j g (1’. Since statistical uncorrelation is equivalent to statistical
independence only when Y is distributed as a Gaussian, which is not satisﬁed in the

general case, equation (5.7) does not strictly hold and should be rewritten as:

,.
I(Y; C) s 2 104-; C) (5.8)

The approximation error depends on how the distribution of Y deviates from a
Gaussian distribution. As an approximation, this can be Obtained by measuring the
Gaussianity Of the individual components Of Y, i.e, Y,- for 1 g i S d’. The measure
Of the Gaussianity Of a random variable can be conducted by computing the kurtosis
or negentropy Of the variable, as in the method of independent component analysis
(ICA) [103].

Motivated by the above discussion, the uncorrelated linear discriminative analysis
(ULDA) is chosen to Obtain the transform matrix W in this dissertation. ULDA was
proposed in [104] and improved in [96] as a generalized linear discriminative analysis
method that projects a feature into statistically uncorrelated feature components.
The improved ULDA is also applicable to the undersampled problem, where the di-
mension of the data is much larger than the number of samples. This property makes
it especially suitable for processing high dimensional features. ULDA has been suc-
cessfully applied to a variety of classiﬁcation tasks, such as text categorization [96],
face reCognition [105] and gene data classiﬁcation [106]. ULDA derives up to Nc — 1

optimal discriminative vectors that maximizes the Fisher’s ratio, with an extra con-

96

 

straint that the different eigenvectors are St-Orthogonal [96], where St is the total
scatter matrix of the features. Since ULDA also maximizes the Fisher’s ratio, the
distribution Of samples on individual feature components, Y}, achieves the goal that
samples from different classes are far away from each other and samples from the same
class are close to each other. In this sense, the value of I (Y,; C) is optimized compared
to other linear projections. Since I (Y; C) S I (X; C) according tO the data process-
ing theorem, optimizing I (K; C) helps reduce the estimation error between I (Y; C)
and I (X; C). When using ULDA for feature evaluation, the transform matrix W is

given by the discriminative vectors Obtained by ULDA in each row.

5.3.1 Overview Of Uncorrelated Linear Discriminative Anal-
ysis

Uncorrelated linear discriminative analysis (ULDA) was proposed to extract statisti-
cally uncorrelated discriminative features [96,104]. Assume that there are 10 classes in
the training data set. Denote the mean vector, covariance matrix and a priori proba-
bility of the ith class as m,, S,- and p,- respectively. Then the between-class scatter Sb,

within-class scatter Sw and the total scatter matrix S, are deﬁned as follows:

It
Sw = Z piSi
i=1

s, = :31me — m)(m.- — m)T (5-9)

St=Sb+Sw

In the traditional linear discriminative analysis (LDA), the Objective function is to
minimize the ratio Of within-class scatter to between-class scatter. This goal can be
achieved by solving an eigenvalue problem on Sglsb, provided that the within-class
scatter matrix 8,, is nonsingular. In this case, there are at most It — 1 discriminative
vectors available, since the rank of S), is bounded above by k — 1. Although 8,,
and Sb are symmetric matrices, generally 8,318), is not symmetric. Therefore, the
discriminative vectors that are the eigenvectors of sys, are not orthogonal and the

extracted features are not uncorrelated.

97

ULDA aims at ﬁnding the optimal discriminative vectors that are St-Orthogonal.
Two vectors Y1 and Y2 are St—Orthogonal if YfSth = 0. Speciﬁcally, after 1'
vectors $1,052, ..,¢,. have been extracted, the (r + 1)th vector 05,.“ is Obtained by

solving the following Optimization problem:

T
m3Xf(¢) = $7533
s.t. (erStaS, = O, 2=1,2,...,7‘

(5.10)

One solution to this Optimization problem was proposed in [104] by iteratively solving
a generalized eigenvalue problem. A more computationally efﬁcient solution Called
ULDA/ QR algorithm was proposed in [96] based on QR decomposition and singu-
lar vector decomposition (SVD). More details on solving this Optimization problem
can be found in [96,104]. In our implementation, the ULDA/ QR algorithm is used
for generating the discriminative vectors. It was shown that feature vectors trans-
formed by ULDA are mutually uncorrelated [96,104]. The proof Of this property of

uncorrelation is straightforward, as shown below.

Given that _ 1 _ q
ill 90f
112 (P;
Y = , = , X (5.11)
_ 31m ] _ ‘P; _]

 

 

 

 

The covariance between any two feature components, y,- and y,- is computed as

follows:

Elﬁn - E(y.))(yj - E(yj))l = vista (5-12)

Based on the prOperty given in equation (5.10), cpfStLp, = 0. Therefore, the feature

components in the transformed space Obtained with the ULDA are uncorrelated.

98

 

Table 5.1. Classiﬁcation Error Rates (in percentage) on “wine”, “iris” and “glass”

 

Data set ]] Wine ]] Iris ]] Glass
Error Rate ]] 1% to 5% I] 4% to 9% ]] 25% to 30%

 

 

 

 

5.4 Experiments

5.4.1 Experiments with the UCI data

For the ﬁrst part of the experiment, the goal is to show that “good” features have in
general high MI between the feature vector and the class label. For this purpose, the
data sets of “wine”, “iris” and “glass” from the UCI machine learning repository [86]
are used for the experiment. Based on the previous experimental results with various
distance metrics and classiﬁers on these data sets presented in literature, such as [93],
we know that the classiﬁcation error rates on the three data sets are in different ranges
as shown in Table 5.1. Based on the results, it can be concluded that the features for
“wine” is better than the features for “iris”, which is better than that of “glass”.
We apply both the equal weight model where W = ﬁldrxd and the ULDA based
model for estimating MI values for the three data sets. Note that for the equal weight
model, Y is in the one dimensional space. The results are plotted in Figure 5.2. For
the ULDA based method, the feature X can be projected to a space with dimension up
to NC —- 1. The data sets “wine” and “iris” have 3 classes and the “glass” data set has
6 classes. Therefore, the projected feature Y E R2 for “wine” and “iris” and Y E R5
for “glass”. Figure 5.2 shows that the MI estimated from the equal weight model is
inconsistent with the empirical results. The inconsistency in this example indicates
that the model Of the equal weights is not suitable for feature evaluation. Figure
5.2 also shows that if only the ﬁrst component, i.e., Y1, obtained by the ULDA is
retained, the evaluation given by the MI in the descending order is as “iris, wine,
glass”, which is still inconsistent with the empirical results. However, it should be
noted that the MI values with Y1 from the ULDA on the 3 data sets are larger than
the corresponding MI values Obtained with the equal weight method. This indicates
that the ULDA optimizes the MI estimation better than the equal weight model.

99

Mutual Information Estimation: wine, iris and glass

 

 

 

 

 

 

 

1.5-
x winequual Weight
—*—-wine:ULDA
O irisquuaI Weight
—e—iris:ULDA
<> glass:Equal Weight
—9—glass:ULDA
c: 1’
.9
iii
E
.9
E
E
3
3
5 o.5~
x
O
o l l l l l
0 1 2 3 4 5

Dimension of Projection

Figure 5.2. Estimation of mutual information on 3 different data sets using equal
weight and LDA based method.

Finally, Figure 5.2 shows that the ULDA based method with additive approximation
d,

model I (Y; C) z 2 I (Y,; C) provides a feature evaluation measure that is consistent

z:

with the empirical classiﬁcation results: “wine” is better than “iris”, which is better

than “glass”.

5.4.2 Experiments with Energy Features for Wavelet Based

Texture Classiﬁcation

In the previous experimental section, the 3 sets of features are from 3 different types of
Objects. In this section, feature evaluation using the proposed method is conducted
on 5 different wavelet features extracted from the same set Of texture images for

classiﬁcation. Wavelet feature extraction is a two-step process. In the ﬁrst step,

100

an image is decomposed into multiple wavelet subbands. In the second step, one
feature component is extracted from each subband and a feature vector is formed by
concatenating all of the feature components [9,23,107]. Given a subband coefﬁcient
matrix R with size M x N, 5 different deﬁnitions of energy function are used to
extract features from each subband: E1 (Standard Deviation), E2 (Residual), E3
( L1 Energy), E4 ( L2 Energy) and E5 (Entropy). The feature vectors based on the

following 5 energy functions are extracted and evaluated in the experiment.

 

 

 

E — iiilrw ')- l2 (513)
1_\MN1.=1J.=1 3] ft '
M N
E2 = Z Z IR<zyj) — ul (5.14)
1 M N
E3 = mg}; Imam (5.15)
1 M N
E. = W22; IR(2',J')I2 (5.16)
M N . . . .

1 M N
II = WzZRﬁij) (5-18)

By deﬁnition, E1 and E2 are similar to each other and E3 and E4 are similar
to each other. Based on this observation, we may expect that the classiﬁcation
performance is close for similar features. In the experiment, 54 texture images from
the Brodatz texture database [56] are used. Each texture image with a size of 512 x 512
is divided evenly into 16 non-overlapping images. Therefore, there are 54 texture

classes, with 16 samples in each class in this classiﬁcation task. Each sample image

101

is decomposed into 85 wavelet subbands with the Daubechies 16—tap wavelet basis.
Hence, the feature dimension d is 85 for each of the 5 features. ULDA applied to the
extracted features generates a new feature up to 54 — 1 = 53 dimensions.

Based on the equation I (Y; C) z i: I (K; C), we set the value of d’ from 1 to 53
and plot the corresponding estimation-ta I (Y; C) versus d’ for each feature type in
Figure 5.3. In Figure 5.3, we note that the shapes of all 5 curves are similar. When (1’
is small (from 1 to somewhere between 10 and 20), the curves have a sharp slope, and
the slope becomes ﬂat after that. This indicates that the ﬁrst several discriminative
feature components provide more information for the classiﬁcation than the other
components. This is consistent with the implementation of the ULDA, which arranges
eigenvectors corresponding to the large eigenvalues ﬁrst. We also note that the curves
of E1 (standard deviation) and E2 (residual) are close to each other, and the curves
of E3 (L1 norm) and E4 (L2 norm) are close to each other, while the curve of E5
(entropy) is not close to any of the other curves. This similarity between the MI
estimates is rooted in the similarity of the deﬁnition of the corresponding energy
functions.

For the purpose of feature evaluation, it can be inferred from Figure 5.3 that E1
and E2 are the best, and the E5 is the worst, while E3 and E4 are in between. Note
that this conclusion comes without conducting any classiﬁcation experiments and is
thus independent of any distance metric and classiﬁer. It is therefore interesting to
compare how this conclusion matches with empirical testing results. For this purpose,
two simple classiﬁcation experiments are conducted with the 5 features. For the ﬁrst
experiment, K nearest neighbor (KNN) classiﬁer with “leave—one-out” method is used.
The value of ”K” in KN N is set to 16 and the standard Euclidian distance is used as
the distance metric. Given two feature vectors X1 and X2, the Euclidian distance is

the £2 norm of their difference, as in equation 5.19.

d1<x1.x2) = ”XI — in§ (5.19)

In the second experiment, all conditions are the same as the ﬁrst one except

102

Evaluation of 5 energy features for wavelet based texture classiﬁcation

 

5—
45_ M“"
.LL“
. '.‘-. ' " i‘g~:é¢=e~0
4- .;L"_“;‘
3_5_ —*—E1:stdev

—+-— E2: residual
--e_ 53: L1 norm
-A— E4: L2 norm

Es: entropy

 

 

 

 

_;v_r*".:‘.'—l 37*‘4-7!‘ 'l -».—“,-7'L:"” "infirm - i-’—...;_‘fL—~}Lk(_ ,1, y
I‘- . - ',

Mutual Information
N
0|
I

1.5r- I .1“

0.5 -,

 

0 l l I i 1
0 1 0 20 30 40 50 60

Dimension of Projection

Figure 5.3. Estimation of mutual information on 5 different energy features for wavelet
based texture classiﬁcation.

103

that the distance metric is replaced by the normalized Euclidian distance, deﬁned in

equation (5.20).

 

d’ . . 2
d2(x1,x2).= Z (X10) ‘ X2“) (520)

i=1 0‘
where a,- is the standard deviation of all the feature components extracted from
subband z'. The classiﬁcation error rates of the empirical tests are summarized in
Table 5.2.

When comparing the error rates in the two experiments, we ﬁnd that the nor-
malization has the most obvious effect on the performance of E3 and E4. This is
because the additive operation used in the deﬁnition of Euclidian distance has the ef-
fect of deemphasizing the small distance values. Therefore, the normalization greatly
reduces the classiﬁcation error rates for E3 and E4. Note that the normalization does
not beneﬁt classiﬁcation accuracy for all features, such as for E5 in this case. There-
fore, manipulation of the distance metric may beneﬁt the evaluation of some features
in the empirical tests while degrading the performance of other features. The exper-
imental results with normalization matches the previous results of MI based feature
evaluation, i.e., E1 and E2 are the best, and E5 is the worst, while E3 and E4 are in
between. This conclusion is not completely consistent with the experimental results
without normalization, where FA performs the worst. However, it is generally believed
that normalization is needed for dealing with ﬁltering features in texture classiﬁca-
tion [57]. Therefore, the experiment with normalization reﬂects the performance of
different features more accurately. The comparison of experimental results in Table
5.2 also illustrates that the variance introduced by the distance measure could be so
large that the feature evaluation based on empirical tests could not reﬂect the truth.
Therefore, MI based measure proposed in this chapter, combined with the empirical

tests, is able to provide a more complete picture for feature evaluation.

104

Table 5.2. Classiﬁcation Error Rates with Different Energy Functions (in percentage)

 

 

 

 

 

 

 

 

 

 

 

E1 E2 E3 E4 E5
Euclidean distance 12.15 13.54 28.59 72.11 39.12
Normalized Euclidean distance 10.53 8.91 11.34 14.12 41.90

 

5.5 Conclusion

In this chapter, a mutual information based feature evaluation method with a corre-
sponding computational model is proposed and studied. The empirical test for feature
evaluation is inﬂuenced by the choice of the distance metric and the classiﬁer. The
proposed method evaluates features by computing the MI between the feature and
the class label. The validity of this method is supported by Fano’s inequality and the
method is not dependent on any distance metric and classiﬁer. To avoid the prob-
lem of estimating pdf in a high dimensional space, an ULDA based computational
model is proposed for M1 computation. Finally, experiments on the UCI data and
wavelet features for texture classiﬁcation demonstrate the promising results for using

the proposed method for quantitative feature evaluation.

105

CHAPTER 6

Summary and Conclusions

This dissertation discusses the problem of feature extraction and selection in trans—
form based image classiﬁcation. The major objective is to obtain a compact and
robust set of features for a given image set using multi-resolution transforms such as
the wavelets, wavelet packets, directional wavelets, Gabor functions and other ﬁlter

banks.

6. 1 Summary

The ﬁrst part of the proposed work focuses on the extraction of a set of features
from wavelet and wavelet packet transforms. The proposed approach ﬁrst quantiﬁes
the dependence between wavelet subbands and then exploits this property to select a
compact set of features that are discriminative for the given image set. This method is
then further improved by incorporating the individual discrimination power provided
by each subband into the selection process.

The promising results obtained with the proposed feature selection methods mo-
tivate the formulation of a more general feature selection problem for image clas-
siﬁcation. This more general feature selection problem can be posed as “How can
we choose the ‘best’ set of multi-resolution transforms for discriminating between
different image classes?” The ‘best’ set of transforms is deﬁned by the compactness
(sparseness) of the selected feature set, the robustness of the features in noisy envi—
ronments and the accuracy of the classiﬁcation. Combining all of these requirements,
we pose this more general problem using sparse representations. Sparse representa-

tions aim at ﬁnding a sparse and close approximation to a given signal using a large

106

collection of functions, called a dictionary. In the proposed work, this framework is
modiﬁed to address the question of image classiﬁcation by deﬁning a cost function
that incorporates the discrimination and reconstruction abilities of the elements in
the dictionary, as well as the sparseness of the selected feature vector. As part of the
proposed work, the different aspects of this problem will be investigated including the
selection of the dictionary elements, different cost functions to quantify the discrimi-
nation power of the selected features and the tradeoff between reconstruction power
and discrimination power.

In the third part, the formulation of the proposed sparse representation for im-
age classiﬁcation method is further improved by using the large margin method for
the measure of discrimination. Based on this new and improved formulation, we can
model the robust and sparse feature extraction with an Optimization problem that can
be solved by iterative quadratic programming. In order to reduce the computational
complexity required for the iterative quadratic programming, we propose decompos-
ing the robust and sparse feature extraction into two steps, with the ﬁrst step being
sparse reconstruction and the second step being sparse feature selection and dimen-
sion reduction. For the second step, we propose a new method called large margin
dimension reduction (LMDR). LMDR integrates the idea of Ll-norm support vector
machine (SVM) and distance metric learning for obtaining feature representation in
a low dimensional space.

In the fourth part, we propose a mutual information based feature evaluation
criterion and its computational model, such that the features obtained from different
selection methods can be quantitatively evaluated. Traditional feature evaluation
based on the empirical study is subjected to the selection of the distance metric and
the classiﬁer. The proposed measure is independent of distance metric and classiﬁer,
and is supposed to more objectively reﬂect the discrimination power of a feature
representation. Computation of mutual information in a high dimensional space is
usually involved in the model. To deal with this problem, we propose computing
mutual information with the uncorrelated linear discrimination analysis (ULDA).

The proposed computational model effectively reduces the computational complexity

107

of computing mutual information for feature evaluation.

6.2 Future Work

6.2.1 Basis Design for Sparse Image Classiﬁcation

The idea of basis design is motivated by the sparse representations for image classi-
ﬁcation discussed in Chapter 3 and the existing data adaptive iterative basis design
methods [12,108,109], which aim to derive an efﬁcient data adaptive dictionary for
sparse representations. The goal is to design a data adaptive dictionary that is eﬂi-
cient for sparse and accurate image classiﬁcation. The basis design method can follow
the same process as discussed in [12,108,109], which interleaves solving the sparse
representation with a dictionary and optimizing each dictionary entry with the ﬁxed
coefﬁcients obtained from the solution of the sparse representation.

If this approach is successful, it would provide a different objective function and a
different rule for updating the dictionary speciﬁc to image classiﬁcation, thus improv-
ing the classiﬁcation accuracy. Formally, the iteration algorithm may be described as

follows.
1. Randomly initialize the dictionary A E RNXM .

2. Sparse representation stage: Use the pursuit algorithm introduced in Section
3.2.2 to Optimize the objective function for sparse representations for signal

classiﬁcation, i.e., equation(3. 10),

K I K
J3(x, A1. A2) = F(X) — A1 2 llxztllo — A2 Z My. - Ann: (61)

i=1 i=1
3. Dictionary update stage: For each m = 1, 2, ..., M, determine the set of signal
samples that use the dictionary entry Dm: wm = {Y,-|1 _<_ i S N, X,(m) # 0}.
If cum has samples from more than one class, apply the LDA on the data set cum
and replace Dm with the eigenvector corresponding to the largest eigenvalue of

the LDA. If all samples in wm are from the same class, keep Dm unchanged.

108

4. Iterate steps 2 and 3 for a given number of iterations.

The motivation for the dictionary update stage is to increase the value of F(X) in
equation (6.1), and therefore increase the value of the objective function. The biggest
challenge of this method is to prove the convergence of the iterative algorithm. An-
other major problem is that after the dictionary entry is updated in step 3, the entry
may no longer be used by the same group of samples. This is also a common prob—
lem for the current iterative basis design methods, as no discussions of convergence

appears in [12,108,109].

6.2.2 Combination of Data-Driven and Model Based Meth-

ods for Sparse Signal Representation and Classiﬁcation

The existing basis design methods can be divided into two categories: data-driven
and model based. The two types of dictionary are designed based on different strate-
gies. The data-driven method derives its dictionary from a set of training signals
and expects the dictionary to adapt to the signals. Therefore, the dictionary is
usually dependent on the training samples. Some examples of data driven basis de-
sign include principal component analysis (PCA) [10], linear discriminative analysis
(LDA) [10], independent component analysis (ICA) [110], non-negative matrix fac-
torization (NMF) [111,112] and K-SVD method [12]. The dictionaries derived from
the data driven model are usually more eﬂicient for signal representation than the
model based method, given that the signal to be represented has the same statistical
properties as the training samples. For example, the dictionary derived from the PCA
method achieves the optimal linear transform for representing a high dimensional sig-
nal with a low dimensional representation in the sense of minimum mean square error
(MSE). NMF has been shown to be able to decompose a signal into non-negative
parts. The disadvantage of the data driven methods is that the dictionary needs to
be updated whenever new samples are added to the training set. This property is also
undesirable in some applications like compression, where the dictionary is also needed

to be encoded. This data driven dictionary design has been successfully adopted in

109

various applications, such as PCA/LDA/NMF for face recognition [66,112,113], K-
SVD for image denoising [18], ICA for blind source separation [114] and NMF for
spectra recovery in brain signals [115].

On the other hand, the model based method is independent of any training data.
The general approach for this basis design method is to capture the desired prOperties
of the dictionary, such as the signal singularities or continuities, and then design the
dictionary that achieves the proposed properties through methods like ﬁlter design.
In this sense, the model based method tries to summarize the common characteristics
of signals and develops a set of templates in representing the signals efﬁciently. A
typical example of this kind of method is the traditional Fourier transform, which
uses sinusoid functions as the templates to model the signal components. Recent
research in the wavelet basis design have generated new dictionaries that can be
used to efﬁciently model signal components. Examples include the ridgelet [116,117],
curvelet [117], contourlet [118], bandlet [14], beamlet [119] and directionlets [120].
Most of these dictionaries inherit the properties of the traditional wavelets, i.e., the
multi-resolution analysis (MRA) for analyzing signal structure at multiple scales and
multiple shifts, and the self—similarity over different scales. Aside from these inherited
advantages, these dictionaries also provide direction as another degree of freedom for
signal analysis. The advantage of the model based method is that the dictionary is
data independent, which is favorable in applications like compression. The disadvan-
tage is that the basis in the dictionary may not be very efficient for signal represen-
tation, when compared with the dictionaries obtained with the data driven methods.
Some typical applications of the wavelet dictionaries include image denoising [117],
inpainting [68], coding [43] and recognition as discussed in Chapter 2.

Since the data driven method and the model based method are based on different
strategies and have different advantages, it is natural to combine their beneﬁts. In
essence, the model based method designs a series of templates (bases) that can ef-
ﬁciently capture the common characteristics of signals. Given that the signal space
is usually much larger than the space that can be efﬁciently (sparsely) spanned by

the dictionary, it is desired that the variability of the “common characteristics” is re-

110

duced such that the dictionary from the model based method has a better chance of
representing the signals efficiently. A direct method to reduce signal variability is to
reduce the signal size, since the variance is exponentially proportional to the size. For
example, for a length N sequence with an amplitude range of m bits, the number of
possible signals is (2m)N. This strategy is used for the K-SVD basis design for sparse
image representation in [12], where images are divided into 8 x 8 non-overlapping
blocks for analysis and the size of each basis is adapted to these blocks.

In the new method, we may use a different strategy to reduce the signal variance.
We notice that some data driven methods are good at extracting the “common char-
acteristics” of signals to be analyzed. For example, the “eigenface” obtained by the
PCA [66], “ﬁsherface” obtained by the LDA [66] and “face parts” obtained by the
NMF [112] are the effective “common characteristics” for face image representation.
Therefore, we propose a two-stage approach where a data driven method is used to
reduce theh variance in the given signal space and to capture the information with a
few components in the ﬁrst stage and a sparse representation with an overcomplete

dictionary is applied in the second stage.

6.2.3 Saliency Analysis with Sparse Representation

Recent applications of sparse representations to computer vision problems such as
object recognition [121] and visual tracking [122] motivates us to apply the sparse
representation method for discriminative signal analysis, especially for saliency anal-
ysis. Generally speaking, saliency analysis aims at ﬁnding information-rich object
parts to discriminate different classes of objects. For example, car parts, like wheels
and side windows in various backgrounds, are searched by moving windows in the spa-
tial domain and extracted as salient templates for car detection in [123]. In [124,125],
the salient regions, such as human face, bicycle boundary, are detected by the dis-
crete cosine transform (DCT) for discriminating different classes of objects in images.
In [126], a PCA subspace is ﬁrst generated with samples of all classes, and then a
subset of PCA bases that can best represent each class is searched for. With the

framework of the sparse representation and the various new wavelet bases with pow-

111

erful signal modelling capacity, we feel that the sparse representations may be able
to provide a better solution to the problem of saliency analysis.

In the proposed method, the images from each class can be decomposed through
the standard sparse representation with a combination of multiple dictionaries. The
combined dictionary should include basis functions that can efﬁciently model differ-
ent structures, such as DCT for local structure and curvelets for recurrent texture
structure. A larger dictionary is likely to have a higher probability of efﬁciently rep-
resenting structures speciﬁc to different images. However, a large dictionary also
means a higher computational complexity for solving the sparse representation prob-
lem. Therefore, there is a tradeoff between the versatility of the dictionary and the
computation cost. We expect that salient regions from different classes can be discrim-
inated based on their respective sparse representations. This approach for saliency
analysis with sparse representations is similar to ﬁnding a matched ﬁlter for images
from an overcomplete dictionary. If the dictionary is more versatile, it is more likely
that one or more “matched ﬁlters” can be found. Designing a “matched ﬁlter” for
each class may not be practical, considering that the common features of samples in a
class may not be easily described. The method of sparse representations is promising
thanks to the recent developments in the wavelet basis design that have generated

dictionaries with powerful image modelling capacity.

112

BIBLIOGRAPHY

113

[1]

l2]

l3]

[4]

[5]

[6]

[7]

[8]

[9]

BIBLIOGRAPHY

T. Chang and C. Kuo, “Texture analysis and classiﬁcation with tree-structured
wavelet transform,” IEEE Transactions on Image Processing, vol. 2, no. 4, pp.
429—441, 1993.

A. Laine and J. Fan, “Texture classiﬁcation by wavelet signature,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 15, no. 11, pp.
1186—1191, 1993.

AK. Jain, S. Prabhakar, and H. Lin, “A multichannel approach to ﬁngerprint
classiﬁcation,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, vol. 21, no. 4, pp. 348—359, 1999.

A. Jain, S. Prabhakar, L.Hong, and S. Pankanti, “Filterbank-based ﬁngerprint
matching,” IEEE Transactions on Image Processing, vol. 9, no. 5, pp. 846—859,
2000.

A. Mojsilovic, M. Popovic, and D.Rackov, “On the selection of an optimal
wavelet basis for texture characterization,” IEEE Transactions on Image Pro-
cessing, vol. 9, no. 12, pp. 2043—2050, 2000.

H. Choi and RC. Baraniuk, “Multiscale image segmentation using wavelet-
domain hidden markov models,” IEEE Transactions on Signal Processing, vol.
10, no. 9, pp. 1309—1321, 2001.

M. Acharyya, R. De, and M. Kundu, “Extraction of feature using m—band
wavelet packet frame and their neuro—fuzzy evaluation for multitexture segmen-
tation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.
25, no. 12, pp. 1639-1644, 2003.

R. Datta, D. Joshi, J. Li, and J. Wang, “Image retrieval: Ideas, inﬂuences,
and trends of the new age,” Tech. Rep., Department of Computer Science and
Engineering, The Pennsylvania State University, University Park, PA 16802,
USA., 2006.

T. Randen and J .H. Husoy, “Filtering for texture classiﬁcation: A comparative
study,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.
21, no. 4, pp. 291—310, 1999. 7

114

[10] R. Duda, P. Hart, and D. Stork, Pattern classiﬁcation (2nd ed. ), Wiley-
Interscience, 2000.

[11] A. Martinez and A. Kak, “PCA versus LDA,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 23, no. 2, pp. 228—233, 2001.

[12] M. Aharon, M. Elad, and A. Bruckstein, “The K-SVD: An algorithm for design-
ing of overcomplete dictionaries for sparse representation,” IEEE Transactions
on Signal Processing, vol. 54, no. 11, pp. 4311—4322, 2006.

[13] S. Mallat, A Wavelet Tour of Signal Processing, Academic Press, 1999.

[14] E.Le Pennec and S. Mallat, “Sparse geometric image representations with
bandelets,” IEEE Transactions on Image Processing, vol. 14, no. 4, pp. 423—
438.

[15] S. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,”
IEEE Transactions on Signal Processing, vol. 41, pp. 3397—3415, 1993.

[16] Y. Pati, R. Rezaiifar, and P. Krishnaprasad, “Orthogonal matching pursuit:
Recursive function approximation with applications to wavelet decomposition,”
in 27th Annual Asilomar Conference on Signals, Systems, and Computers, 1993.

[17] S. Chen, D. Donoho, and M. Saunders, “Atomic decomposition by basis pur-
suit,” SIAM J. Scientiﬁc Computing, vol. 20, no. 1, pp. 33—61, 1999.

[18] M. Elad and M. Aharon, “Image denoising via sparse and redundant repre-
sentations over learned dictionaries,” IEEE Transactions on Image Processing,
vol. 15, no. 12, pp. 3736—3745, 2006.

[19] M. Elad, B. Matalon, and M. Zibulevsky, “Image denoising with shrinkage and
redundant representation,” in Proceedings of IEEE Conference on Computer
Vision and Pattern Recognition, 2006.

[20] G. Choueiter and J. Glass, “A wavelet and ﬁlter bank framework for phonetic
classiﬁcation,” in Proceedings of IEEE International Conference on Acoustics,
Speech and Signal Processing, 2005, vol. 1, pp. 933—936.

[21] N. Ince, A. Tewﬁk, and S.Arica, “Classiﬁcation of movement eeg with 10-
cal discriminant bases,” in Proceedings of IEEE International Conference on
Acoustics, Speech and Signal Processing, 2005, vol. 5, pp. 413—416.

[22] G. Fan and X. Xia, “Wavelet-based texture analysis and synthesis using hidden
Markov models,” IEEE Transactions on Circuits and Systems I: Fundamental
Theory and Applications, vol. 50, no. 1, pp. 106—120, 1 2003.

115

[23] G. Wouwer, P. Scheunders, and D.Dyck, “Statistical texture characterization
from discrete wavelet representations,” IEEE Transactions on Image Process-
ing, vol. 8, no. 4, pp. 592—598, 1999.

[24] C. Garcia, G.Zikos, and G. Tziritas, “Wavelet packet analysis for face recogni-
tion,” Image and Vision Computing, vol. 18, no. 4, pp. 289—297, 2000.

[25] M. Do and M. Vetterli, “Rotation invariant texture characterization and re-
trieval using steerable wavelet—domain hidden markov models,” IEEE Transac-
tions on Multimedia, vol. 4, no. 4, pp. 517—527, 12 2002.

[26] M. Do and M. Vetterli, “Wavelet-based texture retrieval using generalized
gaussian density and Kullback—Leibler distance,” IEEE Transactions on Image
Processing, vol. 11, no. 2, pp. 146-158, 2 2002.

[27] S. Arivazhagan and L. Ganesan, “Texture classiﬁcation using wavelet trans-
form,” Pattern Recognition Letters, vol. 24, no. 10, pp. 1513—1521, 2003.

[28] M. Unser, “Texture classiﬁcation and segmentation using wavelet frames,”
IEEE Transactions on Image Processing, vol. 4, no. 11, pp. 1549—1560, 1995.

[29] K. Laws, “Rapid texture identiﬁcation,” in SPIE Vol. 238 Image Processing
for Missile Guidance, 1980, pp. 376—380.

[30] R. Coifman and M. Wickerhauser, “Entropy-based algorithms for best basis
selection,” IEEE Transactions on Information Theory, vol. 38, no. 2, pp. 713—
718, 1992.

[31] A. Jain and D. Zongker, “Feature selection: Evaluation, application, and small
sample performance,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 19, no. 2, pp. 153-158, 1997.

[32] N. Rajpoot, “Local discriminant wavelet packet basis for texture classiﬁcation,”
in Proceedings SPIE Wavelets X, 2003, vol. 5207, pp. 774—783.

[33] D. Koller and M. Sahami, “Toward optimal feature selection,” in Proceedings
of International Conference on Machine Learning, 1996, pp. 284—292.

[34] S. Zhu, N. Wu, and D. Mumford, “Minimax entropy principle and its application
to texture modeling,” Neural Computation, vol. 9, no. 8, 1997.

[35] Y. Li, A. Cichocki, S. Amari, et al., “Sparse representation and its applications
in blind source separation,” in Proceedings of Neural Information Processing
Systems, 2003.

116

[36] L. Yu and H. Liu, “Efﬁcient feature selection via analysis of relevance and
redundancy,” Journal of Machine Learning Research, vol. 5, no. 10, pp. 1205—
1224, 10 2004.

[37] H. Chen and P. Varshney, “Feature subset selection with applications to hyper-
spectral data,” in Proceedings of IEEE International Conference on Acoustics,
Speech and Signal Processing, 2005, vol. 2, pp. 249—252.

[38] J. Liu and P. Moulin, “Information—theoretic analysis of interscale and intrascale
dependencies between image wavelet coefﬁcients,” IEEE Transactions on Image
Processing, vol. 10, no. 11, pp. 1647—1658, 2001.

[39] M. Crouse, R. Nowak, and R. Baraniuk, “Wavelet-based statistical signal pro-
cessing using hidden markov models,” IEEE Transactions on Signal Processing,
vol. 46, no. 4, pp. 886—902, 1998.

[40] G. Fan and X. Xia, “Improved hidden markov models in the wavelet-domain,”
IEEE Transactions on Signal Processing, vol. 49, no. 1, pp. 115—120, 2001.

[41] J. Portilla and E. P.Simoncelli, “A parametric texture model based on joint
statistics of complex wavelet coefﬁcients,” in International Journal of Computer
Vision, 2000, number 1, pp. 49—71.

[42] E. Simoncelli, “Modeling the joint statistics of images in the wavelet domain,”
in Proceedings of SPIE, 44th Annual Meeting, 1999.

[43] J. Shapiro, “Embedded imagecoding using zerotrees of wavelet coeﬁcients,”
IEEE Transactions on Signal Processing, vol. 41, no. 12, pp. 3445—3462, 1993.

[44] L. Ma, T. Tan, Y. Wang, and D. Zhang, “Personal identiﬁcation based on
iris texture analysis,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 25, no. 12, pp. 1519—1533, 2003.

[45] A. Jain, Fundamentals of Digital Image Processing, Englewood Cliffs,
NJzPrentice-Hall, 1989.

[46] G.K. Grunwald, R.J. Hyndman, L. Tedesco, and R. Tweedie, “A uniﬁed view
of linear AR( 1) models,” Tech. Rep, Department of Preventive Medicine and
Biometrics, University of Colorado Health Sciences Center, Denver, CO, USA.,
1996.

[47] A. Srivastava, A. Lee, E. Simoncelli, and S. Zhu, “On advances in statistical
modeling of natural images,” Journal of Mathematical Imaging and Vision, vol.
18, no. 1, pp. 17—33, 2003.

117

[48] TM. Cover and J.A. Thomas, Elementsof Information Theory, Wiley, 1991.

[49] J. Principe, D. Xu, and J. Fisher, Information Theoretic Learning, pp. 265—319,
Wiley, 1999.

[50] K. Torkkola, “Feature extraction by non-parametric mutual information maxi-
mization,” Journal of Machine Learning Research, vol. 3, pp. 1415—1438, 2003.

[51] K. Hild, D. Erdogmus, K. Torkkola, and J. Principe, “Feature extraction using
information—theoretic learning,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 28, no. 9, pp. 1385—1392, 2006.

[52] R. Moddemeijer, “On estimation of entropy and mutual information of contin-
uous distributions,” Signal Processing, vol. 16, no. 3, pp. 233—246, 1989.

[53] L. Paninski, “Estimation of entropy and mutual information,” Neural Compu-
tation, vol. 15, pp. 1191—1253, 2003.

[54] A. Ihler, J. Fisher, and A. Willsky, “Nonparametric hypothesis tests for statis-
tical dependency,” IEEE Transactions on Signal Processing, vol. 52, no. 8, pp.
2234—2249,2004.

[55] SM. Ross, Introduction to Probability Models (7th edition), Academic Press,
2000.

[56] P. Brodatz, Texture: A Photographic Album for Artists and Designers, Dover,
New York, USA, 1966.

[57] W. Ma and B. Manjunath, “Texture features and learning similarity,” in
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,
1996, pp. 425—430.

[58] Y. Li, A. Cichocki, and S. Amari, “Analysis of sparse representation and blind
source separation,” Neural Computation, vol. 16, no. 6, pp. 1193—1234, 2004.

[59] J. Starck, M. Elad, and D. Donoho, “Image decomposition via the combination
of sparse representation and a variational approach,” IEEE Transactions on
Image Processing, vol. 14, no. 10, pp. 1570—1582, 2005.

[60] I. Gorodnitsky and B. Rao, “Sparse signal reconstruction from limited data
using focuss: a re—weighted minimum norm algorithm,” IEEE Transactions on
Signal Processing, vol. 45, no. 3, pp. 600—616, 1997.

[61] I. Drori and D. Donoho, “Solution of L1 minimization problems by
LARS/Homotopy methods,” in Proceedings of IEEE International Conference
on Acoustics, Speech and Signal Processing, 2006, vol. 3, pp. 636-639.

118

[62]

[63]

[64]

[65]

[56]

[67]

[68]

[69]

[70]

[71]

[72]

B. Olshausen, P. Sallee, and M. Lewicki, “Learning sparse image codes using a
wavelet pyramid architecture,” in Proceedings of Neural Information Processing
Systems, 2001, pp. 887—893.

D. Donoho and X. Huo, “Uncertainty principles and ideal atomic decomposi-
tion,” IEEE Transactions on Information Theory, vol. 47, no. 7, pp. 2845-2862,
2001.

Y. Lin and D. Lee, “Bayesian Ll-Norm sparse learning,” in Proceedings of
IEEE International Conference on Acoustics, Speech and Signal Processing,
2006, vol. 5, pp. 605—608.

D. Wipf and B. Rae, “Sparse bayesian learning for basis selection,” IEEE
Transactions on Signal Processing, vol. 52, no. 8, pp. 2153—2164, 2004.

P. Belhumeur, J. Hespanha, and D. Kriegman, “Eigenfaces vs. ﬁsherfaces:
Recognition using class speciﬁc linear projection,” IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 711—720, 1997.

S. Fidler, D. Skocaj, and A. Leonardis, “Combining reconstructive and dis-
criminative subspace methods for robust classiﬁcation and regression by sub-
sampling,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 28, no. 3, pp. 337-350, 2006.

M. Elad, J. Starck, P. Querre, and D. Donoho, “Simultaneous cartoon and tex-
ture image inpainting using morphological component analysis (MCA),” Jour-
nal on Applied and Computational Harmonic Analysis, vol. 19, pp. 340—358,
2005.

A. Leonardis and H. Bischof, “Robust recognition using eigenimages,” Com-
puter Vision and Image Understanding, vol. 78, pp. 99—118, 2000.

J. Tropp, A. Gilbert, and M. Strauss, “Algorithms for simultaneous sparse
approximation. part I: Greedy pursuit,” Signal Processing, special issue on
Sparse approximations in signal and image processing, vol. 86, no. 4, pp. 572—
588, 2006.

J. Tropp, A. Gilbert, and M. Strauss, “Algorithms for simultaneous sparse
approximation. part II: Convex relaxation,” Signal Processing, special issue
on Sparse approximations in signal and image processing, vol. 86, no. 4, pp.

589—602, 2006.

R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation
and model selection,” in International Joint Conferences on Artiﬁcial Intelli-
gence, 1995, pp. 1137—1145.

119

[73] S. Nene, S. Nayar, and H. Murase, “Columbia object image library (coil-20),”
Tech. Rep., Department of Computer Science, Columbia University, New York
City, NY, USA., 1996.

[74] J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani, “1-norm support vector ma-
chines,” in Proceedings of Neural Information Processing Systems, 2004.

[75] E. Xing, A. Ng, M. Jordan, and S. Russell, “Distance metric learning, with
application to clustering with side-information,” in Proceedings of Neural In-
formation Processing Systems, 2002.

[76] V. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, 1995.

[77] C. Burges, “A tutorial on support vector machines for pattern recognition,”
Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121—167, 1998.

[78] S. Boyd and L. Vandenberghe, Covex Optimization, Cambridge University
Press, 2004.

[79] YALMIP Toolbox, “http://control.ee.ethz.ch/ joloef/yalmip.php,” .

[80] J. Bi, K. Bennett, M. Embrechts, and C. Breneman, “Dimensionality reduction
via sparse support vector machine,” Journal of Machine Learning Research,
vol. 3, pp. 1229—1243, 2003.

[81] J. Bi, Y. Chen, and J. Wang, “A sparse support vector machine approach
to region-based image categorization,” in Proceedings of IEEE Conference on
Computer Vision and Pattern Recognition, 2005, vol. 1, pp. 1121— 1128.

[82] Y. Chen, J. Bi, and J. Wang, “Miles: Multiple-instance learning via embed-
ded instance selection,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 28, pp. 1931— 1947, 2006.

[83] K. Weinberger, J. Blitzer, and L. Saul, “Distance metric learning for large
margin nearest neighbor classiﬁcation,” in Proceedings of Neural Information
Processing Systems, 2005.

[84] R. Rosales and G. Fung, “Learning sparse metrics via linear programming,”
in Proceedings of the ACM SIGKDD international conference on Knowledge
discovery and data mining, 2006, pp. 367 — 373.

[85] G. Fung, R. Rosales, and R. Rao, “Feature selection and kernel design via linear
programming,” in Proceedings of Internatioanl Joint Conference on Artiﬁcial
Intelligence, 2007, pp. 786 - 791.

120

[86] D. Newman, S. Hettich, C. Blake, and C. Merz, “UCI repository of machine
learning databases,” 1998.

[87] Y. Yang and J. Pedersen, “A comparative study on feature selection in text cat-
egorization,” in Proceedings of International Conference on Machine Learning,
1997, pp. 412—420.

[88] T. Liu, S. Liu, Y. Chen, and W. Ma, “An evaluation on feature selection for text
clustering,” in Proceedings of International Conference on Machine Learning,
2003.

[89] P. Howarth and S. Ruger, “Evaluation of texture features for content-based
image retrieval,” in International Conference on Image and Video Retrieval,

2004, pp. 326—334.

[90] K. Weinberger, J. Blitzer, and L. Saul, “Distance metric learning for large
margin nearest neighbor classiﬁcation,” in Proceedings of Neural Information
Processing Systems, 2005, pp. 1473—1480.

[91] G. Lebanon, “Metric learning for text documents,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 28, no. 4, pp. 497—508, 2006.

[92] G. Lanckriet, N. Cristianini, P. Bartlett, L. Ghaoui, and M. Jordan, “Learn-
ing the kernel matrix with semideﬁnite programming,” Journal of Machine
Learning Research, vol. 5, pp. 27—72, 2004.

[93] L. Li, “Multiclass boosting with repartitioning,” in Proceedings of International
Conference on Machine Learning, 2006, pp. 569 — 576.

[94] D. LaLoudouana and M. B. Tarare, “Data set selection,” in Proceedings of
Neural Information Processing Systems, 2002.

[95] D. Xu and J. Principe, “Feature evaluation using quadratic mutual informa-
tion,” in Proceedings of International Joint Conference on Neural Networks,
2001, vol. 1, pp. 459—463.

[96] J. Ye, R. Janardan, Q. Li, and H. Park, “Feature extraction via generalized

uncorrelated linear discriminant analysis,” in Proceedings of International Con-
ference on Machine Learning, 2004, p. 895902.

[97] E.G. Maes and P. Beauseroy, “Mutual informatio-based feature extraction on

the time-frequency plane,” IEEE Transactions on Signal Processing, vol. 50,
no. 4, pp. 779—790, 2002.

121

[98]

[99]

[100]

[101]

[102]

[103]

[104]

[105]

[106]

[107]

[108]

[109]

N. Kwak and CH. Choi, “Input feature selection by mutual information based
on parzen window,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 24, no. 12, pp. 1667—1671, 2002.

H. Peng, F. Long, and C. Ding, “Feature selection based on mutual informa-
tion: Criteria of max-dependency, max-relevance, and min-redundancy,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp.
1226 - 1238, 2005.

J. Denzler and GM. Brown, “Information theoretic sensor data selection for
active object recognition and state estimation,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 24, no. 2, pp. 145—157, 2001.

Y. Zhang and Q. J i, “Sensor selectiion for active information fusion,” in Proc. of
the National Conference on Artiﬁcial Intelligence (AAAI), 2005, pp. 1229—1234.

S. Kaski and J. Peltonen, “Informative discriminant analysis,” in Proceedings
of International Conference on Machine Learning, 2003, pp. 329-336.

A. Hyvarinen and E. Oja, “Independent component analysis: algorithms and
applications,” Neural Networks, vol. 13, no. 4-5, pp. 411—430, 2000.

Z. Jin, J. Yang, Z. Hu, and Z. Lou, “A theorem on the uncorrelated optimal
discriminant vectors,” Pattern Recognition, vol. 34, no. 10, pp. 2041 2047,
2001.

Z. Jin, J. Yang, Z. Hu, and Z. Lou, “Face recognition based on the uncorrelated
discriminant transformation,” Pattern Recognition, vol. 34, no. 10, pp. 1405
1416,2001.

J. Ye, T. Li, T. Xiong, and R. J anardan, “Using uncorrelated discriminant anal-
ysis for tissue classiﬁcation with gene expression data,” IEEE/A CM Transac-
tions on Computational Biology and Bioinforrnatics, v01. 1, no. 4, pp. 181—190,
2004.

C. Pun and M. Lee, “Log-polar wavelet energy signature for rotation and scale
invariant texture classiﬁcation,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 25, no. 5, pp. 590-603, 2003.

B. Olshausen and D. Field, “Sparse coding with an overcomplete basis set: A
strategy employed by v,” Vision Research, vol. 37, no. 23, pp. 3311-3325, 1997.

K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, , and T. Sejnowski,
“Dictionary learning algorithms for sparse representation,” Neural Computa-
tion, vol. 15, pp. 349-396, 2003.

122

[110] A. Hyvarinen, “Survey on independent component analysis,” Neural Computing
Surveys, vol. 2, pp. 94—128, 1999.

[111] D. Lee and H. Seung, “Learning the parts of objects by non-negative matrix
factorization,” Nature, vol. 401, pp. 788—791, 1999.

[112] M. Spratling, “Learning image components for object recognition,” Journal of
Machine Learning Research, vol. 7, pp. 793—815, 2006.

[113] S. Li, X. Hou, and H. Zhang, “Learning spatially localized, parts—based repre-
sentation,” in Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition, 2001.

[114] A. Cichocki and S. Amari, Adaptive Blind Signal and Image Processing, John
Wiley and Sons, 2002.

[115] P. Sajda, S. Du, T. Brown, R.Stoyanova, D. Shungu, X. Mao, and L. Parra,
“Nonnegative matrix factorization for rapid recovery ofconstituent spectra in
magnetic resonance chemical shift imaging of the brain,” IEEE Transactions
on Medical Imaging, vol. 23, no. 12, pp. 1453— 1465, 2004.

[116] A. Flesia, H. Hel—Or, A. Averbuch, E. Candes, R. Coifman, and D. Donoho,
“Digital implementation of ridgelet packets,” in Beyond Wavelets. 2002, Aca-
demic Press.

[117] J. Starck, E.J. Candes, and D.L.Donoho, “The curvelet transform for image
denoising,” IEEE Transactions on Signal Processing, vol. 11, no. 6, pp. 670—684,
2002.

[118] M.N. D0 and M. Vetterli, “Contourlets,” in Beyond Wavelets. 2003, Academic
Press.

[119] D. Donoho and X. Huo, “Beamlets andmultiscale image analysis,” Multiscale
and Multiresolution Methods, Springer Lecture Notes in Computational Science
and Engineering, vol. 20, pp. 149-196, 2002.

[120] V. Velisavljevic, B. Beferull-Lozano, M. Vetterli, and P. Dragotti, “Direction-
lets: Anisotropic multidirectional representation with separable ﬁltering,” IEEE
Transactions on Image Processing, vol. 15, no. 7, pp. 1916-1933, 2006.

[121] T. Pham and A. Smeulders, “Sparse representation for coarse and ﬁne object
recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 28, no. 4, pp. 555—567, 2006.

123

[122]

[123]

[124]

[125]

[126]

O. Williams, A. Blake, and R. Cipolla, “Sparse bayesian learning for efﬁcient
visual tracking,” IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, vol. 27, no. 8, pp. 1292-1304, 2005.

S. Agarwal and D. Roth, “Learning a sparse representation for object detec-
tion,” in Proceedings of the 7th European Conference on Computer Vision,
2002, vol. 4, pp. 113—130.

D. Gao and N. Vasconcelos, “Discriminant saliency for visual recognition from

cluttered scenes,” in Proceedings of Neural Information Processing Systems,
2004.

D. Gao and N. Vasconcelos, “Integrated learning of saliency, complex features,
and object detectors from cluttered scenes,” in Proceedings of IEEE Conference
on Computer Vision and Pattern Recognition, 2005, vol. 2, pp. 282 — 287.

N. Vaswani and R. Chellappa, “Principal components null space analysis for
image and vedio classiﬁcation,” IEEE Transactions on Image Processing, vol.
15, no. 7, pp. 1816-1830, 2006.

124

[[11]]

     

llllllllllli