DISCRIMINATIVE SPARSE REPRESENTATIONS FOR IMAGE CLASSIFICATION
By
Suhaily Cardona-Romero

A THESIS
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
MASTER OF SCIENCE
Electrical Engineering
2012

ABSTRACT
DISCRIMINATIVE SPARSE REPRESENTATIONS FOR IMAGE CLASSIFICATION
By
Suhaily Cardona-Romero
Sparse representations and compressed sensing (CS) are two methods that have drawn the
attention of the signal processing community due to their ability to reduce the dimensionality of
signals while preserving enough information for signal representation. However, these compact
representations do not necessarily preserve the most discriminative aspects of the signal. This
thesis addresses this issue by developing a new discriminative framework to obtain a compact
representation with high discriminative information for image classification applications.
The first part of this thesis presents a greedy algorithm inspired by CoSaMP with the
inclusion of a new cost function that quantifies the tradeoff between discrimination power and
sparsity. The inclusion of this cost function helps to select a small number of atoms from an
overcomplete dictionary that produces discriminative sparse representations of images from
different classes. Through experiments, it was shown that such representations can be used as
features to classify new sample images even under noisy environments or missing pixels.
The second part of this thesis proposes a method to obtain discriminative measurements
from CS and is motivated by the fact that the presence of irrelevant features may reduce the
classification accuracy. To address this issue, a feature selection step was added to CS to
eliminate irrelevant features from the measurements. As a result of the elimination of such
features, an improvement in the classification accuracy is observed. In conclusion, it was
demonstrated that a subset of incoherent projections with high discrimination power performs
better than the whole set of CS measurements for classification purposes.

Copyright by
SUHAILY CARDONA-ROMERO
2012

To my parents, three brothers and boyfriend
for their unconditional love and support

iv

ACKNOWLEDGMENTS

I would like to thank my advisor, Dr. Selin Aviyente, for her guidance and support during
my graduate studies. Not only did she guide me through my thesis but she was an integral part in
providing me with the knowledge that I needed to be able to do my work. Also, I want to thank
Dr. Lalita Udpa for her time and support while being a part of my committee. I would also like to
take this opportunity to thank my lab mates: Marcos Bolanos, Ying Li and Ali Mutlu, for helping
me when I needed it.
In addition, I am very grateful to Dr. Barbara O’Kelly, Dr. Percy Pierre and the Sloan
Engineering Program for the opportunity they gave me to pursue my graduate studies at MSU.
Through their seminars and meetings not only did I get the opportunity to network with other
people but I was also able to learn some valuable information for my life as a graduate student.
Also, I want to thank them for their support and funding provided during these two years.
Finally, I would like to thank my parents, Jose and Migdalia, and my three brothers, Jose,
Eduar and Anthony, for their unconditional love and support through my entire life. I want to
thank my parents for all the effort they made to give me the formation that I needed in order to
succeed in life. Also, I want to thank my beloved boyfriend, Eduardo E. Montalvo, for all his
support, encouragement and love. Thanks to his company and sense of humor, this chapter of my
life was more pleasant. When I was going through hard times, he motivated me and helped me to
continue achieving my goals. Thank you all, without your help this would have not been
possible.

v

TABLE OF CONTENTS

LIST OF TABLES........................................................................................................................ vii
LIST OF FIGURES..................................................................................................................... viii
ABBREVIATIONS..................................................................................................................... ix
CHAPTER 1: INTRODUCTION.............................................................................................. 1
CHAPTER 2: BACKGROUND................................................................................................ 7
2.1 WAVELETS.............................................................................. ................................ 7
2.2 OVERCOMPLETE DICTIONARY....................................................................
13
2.3 SPARSE REPRESENTATION..........................................................................
14
2.3.1 Matching Pursuit.......................................................................................... 16
2.3.2 Orthogonal Matching Pursuit....................................................................... 18
2.3.3 Compressive Sampling Matching Pursuit.................................................... 19
2.4 COMPRESSED SENSING....................................................................................... 21
CHAPTER 3: DISCRIMINATIVE SPARSE REPRESENTATIONS FOR IMAGE
CLASSIFICATION USING PURSUIT ALGORITHMS .................................................... 24
3.1 DISCRIMINATIVE SPARSE REPRESENTATIONS……………….................... 25
3.1.1 Objective Function……………………………………………………...… 26
3.1.2 Discriminative CoSaMP vs. Reconstructive CoSaMP………………...…. 27
3.1.3 Discriminative Sparse Representations Algorithm……………………….. 30
3.2 EXPERIMENTS AND RESULTS............................................................................ 32
3.2.1 Databases and Experimental Setup.............................................................. 33
3.2.2 Representative vs. Discriminative Representation....................................... 36
3.2.3 Robustness of the Discriminative Sparse Representation Algorithm…….. 39
3.2.3 Comparison with LDA................................................................................. 42
3.3 CONCLUSIONS........................................................................................................ 46
CHAPTER 4: DISCRIMINATIVE FEATURE SELECTION FROM COMPRESSED
SENSING MEASUREMENTS FOR IMAGE CLASSIFICATION………………............ 47
4.1 DISCRIMINATIVE COMPRESSED MEASUREMENTS..................................... 49
4.2 EXPERIMENTS AND RESULTS............................................................................ 51
4.3 CONCLUSIONS........................................................................................................ 54
CHAPTER 5: CONCLUSIONS AND FUTURE WORK..................................................... 55
5.1 CONCLUSIONS………………………………….……………………………….. 55
5.2 FUTURE WORK…………………………………………………………...……… 56
REFERENCES........................................................................................................................... 59

vi

LIST OF TABLES

Table 1. Classification results using DiscCoSaMP and RecCoSaMP from Coil-1……..…… 38
Table 2. Classification results using a modified OMP algorithm and the proposed greedy
algorithm from Coil-1 with Gaussian Noise………..……………………………………...… 41
Table 3. Classification results using a modified OMP algorithm and the proposed greedy
algorithm from Coil-1 with occlusion…………………………………………………………42
Table 4. Classification results using LDA and the proposed greedy algorithm from the ETHZ
database with Gaussian Noise……………………………………….……………………….. 45
Table 5. Classification results using LDA and the proposed greedy algorithm from the ETHZ
database with occlusion…………………………………….………………………………… 45
Table 6. Number of measurements needed with the proposed method to achieve better results
than using the whole set of measurements…………………………………………………… 53

vii

LIST OF FIGURES

Figure 1. Scaling (a) and Wavelet (b) function for Haar (left), Daubechies (middle) and
Coiflets (right) wavelet families...…………………..…………………………………………. 8
Figure 2. Analysis filter bank for 1D signals………….………………………………...……… 10
Figure 3. Analysis filter bank for 2D signals…..……………………………………………..… 11
Figure 4. Subimages for one (left) and two (right) levels of decomposition…...….…............... 12
Figure 5. Matching Pursuit Algorithm………………………………………………...…......…. 17
Figure 6. Orthogonal Matching Pursuit Algorithm…………………………………………....... 19
Figure 7. Compressive Sampling Matching Pursuit Algorithm…………………..…………….. 21
Figure 8. Compressed sensing model (For interpretation of the references to color in this and
all other figures, the reader is referred to the electronic version of this thesis.)……………… 23
Figure 9. Representative CoSaMP and Discriminative CoSaMP algorithms………………..… 30
Figure 10. Discriminative Sparse Representations Algorithm………………………………… 31
Figure 11. Sample images from COIL-20 database……………...…..……………….………... 34
Figure 12. Sample images from ETHZ database……….………………………………............ 35
Figure 13. A modified version of OMP by adding a cost function that quantifies discrimination
power and sparsity………………..………………………………………………………..……40
Figure 14. Haupt et al. algorithm (left) and the proposed algorithm (right) for image
classification from compressive measurements…………..…………………………………….. 51
Figure 15. Classification results of the proposed method using different number of
measurements: (a) 20 measurements, (b) 40 measurements, (c) 60 measurements, (d) 80
measurements, (e) 100 measurements and (f) 256 measurement…..…………………………… 52

viii

ABBREVIATIONS

BiCGStab

Biconjugate Gradient Stabilized

BP

Basis Pursuit

COIL

Columbia Object Image Library

CoSaMP

Compressive Sampling Matching Pursuit

DiscCoSaMP

Discriminative CoSaMP

DWT

Discrete Wavelet Transform

i.i.d.

Independent Identically Disctributed

KPCA

Kernel Principal Component Analysis

LASSO

Least Absolute Shrinkage and Selection Operator

LDA

Linear Discriminative Analysis

LLE

Locally-Linear Embedding

MP

Matching Pursuit

OMP

Orthogonal Matching Pursuit

PC

Principal Component

PCA

Principal Component Analysis

RecCoSaMP

Reconstructive CoSaMP

RIP

Restricted Isometric Property

SNR

Signal-to-Noise Ratio

SVM

Support Vector Machine

ix

CHAPTER 1
INTRODUCTION

Through the years, the innovations in sensor technology have led to the collection of
massive amounts of data with high dimensionality. The analysis of large amounts of high
dimensional data can pose a problem in the computation time for many applications in the area
of image processing such as texture classification, object detection, and recognition. This
problem can be addressed by eliminating dimensions that seem to be redundant or irrelevant to
the desired application. Dimensionality reduction is often used as a preprocessing technique that
looks for a low dimensional representation from a high dimensional signal, using linear or
nonlinear methods, such that the structure of the signal is preserved. The goal of this thesis is to
address the dimensionality reduction problem to extract a low dimensional feature vector with
high discrimination power for image classification applications.
The most widely used linear methods for dimensionality reduction are Principal
Component Analysis (PCA) and Linear Discriminative Analysis (LDA). The goal of PCA is to,
citing Jolliffe [1]:
“…reduce the dimensionality of a data set consisting of a large number of interrelated
variables, while retaining as much as possible of the variation present in the data set.
This is achieved by transforming to a new set of variables, the principal components
(PCs), which are uncorrelated, and which are ordered so that the first few retain most of
the variation present in all of the original variables.”
The implementation of PCA results in orthogonal projections that represent the data set with a
few components or dimensions, i.e. PCs, without taking into account the differences between the

1

classes making such a projection unsuitable for discrimination purposes. Contrary to PCA, LDA
reduces the dimensionality by seeking the best projection that separates the different classes in
the data set by maximizing the between-class scatter while at the same time minimizing the
within-class scatter. However, LDA is known to be very sensitive to noise in the data.
Nonlinear methods used to reduce the dimensionality of a data set include isomaps,
locally-linear embedding (LLE) and kernel mapping [2, 3]. Isomaps reduce the dimensionality
by mapping high dimensional data into a lower dimensional space while it preserves the
neighborhood distances and the geodesic distances between all pair of features. Similar to
isomaps, LLE reduces the dimensionality by preserving the neighborhood distances; the
difference is that LLE tries to minimize the least squares error of the geodesic distances [2].
Kernel methods map the data from the high dimensional space  d into a dot product space F
via a nonlinear function, using methods such as Kernel Principal Component Analysis (KPCA)
[3] which is a generalization of PCA.
In recent years, sparse representations and compressed sensing have been used to reduce
the dimensionality of data sets. Sparse representations reduce the dimensionality by projecting
the data against a small number of elements (called atoms) from an overcomplete dictionary such
that the reconstruction error is minimal. In this case, the atoms to obtain the low dimensional
projection of the data can be selected through the implementation of greedy algorithms such as
Matching Pursuit [30], Orthogonal Matching Pursuit [31], Compressive Sampling Matching
Pursuit [32] and convex optimization methods such as the l1 -minimization problem and LASSO
[27, 34]. On the other hand, compressed sensing represents a sparse signal with a small number
of measurements through random projections. Using this method, the number of measurements
obtained is much less than the number of samples required from the Nyquist sampling theorem

2

while still allowing the reconstruction of the original signal with no or little loss of information.
These are two methods that have been widely used for signal representation and compression
purposes.
Recently, the signal processing community has been interested in expanding these
methods to applications such as object detection, face recognition and image classification. In
[4], a method based on sparse representation for text detection is proposed using a learned
dictionary obtained using text-like edge features extracted from a database of text images. This
dictionary in conjunction with OMP is used to obtain sparse representations of the edge features
extracted from the test images using a 16  16 window. The sparse representation obtained from
the 16  16 window is detected as text using a threshold based on the number of nonzero
elements in the representation. A similar method for text detection can be found in [5], where
two different dictionaries are learned (one from the text images and the other from the
background) and the subimage within the 16  16 window is projected using each dictionary
resulting into two different projections. The subimage is detected as text or background
depending on which projection produces the smaller reconstruction error. Sparse representations
have also been used for the detection of humans in images [6] thanks to its multi-scale nature. In
[32], Wright et al. proposed an algorithm for face recognition that use the training images as the
atoms of the dictionary to solve the sparse representation problem and classify the test image to
the class of the training image (atom) that minimizes the reconstruction error. In [7], it is shown
that compressed sensing measurements can be used to subtract the background of test images.
Given a background scene image xb and a test image x t of the same scene, any discrepancy
present can be obtained by the pixel-wise subtraction of both images ( xb  xt ). However, if the
images are available in the compressed space, Cevher et al. showed that the discrepancies can be

3

obtained by reconstructing the difference between the background measurements y b and the test
measurements y t ( yb  yt  xb  xt ) .
In this thesis, a new theoretical framework is developed for using sparse representations
and compressed sensing for image classification applications. Currently, sparse representations
are mostly limited to reconstructive representations of signal. For classification purposes, it is
more important to obtain discriminative sparse representations instead of reconstructive ones, i.e.
minimizing reconstruction error. To achieve this goal a new optimization cost function that
combines discrimination power and sparsity is proposed along with a modified greedy pursuit
algorithm. This cost function allows a tradeoff between discrimination and sparsity that helps
with the selection of the smallest number of atoms from an overcomplete dictionary that best
discriminate a set of training images. The indices of the atoms identified as the most
discriminative in the training stage are used to obtain the discriminative sparse representation of
the test images. The performance and robustness of the proposed method is evaluated by
performing classification experiments with two different image databases under different levels
of Gaussian noise and different sizes of occlusion. The first experiment is performed with a
standard image database with low intra-class variability and objects in a black background. The
results of this experiment are compared to a modified OMP algorithm. The proposed method
selects multiple atoms per iteration as opposed to OMP which selects a single atom at each
iteration. This modification enables the proposed algorithm to select features in a more
computationally efficient way while at the same time achieving comparable or better
classification accuracy than the modified OMP algorithm. The second experiment uses a more
challenging database with high intra-class variability to show the effect of class variability on
sparseness. In this experiment, a comparison with LDA is presented to illustrate the superior

4

performance of the proposed algorithm in terms of accuracy and sparsity over conventional
methods.
Similar to the sparse representation methods, compressed sensing has been mostly used
for signal compression and reconstruction. Recently, Haupt et al. have shown that these
measurements can also be used for signal classification purposes [63]. In addition, they presented
and validated a theoretical misclassification bound that shows that the error probability decays
exponentially as the number of measurements increase. In this thesis, this approach is extended
to image classification by using a cost function based on the Fisher score to select the most
relevant compressed sensing measurements (features) for classification purposes instead of using
all the measurements. The cost function proposed will measure how well the compressed sensing
measurements maximize the between-class scatter while at the same time minimize the withinclass scatter. The motivation to include this cost function comes from the study made in [46]
where Almaullin and Dietterich showed through experiments that the presence of redundant or
irrelevant features can drop the classification accuracy significantly. With the inclusion of this
cost function, it is expected to obtain higher classification accuracy using only a small number of
measurements rather than using the whole set of measurements.
The organization of this thesis is as follows. Chapter 2 briefly reviews some basic
concepts on wavelets, overcomplete dictionaries, sparse representations, compressed sensing and
some common greedy algorithms used to solve the sparse representation problem as well as
some applications in these areas. Chapter 3 presents a new greedy algorithm to classify images
using sparse representations. A new cost function combining discrimination power and sparsity
to obtain discriminative sparse representations is proposed. The performance and the robustness
of the proposed algorithm are evaluated through experiments and it was shown that the algorithm

5

can work under noisy and occluded environments and can perform better than existing
dimensionality reduction methods. In addition, the performance of the proposed algorithm is
compared with LDA and a modified version of OMP. Chapter 4 presents a feature selection
method from compressed sensing samples to improve the classification accuracy without the
presence of irrelevant or redundant features.

6

CHAPTER 2
BACKGROUND

This chapter presents basic concepts and current applications in the area of transform
based image feature extraction and classification that will serve as background for the work
presented in this thesis. First, a brief overview of wavelets and wavelet transforms with some
applications in the area of image processing will be presented. After this overview, the extension
of wavelets to sparse representations using overcomplete dictionaries will be introduced. Then,
some of the optimization methods and greedy algorithms that have been proposed to solve the
sparse representation problem will be described. Finally, some background information about
compressed sensing will be presented as well as some applications in the area of signal/image
processing.

2.1 WAVELETS
An orthogonal family of wavelets is a collection of orthogonal basis functions created by
dilating and translating a “mother wavelet”  (t ) [8]:



j ,n t  

1
2

j

t 2 jn

 2j 





for ( j, n)  Z 2

(2.1)

where j defines the scale of the basis and n defines its translation. In the same way, an
orthogonal scaling family can be defined as:

 j ,n t  

1
2

j

t 2 jn

 2j 





7

for ( j, n)  Z 2

(2.2)

where  (t ) corresponds to the scaling function and  j, n t  is orthogonal to  j, n t  . These two
functions usually are used together to perform multiresolution analysis of signals where the
representations obtained with scaling functions  j, n t  correspond to the low frequency
information in the signal and the representations obtained with wavelet functions  j, n t 
correspond to the high frequency information. Some of the most common orthogonal wavelet
families include Haar, Daubechies, Symlets and Coiflets. Examples of 1D wavelets and scaling
functions can be seen in Figure 1.

(a)

(b)

Haar

Daubechies

Coiflets

Figure 1. Scaling (a) and Wavelet (b) functions for Haar (left), Daubechies (middle) and
Coiflets (right) wavelet families

A decomposition of any finite energy signal f can be obtained through the inner product
of the signal and the scaling and wavelet families respectively as:

8

a j ( n)  f ,  j , n 

d j (n)  f , j , n 



 f (l )  j, n (l )

(2.3)

l  


 f (l )  j, n (l )

(2.4)

l  

where a j (n) corresponds to the approximation coefficients (low frequency information) and

d j (n) corresponds to the wavelet/detail coefficients (high frequency information). Therefore, an
orthogonal expansion of f (t ) can be obtained as:

f (t )   a j0 (n) j0,n (t ) 
n



  d j (n) j,n (t )

(2.5)

j  j0 k

Since the set { j0 , n t , n  Z } spans the same subspace as { j , n t , j  j0 , n  Z } [9] this

expansion can also be obtained as:

f (t )   d j (n) j ,n (t )


(2.6)

j k

A fast implementation of the wavelet decomposition can be obtained by applying the
filter bank theory. Filter banks decompose a signal into approximation and details coefficients
by convolving the signal with low-pass filters h(n) and high-pass filters g (n) and
downsampling the output by two as:
a1 (n) 



 h(l  2n)a0 (l )

l  

and d1 (n) 



 g(l  2n)ao (l )

(2.7)

l  

where a 0 (l ) corresponds to the input signal or the approximation coefficients at the highest
scale. Multiple levels of decomposition can be achieved by iterating the analysis stage on the
approximation coefficients as shown in Figure 2. This decomposition provides a description of

9

the signal at different scales such that coarse and fine features can be obtained simultaneously to
analyze the signal.

h(n)
↓2

↓2

dj+2(n)

aj+1(n)

aj(n)
g(n)

aj+2(n)

g(n)

h(n)

↓2

↓2

dj+1(n)

Figure 2. Analysis filter bank for 1D signals

This decomposition has been extended to two dimensional signals such as images using
separable wavelets. The 1D scaling and wavelets functions are extended to 2D through the
combination of products of these functions as follows:

 ( x1 , x2 )   ( x1 ) ( x2 )
 H ( x1 , x 2 )   ( x1 ) ( x 2 )
 V ( x1 , x2 )   ( x1 ) ( x2 )

 D ( x1 , x 2 )   ( x1 ) ( x 2 )

(2.8)

where  H ,  V and  D correspond to wavelets in the horizontal, vertical and diagonal
directions, respectively. Given 2D separable scaling and wavelet functions, respectively defined
as:

 j ,m,n x1 , x 2  

 ij ,m,n x1 , x 2  

1
2j
1
2j

 x1  2 j m x 2  2 j  n 

,
j
 2j

2









i  x1




 2 j m x2  2 j  n 
 for i  {H , V , D}
,
j
j

2
2


10

(2.9)

An image F ( x1 , x2 ) of size M  N can be decomposed into the approximation and detail
coefficients as:
M 1 N 1

a ( j0 , m, n) 

  F ( x1, x2 ) j0 , m, n x1, x2 
MN x  0 x  0
1
2

d i ( j, m, n) 

1

1

M 1 N 1

  F ( x1, x2 ) ij, m, n x1, x2 

MN x  0 x  0
1
2

for i  {H , V , D}

(2.10)

respectively.
Similar to the 1D case, a filter bank can be implemented to obtain the approximation and
details coefficients of an image by applying high-pass ( g ) and low-pass ( h ) filters to the rows
and the columns of the input image and dowsampling the outputs by two as shown in Figure 3.

↓2

aj+1

g

↓2

dVj+1

h

↓2

dHj+1

↓2

dDj+1

h
h
aj

Rows

↓2

Columns
g

↓2

Rows
g

Figure 3. Analysis filter bank for 2D signals

In this case, the results are subimages with the approximation and detail coefficients of the
original image. To obtain multiple levels of decomposition, the same process can be applied to
the approximation coefficients which results in four additional subimages (Figure 4). For N
levels of decomposition, the image is decomposed into 3N + 1 subimages.

11

aj+2
aj+1

dVj+1

dVj+1
dHj+2

dHj+1

dVj+2

dDj+1

dDj+2

dHj+1

dDj+1

Figure 4. Subimages for one (left) and two (right) levels of decomposition

Wavelet decomposition has been very popular in areas such as signal processing,
computer vision, pattern recognition and medical applications, among others. This popularity is
due to its ability to produce compact representations of signals/images at different resolution
levels. Wavelet decomposition has been used in the area of signal processing for speech
compression [10]. In [10], Joseph et al. analyze the performance of different wavelets for
compression purposes using a threshold to drop the wavelet coefficients with small amplitudes
which are considered to be insignificant. In the area of image processing, wavelets have been
used for compression [11-13], watermarking [14, 15], denoising [16, 17], detection [19, 23, 24],
object recognition [18], and image classification [14-22], among others. In [18] and [19], the
wavelet coefficients obtained from the discrete wavelet transform are fed into a classifier as
features for object recognition and airplane detection and tracking, respectively. Wavelets have
also been used to classify texture images using either energy features computed directly from the
DWT coefficients [20] or a combination of wavelet statistical features and wavelet co-occurrence
features such as energy, entropy or homogeneity [21]. In [22], the dependence between wavelets
features from different subbands is studied to select the best set of wavelets features for
classification purposes. Other features that can be used for image classification are edges which

12

can be detected by identifying local maximum from the wavelet transform coefficients using a
threshold to remove the noise that interfere with such identification [23]. Wavelets have also
been used in medical imaging applications such as detection of microcalcifications in
mammograms by applying wavelet filters to remove the background noise and enhance the
microcalcifications present in the mammograms [24].

2.2 OVERCOMPLETE DICTIONARY
A dictionary    MN is a collection of elementary signals called atoms, given by:

 

,
  
 

  {1,2,..., M }

(2.11)

where the atoms   are discrete time signals of length N . A dictionary can be classified as
undercomplete, complete or overcomplete depending on whether it spans the signal space or not.
If the atoms of the dictionary entirely span the signal space forming a basis, the dictionary is
called a complete dictionary. When the number of atoms is larger than the dimension of the
signal space (M  N ) and there is a subset in the dictionary that forms a basis, it is called an
overcomplete dictionary. Otherwise, the dictionary is called an undercomplete dictionary. In this
case, the number of atoms composing the dictionary is less than dimension of the signal space
(M  N ) . Overcomplete dictionaries are constructed using combinations of bases or adding

basis functions to a complete dictionary. The most commonly used functions to construct
dictionaries are Gabor [57], wavelets [10, 11], contourlets [25], curvelets [16], ridgelets or
combinations of these. Overcomplete dictionaries have become an important tool in the signal
processing area due to their capacity to generate sparse representations of signals [26, 27].

13

A signal y   N can be represented as a linear combination of the elements from a
dictionary as follows:
y

N 1

  nn

(2.12)

n0

where  n are the expansion coefficients of the signal and    is the index of the atom  .
However, for the case of overcomplete dictionaries, such a representation is not unique. This
gives us the possibility to find the combination that works best for the desired problem. For the
sparse representation problem, the goal is to find the most compact representation that
reconstructs the signal with the minimum reconstruction error.

2.3 SPARSE REPRESENTATION
Given an signal y   m , an overcomplete dictionary    mk that contains k atoms
and a vector x   k that contains the representation coefficients of the signal y, the sparse
representation problem can be posed as follows:

min x 0
x

s.t.

x  y

(2.13)

where x 0 is the l 0  norm , that counts the nonzero elements in the vector x . In order for the
signal reconstruction to be robust to noise, equation (2.13) can be relaxed to:

min x 0
x

y  x 2   .

s.t.

(2.14)

where  is the permitted error in the reconstruction. The solution for the l 0  norm has been
shown to be NP-hard. Several algorithms have been studied to obtain an approximate solution to
this problem [27-32]. The two most common methods are greedy algorithms and convex
optimization methods.
14

The convex optimization methods solve the combinatorial problem by replacing the

l 0  norm with a convex function, usually with the l1  norm . In [27], Chen et al. proposed an
algorithm called Basis Pursuit (BP) to solve the sparse representation problem in overcomplete
dictionaries using a convex optimization method which finds the decomposition that minimizes
the l1  norm :

min x 1
x

s.t.

x  y

(2.15)

One drawback with convex optimization methods is that they tend to be computationally
extensive when the system to be solved is very large [33]. Least Absolute Shrinkage and
Selection Operator (LASSO) is a method proposed by Tibshirani [34] to solve the

l1 -minimization problem more efficiently. LASSO finds an estimate of x by minimizing the
least square error subject to a l1  norm constraint in the solution vector, formulated as:
1
2
y  x 2   x 1
x 2

min

(2.16)

where   0 is the parameter that controls the tradeoff between the least square error and the
sparsity of x . This optimization problem converges to the solution of the l1 -minimization
problem when the value of  approaches to zero.
On the other hand, greedy algorithms try to find the “best” solution to the problem
iteratively. Greedy algorithms may not find the optimal solution but find a local solution that
approximates to the global solution. Some algorithms proposed to solve the sparse representation
problem are Matching Pursuit (MP), Orthogonal Matching Pursuit (OMP), and Compressive
Sampling Pursuit (CoSaMP).

15

2.3.1 Matching Pursuit
Mallat and Zhang [30] introduced an algorithm, called Matching Pursuit, to represent a
signal as a linear combination of elements from an overcomplete dictionary. These elements are
chosen iteratively, one by one, such that they represent the structure of the signal.

  

Given a dictionary   

with unit norm, let Rn be the residual of an n term

approximation of a given signal y . MP decomposes the residue Rn by projecting it onto the
elements of  to find the element that is highly correlated with Rn . For example, the first
iteration of the algorithm will represent the signal as:

y  y,  0  0  R1

(2.17)

where R1 is the residue after approximating y in the direction of  0 . Since the norm of y can
be calculated as:
2
2
y 2  y,  0 2  R1 2 ,

(2.18)

the atom n   to be chosen at the nth iteration is the one that solves the following
optimization problem:

n  arg max Rn 1 ,  .
 

(2.19)

Then, using the atom that solved the optimization method, the approximation vector  n and the
residue of the signal are updated as follows:

 n   n 1  Rn 1 , n
Rn  Rn 1  Rn 1 , n n

16

(2.20)

respectively. This process is repeated until the stop criterion is met. A description of the MP
algorithm can be found in Figure 5.

Input: Signal y   m , dictionary    mk
Output: Sparse approximation vector 

R0  y

(Residue Initialization)

0  0

(Initial approximation)

n 1
repeat

n  arg max Rn 1 , 

(Greedy selection)

 n   n 1  Rn 1 , n

(Approximation update)

Rn  Rn 1  Rn 1 , n n

(Residue update)

 

n  n 1
until stop criterion

Figure 5. Matching Pursuit Algorithm

Depending on the problem to be solved, the algorithm can be stopped when one of the
following stopping criteria is met: (i) after l fixed number of iterations, (ii) when the residue

Rn  0 or (iii) when the residue Rn   for some  . One shortcoming of the MP algorithm is
that it requires a large number of iterations to converge, therefore it is computationally complex.
Pati and Krishnaprasad [31] introduced a new algorithm called Orthogonal Matching Pursuit that
overcomes this issue.

17

2.3.2 Orthogonal Matching Pursuit
Orthogonal Matching Pursuit, a modified version of MP, is a recursive algorithm that
computes representations of functions with respect to nonorthogal and overcomplete dictionaries
[31]. MP was modified by adding a least-squares minimization to improve the convergence of
the algorithm.
The first step of the OMP algorithm is the same as the MP, which finds the index of the
atom that is the most correlated with the residue. Then, this index is stored in the set  that will
contain the indices of all the atoms selected through all the iterations. The third step, which is the
part that is different from MP, finds the coefficient vector  n that solves the following leastsquares minimization:

 n  arg min y  |  n x
x

2

(2.21)

where the symbol  |n represents the dictionary  with columns restricted to the indices in

 n . This step ensures the orthogonality between the residue and the atoms selected in the
previous iterations. Therefore, the correlation of the residue with the atoms selected in the
following iterations will be equal to zero avoiding the selection of the previously selected atoms,
leading to a faster convergence of the algorithm. The last step updates the current residue by
subtracting the projection of the signal, obtained using the restricted dictionary  |n , from the
original signal. These steps are repeated until the desired stopping criterion is met. The same
stopping criteria mentioned in the previous section can be applied to the OMP algorithm. A
mathematical description of the algorithm can be found in Figure 6.

18

Input: Signal y   m , dictionary    mk
Output: Sparse coefficient vector 

Rn  y

(Residue Initialization)

0  0

(Initial approximation)

0

n 1
repeat

 n  arg max Rn 1 , 

(Atom Index)

 n   n1   n

(Atoms Index Merge)

 

 n  arg min y  |  n x
x

(Least square minimization)

2

Rn  y  | n  n

(Residue update)

n  n 1
until stop criterion

Figure 6. Orthogonal Matching Pursuit Algorithm

2.3.3 Compressive Sampling Matching Pursuit
Compressive Sampling Matching Pursuit [32] is one of the modern methods used to
obtain sparse representations of signals based on OMP and the compressed sensing theory. One
of the differences between this algorithm and the algorithms previously described is its ability to
choose multiple atoms per iteration allowing a faster convergence. CoSaMP searches iteratively
for the largest elements of the target signal using a proxy signal. Needell and Tropp state that,
given an s-sparse signal x   k (where a signal is s-sparse if it has only s  k nonzero
elements) and a dictionary    mk whose transpose is  t , a vector p   t x can serve as a
proxy signal for the signal x due to the fact that the s largest elements in p will correspond to

19

the s largest elements in x . Based on this, it is only necessary to apply  t to the sample
measurements y  x to obtain the proxy signal.
Given a signal y   m , a dictionary    mk that meets the restricted isometric
property [see 32] and a residue r initialized as y , CoSaMP uses an observation vector p   t r ,
to select the indices of the 2s atoms that are the most correlated with the residue, where s is the
desired sparsity level. Then, this set of indices is merged with the indices of the atoms used to
obtain the previous s-sparse coefficient (or representation) vector  j 1 . This new set of indices
is stored in T and used to obtain the solution of the following least squares problem:

b|T   | y
T

(2.22)

where b|T is the solution of the least square problem with nonzero values corresponding to the
indices in T ,  |T is the dictionary restricted to the set of indices T and  |  ( |tT  |T ) 1  |tT
T
is its pseudoinverse. The new coefficient vector  j is obtained by retaining only the s largest
elements in b|T . Finally, the residue is updated to be used in the proxy signal at the next
iteration. These steps are repeated until the desired stopping criterion is met. In [32], three
different stopping criteria that depend on: (i) a fixed number of iterations, (ii) the norm of the
residue r 2 , and (iii) the maximum magnitude of the entries of the proxy p  were analyzed.
A list of the assumptions made and variations of this algorithm can be found in [32]. Figure 7
presents the mathematical description of the algorithm.

20

Input: Signal y   m , dictionary    mk , sparsity level s
Output: s-sparse coefficient vector 
ry

(Residue initialization)

0  0

(Initial approximation)

j 1
repeat

p  t r

(Generate the proxy)

 j  supp ( p2s )

(Index of 2s largest coeff.)

T   j  supp ( j 1 )

(Update the set of indices)

b|T   | y
T

(Least squares problem)

b c 0
|T

 j  bs

(Pruning step)

r  y   j

(Residue update)

j  j 1
until stop criterion

Figure 7. Compressive Sampling Matching Pursuit Algorithm

2.4 COMPRESSED SENSING
Nyquist sampling theorem states that a continuous time signal can be reconstructed from
its samples if the sampling rate is greater than twice the signal bandwidth. In many applications
such as bioinformatics, astronomy, machine learning, hyperspectral and medical imaging, the
number of samples required by the Nyquist sampling theorem can be too high which leads to the
problem of having to handle high dimensionality data. In recent years, it has been shown that a
sparse signal can be reconstructed from a small number of random measurements less than the
minimum number of samples required by the Nyquist sampling theorem, using compressed
sensing framework [35].

21

Given a signal y   m that is s-sparse in a dictionary    mk with representation
coefficients contained in the s-sparse signal x   k and a sampling matrix    Nm , the
signal y can be recovered from its measurements:
z  y   x ,

(2.23)

by solving the following optimization problem:

ˆ
x  arg min x 0
y

z  y   x

subject to

(2.24)

where  satisfies the restricted isometry property (RIP) [36, 40]. It is said that a matrix 
satisfies the RIP of order s if there exist a constant  s  (0,1) such that
2

2

2

(1   s ) x 2  x 2  (1   s ) x 2

(2.25)

holds for all s -sparse vectors (with at most s nonzero entries). This property ensures near
optimal performance of reconstruction algorithms [38]. In [39], it has been proven that random
matrices whose entries are independent and identically distributed (i.i.d) random variables, such
as Gaussian, Bernoulli or related distributions, will satisfy the RIP. Therefore, if the sampling
matrix  is chosen as one of these random matrices, there is no need to know the representation
matrix  and it can be chosen arbitrarily [39, 40]. As mentioned, the solution to the l 0  norm
in equation (2.24) it is known to be a NP-hard problem that can be solved either by greedy
algorithms or convex optimization methods.
From equation (2.23), it can be seen that the sampling matrix  reduces the dimension of
the signal y by mapping the signal from the m-dimensional space to the N-dimensional space
where N is much smaller than m . This reduction is obtained through the inner product of the
random vectors of the sampling matrix with the sparse signal producing a low dimensional signal

z of random measurement as shown in Figure 8.
22

Sampling Matrix
Ψ
N  m

Sparse signal Measurements
z
y
m 1

N 1

Figure 8. Compressed sensing model (For interpretation of the references to color in this
and all other figures, the reader is referred to the electronic version of this thesis.)

Compressed sensing has been used in the area of medical imaging to reduce the
acquisition time of MRI images by reducing the number of measurements to be acquired for
reconstruction purposes [41]. It has also been used to reduce the complexity of the framework of
radars by eliminating the necessity of matched filtering in the radar receiver and reducing the
sampling rate of the receiver [42]. The compressed sensing theory has lead to the design of the
“single pixel” compressed sensing camera that computes random linear measurements (inner
products between the scene and a set of test functions) of the scene under view instead of pixel
samples [43]. This camera compresses the scene under view at the same time that it is acquired,
thus, it has the ability to handle high-dimensional data set.

23

CHAPTER 3
DISCRIMINATIVE SPARSE REPRESENTATIONS FOR IMAGE CLASSIFICATION
USING PURSUIT ALGORITHMS

Sparse representations have been actively used in areas such as noise removal [44],
inpainting [45], compression [46], reconstruction [30-32], among other areas of signal processing
due to the fact that many natural signals can be sparse with respect to the proper dictionary.
Therefore, a signal can be represented with a small number of dictionary elements such that its
reconstruction error is minimal. Recently, there has been an effort to expand sparse
representation to the area of image classification (e.g., [47-49]). In [47], Wright et al. uses the
training images as the atoms of the dictionary to obtain the sparse representation of the test
images for the purpose of face recognition. The classification is performed by assigning the test
image to the class that minimizes the residual. This algorithm performs well when the objects to
be classified have minimal pose variation. However, for objects with high intra-class variation,
the representation may no longer be sparse. Also, it requires a large number of training images
per class to be able to represent the test samples as a linear combination of only the training
images from the same class. In [48], Huang and Aviyente proposed a framework that combines
reconstruction error and discrimination power to obtain sparse and discriminative representation
of signals using MP. A similar method can be found in [49], where a metric that includes both
reconstruction and discrimination terms is proposed to learn adaptive dictionaries which leads to
sparse discriminative and reconstructive image representation. This method learns one dictionary
per class which can be computationally complex.

24

The most widely used method to obtain discriminative representations of images from
different classes is Linear Discriminative Analysis (LDA). However, as mentioned before, LDA
is known to be highly sensitive to the noise in data. On the other hand, the sparse representation
method is robust to noise but its main objective is to look for representations that are suitable for
reconstruction. To be able to classify images with high classification accuracy, it is more
important to obtain discriminative representations rather than reconstructive representations. To
reach this goal, a robust algorithm is proposed based on greedy pursuit algorithms that instead of
choosing atoms for reconstruction purposes, chooses the atoms that are suitable for image
classification.

3.1 DISCRIMINATIVE SPARSE REPRESENTATIONS
In this chapter, an objective function that combines discrimination power and sparsity to
obtain discriminative representations of images is introduced. To demonstrate the effectiveness
of using discriminative representations instead of reconstructive representations, the original
CoSaMP was modified so that classification results can be obtained using reconstructive features
as well as discriminative features. Then, an algorithm derived from CoSaMP, in conjunction with
the objective function, is proposed to select the smallest possible number of atoms that produce
the best discriminative representation of a set of images. This algorithm was inspired by
CoSaMP and the simultaneous sparse approximation algorithms described in [50, 51]. Finally,
some classification experiments that demonstrate the effectiveness of the objective function and
the robustness of the proposed algorithm is presented.

25

3.1.1 Objective Function
Given an overcomplete dictionary    m  k with k atoms of dimensionality m, an
input matrix Y  [Y1 , Y2 ,..., Yc ] where Yi  [ y1 , y 2 ,..., y ni ]1ic corresponds to the set of ni
vectorized training images from the i th class given that there are C  {1,2,..., c} different classes
c

and n 



ni is the total number of images, the feature matrix

X  [ x1 , x 2 ,..., x n ] can be

i 1

obtained by projecting the training images against the atoms in the dictionary where x j is the kdimensional feature vector that corresponds to the j th training image. To measure how well
each atom in the dictionary produce a discriminative representation of the n training images, the
following discrimination measure is used:

F(X ) 

trace ( S b )
,
trace ( S w )

(3.1)

where S b and S w are the between-class and the within-class scatter matrices, respectively
defined as:
c

S b   ni ( μi   )( μi   ) t

(3.2)

i 1

c

Sw  

 ( x j  μi )( x j  μ i )t

(3.3)

i 1 x j Ci

where,

μi 

1
 xi
ni x C
i
i

and  is the total mean vector defined as:

26

(3.4)



1 n
xj
n j 1

(3.5)

This measurement F(X) will produce a vector of size 1 k that contains the
discriminative power of each atom. The highest values in this vector correspond to the atoms that
produce a discriminative representation of the training images by maximizing the between-class
~
scatter and minimizing the within-class scatter. Given that X   sn is a subset of X   kn

( s  k ), the combination of discrimination and sparsity measures is proposed to identify the
atoms that discriminate and create a sparse representation of the training signals by maximizing
the following objective function:
n
~
X  arg max F ( X )   xi
X
i 1
0

(3.6)

~
where the rows of the subset matrix X are the most discriminative representations. The indices
of the atoms that produce these discriminative representations will be stored in the set  to
obtain the low dimensional features of the test images.

3.1.2 Discriminative CoSaMP vs. Reconstructive CoSaMP
In the literature, it has been shown that discriminative representations perform better than
reconstructive representations for classification purposes [68, 69]. To illustrate that this is also
true for the case of sparse representations, CoSaMP was used to evaluate the effect of selecting
atoms that produce discriminative representations of the images versus atoms that produce
reconstructive representations. The original CoSaMP algorithm was modified to be able to obtain
the atoms that simultaneously produce a reconstructive representation of a set of images from

27

different classes. Then, the objective function (3.6) was introduced to the algorithm to obtain the
atoms that produce a discriminative representation of the same set of images.
Given a signal Y   mn and a dictionary    mk with k atoms, the reconstructive
CoSaMP (RecCoSaMP) algorithm starts by initializing the residue R as the input matrix Y and
the representation matrix A   kn as a matrix of zeros. Then, the algorithm obtains the set of
indices of the atoms that produce a representative representation of the images in Y by:
(1) Generating a proxy matrix P (r , c)   t R where c  {1,2...,n} and r  {1,2...,k}.
(2) Obtaining the largest coefficient from each row of the proxy matrix and storing it in
the vector p(r ) .
(3) Identifying the indices of the 2s largest coefficients in the vector p(r ) .
(4) Merging the identified indices in step (3) with the set used to obtain the previous
representation matrix A . At the first iteration, the set used to obtain the
representation matrix is empty.
(5) Solving the least squares problem by restricting the columns in the dictionary using
the indices on the merged set.
(6) Obtaining the largest coefficient from each row of the approximation matrix obtained
in (5) and storing it in the vector b(r ) .
(7) Identifying the indices of the s largest coefficients in the vector b(r ) .
(8) Retaining only the dimensions of the approximation matrix obtained in step (5) using
the indices identified in step (7) and setting all other dimensions to zero to obtain the
s-sparse representation matrix A .

28

(9) Updating the residue by subtracting the part of the input matrix that has been already
approximated from the input matrix.
These steps are repeated until the stop criterion is met. For the Discriminative CoSaMP
(DiscCoSaMP) algorithm, the first five steps (1)-(5) are similar to the RecCoSaMP algorithm.
Then, the objective function (3.6) is introduced to identify the s indices of the atoms that produce
a discriminative representation of the input images. This set of indices is used to retain only the
most significant dimensions of the representation matrix B and setting all other dimensions to
zero. Finally, similar to the RecCoSaMP algorithm, the residue is updated. A mathematical
description of these algorithms can be found in Figure 9.
In the first four steps of the DiscCoSaMP, a set of 2s atoms are selected based on the
magnitude of the coefficient of the proxy signal to obtain the result of the least squares problem.
These steps, however, may be redundant since this set of atoms is pruned to obtain only the s
indices of the most discriminative atoms. The indices of atoms that produce a discriminative
representation of the images can be selected directly from the solution of the least squares
problem while still obtaining sparse representations with high discrimination power for image
classification.

29

Input: Signal Y   mn , dictionary    mk , sparsity level s
Output: Set of indices 
Reconstructive CoSaMP

Discriminative CoSaMP

R Y
A0  0

R Y
A0  0

0  0


0  0


j 1

j 1

repeat

repeat

(Residue initialization)
(Initial approximation)

P ( r , c)   t R

P ( r , c)   t R

(Generate the proxy)

p(r )  max P (r, c)

p(r )  max P (r, c)

(Largest coeff. in each row)

  arg max ( p(r ) 2s )

  arg max ( p(r ) 2s )

(Index of 2s largest coeff.)

r

r

T     j 1

T     j 1

(Update the set of indices)

B   | Y
T

B   | Y
T

(Least squares problem)

b(r )  max B(r, c)

 j  arg max b(r )
r

A j  B|

R  Y  A j
j  j 1
until stop criterion

 j  arg max F ( B)
B

A j  B|

(Pruning step)

R  Y  A j

(Residue update)

j  j 1
until stop criterion

Figure 9. Reconstructive CoSaMP and Discriminative CoSaMP algorithms

3.1.3 Discriminative Sparse Representations Algorithm
In this section, the proposed method derived from CoSaMP to obtain the indices of the
atoms that produce a discriminative representation of a set of input images from different classes
is presented [52]. As input, the proposed algorithm needs a matrix containing the m-dimensional
vectorized training images Y  mn , the overcomple dictionary  m k with k atoms, the stop
criterion and the desired sparsity level s (the number of atoms to obtain the discriminative

30

representation of the images). Similar to the greedy methods explained in Chapter 2, the residue
R is initialized as the input matrix Y . Then, the proposed algorithm selects the indices of the

atoms that produce a discriminative sparse representation of the input matrix by:
(1) Solving the least square problem, X    R , to obtain the approximation matrix.
(2) Getting the set  of s indices of the atoms that solve the optimization problem (3.6)
using the approximation matrix from step (1).
(3) Pruning the approximation matrix by retaining only the rows produced by the atoms
in the set of indices obtained in (2) and setting all other rows to zero.
(4) Updating the residue through the subtraction between the input matrix and the part of
the signal that has been approximated.
These steps are repeated until the stopping criterion selected is met. A mathematical description
of the proposed algorithm can be found in Figure 10.

Input: Matrix of images Y m n , dictionary  m k , sparsity level s
Output: Dictionary indices 

R0  Y

(Residue initialization)

l 1
repeat
X    Rl 1

(Least squares problem)

 l  arg max F ( X )

(Index of s largest values)

X

As  X | l

(Sparse Approximation)

Rl  Y  As
l  l 1
until stop criterion

(Residue update)

Figure 10. Discriminative Sparse Representations Algorithm

31

In the literature, several methods are described to solve the least squares problem. In [32],
the use of Richardson’s iteration or conjugate gradient method is recommended to solve this
problem. In this thesis, the biconjugate gradient stabilized method (BiCGStab) was used, which
is a variant of the conjugate gradient method but have a smoother and faster convergence. As
stopping criterion, given that l is the set of indices identified on the l th iteration and l 1 the
set of indices identified on the l - 1th iteration, the cardinality of the intersection between l
and l 1 was used. If the cardinality of the intersection is less than the pre-determined sparsity
value s, the algorithm continues to the next iteration until the following criteria is met:

l 1  l  s

(3.7)

3.2 EXPERIMENTS AND RESULTS
In this section, experimental results on two different databases are presented to evaluate
the performance of the algorithm. The algorithm is implemented for the cases of noiseless
images and images with different levels of noise and occlusion to demonstrate the robustness of
the algorithm. The experiment in Section 3.2.3 uses a subset of the COIL database to evaluate
the performance of the proposed algorithm with images with low intra-class variability. Also, a
comparison between the results of the proposed algorithm with those of a modified version of
OMP will be presented to evaluate if the number of atoms selected at each iteration affects on the
classification results. The classification results were obtained using Support Vector Machine
(SVM) as the classifier, where the accuracy of SVM depends on the model parameters chosen to
classify the images. In the second experiment, to avoid the dependency of the classification
results on the classifier parameters the performance of the proposed algorithm was evaluated

32

using a similarity metric based on correlation. The database used in this experiment contains
images with high intra-class variability to prove that the proposed algorithm can handle class
variability and the results compared with those of LDA.

3.2.1 Databases and Experimental Setup
The first database used to evaluate the algorithm is the Columbia Object Image Library
(COIL-20) dataset. This database consists of 1440 grayscale images of 20 different objects as
shown in Figure 11. Each object was placed in a motorized turntable against a black background
and it was rotated through 360 degrees to obtain images every 5 degrees, for a total of 72 images
per object [53]. Each image in the database was resized from 128 128 pixels to 16  16 pixels
and rearranged into a vector of length 256. These vectors were divided into training and testing
sets and the vectors of each set were concatenated in a matrix where the training matrix was used
to obtain the atoms that simultaneously represent the images and the test matrix to obtain the
classification accuracies. From this database only a subset of 423 images containing the first six
objects was used, which is going to be called Coil-1.

33

Object 1

Object 2

Object 3

Object 4

Object 5

Object 6

Object 7

Object 8

Object 9

Object 10

Object 11

Object 12

Object 13

Object 14

Object 15

Object 16

Object 17

Object 18

Object 19

Object 20

Figure 11. Sample images from COIL-20 database (The images in this figure are for visual
reference only, the text in the images are not meant to be readable)

The second database is a more challenging database whose objects are against real world
backgrounds, under different lightning conditions, with high intra-class variability and can even
contain some occlusions. The database is the TU Darmstadt Database from the PASCAL Object
Recognition Database Collection, formerly the ETHZ database [54]. This database consists of
images from three different classes: side views of motorbikes, cows and cars. Samples of the
images from this database can be seen in Figure 12. From this database a subset of 50 images per
class was extracted for a total of 150 images. The objects in the images were extracted with the

34

segmentation mask provided and resized into 128 128 images, whose pixels were rearranged
into a vector of length 16,384. Similar to the previous database, the vectors were divided into
training and testing sets and the vectors of each set were concatenated in a matrix where the
training matrix was used to obtain the atoms that simultaneously represent the images and the
test matrix to obtain the classification accuracies.

Figure 12. Sample images from the ETHZ database

The robustness of the algorithms was evaluated by applying different levels of noise and
occlusion to the images. For both databases, the noisy images were generated using random
Gaussian noise with signal-to-noise ratios (SNR) of 20dB, 15dB and 10dB. The occluded images
from the Coil database contain black squares of size 3 3 , 5 5 and 7  7 . For the case of the
images from the ETHZ database, the black squares are of size 15 15 , 19  19 and 31 31 .

35

In both cases, the features of the images were extracted by using an overcomplete
dictionary formed by a combination of Haar atoms and Gabor functions. Haar wavelets have
been used to detect/model discontinuities in images [55, 56]. On the other hand, Gabor functions
are good for modeling continuous elements and directionality in images. Some advantages of the
Gabor functions are their ability to save the neighborhood relation between pixels, its robustness
against illumination and noise and its ability to represent images based on the way the human
mind does [57]. The ability of these bases to model both continuities and discontinuities make
them suitable to classify natural images that consist of continuous and discontinuous elements.
The overcomplete dictionary used in the experiments contains 256 Haar atoms and 781 Gabor
atoms for a total of 1037 atoms. The Haar atoms were generated using the scaling function
corresponding to the length of the vectorized images and the wavelet functions for all shifts and
eight scales. And the Gabor functions were generated as:

G( x)  g ( x) cos(w0 ( x  x0 ))

(3.8)

where w0  5 is the center of support in the frequency domain and

g ( x) 

  x  x 2 
0
 
 
    
1
exp  

2
2 







(3.9)

is a Gaussian function with scale   0.05 and center x0 .

3.2.2 Reconstructive vs. Discriminative Representation
The first experiment evaluates the performance of the RecCoSaMP and DiscCoSaMP to
illustrate how the inclusion of a cost function that quantifies separability between the images’
classes can improve the classification accuracy. This experiment was performed using Coil-1,
36

which contains 432 images from 6 different classes, was performed to evaluate the performance
of the Reconstructive and Discriminative CoSaMP algorithms presented in Section 3.1.2. The
experiment was performed using 396 images as training images and 36 as test images. For the
classification stage, SVM was used as the classifier and the results are an average of a 12-fold
cross-validation.
SVM is a method that uses a kernel mapping function which transforms the input data to
a higher dimensional plane seeking the optimal hyperplane that best separates the feature vectors
from different classes. Some of the most commonly used kernel functions are lineal, polynomial,
radial basis functions and sigmoids. SVM was implemented using the library LibSVM [58] and
the radial basis function as the kernel function. The accuracy of the algorithm has a large
dependency on the model parameters. If these parameters are not selected correctly, the accuracy
results may not be optimal. In [59], it is recommended to use the grid search to find the optimal
values of the parameters. This is a method that evaluates the performance of the algorithm trying
different values of each parameter across a desired range. Then, the set of parameters with the
“best” accuracy is picked.
The goal of this work is to obtain the maximum classification accuracy with the smallest
number of atoms. Therefore, the results are evaluated in terms of the optimal and maximum
results using the following cost function:

max( A  S p )

(3.10)

where A is the percentage of classification, S p is the percentage of sparsity (number of atoms
selected to obtain the accuracy/total number of atoms in the dictionary) and  ,  are constants.
When   1 and   1 , the optimal results which correspond to the point where there is an
optimal tradeoff between the accuracy and the number of atoms used to classify the images can
37

be obtained. The maximum accuracy results correspond to   1 and   0 which is obtained
when the maximum accuracy has occurred for the smallest number of atoms.
From Table 1, it can be seen that there is not a big difference between the optimal
accuracies of both algorithms but the optimal sparsity of the Discriminative CoSaMP
(DiscCoSaMP) is much smaller than that of the Reconstructive CoSaMP (RecCoSaMP). This is
because the first atoms selected using RecCoSaMP are suitable to reconstruct the images and not
necessarily to classify them, making it necessary the use of more atoms to obtain high
classification results. Comparing the two algorithms, there is a difference of 6.26% in sparsity
which corresponds to 65 atoms more atoms needed to obtain similar results. These results prove
that using the cost function (3.6), classification accuracies near the maximum results can be
obtained using much less atoms than using a reconstructive cost function.

Opt. / Max.

Noiseless

Results

Algorithm

(%)

Opt. Spar.

13.91

Opt. Max.

89.35

Max. Spar.

21.33

Max. Acc.

93.52

Opt. Spar.

7.65

Opt. Max.

90.28

Max. Spar.

21.48

Max. Acc.

93.98

Reconstructive CoSaMP

Discriminative CoSaMP

Table 1. Classification results using DiscCoSaMP and RecCoSaMP from Coil-1

38

3.2.3 Robustness of the Discriminative Sparse Representations Algorithm
In this section, an evaluation of the proposed algorithm using images with low intra-class
variability is performed. The robustness and sparsity of the algorithm are evaluated with images
under different levels of Gaussian noise and occlusion. Also, a comparison between the
classification results obtained using the atoms selected with OMP and the proposed algorithm
will be presented. The image database used was Coil-1 which contains 432 images from 6
classes with low intra-class variability. From the 432 images in the database, 396 images were
used as training images to obtain the indices of the atoms that best discriminate between classes
and the other 36 were used as testing images to evaluate the performance of the algorithm using
SVM as the classifier. The results are the product of an average of a 12-fold cross-validation.
To be able to compare the results of the proposed algorithm with OMP, it is necessary to
modify OMP in a way that it selects the atoms that produce the most discriminative
representation instead of a reconstructive one. The modification was done by replacing the
identification step in the OMP algorithm with the cost function in equation (3.6). The original
step selected the atoms most correlated with the residue. With the introduction of the cost
function (3.6), the algorithm will select the atom that produces the most sparse and
discriminative representation. The mathematical description of the modified OMP algorithm can
be seen in Figure 13.

39

Input: Signal Y , dictionary 
Output: Set of indices 
R0  Y
A0  0

(Residue Initialization)
(Initial approximation)

0

n 1
repeat

C  R n 1 , 

 n  arg max F (C )

(Atom Selection)

 n   n 1   n

(Merging the indices of the selected atoms)

An  arg min Y  |  n X
2
X

(Least square minimization)

 

R n  Y   |n An
n  n 1
until stop criterion

(Residue update)

Figure 13. A modified version of OMP by adding a cost function that quantifies
discrimination power and level of sparsity

Table 2 shows the optimal and maximum accuracy results obtained from the modified
OMP and the proposed algorithm for the case of Gaussian noise. It can be seen that, when the
SNR value is low (20dB and 15dB), the optimal accuracy results from the proposed method are
slightly higher compared to the modified OMP results with a maximum difference in the optimal
sparsity values of around 0.58% (which corresponds to a difference of 6 atoms). When the noise
level is higher, in this case 10dB, the optimal accuracy result from the proposed method is higher
than the modified OMP with a difference of 0.16% in the optimal sparsity values (which
corresponds to a difference of 2 atoms). For the case of images with occlusion, from Table 3, it
can be seen that the proposed method has better optimal and maximum accuracy results than the

40

modified OMP. For the case of images with occlusion of size 7  7 , the classification accuracies
of the proposed method are achieved with a smaller amount of atoms than the modified OMP.
In conclusion, even though the sparsity values of both algorithms are not very different,
the proposed algorithm still obtains better accuracy results than the modified OMP. These results
show the robustness of the algorithm under noisy conditions and missing data. In real problems,
due to the fact that the amount of noise or missing data in the images is unknown, it is better to
use the proposed algorithm since higher classification results can be obtained at the cost of using
only a few more atoms than the modified OMP. Also, the ability of the proposed algorithm to
select multiple atoms at each iteration makes the computational time to be much faster than the
modified OMP.

Opt. / Max.

Noiseless

20dB

15dB

10dB

Results

(%)

(%)

(%)

(%)

Method

Opt. Sparsity

3.52

3.55

3.53

4.32

Modified

Opt. Accuracy

90.74

86.40

84.55

77.96

OMP

Max. Sparsity

5.95

7.53

8.79

11.35

Max. Accuracy

92.59

88.24

86.65

81.50

Opt. Sparsity

3.42

3.94

4.11

4.48

Proposed

Opt. Accuracy

90.74

86.73

84.80

80.71

Algorithm

Max. Sparsity

4.97

4.42

5.25

6.31

Max. Accuracy

91.44

87.05

85.20

81.15

Table 2. Classifications results using a modified OMP algorithm and the proposed greedy
algorithm from Coil-1 with Gaussian noise

41

Opt. / Max.
Method

Noiseless

3 3

5 5

77

Results
Opt. Sparsity

3.52

3.79

3.37

3.53

Modified

Opt. Accuracy

90.74

81.59

73.72

62.70

OMP

Max. Sparsity

5.95

10.57

11.67

13.81

Max. Accuracy

92.59

84.24

77.02

66.65

Opt. Sparsity

3.42

3.70

3.51

2.66

Proposed

Opt. Accuracy

90.74

84.31

76.87

68.75

Algorithm

Max. Sparsity

4.97

5.49

5.72

4.11

Max. Accuracy

91.44

84.92

77.58

68.12

Table 3. Classification results using a modified OMP algorithm and the proposed greedy
algorithm from Coil-1 with occlusion

3.2.4 Comparison with LDA
Since LDA is one of the most commonly used methods to obtain image representation in
a lower dimensionality space, the results obtained using this method were compared with those
of the proposed method. The comparison was done using the ETHZ database and 10-fold crossvalidation. From the 150 images, 135 were used as training images to obtain the indices of the
atoms that best discriminate between classes and the other 15 images were used as test images to
evaluate the performance of the algorithm. Given that the dimensionality of the images in this
database is 16,383 ( 128  128 ), to obtain a representation of the images, it is necessary to
construct a dictionary with the same dimensionality of the images. To avoid the construction of
such a high dimensional dictionary, which would increase the computational complexity of the
algorithm, nonoverlapping patches of 16  16 pixels were extracted from the images. This

42

produces a total of 64 nonoverlapping patches per images from which only 10, selected
randomly, were used to extract the features. To have a fair comparison, the patches of the images
were projected to the same dictionary used to obtain the representation of the images in the
proposed algorithm. In that way, the features vectors are extracted from the same feature space.
Due to the fact that the search of the SVM parameters can be computationally expensive
and it could be that the range chosen is not the one that contains the optimal values, the
evaluation for this experiment was performed using a metric based on correlation which is
independent of the classifier. First, the features were extracted by projecting the extracted
patches to the selected atoms. Then, the l th test patch is assigned to the class of the training
patch that maximizes the following cost function:
t
1 (b ji  b ji )(d l  d l )
ˆ
i  arg max
q 1
 b ji  d l
i

(3.11)

where


b ji is the feature vector extracted from the j th patch that belongs to the i th training
image



b ji is the mean of b ji and  b ji its standard deviation



d l is the feature vector extracted from the l th testing patch from the test image



d l is the mean of d l and  d l its standard deviation



q is the dimension of the feature vector

After classifying all the patches, the test image is assigned to the class that has the maximum
number of patches assigned to that class:

43

ˆ
i  max p(i)

(3.12)

iC

where p(i) is the vector that contains the number of test patches assigned to the i th class and iˆ
is the label assigned to the test image. Finally, the classification accuracy is defined as:
z

  el

Accuracy  1  l 1
z







  100




(3.13)

where z is the total number of test images and el  1 if the test image is misclassified and

el  0 , otherwise. To avoid overfitting of the results, an n-fold cross-validation was performed
for all the experiments and for the case of noisy images, the process of adding Gaussian noise to
the test images on each cross-validation was performed 15 times.
The optimal and maximum accuracy results of the LDA and the proposed algorithm can
be found in Table 4 and Table 5. These results show that the proposed algorithm always
surpasses LDA. For the case where the images have Gaussian noise, as expected, LDA has a low
performance due to its sensitivity to noise. On the other hand, the proposed method is robust to
noise. Even when the images have SNR = 10 dB, the optimal classification accuracy does not get
lower than 88% of accuracy with a sparsity level of, at most, 4.95% which corresponds to around
52 atoms. This is a reduction of 80% in the dimensionality of the patches, which corresponds to a
reduction of 96.82% in the dimensionality of the images (520 features per images/16,384). For
the case of images with occlusion, the optimal accuracies do not get lower than 89% with, at
most, 42 atoms. These results show that even with images with high intra-class variability the
algorithm is robust to noise and the sparsity level needed to obtain the optimal accuracy results
do not get higher than 5% of the dimensionality of the images.

44

Opt. / Max.

Noiseless

20dB

15dB

10dB

Results

(%)

(%)

(%)

(%)

Method

Opt. Sparsity

4.61

6.38

6.01

6.37

Opt. Accuracy

81.33

65.80

63.07

62.43

Max. Sparsity

7.73

9.01

9.79

9.20

Max. Accuracy

82.67

67.60

65.40

64.00

Opt. Sparsity

2.87

4.12

4.34

4.95

Proposed

Opt. Accuracy

90.67

89.27

89.20

88.00

Algorithm

Max. Sparsity

7.40

15.14

13.87

14.94

Max. Accuracy

91.33

91.33

92.40

91.13

LDA

Table 4. Classifications results using LDA and the proposed greedy algorithm from the
ETHZ database with Gaussian noise

Opt. / Max.

Noiseless

15 15

19  19

31 31

Results

(%)

(%)

(%)

(%)

Method

Opt. Sparsity

4.61

5.18

4.71

4.67

Opt. Accuracy

81.33

79.87

78.87

78.13

Max. Sparsity

7.73

7.70

7.11

7.05

Max. Accuracy

82.67

81.33

80.27

79.47

Opt. Sparsity

2.87

3.16

3.49

4.05

Proposed

Opt. Accuracy

90.67

90.13

90.13

89.33

Algorithm

Max. Sparsity

7.40

11.07

12.61

14.46

Max. Accuracy

91.33

92.40

92.93

92.53

LDA

Table 5. Classification results using LDA and the proposed greedy algorithm from the
ETHZ database with occlusion
45

3.3 CONCLUSIONS
In this chapter, a modified cost function that achieves a tradeoff between discrimination
power and the sparsity was proposed and a corresponding greedy algorithm based on CoSaMP
was developed to obtain the atoms that produce a discriminative sparse representation of a set of
images. The atoms selected from the overcomplete dictionary based on the training images were
used to extract features and classify images from the testing set. It was shown that using a cost
function to selects atoms that produce discriminative representations rather than reconstructive
ones can reduce the number of atoms needed to obtain high classification accuracies.
Two different experiments were performed to evaluate the performance and robustness of
the proposed algorithm. To evaluate the robustness of the algorithm, different levels of Gaussian
noise and occlusion were added to the test images. The experiments showed that the proposed
algorithm can work under conditions where the images contain high level of noise and occlusion
and still maintain a low sparsity level. Also, it was shown that the algorithm is able to handle
image datasets with low intra-class variability as well as high intra-class variability. In addition,
it was shown that the proposed algorithm can work with reduced dimensionality data, such as
nonoverlapping random patches, instead of the whole images avoiding the construction of high
dimensional dictionaries.

46

CHAPTER 4
DISCRIMINATIVE FEATURE SELECTION FROM COMPRESSED SENSING
MEASUREMENTS FOR IMAGE CLASSIFICATION

In recent years, there has been a move in signal processing to sense signals using fewer
samples than the Nyquist rate of samples, which is known as compressed sensing. Recently, this
idea has been explored in the area of signal classification and detection where the high
dimensionality of the data can increase the complexity of such applications. For example, the
current methods to obtain images for MRI applications can be time-consuming due to the high
dimensionality of the images. To solve this problem, compressed sensing has been used to
improve the acquisition time of the images without degrading the image quality [60]. In [61], the
use of random measurements from local patches is proposed to classify texture images using a
single nearest neighbor classifier. First, a texton dictionary is learned from the compressed
sensing measurements of the patches using K-means clustering and a histogram per class is
learned by labeling each of the patches to the closest texton in the dictionary. Then, a new
texture image is classified by finding the histogram that is closest to the new texture image
histogram. In [62], compressed sensing has been used for signal detection by applying the MP
algorithm to select fewer numbers of incoherent measurements than the necessary for
reconstruction purposes. Compressed sensing has also been used for signal classification [63] by
projecting the training signals and the test signal to be classified onto random vectors and
classifying the signal to the class of the training signal with minimum distance between the
compressed measurements of the test signal and the compressed measurements of the training
signal.

47

In this thesis, the work of Haupt et al. [63] will be used as motivation for the work
presented in this chapter. Given a test signal f T   n and a sampling matrix A   kn where

A j corresponds to the element/row j in the matrix, Haupt et al. have shown that the compressed
measurements of the signal f T , besides being used for reconstruction purposes, can be used as
features for classification purposes. The compressed measurements of the signal f T are
obtained through the following inner product:

y( j )  A j , f T  w( j )

for j  1,2,..., k

(4.1)

where w ( j ) is the vector that models the measurement noise with entries are i.i.d. N (0,  2 )
random variables and k  n . Given a set F  { f1 , f 2 ,..., f m } of m training signals, the
classification is performed by assigning to the test signal the class of the training signal that
solves the following minimization problem
2
ˆ
f k  arg min y  Af .

(4.2)

f F

Haupt et al. performed several simulations using this method to validate the following theoretical
classification error bound:
2


ˆ  f  (m  1)1   d min 
P fk
T

4n 2 







k 2

(4.3)

where

d min 

min

f i , f j F , i  j

fi  f j

2

(4.4)

is the minimum Euclidean distance between every pair of training signals and    is used for
the cases where the training signals in F do not have unit norm. This method uses all of the

48

measurements obtained from compressed sensing without taking into account the presence of
possible redundancy or irrelevant features in the measurements which may reduce the
classification accuracy. This problem can be solved by applying a feature selection method in
conjunction with compressed sensing to keep only the relevant features for classification
purposes.
Feature selection is an important step for the area of image classification that selects a
subset of relevant features from an input set of images to construct a robust model for
classification purposes. Given a set of features, the elimination of irrelevant features will help to
improve the classification accuracy, rather than using the whole set of features. In [64], it is
shown that classification accuracy can drop significantly in the presence of irrelevant features if
the feature selection method used fails to identify them. This chapter proposes to include a cost
function to select the most relevant features from the compressed sensing measurements.

4.1 DISCRIMINATIVE COMPRESSED MEASUREMENTS
To select the set of measurements/features that are the most relevant for classification
purposes, a metric that measures the discriminative power of each feature vector is proposed.
This measurement will be used to sort the feature vectors from the random measurements and
select the most discriminative features for classification purposes. Using this method, it is
expected to obtain better classification accuracy using less feature vectors than using the whole
set obtained with compressed sensing.
Given a set F  { f1 , f 2 ,..., f m } of m training signals, where f i   n , and a sampling
matrix A   k n with k random vectors, and random measurements obtained as:

49

n

M ( j, i)   A( j, l )F (l , i)

for j  1,2,..., k and i  1,2,..., m

(4.5)

l 1

the discrimination power of the measurements produced by each random vector is obtained using
the Fisher discriminant ratio (explained in Section 3.1.1) as:

F (M ) 

trace ( S b )
trace ( S w )

(4.6)

Based on this measurement, the features will be rank ordered and the first p dimensions will be
~
selected and stored in M   pn , discarding the k  p dimensions with smaller discrimination

values. The indices of the p feature vectors selected will be stored in  to be used to restrict
the rows of the sampling matrix A and obtain the measurements of the test image and the
training images. The test image y will be classified to the l th class if the training image that
minimizes:
arg min y  A| f

(4.7)

f F

also belong to that class. As before, the classification accuracy is defined as:
z


  ei 


Accuracy  1  i 1   100
z 






(4.8)

where z is the total number of test images and e i  1 if the test image is misclassified and e i  0
otherwise. The mathematical description of this method can be found in Figure 14.

50

Input: Training signals F   n m ,Test signal f T   n , sampling matrix A   k n
Output: Test image label l
Haupt et al. Algorithm

Proposed Algorithm

M  A, F

(Training measurements)

  arg max ( F ( M ))

(Feature Selection)

M

y  A, fT

y  A| , f T

l  arg min y  Af

l  arg min y  A|  f

f F

(Test measurements)
(Test labeling)

f F

Figure 14. Haupt et al algorithm (left) and the proposed algorithm (right) for image
classification from compressive measurements

4.2 EXPERIMENTS AND RESULTS
In this section, the results obtained with the proposed algorithm will be compared to
those of the Haupt et al. algorithm. The database used in this experiment is the ETHZ database
explained in Section 3.2.1. In this case, to obtain the compressed sensing measurement, the
original images of size 128 128 were used instead of 1616 nonoverlapping patches. The
elements of the sampling matrix were drawn from a Gaussian function with zero mean and
unitary standard deviation. The results of this section are given for 100 simulations with a 10fold cross-validation.
Figure 15 shows the results for (a) 20 measurements, (b) 40 measurements, (c) 60
measurements, (d) 80 measurements, (e) 100 measurements and (f) 256 measurements as the size
of the compressed samples from which the most relevant features will be extracted. The solid
straight red line in the graphs corresponds to the result of Haupt et al. algorithm used as reference
and the blue line with the asterisks corresponds to the results obtained with the proposed feature
selection method.
51

Accuracy

Accuracy

(b)

Accuracy

Accuracy

(a)

(c)

Accuracy

Accuracy

(d)

(e)

(f)

# of Measurements

# of Measurements

Figure 15. Classification results of the proposed method using different number of
measurements: (a) 20 measurements, (b) 40 measurements, (c) 60 measurements, (d) 80
measurements, (e) 100 measurements and (f) 256 measurements.

52

For all the cases presented in Figure 15, the proposed algorithm obtains similar or better
results with less number of measurements/features than using the whole set. When the number of
measurements is less than 20, removing some features will degrade the classification accuracy
because all the features available are important to be able to discriminate the images. Table 6
presents the classification results obtained using different number of measurements and the size
of the minimum subset of these measurements needed to obtain similar or better accuracy with
the proposed method. It can be seen that as the number of compressed sensing measurements
increases, the amount of reduction in the support of the feature set that also increases. Hence,
when more number of features is available, there is more redundancy and a chance to improve
classification by adding a feature selection step to compressed sensing.

Smallest
Accuracy using the

/ Features

# of Measurements/Features

Maximum

to reach the same or better

reduction that

accuracy with the proposed

# of Measurements

can be achieved

whole set of
measurements /
features
method

20 Measurements

79.15%

13

35%

40 Measurements

83.85%

21

47.5%

60 Measurements

84.39%

19

68.33%

80 Measurements

84.61%

15

81.25%

100 Measurements

84.6%

13

87%

256 Measurements

84.4%

9

96.48%

Table 6. Number of measurements needed with the proposed method to achieve better
results than using the whole set of measurements
53

4.3 CONCLUSIONS
In this chapter, the inclusion of a discrimination measure for the selection of features
from compressed sensing measurements was proposed. The motivation for this modification to
compressed sensing is to remove redundant or irrelevant features from the original compressed
features to increase the classification accuracy. The performance of the proposed algorithm was
evaluated with noiseless images from the ETHZ database. It was found that using the Fisher
discriminant ratio to select a subset of the compressed sensing samples improves the
classification accuracy and decreases the sparsity of the feature set. As a result, the selection of a
subset of the features produced similar or better results than using the whole set of features. With
the proposed algorithm, it was obtained that the number of measurements can be reduced to at
most by 35% from 20 measurements and by 96.48% from 256 measurements. In conclusion, it is
possible to achieve better classification results using a subset of the features than the whole set.

54

CHAPTER 5
CONCLUSIONS AND FUTURE WORK

5.1 CONCLUSIONS
This thesis discussed the problem of selecting a compact set of features with high
discrimination power for image classification. The methods proposed to solve this problem are
sparse representation and compressed sensing, two closely related and well-developed
techniques for signal compression and reconstruction. The methods proposed in this thesis
extend the framework for sparse signal representation and sensing to image classification
applications such that the representations are sparse and discriminative at the same time. Both
methods help to reduce the dimensionality of the features to avoid handling high dimensional
images for classification purposes. The first method achieves this goal by projecting the sample
images against a small number of elements from an overcomplete dictionary that produces a
discriminative sparse representation of the data. The second method uses a feature selection
measure to obtain, from random compressed sensing measurements, the most relevant
measurements (features) such that the classification accuracy is equal or better than using all the
measurements.
In the first part of this thesis, a sparse representation method was proposed based on a
greedy algorithm similar to CoSaMP to select a group of atoms from an overcomplete dictionary
to produce a discriminative sparse representation of a set of images from different classes. This
goal was achieved through the inclusion of a cost function that performs a tradeoff between
discrimination power and sparsity. First, it was showed that discriminative representations
perform better than reconstructive representations for classification purposes. Then, different

55

experiments were performed using two different image databases with different levels of
Gaussian noise and occlusion. The results of these experiments showed that the inclusion of the
cost function helps to select atoms that produce discriminative features and the classification
accuracy of the proposed method is better than the accuracy obtained from the modified OMP.
The proposed algorithm was also compared with LDA using 128  128 images from the ETHZ
database. In this case, random patches of size 16  16 were extracted to avoid the construction of
a high dimensional dictionary. It was shown that classification accuracies higher than those of
LDA can be obtained using low dimensional features extracted from a set of random patches. For
the worst case, the dimensionality of the patches was reduced from 256 to 52 which correspond
to a reduction of 80%. In conclusion, it was showed that the proposed algorithm can handle
images with high intra-class variability and high level of noise and occlusion.
For the second part of this thesis, a feature selection method was proposed to select the
most relevant features from the compressed sensing measurements for classification purposes.
With the use of a cost function that measures the discrimination power of the set of random
measurements, it was shown that a subset of the measurements produces better classification
accuracy than the whole set. This is a result of eliminating irrelevant features present in the
compressed sensing measurement. The greater the number of compressed sensing measurements
available to choose from, the greater the number of irrelevant features that can be eliminated
using the feature selection method proposed in Chapter 4.

5.2 FUTURE WORK
As future work, different methods could be evaluated to select the patches instead of
using random patches from high dimensional images. The selection of patches with high

56

discrimination power would help to obtain higher classification accuracies. Some methods that
have been used to extract discriminative patches for classification purposes are combinatorial
and statistical methods [65] or methods that extract the patches around some interest points
detected using operators such as the Förstner operator [66] or SIFT descriptor [67]. Also, future
research can include determining the optimal size of the patches such that there is a tradeoff
between the dimensionality of the patch and the classification accuracy to avoid computational
complexity in the implementation of the algorithm.
For the proposed method presented in Chapter 4, future work could consider different
sampling matrices other than the Gaussian random matrix used in this thesis. The sampling
matrix to obtain the measurements can be learned from the images such that only a small number
of measurements contain higher discrimination information. Previously, it was mentioned that
the construction of learned dictionaries can increase the computational complexity of the
algorithm. To avoid this problem, only one sampling matrix with low dimensionality can be
learned using a subset of patches instead of the whole images.

57

REFERENCES

58

REFERENCES

[1]

I.T. Jolliffe, Principal Component Analysis, Series: Springer Series in Statistics, 2nd ed.,
Springer, NY, 2002.

[2]

M. Vlachos, C. Domeniconi, D. Gunopulos, G. Kollios, and N. Koudas, “Non-Linear
Dimensionality Techniques for Classification and Visualization,” Proceeding of the Eight
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.
645-651, 2002.

[3]

B. Schoelkopf, A. Smola, and K. R. Mueller, “Nonlinear Component Analysis as a Kernel
Eigenvalue Problem,” Neural Computation, vol. 10, pp. 1299-1319, July 1998.

[4]

W. Pan, T.D. Bui, and C. Y. Suen, “Text Detection from Scene Images using Sparse
Representation,” 19th International Conference on Pattern Recognition, pp. 1-5, December
2008.

[5]

M. Zhao and S. Li, “Sparse Representation Classification for Text Detection,” Second
International Symposium on Computational Intelligence and Design, vol.1, pp. 76-79,
December 2009.

[6]

G. K. Vinay, S. M. Haque, R. V. Babu, and K. R. Ramakrishnan, “Human Detection using
Sparse Representation,” IEEE International Conference on Acoustics, Speech, and Signal
Processing, March 2012.

[7]

V. Cevher, A. Sankaranarayanan, M. F. Duarte, D. Reddy, R. G. Baraniuk, and R.
Chellappa, “Compressive Sensing for Background Subtraction,” Lecture Notes in
Computer Science, vol. 5303, pp. 155-168, 2008.

[8]

S. Mallat, A Wavelet Tour of Signal Processing: The Sparse Way, Academic Press an
imprint of Elsevier, 3rd ed., Burlington, MA, 2009.

[9]

B. Vidakovic and P. Mueller, “Wavelets for Kids: A Tutorial Introduction,” Institute of
Statistics and Decision Science, Duke University, pp. 1-28, 1994.

[10] S. M. Joseph, R. Sebastian, and B. Anto, “The Effect of Different Wavelets on Speech
Compression,” ACM Proceedings of the 2011 International Conference on
Communication, Computing & Security, pp.265-268, 2011.
[11] M. Antonini, M. Barlaud, P. Mathieu, and I. Daubechies, “Image Coding Using Wavelet
Transform, ” IEEE Transactions on Image Processing, vol. 1, April 1992.
[12] A. S. Lewis, and G. Knowles, “Image compression using the 2-D Wavelet Transform,”
IEEE Transactions on Image Processing, vol. 1, April 1992.

59

[13] J. Reichel, G. Menegaz, M. J. Nadenau, and M. Kunt, “Integer Wavelet Transform for
Embedded Lossy to Lossless Image Compression,” IEEE Transactions on Image
Processing, vol. 10, pp. 383-392, March 2001.
[14] X. G. Xia, C. Boncelet, and G. Arce, “Wavelet Transform based watermark for digital
images,” Optic Express, vol. 3, pp. 497-511, December 1998.
[15] P. Bao, and X. Ma, “Image Adaptive Watermarking Using Wavelet Domain Singular
Value Decomposition,” IEEE Transactions on Circuits and Systems for Video Technology,
vol. 15, pp. 96-102, January 2005.
[16] J. L. Starck, E. J. Candes, and D. L Donoho, “The Curvelet Transform for Image
Denoising,” IEEE Transactions on Image Processing, vol. 11, pp. 670-684, June 2002.
[17] K. Huang, Z. Wu, G. S. K. Fung, and F. H. Y. Chang, “Color image denoising with wavelet
thresholding based on human visual system model,” Signal Processing: Image
Communication, vol. 20, pp. 115-127, February 2005.
[18] A. Apatean, A. Rogozan, S. Emerich, and A. Bensrhair, “Wavelets as features for objects
recognition,” Acta Tehnica Napocensis, vol. 49, pp. 23-36, February 2008.
[19] S. Rastegar, A. Babaeian, M. Bandarabadi, and Y. Toopchi, “Airplane Detection and
Tracking Using Wavelet Features and SVM Classifier”, 41st Southeastern Symposium on
System Theory, pp 64-67, March 2009.
[20] K. M. Rajpoot, and N. M. Rajpoot, “Wavelets and Support Vector Machines for Texture
Classification,” In 8th IEEE International Multioptic Conference (INMIC'04), pp. 328–333,
2004.
[21] S. Arivazhagan, and L. Ganesan, “Texture Classification using Wavelet Transform,”
Pattern Recognition Letters, vol. 24, pp. 1513-1521, June 2003.
[22] K. Huang and S. Aviyente, “Wavelet Feature Selection for Image Classification,” IEEE
Transactions on Image Processing, vol. 17, pp. 1709-1720, September 2008.
[23] X. Wang, “Image Edge Detection Based on Lifting Wavelet,” Intelligent Human-Machine
Systems and Cybernetics, vol. 1, pp. 25-27, August 2009.
[24] M. Rizzi, M. D’Aloia, and B. Castagnolo, “Computer Aided Detection of
Microcalcifications in Digital Mammograms Adopting a Wavelet Decomposition,”
Integrated Computer-Aided Engineering, vol. 16, pp. 91-103, 2009.
[25] M. N. Do, and M. Vetterli, “The Contourlet Transform: An Efficient Directional
Multiresolution Image Representation,” IEEE Transactions on Image Processing, vol. 14,
pp. 2091-2106, December 2005.

60

[26] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An Algorithm for Designing of
Overcomplete Dictionaries for Sparse Representation,” IEEE Transactions on Signal
Processing, vol. 54, pp. 4311-4322, November 2006.
[27] S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,”
SIAM Journal on Scientific Computing, vol. 20, pp. 33–61, 1998.
[28] D. Wipf and B. Rao, “Sparse bayesian learning for basis selection,” IEEE Transactions on
Signal Processing, vol. 52, pp. 2153–2164, August 2004.
[29] J. Shi, X. Ren, G. Dai, J. Wang, and Z. Zhang, “A non-convex relaxation approach to
sparse dictionary learning,” In Proceedings of the IEEE international conference on
computer vision, pp. 1809–1816, June 2011.
[30] S. Mallat, and Z. Zhang, “Matching pursuits with time-frequency dictionaries,” IEEE
Transactions on Signal Processing, vol. 41, pp. 3397–3415, December 1993.
[31] Y. C. Pati, R. Reiizafar, and P. S. Krishnaprasad, “Orthogonal Matching Pursuit: Recursive
Function Approximation with Applications to Wavelet Decomposition,” Asilomar
Conference on Signals Systems and Computers, pp. 1-5, November 1993.
[32] D. Needell, and J. Tropp, “Cosamp: Iterative signal recovery from incomplete and
inaccurate samples,” Applied and Computational Harmonic Analysis, vol. 26, pp. 301–321,
May 2009.
[33] M. A. Hameed, “Comparative Analysis of Orthogonal Matching Pursuit and Least Angle
Regression,” MS Thesis, Michigan State University, 2012.
[34] R. Tibshirani, “Regression shrinkage and selection via the Lasso,” Journal of the Royal
Statistical Society, vol. 58, pp. 267-288, 1996.
[35] D.L. Donoho, “Compressed Sensing,” IEEE Transactions on Information Theory, vol. 56,
pp. 1289-1306, April 2006.
[36] E. J. Candès, “The restricted isometry property and its applications for compressed
sensing,” Comptes Rendus Mathematique, vol. 346, pp. 589-592, May 2008.
[37] M. A. Davenport, M. F. Duarte, Y. C. Eldar, and G. Kutyniok, “Introduction to
Compressed Sensing,” Book Chapter in Compressed Sensing: Theory and Applications,
Edited by Y. C. Eldar and G. Kutyniok, Cambridge University Press, 2011.
[38] M. A. Davenport, P. T. Boufounos, M. B. Wakin, and R. G. Baraniuk, “Signal Processing
with Compressive Measurements,” IEEE Journal of Selected Topics on Signal Processing,
vol. 4, pp. 445-460, April 2010.

61

[39] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin, “A simple proof of the restricted
isometry property for random matrices,” Constructive Approximation, vol. 28, pp. 253-263,
2008.
[40] E. J. Candes and M. B. Waking, “An Introduction to Compressive Sampling,” IEEE Signal
Processing Magazine, vol. 25, pp. 21-30, March 2008.
[41] M. Lusting, D. L. Donoho, J. M. Santos, and J. M. Pauly, “Compressed Sensing MRI,”
IEEE Signal Processing Magazine, vol. 25, pp. 72-82, March 2008.
[42] R. Baraniuk and P. Steeghs, “Compressive Radar Imaging,” 2007 IEEE Radar Conference,
pp. 128-133, April 2007.
[43] M. F. Duarte, M. A, Davenport, D. Takhar, J. N. Laska, T. Sun, K. F. Kelly, and R. G.
Baraniuk, “Single-Pixel Imaging via Compressive Sampling,” IEEE Signal Processing
Magazine, vol. 25, pp. 83-91, March 2008.
[44] M. Elad and M. Aharon, “Image denoising via sparse and redundant representation over
learned dictionaries,” IEEE Transactions on Image Processing, vol. 15, pp. 3736-3745,
December 2006.
[45] M. J. Fadili, J.-L. Starck, and F. Murtagh, “Inpainting and Zooming using Sparse
Representations,” The Computer Journal, vol. 52, pp. 64-79, 2009.
[46] J. Xu, Y. Pi, and R. Ming, “SAR Image Compression Based on Sparse Representation,”
11th International Radar Symposium, pp. 1-4, June 2010.
[47] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robust Face Recognition via Sparse
Representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30,
pp. 210–227, February 2009.
[48] K. Huang and S. Aviyente, “Sparse Representation for Signal Classification,” In Advances
in Neural Information Processing Systems, pp. 609–616, 2006.
[49] F. Rodriguez and G. Sapiro, “Sparse Representations for Image Classification: Learning
Discriminative and Reconstructive Nonparametric Dictionaries,” Technical report,
University of Minnesota, December 2007.
[50] M. Strauss, J. Tropp, and A. Gilbert, “Algorithms for simultaneous sparse approximation
part I: Greedy pursuit,” Signal Processing, vol. 86, pp. 572–588, April 2006.
[51] M. Strauss, J. Tropp, and A. Gilbert, “Algorithms for simultaneous sparse approximation
part II: Convex Relaxation,” Signal Processing, vol. 86, pp. 589–602, April 2006.

62

[52] S. Cardona-Romero and S. Aviyente, “Discriminative Sparse Image Representation for
Classification Based on a Greedy Algorithm,” IEEE Statistical Signal Processing
Workshop, August 2012.
[53] S. A. Nene, S. K. Nayar, and H. Murase, “Columbia Object Image Library (COIL-20),”
Technical Report CUCS-005-96, February 1996. Database available at
http://www1.cs.columbia.edu/CAVE/software/softlib/coil-20.php.
[54] B. Leibe and B. Schiele, The Pascal Object Recognition Database Collection: The TU
Darmstadt Database (formerly the ETHZ Database), Database. Available at
http://pascallin.ecs.soton.ac.uk/challenges/VOC/databases.html#TUD.
[55] H. Yi, “Robust Wavelet Transform-based Correlation Edge Detectors using Correlation of
Wavelet Coefficients,” International Journal of Signal Processing, Image Processing and
Pattern Recognition, vol. 4, pp. 77-88, December 2011.
[56] P. L. Dragotti and M. Vetterli, “Wavelets Footprints: Theory, Algorithms, and
Applications,” IEEE Transactions on Signal Processing, vol. 51, pp. 1306-1323, May
2003.
[57] V. Kumar, “Face Recognition using Gabor Wavelets,” Technical Report, Global Academy
of Technology, 2006.
[58] C.-C. Chang and C.-J. Lin, “LIBSVM -- A Library for Support Vector Machines,” ACM
Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1-27:27, 2011. Software
available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[59] C.-W. Hsu, C.-C. Chang, and C.-J. Lin, “A Practical Guide to Support Vector
Classification,” Technical Report, National Taiwan University, 2010. Available at
http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
[60] M. Lustig, D. Donoho, and J. M. Pauly,“Sparse MRI: The Applications of Compressed
Sensing for Rapid MR Imaging,” Magnetic Resonance in Medicine, vol. 58, pp. 11821195, December 2007.
[61] L. Liu and P. W. Fieguth, “Texture Classification from Random Features,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 34, pp. 574-586, March
2012.
[62] M. F. Duarte, M. A. Davenport, M. B. Wakin, and R. G. Baraniuk, “Sparse Signal
Detection from Incoherent Projections,” IEEE International Conference on Acoustics,
Speech, and Signal Processing, Toulouse, France, May 2006.
[63] J. Haupt, R. Castro, R. Nowak, G. Fudge, and A. Yeh, “Compressive Sampling for Signal
Classification,” In Proc. 40th Asilomar Conf. Signals, Systems and Computers, November
2006.

63

[64] H. Almuallim and T.G. Dietterich, “Learning with many irrelevant features”, In: Ninth
National Conference on Artificial Intelligence, pp. 547-552, 1991.
[65] A. Vashist, Z. Zhao, A. Elgammal, I. Muchnik, and C. Kulikowski, “Discriminative Patch
Selection using Combinatorial and Statistical Models for Patch-Based Object Recognition,”
Conference on Computer Vision and Pattern Recognition Workshop, pp. 1-12, June 2006.
[66] S. Agarwal and D. Roth, “Learning a Sparse Representation for Object Detection”,
Proceedings of the Seventh European Conference on Computer Vision, vol. 4, pp. 113-130,
2002.
[67] F. Xu and Y.-J. Zhang, “Feature Selection for Image Categorization”, Computer Vision ACCV 2006, vol. 3852, pp. 653-662, 2006.
[68] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. Fisherfaces:
Recognition Using Class Specific Linear Projection”, IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 19, pp. 711-720, July 1997.
[69] A. M. Martinez and A. C. Kak, “PCA versus LDA.” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 23, pp. 228-233, February 2001.

64