l
l

.HHWIHII

    
       
   
 

H

W”

“WI?

\
\

THEE-‘8
L

1007

..—v.¢—.—.-.-.-.—-_4-.-u—

This is to certify that the dissertation entitled

LARGE DIMENSION AND SMALL SAMPLE SIZE
PROBLEMS: CLASSIFICATION, GENE SELECTION AND
ASYMPTOTICS

presented by

JUN LUO

has been accepted towards fulﬁllment
of the requirements for the

Ph.D degree in Statistics and Probability

 

 

W

Major Professor’s Signature

I? - 0L},— Zc-oé

 

Date

Doctoral Dissertation

MSU is an Afﬁrmative Action/Equal Opportunity Institution

 

LIBRARY
Michigan State
University

 

 

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE

DATE DUE

DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2/05 p:/ClRC/DateDue.indd-p.1

 

 

LARGE DIMENSION AND SMALL SAMPLE SIZE
PROBLEMS: CLASSIFICATION, GENE
SELECTION AND ASYMPTOTICS

By

Jun Luo

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Statistics and Probability

2006

ABSTRACT

LARGE DIMENSION AND SMALL SAMPLE SIZE
PROBLEMS: CLASSIFICATION, GENE
SELECTION AND ASYMPTOTICS

By

Jun Luo

Classification of patient samples is an important aspect of cancer diagnosis and treatment.
The support vector machine (SVM) and penalized logistic regression (PLR) have been
successfully applied to microarray cancer diagnosis problems. The two methods treat equal
penalty on each loss. That may lead to misclassiﬁcation on unbalanced data. So we propose
V-ridge regression (ll-RR), which puts a generalized weight on the loss of each sample and
optimizes the weight. vector by the model itself, as an alternative method to the SVM and
PLR for classiﬁcation in microarray cancer diagnosis. Often a primary goal in microarray
analysis is to identify the genes which are most responsible for classiﬁcation in microarray.
Two gene selection methods are considered, univariate ranking (UR) and recursive feature
elimination (RF E)

Simulation on the well known leukemia data and breast cancer prognosis data indicates
that V—RR combined with either UR or REF tends to select less signiﬁcant genes than other
methods. Meanwhile, V-RR performs superior to SVM and PLR with a lower rate in both
cross—validation error and test error.

One of the weaknesses of the SVM is that given a tumor sample, it only predicts a cancer
class label but does not provide any estimation of the underlying probability. The penalized
logistic regression has the advantage of additionally providing an estimate of the underlying

probability of being assigned to each class, but in fact it does not offer any estimate for the

probability of the outcome class, conditional on an individual gene variable. We propose
the conditional logistic regression (CLR) model, which is an alternative for the microarray
cancer diagnosis classiﬁcation, for the underlying probability of the response given any gene
variable. In addition, since a primary goal in microarray cancer diagnosis is gene selection,
we propose a new method called modiﬁed univariate ranking (MUR) as a new choice for
dimension reduction.

We show that when applied to a microarray data for classiﬁcation, CLR performs similarly
to SVM, PLR and BMA, but CLR has the advantage of providing the probability of the
outcome class, conditional on any individual gene variable. Empirical results on leukemia
and breast cancer data indicate that the CLR method combined with one gene selection
method (MUR, BSS/VVSS or RFE) tends to perform superior on both CV—error and test
error rate.

Microarray data typically have very high dimension p and much smaller sample size n.
Classical asymptotic theory deals with p ﬁxed and 72. goes to inﬁnity, which is no longer
appropriate for microarray data analysis. There are discussions in the literature about the
behavior of estimations when both p and n tend to inﬁnity, but very few dealing with 77.
ﬁxed and p tends to inﬁnity. The latter situation seems more relevant to microarray data
in practice. Here we outline and describe the asymptotical behavior of ridge regression
estimations when sample size n is ﬁxed and dimension p tends to inﬁnity. Given certain
data, mean squared error consistency is established under certain regularity conditions.
When there are only ﬁnite number of important genes that are actually related to the
outcome, we propose a variable screening method to eliminate genes which are unrelated
to the outcome and prove the asymptotic consistency of the procedure. After screening,
the dimension-reduced microarray data can be further analyzed via a well-known variable
selection method such as AIC and BIC. Some simulation results for testing the performance

of the screening method are also presented.

Copyright by
.hniLuo

2006

To my deeply loved mother Laxiang Liu, my lenient father Youxin Luo and
my supportive sister Yan Luo.
To all my friends at Michigan State University. They have kept me

accompany in the past 4 years.

ACKNOWLEDGMENTS

First of all, I sincerely appreciate my PhD adviser, Professor Yijun Zuo, for what he has
done for me in the program. His excellent guidance and thoughtful lectures sparkled my
passion in statistics and life. My dissertation is an allusion without Professor Yijun Zuo’s
effort and time spent on me.

I also want to give my special thanks to Professor Hira Koul, who has been very strict
and very careful to be a substitute of Professor Yijun Zuo. I am happy to have him as my
committee chair. I value his advice in the rehearsal and writing of the thesis more than he
could imagine.

I would like to thank Professors Sarat Dass, Robert Tempelman and Lijian Yang for
servmg on my thesis committee and for their valuable advice and kind cooperation. I am
also grateful to Professor James Stapleton for his generous help and spiritual support.

I have made some lifetime friends at Michigan State University, such as Winny Chiang,
Kirk D Dolan, Hongwen Guo, Wenmei Huang, Fang Li, Rong Liu, Aaron Thomas Porter,
Yu Sun, Weixing Song, Tony Wang, Lily Wang, Jing Wang, Hui Zhang, Yongfang Zhu and
Yanwei Zhang. I thank them for their moral support and friendship.

In the last, I am grateful to my family in China. My loved mother, father and sister
have been supportive of me all the way. They tried every possible means to make sure I am
happy and healthy. No words can describe my feelings at this moment. I will just say “I

Love You All, My Dear Familyl”!

vi

TABLE OF CONTENTS

LIST OF TABLES ix
LIST OF FIGURES x
1 Introduction 1
1.1 Standard SVM for 2-class classiﬁcation ..................... 4
1.2 u-SVM formulation ................................ 7
1.3 Equivalence proof ................................. 8
1.4 Ll-norm Support Vector Machine ........................ 9

2 Classiﬁcation and Gene selection of Cancer Microarrays by u-Ridge Re-

gression 11
2.1 Approach ..................................... 13
2.1.1 V—Ridge Regression ............................ 13
2.1.2 Quadratic Programming ......................... 14
2.1.3 Computation of w and b ......................... 15
2.1.4 Feature selection ............................. 16
2.2 Results ............... " . . . ‘ ..................... 17
2.2.1 Choosing A and l/ ............................. 17
2.2.2 Leukemia data .............................. 19
2.2.3 Breast cancer prognosis data ....................... 20
2.3 Discussion ..................................... 21
2.4 Conclusion ..................................... 22

3 Density Curve Estimation and Classiﬁcation of Microarray by Conditional

Logistic Regression Model 23
3.1 Approach ..................................... 25
3.1.1 Conditional Logistic Regression Formulation .............. 25
3.1.2 Algorithm ................................. 26
3.1.3 Feature Selection ............................. 29
3.1.4 Cut-off statistics (1,, ............................ 31
3.2 Results ....................................... 32
3.2.1 Breast cancer prognosis data ....................... 32
3.2.2 Leukemia data .............................. 33
3.3 Conclusion ..................................... 34

vii

4 Asymptotics and Variable Screening for Microarrays with Fixed Small
Sample Size and Large Dimension
4.1 Mean Squared Error Consistency ........................
4.2 Variable Screening Procedure ..........................
4.3 Simulation Result .................................
4.4 Conclusions and Discussions ...........................

5 Gene Selection Methods for Microarray Data
5.1 Univariate Ranking ................................
5.2 Recursive Feature Elimination ..........................
5.3 Ratio of Between—group sum of squares to Within-group sum of squares .
5.4 Clustering method ................................
5.5 Pairwise Ranking Method ............................

BIBLIOGRAPHY

viii

43
48

50
51
52
53
53
55

57

2.1
2.2
2.3
2.4

3.1
3.2
3.3
3.4
3.5
3.6

4.1

4.2
4.3

LIST OF TABLES

Parameter selection. ............................... 18
Comparison of leukemia classiﬁcation methods ................. 19
Published results of leukemia classiﬁcation methods .............. 19
Comparison of breast cancer classiﬁcation methods. .............. 20
Comparison of classiﬁcation methods for breast cancer prognosis data. . . . . 32
Output of CLR on breast cancer data. ..................... 32
Comparison of CLR on leukemia data. ..................... 33
Comparison of leukemia classiﬁcation methods. ................ 33
Parameters for the chosen genes in the ﬁnal classiﬁer from CLR MUR. . . . 33
Parameters for the selected genes in the ﬁnal classiﬁer from CLR RF E. . . . 34
Simulation results for MSE consistence based on 1,000 runs (x ﬁxed), hp =

p025, n = 60, X is from the ﬂowchart ............. ‘ ......... 46
Simulation results for close-to—O nonzero ,8,- values based on 1,000 runs (x ﬁxed) 47

Simulation results for distinguished nonzero l3,- values based on 1,000 runs (x
ﬁxed) ........................................ 48

ix

LIST OF FIGURES

1.1 Two class classiﬁcation .............................. 5
1.2 Support vector machine ............................. 5
4.1 Flowchart of data generation ........................... 44

CHAPTER 1

Introduction

The aim of the dissertation is to provide an up to date review of some different approaches to
classiﬁcation (mainly machine learning methods), propose two new methods of classiﬁcation,
compare their performance on some challenging data sets, and draw conclusions on their
applicability to realistic microarray problems.

The task of classiﬁcation occurs in a wide range of human activity. At its broadest, the
term could cover any context in which some decision or forecast is made on the basis of
currently available information, and a classiﬁcation procedure is then some formal method
for repeatedly making such judgements in new situations. In this dissertation we shall
consider a more restricted interpretation. We shall assume that the problem concerns the
construction of a procedure that will be applied to a continuing sequence of cases, in which
each new case must be assigned to one of a set of predeﬁned classes on the basis of observed
attributes or features. The construction of a classiﬁcation procedure from a set of data for
which the true classes are known has also been variously termed as pattern recognition,
discrimination, or supervised learning (in order to distinguish it. from unsupervised learning
or clustering in which the classes are inferred from the data).

Microarray experiments raise numerous statistical questions in image analysis, experi-
mental design, cluster and discriminant analysis, and multiple hypothesis testing. Here we
focus on the classiﬁcation of tumors using gene expression data. Three main types of sta-
tistical problems are associated with tumor classiﬁcation: (1) identiﬁcation of new tumor

classes using gene expression proﬁles, cluster analysis/ unsupervised learning; (2) classifica-

tion of malignancies into known classes, discriminant analysis/supervised learning; and (3)
identiﬁcation of ”marker” genes that characterize the different tumor classes, called variable
selection. In a two class microarray classiﬁcation problem, we are given a training data set
(131,311) ...... (3:7,, y“), where the input :13,- is a p-vector corresponding to the gene expression
values of the ith experiment (or samples), gr,- = (2:21, ..., rip), see Eisen (1998), Getz (2000),
Slonim (2000), Yang (2001), van’t Veer (2002) and Chen and Chen (2003). The output 3;,-
is a binary class label and assumed taking values in {—1,+1}, see Xiong (2000) and Yeung
(2001). The problem of interest is to ﬁnd a classiﬁcation rule from the training data, so
that we can actually assign a class label from {—1, +1} when given a new sample .7: with p
gene expression measurements.

Data from these new types of experiments present a ”large p, small 7:.” problem; that
is, a very large number of variables (genes) relative to the number of observations (tumor
samples). In statistics when the number of variables is much larger than the number of
samples, one is said to be facing the problem of the so called ”curse of dimensionality”, and
the function estimated (in here, it is classiﬁer of the microarray data) may be over-ﬁtting
(i.e.very high accuracy in ﬁtting the training Samples but very low accuracy in assigning
labels for the test data). This problem is mitigated by using some gene selection methods
and making sure the classiﬁer is smooth. So that the new samples similar to those in the
training set will be labeled similarly.

Machine learning is a scientiﬁc ﬁeld that addresses the question of how to program systems
to automatically learn and to improve with experience. Vapnik (1995) successfully invented
the application of machine learning methods (called support vector machine or SVM) to
two class classiﬁcation problems. As a consequence, SVM has been applied for classiﬁcation
in cancer microarray data. Besides SVM, past publications on cancer classiﬁcation using
gene expression data have focused mainly on the cluster analysis of both tumor samples and
genes and include applications of hierarchical clustering (Alon et at. (1999) and Perou et
al. ( 1999)) and partitioning methods such as self-organizing maps ( Golub (1999)). Dudoit
et al. (2002) compared the performance of a bunch of different discrimination methods
for the classiﬁcation of tumors based on gene expression proﬁles. These methods include

traditional ones, such as nearest-neighbor and linear discriminant analysis, as well as more

modern ones, such as classiﬁcation trees, bagging and boosting.

Statistical approaches are generally characterized by having an explicit underlying prob-
ability model, which provides a probability of being in each class. In addition, it is usually
assumed that the techniques will be used by statisticians, and hence some human interven-
tion is assumed with regard to variable selection and transformation, and overall structuring
of the problem. Unfortunately, SVM does not offer any underlying probability in the classi-
ﬁcation. Zhu (2004) proposed penalized logistic regression (PLR) for classiﬁcation in cancer
microarray and an estimator of the underlying probability.

But neither SVM nor PLR can avoid the misclassiﬁcation on unbalanced data set: the
larger the training sample size for one class, the smaller its corresponding classiﬁcation error
rate. The main cause is that the penalty of misclassiﬁcation is taken to be the same for
each training sample. Wang and Yang (2004) used weighted support vector machine on the
prediction of membrane protein types. Inspired by this, we proposed a new non-parametric
classiﬁcation method called V-ridge regression ( u-RR), where each sample contributes dif-
ferently. Since SVM, PLR and any discriminant methods do not study the conditional
probability of the outcome, conditional upon each gene expression level, our conditional lo—
gistic regression is indeed very useful in classification and providing underlying distribution.
The detailed work is in chapter 2 and 3.

DNA microarrays now permit scientists to screen thousands of genes simultaneously and
determine whether these genes are active, hyperactive or silent in normal or cancerous
tissue. Because these new microarray devices generate bewildering amounts of raw data, new
analytical methods must be developed to sort out whether cancer tissues have distinctive
features. In this dissertation, we address the problem of selection of a small subset of
genes from broad patterns of gene expression data, recorded on DNA microarrays. Using
available training examples from cancer and normal patients, we build a classiﬁer suitable
for genetic diagnosis, as well as drug discovery. Previous attempts to address gene selection
in cancer microarray are listed in chapter 5, including correlation techniques and pairwise
selection. A new method for gene selection based on univariate ranking method is also
proposed. We demonstrate experimentally that the genes selected by this technique yield

better classiﬁcation performance and are biologically relevant to cancer. These ﬁndings

are consistent with the published results in the literature. In contrast with the published
methods, our method eliminates gene redundancy automatically and yields better and more
compact gene subsets. Simulations are done with well-known golden standard data sets
Leukemia and Breast Cancer Prognosis data. For gene selection in asymptotic case, existing
statistical methods deal with a single gene at a time (see Chen et al. (2003) ), or deal with
the case in which the sample size n -—> 00 (see Shae (2006)). Chapter 4 describes insight
for the analysis of microarray data with the more realistic case that sample size is ﬁxed and

the dimension of variables p —» 00.

1.1 Standard SVM for 2-class classiﬁcation

The original SVM was motivated by the idea of maximizing the distance between the separa-
tion hyperplane and the closest point in the training samples, also of making the prediction
function as smooth as possible. See Vapnik (1995, 1998, 2001 and 2002), Ripley (1996),
Bradley (1998), Allwein (2000), Mukherjee (2000), Scholkopf (2000) and Evgeniou (2000).
Discriminant methods have been used for classiﬁcation in cancer microarray data, see Du-
doit (2002). Based on the n training data ($1,311) ....... (rn, yn), where I,- is a p-vector in the
dot product space (through the chapter, we will focus on the linear separation hyperplane
and use 11:,- as the basis function), y,- is the label from {+1, —1} for the ith sample. The

classiﬁcation rule is given by:
f(1:) = sign(< 112,1: > +b). (1.1)

To address the classiﬁer for the microarray data with maximal margin distance, some sta-
tistical methods have been applied. Throughout the dissertation, < any > is the inner
product of the two vectors x and y.

For a two class classiﬁcation problem, the linear SVM ﬁts a model f (3:) =< w, x,- > +b,

where w and b are such that

. 1 9 H
mm — w ‘ + C ] , 1.2
,5 [2H “2 Z} < )
subject to the condition
y,(< 10,13 > +b)21—£,-, £1- 2 0, for all i=1,2,...,n, (1.3)

Q.
I

 

 

 

 

 

 

X f Y
face-ct" =5i n w.x+t’>
° denotes +1 ( ) g ( )
‘ denotes-1 : , °
, . , . . Howwould you
. ° _ ° classify this data?

 

 

Figure 1.1. Two class classiﬁcation

 

 

Linear Classiﬁers
X

 

 

 

f y

f ' = ,0) = Signmr. x+ f3}

 

 

' denotes +1
° denotes -1 O, '
J . o
, Any of these
a ' would be ﬁne.
/ ° , .
é . . ° . _ ..but which is
/ 5* a . . best?
I

 

 

Figure 1.2. Support vector machine

where {£3 is the soft margin which allows the possibility of violation on the distance con-
straint for non-separable data (i.e. when (,7 > 1,\7’z' = 1,2, ....,n) This new formulation
trades off two goals of ﬁnding a hyperplane with large margin (minimizing llwll) and ﬁnding
a hyperplane that separates the data well (minimizing the loss). And the constant C > 0
controls the trade-off. In practice, C is determined by cross validation. The classiﬁer is
given by f (r) = sign(< 10,3: > +b). The above set-up is called C-SVM.

Ideally, we are interested in the number of nonzero 5,, as that is the count of errors
made by our classiﬁer on the set of training examples. This count is the L0 norm, and
is discontinuous and non-convex. We would like to stay close to L0, while maintaining
convexity. We take the Lagrangian in the usual manner:

L(w,€,b,a,i3) = $11]er 0251' + Zaill " 52' - yi($iw + ’0] — 2315i . (1-4)

i=1 i=1 {:1

To ﬁnd the dual form of the problem, we ﬁrst need to minimize L(w, 5, b, a) with respect to
111,6 and b (for ﬁxed a and ﬂ), i.e. minw,§,b L(w, 5, b, (1,6). Since the Lagrangian function is
linear in a and 6, we can not set the gradient with respect to a and [3 to zero. We obtain
the following dual optimization problem:

n 1 TI.

Incax [g -—5 ”2:1 yiyjaiaj < 12,-,;rj > ], (1.5)
subject to the constraint 0 S a g C, 2,11 my,- 2 0. The introduction of the box constraint
on a, is necessary to ensure that the Lagrangian will be bounded (i.e., that we can not drive
the cost to —oo). This will occur in the case in which (C — ai) is negative and 5,- goes to
co, the summed expression 2?:1(C — 01,-)5, goes to —00. The box constraint prevents this
from happening.

The dual still has a quadratic objective, and differs from the optimal margin classiﬁer
only in the introduction of the box constraint. Indeed, we can still use to = 2:le oil-$231,- to
give us the optimal value of w in terms of the optimal value of a. We must also verify that

the KKT dual-complementary conditions are still satisﬁed in this optimization problem:
a, = 0 =2 yi[:c;w + b]21,
(12-: C => y.,-[:I;,’,-w + b] S 1,

0 < a, < C => y,-[;1,‘;'w + b] =1.

As before, a,- will be nonzero only for the support vectors, where the set of support vectors
now includes all data points on the margin boundary as well as those on the wrong side of
the margin boundary.
Most often, one uses the regularized optimization problem which is so called standard
Lg-norm SVM:
71
min [2(1— y,(< u.r,:1:,-> +1»), + Alla/jg] (1.6)

“”b i=1

where /\ 2 0 is a tuning parameter, and the classiﬁer is deﬁned in (1.1). Model (1.2) and
(1.3) is basically equivalent to (1.6) in the sense that for any given C > 0, there exists a
A = 1 / (2C) so that these two models share the same optimal value w, b, i.e. we will get
the exactly same classiﬁcation rule. Note that (1.6) has the form loss and penalty, and A
is the tuning parameter that controls the trade-off between the loss and penalty. The loss

(1 — y f )+ is called the hinge loss, and the Lg-norm penalty is called the ridge penalty.

1 .2 V-SVM formulation

lVIotivated by the idea of maximizing the margin, another approach to get the classiﬁer is
considering
_ 1 2 12
1533;; , [gllwllz + C<—up + Muzzle-ﬂ . no

subject to the constraint
y,(< 11),:121' > +b) 2 p — 5,, 5,- 2 0, p Z 0, for all i=1,2,...n, (1.8)

where C > 0 and z/ 6 [0,1] are parameters. To understand the role of p, note that for
E = 0nx1,the constraint simply means the two classes are separated by the margin p/lel.
The classiﬁer for (1.7) is deﬁned in (1.1). Applying the Lagrangian dual variable method on
model (1.8), it turns out that C in (1.7) does not matter at all and (1.7) obtains the same
decision function as

, 1 2 1 " ]
_ I‘ — l/ —— ' 109
21.53%) [QHH “2 p + n £1 £2 ( )

under the constraint (1.8), V is a parameter in [0,1]. Model (1.9) is known as u-SVM, which
is proposed by Scholkopf, Smola, Williamson, and Bartlett (2000) for classiﬁcation. The
relationship between SVM and V-SVM with 2-norm penalty is analyzed below. We leave
the comparison with l-norm to the feature selection section.

The V-SVM possesses some additional properties than C-SVM besides that both methods
provide the classiﬁer using the training data. For example, V-SVM has the advantage of
using a parameter u on controlling the number of support vectors and the fraction of training

errors.

1 .3 Equivalence proof

Generally (1.6) and (1.8) and (1.9) are two different problems with the same optimal solution
set.
Theorem 1. In model (1.8) with constraint (1.9), for any given u 6 [0,1] with optimal
solution 5 > 0, we conclude (1.8) and (1.9) has the same classiﬁer as (1.6) with A = 715/2.
Proof. Focus on model ( 1.6), suppose there are ﬂab such that
... 1 n
(a b) = argggg [,kui — 225+ 1/n gt] ,

subject to the condition
y,(< 10,115 > +0) 2 ﬁ- fpéi 2 0,fOI‘ all i=1,2,...,n.

Now we do the following transformation of variables:

Then the classiﬁer for (1.8) and (1.9) becomes
f(x) = sign(< Ew’,:1: >) + pi), = sign(< w', :1: >) + b’.

Recall (w' , b’) satisﬁes

Tl
I I . ~2 2 ~ ~
w ,b = (117‘ mm [1 2, w . -—1/ + n .],
( g'w,b,£,- //> H “2 p p/ 2 £1

i=1

subject to the condition
yi(< 10,13 > +b)21— 5,, f,- 2 0, for all i=1,2,...,n.

The above optimization problem is the same as

subject to the condition
yt(< w’,:v.- > +b’) 2 1— 5;, g;- 2 0, for all i=1,2,...,n.

That is model (1.6) with A = 715/2, so the classiﬁer is f(.r) = sign(< w’, :t: > +b’). Obviously

we get the same decision function from the two models under the condition 5 > 0.

1.4 Ll-norm Support Vector Machine

The standard 2-norm SVM is known for its good performance in two-class classiﬁcation.
Hastie (2004) studied the regularization path for SVMs. In the section, we talk about 1-
norm SVM. 1-norm SVM has some advantage over the standard 2-norm SVM, especially
l-norm SVM functions as one variable selection method and a classiﬁcation method.

In standart 2—class classiﬁcation problems, we are given a set of training data
(231,312), ..., (3:7,,yn), where the input r,- E R”, and the output y,- E {1, —1} is binary. We
wish to ﬁnd a classiﬁcation rule from the training data, so that when given a new input at,
we can assign a class y from {1, —1} to it. To handle this problem, l-norm support vector
machine (l-norm SVM) was considered:

71 q

3113211 — we + gamma . (1.10)
0‘ i=1 j=1

subject to the condition |]/3]|1 = [131] + + [31] S s, where D = 111(13),...,hq(17) is a
dictionary of basis functions, and s is a tuning parameter. The solution is denotes as [30(3)

and 6(3); the ﬁtted model is

q
[(JI) = [1’30 + Z )3th (.7?) .
i=1

A

The classiﬁcation rule is given by 31977] f (27)]

The 1-norm SVM has been successfully used in classiﬁcation problem for cancer microar-
ray data. To get a good ﬁtted model f(a:) that performs well on future data, we also need
to select an appropriate tuning parameter s. In practice, people usually pre—specify a ﬁnite
set of values of s that covers a wide range, then either use a separate validation data set or
use cross-validation to select a value for s that gives the best performance among the given
set. In Zhu (2004), the chapter illustrates that the solution path [3(3) is piece-wise linear
as a function of s (in the 1?." space); it also proposes an efﬁcient algorithm to compute the
exact whole solution path {6(5), 0 S s S 00}, hence help us understand how the solution
changes with s and facilitate the adaptive selection of the tuning parameter 3. Under some
mild assumptions, Zhu showed that the computational cost to compute the whole solution
path 6(3) is 0(nq min(n, q)2) in the worst case and 0(nq) in the best case.

l-norm SVM replaces the ridge penalty in 2—norm SVM with the L1-norm penalty on 5,

i.e., the lasso penalty (See Tibshirani 1996), and considers the optimization below:

17.

q
3513 [g1 — act + Zap-(avail. + Allﬁlli], (1.11)
_ 3-1
which is equivalent Lagrangian version of the constrained optimization problem. The lasso
penalty was ﬁrst proposed by Tibshirani (1996) for regression problems, where the response
y is continuous rather than categorical. It has also been used in Bradley (1998) for clas-
siﬁcation problems under the framework of SVMs. Knight (2000) and Pa (2004) studied
the asymptotics for lasso-type estimators. Similar to the ridge penalty, the lasso penalty
also shrinks the ﬁtted coefﬁcients B’s towards zero, hence 1-norm SVM also beneﬁts from
the reduction in ﬁtted coefﬁcients’ variances. Another property of the lasso penalty is that
because of the L1 nature of the penalty, making A sufficiently large, or equivalently s suf-
ﬁciently small, will cause some of the coefficients ﬁj’s to be exactly zero. Thus the lasso
penalty does a kind of continuous feature selection, while this is not the case for the ridge

penalty in 2-norm SVM, where none of the 53’s will be equal to zero.

10

CHAPTER 2

Classiﬁcation and Gene selection of
Cancer Microarrays by u—Ridge

Regression

There has been a recent explosion in the use of microarray data for classiﬁcation in a variety
of diagnostic areas, see Golub (1999) and Hastie (1998). The prediction of the diagnostic
category of a tissue sample from its expression array phenotype given the availability of
similar data from tissues in classiﬁed categories is known as classiﬁcation or supervised
learning. In the context of gene expression data, for example, different tumor types (Golub
et al. 1999; Ramaswamy et al. 2001), response to therapy (van’t Veer et al. 2003). A
challenge in predicting the diagnostic categories using microarray data is that the number
of genes is much greater than the number of tissue samples available, and we assume only
a subset of genes is relevant in distinguishing different classes. Selection of relevant genes
for classiﬁcation is known as feature selection, which is a primary goal in microarray data
analysis. A small set of relevant genes is essential for the development of inexpensive
diagnostic tests.

The support vector machine (SVM) is one of the leading methods that has been suc-
cessfully applied to classiﬁcation of the cancer diagnosis (Lee & Lee 2002, Mukherjee et ul.

1999, and Ramaswamy et al. 2001). In two-class classiﬁcation, the linear SVM ﬁts a model

11

f(1:) 2 b0 + 131><pbpx1 that minimizes

2(1— yucca» + glitz.
i=1

The classiﬁcation decision is then made according to sign[ f (1)] Inspired by the gene
variable selection, Zhu (2003) proposed a Ll-norm SVM using lasso penalty. The model
can select variables by shrinking the coefﬁcients toward 0 and actually setting some of them
to be exactly 0.

Besides the non-parametric classiﬁcation methods, Zhu (2004) proposed a penalized lo-
gistic regression (PLR) model and addressed the estimator of probability of interest p(:r),
where the p(x) = P(Class = +1|X = 1:) is the conditional probability of a point being in
class {+1}, given the gene expression measurements x. PLR performs comparably to SVM
in classiﬁcation besides providing an estimator of the underlying probability. However, these
three classiﬁcation methods may lead to misclassiﬁcation on unbalanced data due to equal
penalty on the loss of each sample. Wang and Yang (2004) proposed a weighted SVM to
deal with classiﬁcation on unbalanced data. They tried to put a constant weight on the loss
of samples from class {+1} and another constant weight on the loss of samples from class
{—1} so that the classiﬁer will not beneﬁt the class with more training samples.

It is possible that each sample contributes differently to the ﬁnal classiﬁcation rule. To
capture the characteristic, in the thesis, we propose a new Optimization method called u-
ridge regression (ll-RR) method. It can optimize the generalized weight for each training
sample instead of putting a pre—identiﬁed constant weight. We deﬁne an upper bound of
the weight so that the model is very ﬂexible and obtain the best performance on a new gene
expression data set. We also give a new and efﬁcient computational insight for computing
coefﬁcients and constant in the model.

Besides predicting the correct cancer class for a given tumor sample, a primary challenge
in microarray cancer diagnosis is to identify the relevant genes with contribute most to the
outcome. We apply three gene selection methods here, univariate ranking (UR) (Dudoit et
al. 2002, Golub et al. 1999 and Zhu 2005) and recursive feature elimination (RFE) (Guyon
et al. 2002). The comparison of SVM, PLR and u-RR with external gene selection methods

are done with two frequently studied microarray data sets: leukemia introduced by Golub

12

et al. (1999) and breast cancer prognosis data which appeared on Nature in 2002 by van’t
Veer et (11..

Our simulation results 011 the real data sets indicates u-RR with RF E tends to select
less genes with exceptional error rates than SVM, PLR and BMA. More detailed report of
simulation appears in section 2.3. The formulation of u—RR, the description of UR, RFE

and computation are described in section 2.2.

2. 1 Approach

2.1.1 V—Ridge Regression

VVeighted-SVM has been discussed for unbalanced 2-class classiﬁcation problem. See Chew
(2001). He puts a common weight on all training samples from one class and puts another
common weight on the rest of training samples. In Wahba et al. (2005), the technical
report states a modiﬁed penalized likelihood for weighted data. Basically, each simulated
multivariate data point has two associated weights. In this section, we propose a new model
with a generalized weight, which may be negative, on the loss of each sample, and try to
extract the best classiﬁer among all the weighted models. ‘

When we use a linear classiﬁer in a two—class microarray data classiﬁcation problem, to
assign the most likely group to the samples in test data, we come up with the idea of putting
a penalty on the inﬂuence of each gene. This is one way to lower misclassiﬁcation, since
those samples which have very similar gene expression values will be labeled similarly. In
another words, if the label of one training sample :51 is yl E {—1,+1}, and there is a test
sample 2:2 which is very close to $1 in Euclidian distance sense. So 2:2 is more likely from
class yl as 2:1. Therefore, we hope the inner product | < w, (1:1 —— (1:2) > I will be as small as
possible. Roughly speaking, we may control Hail]2 so that the samples in test data will be
assigned to the same group as these similar sample tissues in training data. The standard
SVM puts the same weight on the loss of each training sample. Here, we put a generalized

weight on the loss of each sample and regard the weights as unknown parameters. Consider

13

the optimization problem below:

Tl-

max min [Zt,(< 111,55, > —y,-) + AlleE] , (2.1)

t 11)
IX” 1:1

subject to the condition
72
§:a=a oguga 12a udga VL=LZMn. @2)
i=1

Each t,- can be understood as the generalized weight for the loss function of the ith sample.
The max — min forces each loss to approach 0. Since our optimization is actually a modi-
ﬁcation of the dual form of ridge regression, we name (2.1) and (2.2) as V-ridge regression
model (ll-RR). The constraint 221:1 t,- = 0 is a regularized condition. The parameters A
and V are to be chosen. In practice, they are determined by cross-validation. We clearly
state the selection process in section 2.2.

The classiﬁcation function for V-RR is given by f(:1:) = stgn(< 10,51: > +b) for the choice

of A and v, where b is a constant.

2.1.2 Quadratic Programming

As is well-known, the microarray data typically has small sample size it but large dimension
of gene variables p (in the thousands). In the model we proposed above, one of the important
roles of the vector t is reducing the number of parameters from p + 1 to n + 1, which
makes the optimization feasible. The objective function L(w,b,t) = 2;, t,-(< 112,321 >
) + b — y,) + Allwllg is convex for w,b when the vector t is ﬁxed. For any given t1Xn
satisfying the constraints, notice that the value of b is not a function of t. In fact, any value
of b will not change the optimal solution for the whole optimization problem (2.1) and (2.2).
So we propose a rule to get the value of b in section 2.3. To reduce the computation, we

simplify the part minwab L(w, b, t) by getting the score equation for to, whose expression is

given by
8L(w,b,t) = 0.
(911)
That leads to n
2mu+§:uﬂp=0. (an
i—l

14

Therefore, the optimization problem becomes searching for the weight vector which

Tl
max [— Auwné — Z ta] . (24)
2'21

tlxn

subject to the condition

n
Xi, = 0, |t,| s 2), Vi = 1,2,...,n.
i=1

Plug (2.3) into (2.4), then we have

. 1 n n n
mtln [24-A- Z 2 title < 172', 13k > + Ztiyi] , (2.5)
i=1

i=1k=1

subject to the condition

n
Xi, = 0, |t,:| g u, V'i=1,2,...n. (2.6)
i=1

Once A and u are determined, this is a standard quadratic programming problem, where t
is a 1 x n vector with typically small n value. Lately there is a contributed package in R,

Kernlab, which can make the computation even easier.

2.1.3 Computation of w and b

For any chosen values of A and V, w is given by the expression to = —t1x,,.r,,xp/2A. Obvi-
ously, the optimization (2.5) and (2.6) does not provide the optimal solution of b. Thinking
of the fact that b is playing a role for samples with y,- = +1 and < 112, cc,- > being less than 1,
or for samples with y,- = —1 and < w, :17,- > being greater than —1, we consider two groups

of samples,
Sl={261,2,...,n]yi= +1}, 52 ={i€1,2,...,n|yi= —1}.

There exists a constant that is just large enough for all samples from group S 1 being correctly
classiﬁed, and there exists another constant that is just small enough for samples from group
52 being correctly classiﬁed. For a new sample to be fairly classiﬁed, we deﬁne b as the

average value of the two constants. That is

b = (— min < aux,- > — max < 215$,- >)/2 . (2.7)
i651 1682

This b can avoid the over-ﬁtting on the training data. Actually it works very well on two

real gene expression data sets.

15

2. 1.4 Feature selection

Besides predicting the correct cancer class for a given tumor sample, another primary chal-
lenge in microarray cancer diagnosis is to identify the relevant genes which contribute the
most to the classiﬁcation. Among all external gene selection methods, univariate ranking
(UR) and recursive feature elimination (RFE) are the most often used. Golub et at. (1999)

ﬁrst introduced UR for each gene in the two class classiﬁcation problem. The criterion is

deﬁned as:
T+—T—
3j = #— , (28)

where T;- + and 0; indicate the average and standard deviation of the gene expression

values of gene j for all samples from class +1. Similarly, ?r‘; ‘ and or; indicate the average
and standard deviation of the gene expression values of gene j for all samples from class
{—1}. Genes that give the most positive values are supposed to be most correlated with
class {+1}, and genes that give the most negative values are supposed to be most correlated
with class {—1}. This ranking criteria implicitly assumes orthogonality among the genes,
because each 33- is computed with information about a single gene and does not take into
account mutual information between genes. Later on Dudoit et al. (2002) used the ratio
of between-group to within-group sum of squares (BSS/WSS) to determine the initial gene
order for the multi-class case.

Opposite to the ranking method, RF E recursively removes genes based upon the absolute
magnitude of the hyperplane elements. For the linear kernel, given microarray data with p
genes per sample, a classiﬁcation method will output to, which is a vector with p components,
each corresponding to a particular gene. The absolute magnitude of each element [10]]
determines its importance in classifying a sample. The idea behind RF E is to eliminate

elements of to that have small magnitude. This screening procedure is iterative until a

desired number of genes is obtained.

16

2.2 Results

In this section, we ﬁt V-RR to leukemia data set (Golub et al. (1999)) and breast cancer
prognosis data set (van’t Veer et al. (2002)). UR and RF E are applied to reduce the number
of genes at each step. For the UR method, we ﬁrst use (2.8) to compute sj for all genes and
rank the genes in the descending order of [3]]. Then we apply an iterative procedure that
goes as following: start with ﬁtting V—RR using all current genes, next remove 10% of the
genes in the model that are at the bottom 10% of the ranking, then reﬁt the model with
the remaining genes, and iterate. For the RFE applied in V-RR, according to the recursive
procedure, at the kth step of the iteration, we ﬁt the model with the remaining kp genes,
kp
fk(1:) =< 10,110 > +bk = bk + Zaijrj ,
j=1
and eliminate the genes with the overall smallest 10% [111]] values. So at each step of RFE
we may eliminate different number of genes. Notice that once the gene is removed from the
model, it will never come back to the model again. So the parameters used in the RF E at
the very beginning have a signiﬁcant inﬂuence on the whole procedure of keeping genes at
each step. The number of genes in the ﬁnal model is selected by cross-validation and the
performance of the ﬁnal model is evaluated on the test samples.
Before applying V-RR, we standardize each sample as is usually done in microarray studies,
see Dudoit et al. (2002) and Guyon et al. (2002), so that the mean and standard deviation

of the expression levels are 0 and 1, respectively.

2.2.1 Choosing A and V

We randomly divide the training data into 10 groups. At each step we use one group as the
cross-validation dataset and the other 9 groups are used as training data where the classifier
comes from. So each sample in the training data will be regarded as cross—validation set for
only one time and will be used as training data 9 times. The RF E method is more sensitive
to the parameters, so we determine (A, V) by applying RFE and use the same parameter
setting in UR. When searching for A and V used in the ﬁnal model of V-RR, we follow

the rule below: among the chosen A and V which result in the minimal 10-fold CV—error

17

Table 2.1. Parameter selection.

 

 

A V CW RFE UR
25 0.1 2/38 37 41
0.3 3/38
0.5 4/38
0.7 3/38
0.9 2/38 20 42
100 0.1 2/38 43 35
0.3 2/38 30 40
0.5 3/38
0.7 2/38 26 40
0.9 3/38
200 0.1 4/38
0.3 2/38 47 34
0.5 2/38 37 41
0.7 4/38
0.9 3/38

 

 

 

 

 

 

 

on the whole training data without gene selection, we want the parameters with minimal
summation of the CV-error across all iterations in gene elimination.

l

()‘v V) : argarg("‘i'1A20.ng/gl CV1)(min k: 1VIisck) ?
=1

where l is total number of iterations, CV1 means the 10-fold CV—error using all genes in the
training data, and M z'sc,c means the number of misclassiﬁcati'on based on cross-validation at
the kth iteration. We care about the CV-error without gene selection due to the disadvantage
of RF E method. If the parameters do not perform well at the beginning, there is a bigger
chance that it will not do a good job in the gene elimination procedure. In each iteration,
we can update the V value to ﬁt the data with remaining genes. Through this parameter
selection, we can conﬁdently claim that the parameters A and V work generally well at each
step. In practice, however, we can not try all possible values of A and V. Table 1 shows
the numerical parameter selection process for the leukemia data based on 12 results. The
minimal summation of the CV—error happens at initial value A = 25 and V = 0.9, hence
(25, 0.9) are chosen as the regularization parameters in V-RR model to ﬁt the entire training
data. Recall that we can adjust the V value later on.

In Table 2.1, RFE column is the total number of 10—fold misclassifications across all
iterations in gene elimination using RFE. Similarly, UR column is the total number of

10—fold misclassiﬁcations across all iterations in gene elimination using UR.

18

Table 2.2. Comparison of leukemia classiﬁcation methods

 

 

 

 

 

 

 

 

 

MetIocT 10-fold CV -error Test error m
SVM UR 2/38 3/34 22
PLR UR 2/38 3/34 16
V-RR UR 0/38 2/34 7
SVM RFE 2/38 1/34 31
PLR RFE 2/38 1/34 26
V-RR RFE 0/38 0/34 11
Table 2.3. Published results of leukemia classiﬁcation methods
Method 10%ld CV-error Test error m
Golub et al 3/38 4/34 50
Tibshirani et al 2/ 38 2/ 34 21
L1 norm SVM 2/38 2/34 17

 

 

 

 

 

 

2.2.2 Leukemia data

Leukemia (Golub et al. 1999) consists of 38 training samples and 34 test samples of two
types of acute leukemias, acute myeloid leukemia (AML) and acute lymphoblastic leukemia
(ALL). Each sample contains 7129 gene expression levels. 10-fold cross validation is used
as a criterion to determine parameters A and V. As in table 2.1, A = 25 and V = 0.9 are
used in the V-RR model and the performance of the classiﬁer is evaluated on the test data.
For the classiﬁcation method SVM and PLR, we follow the results from Zhu (2004). We
can see that when using the same set of genes (i.e. using UR), V-RR yields less signiﬁcant
genes with reduced CV-error and test error than SVM and PLR. When using RFE, V-RR
performs the best with least relevant genes and smallest CV-error and test error among the
three listed methods. V-RR with UR obtained a more manageable set of genes than V-RR
with RF E, at the cost of one more test error (which increases the test error from 0 to 1).

The minimal cross-validation error for V-RR occurs at 7 genes and 11 genes in the UR and
RF E, respectively. The list of the 7 genes chosen by V-RR UR are identiﬁed as 1745, 1834,
2020, 2288, 3320, 4847 and 5772. The list of 11 genes chosen by V-RR RFE are identiﬁed as
numbers 1249, 1779, 1796, 1834, 1846, 1882, 2288, 2402, 4847, 5039 and 5950. The results
of Golub et at. (1999), Tibshirani et al. (2002) and Zhu (2003) are summarized in Table
2.3. m is the number of selected genes.

In Table 2.2, V-RR sacriﬁce two test errors by eliminating 4 genes. It is not necessary
that the 7 genes selected by V—RR with UR are all in the set. of 11 genes selected by V-RR
with RFE.

19

Table 2.4. Comparison of breast cancer classiﬁcation methods.

 

 

 

 

Method Test error m
Van’t Veer et al. 2/19 70
Yeung 3/ 19 6
V-RR RF E 2 / 19 2
V-RR UR 4 / 19 2

 

 

2.2.3 Breast cancer prognosis data

The breast cancer prognosis data (van’t Veer et al. (2002) and Shieh (2004)) consists
of 97 primary breast tumor samples hybridized to cDNA arrays consisting of 24481 gene
expression levels. The two categories are: the good prognosis group (patients who remained
disease free for at least 5 years) and the poor prognosis group (patients who developed
distant metastases within 5 years). We picked 4918 signiﬁcantly regulated genes (at least a
2-fold difference and p—value < 0.01 in more than 3 samples) among 24481 genes. We further
removed two samples with missing values at the 4918 gene expression levels. Therefore, the
breast cancer prognosis datga in this chapter consists of 76 training samples and 19 test
samples across 4918 genes.

When we apply the V-RR with RFE to the 76 training samples each with 4918 genes, a
cross-validation set formed by 4 samples from the good group and 3 samples from the poor
group was randomly taken from the training data to determine the tuning parameters.
For each A value, we choose that V E (0.1) which leads to the smallest CV-error rate and
evaluate the performance of the classiﬁer on the test samples. Due to the randomness of
the cross validation set, we repeated the programming 10 times for each pair of A and V,
and take the average of number of misclassiﬁcations across the 10 simulations. The average
number of misclassiﬁcations at the kth iteration in gene selection was used as M 630;, in the
parameter selection criterion. Then our V-RR method is trained on all 76 training samples
to get the classiﬁer. When applying RFE for gene selection, we found out that any A > 5 will
provide the same gene selection result. Genes identiﬁed as 1865 and 2294 are selected with
2 classiﬁcation error on the test data. The result varies as A decreases from 5. We compare
the published results of Van’t Veer et al. (2002) and BMA Yeung (2004), to our selected

genes and the corresponding classiﬁcation errors. The comparison is summed up in table 2.4.

20

There is only one gene expression level identiﬁed as AL080059 in the above 4 sets of

selected genes.

2.3 Discussion

Among all the above classiﬁcation methods with UR, V—RR found 7 genes that distinguished
acute myeloid leukemia (AML) from acute lymphoblastic leukemia (ALL) with a lower error
rate than SVM and PLR employed in Zhu (2004). That is to say, V-RR can reduce the test
error with even smaller set of genes. This may imply the weights used in V-RR capture the
interactions among genes better than SVM and PLR models.

Comparing the results of classiﬁcation methods with RFE, we found that V-RR dramat-
ically improves all three criterion: CV-error, test error and number of genes, based on the
existing gene selection results on leukemia and breast cancer prognosis data. Combined
with either UR or RF E, V-RR obtained a smaller set of related genes without any cost of
CV-error and test error.

We can tell from Table 2.2 that V—RR with UR tends to select less (or same amount
of) genes than V-RR with RF E, but UR leads to a higher test error. It is a trade—off.
Through the simulation, we can tell that RFE is more sensitive to the values of regularization
parameters than UR. So more attention should be focused in choosing the parameters in
RFE than in UR. The set of genes chosen by V-RR with UR in leukemia data has 3 genes in
common with the set of genes chosen by V-RR with RFE. In breast cancer prognosis data,
they have one common selected gene named AL080059, which is also in the list of van’t
Veer et al. (2002) and Yeung (2004).

It is also worth to note that in the simulation, we can update the regularization parameters
to get the picture of the overall performance at all iteration steps rather than keep the
best performing parameters at an individual iteration step, as we did in the breast cancer
prognosis data. However, we can not try out all A values. In our approach, the summation
of the misclassiﬁcations over all steps is one criterion for the parameters, so the model with
the chosen parameter A guarantees a good overall performance on the cross-validation data,

which is consistent with our simulation results in the leukemia data and breast cancer data.

21

Here we propose a new method inspired by ridge regression for the two class classiﬁcation.
In going from the two class to the multi-class classiﬁcation, the one-vs-rest scheme is often
used: given K classes, the problem is divided into a series of K one-vs-rest problems, and
each one-vs—rest problem is addressed by a different class—speciﬁc V-RR classiﬁer; then a new
sample takes the class corresponding to the classiﬁer with largest real valued output. This
is one approach for the multi-class classiﬁcation. But how to extend the V-RR directly to
multi-class case with less work is called for and I am working on it. For general information
about multi-class classiﬁcation, see Dietterich (1991), Ramaswamy (1998) and Lee and Lee

(2002)

2.4 Conclusion

We have proposed a V-ridge regression (V—RR) model for the two class microarray cancer
diagnosis classiﬁcation. The simulation results on the real leukemia data and breast cancer
data show that when using the same set of genes (using UR), V-RR identiﬁes less signiﬁcant
genes with smaller CV-error and test error than SVM and PLR. When using the recursive
feature elimination method to select relevant genes, V-RR works perfect on leukemia data
with 0 CV—error and 0 test error, and it gains comparable test error on breast cancer data.
What is more, V-RR with RF E selects less genes than SVM, PLR, and other published
results. This good property makes the set of chosen genes more manageable. Therefore, we

claim V-RR is a good classiﬁcation method for two class microarray data.

22

CHAPTER 3

Density Curve Estimation and
Classiﬁcation of Microarray by

Conditional Logistic Regression

hdodel

' The support vector machine (SVM) has been successfully applied to microarray cancer
diagnosis problems. Khan (2001) proposed a classiﬁcation method using gene expression
proﬁling and artiﬁcial neural networks. However, one weakness of the SVM and Khan is
that given a tumor sample, it only predicts a cancer class label but does not provide any
estimate of the underlying probability. The penalized logistic regression has the advantage
of additionally providing an estimate of the underlying probability of being assigned to a
class, but it does not offer any estimate for the probability of the class 3; conditional on
an individual gene variable. We propose the conditional logistic regression (CLR) model,
which is an alternative for the microarray cancer diagnosis classiﬁcation, for the underlying
probability of the response given any gene variable. Meanwhile, since the gene selection
purpose as a primary goal in microarray cancer diagnosis, we propose a new method called
modiﬁed univariate ranking (MUR) as a new choice for dimension reduction.

We show that when applied on a microarray data for classiﬁcation, CLR performs compa-

rable to those classiﬁcation methods, e.g. SVM, PLR and BMA, but CLR has the advantage

23

of providing the probability of the class y conditional on any individual gene variable. Em-
pirical results on Leukemia and Breast Cancer data indicate the CLR combined with one
of the gene selection methods (MUR,BSS/WSS or RFE) tends to perform superior to SVM
and PLR on both CV—error and test error rate.

The support vector machine (SVM) is one of the leading methods that has been success-
fully applied to classiﬁcation of the cancer diagnosis. See Lee & Lee (2002), Mukherjee et
al. (1999), and Ramaswamy et al. (2001). In two-class classiﬁcation, the linear SVM ﬁts a
model f (:17) = b0 + xlprpxl that minimizes

n

20 — y.f(x.)>+ +3111]? .

i=1

The classiﬁcation decision is then made according to sign[ f (1:)] However, one weak-
ness of the SVM is that it only estimates sign[p(:1:) — 0.5], but doesn’t offer the the
probability of interest p(:1:) or 112(3), where the p(:r) = P(Class = IIX = :13) is the
conditional probability of a point being in class 1 given the gene measurements :13, and
p,(.1:) = P(Class = 1|ith gene expression level). Recently, Zhu (2005) proposed a penalized
logistic regression (PLR) classiﬁer and the underlying probability p(:1:), but it still lacks the
investigation on the estimation of pi(:r.).

In this chapter, we use CLR model to get an insight of the p,-(a:) and take the p,(:1:) into
consideration for classiﬁer. The classiﬁcation rule is given by sign[n§I—{~:—)b — a], where a is
a cut-off value depending on the gene variables used in CLR. The CLR not only performs
as well as SVM, PLR and BMA in two-class classiﬁcation, but can provide an estimate of
the probability of interest p,(a:). -

Maximum likelihood and the Newton-Raphson algorithm is the traditional way to solve
CLR numerically. However, the computation in microarray data is tedious. Instead, we
use a sequential minimal optimization (SMO) algorithm (Platt ( 1998)) to solve CLR in this
chapter. SMO was ﬁrst proposed in Keerthi et al. (2002) for PLR model for two-class
classiﬁcation; we modify it to be applicable to the CLR model.

Besides predicting the correct cancer class for a given tumor sample, a primary challenge
in microarray cancer diagnosis is to identify the relevant genes that contribute most to the

outcome. We apply three gene selection methods here, univariate ranking (UR) (Dudoit et

24

al. (2002), Golub et al. (1999) and Zhu (2004)), recursive feature elimination (RF E) (Guyon
et al. (2002)), and the ratio of between-group to within-group sum of squares (BSS/WSS)
(Dudoit et al. (2002)). Furthermore, we propose a modiﬁed univariate ranking (MUR) to
improve the classiﬁcation performance and eliminate genes. The comparison of SVM, PLR
and CLR with some external gene selection method are done with two frequently studied
microarray data sets.

Our simulation results on the real data sets indicates CLR with MUR tends to select
less genes with exceptional error rates than SVM, PLR and BMA with some gene selection
method. More detailed report of simulation is in section 3.2. The formulation of CLR, the
description of UR, RFE, BSS/WSS and MUR, and the cut-off value are described in section
3.1.

3.1 Approach

In standard two class classiﬁcation problems, we are given a set of training data (ethyl),
(11:2, yg), ..., ($71,317,), where the input :13,- = (513“,513i2, ..., rip), the output y, is qualitative and
assumes values in {+1, —1}. We wish to ﬁnd a classiﬁcation rule from the training data, so
that when given a new input :1:_, we can assign a class label from {+1, —1} to it. Meanwhile,
we wish to understand more on the relationship between the binary response and each gene
variable. The relationship is deﬁned by the probability of y conditional on :1:_j with logistic
regression model. A statistic based on the conditional probability is accompanying a tissues

sample and is used as our cut-off value for classiﬁcation.

3.1.1 Conditional Logistic Regression Formulation

Usually it is assumed that the training data are an independently and identically distributed
sample from an unknown probability distribution. To estimate the curves of the conditional
probability, we think of the parametric approach and use the most prominent logistic models,

which has the form

1
1+ exp—311.1031) ,

 

P(yl$.1) =

25

 

P = ,
(yl$.2) 1+ exp—yf2(f-2)

1
1 + exp-90’0”?) ’

 

P(yl17.p) =

where :roj represents the jth gene expression level of the tissues sample 13,, and fj(:r) =
bjO + bjlx, j = 1,2, ...,p.
Logistic regression models are usually ﬁt by maximum likelihood. Given the training set,

the negative log-likelihood is

n

P n p
- Z 2108 P(3/ = yil$1jl = Z 2108(1+ exp—yifj(xfj))-

i=1 j =1 i=1 j=1
We hope Two similar gene expression value measured on a same gene variable would lead
to very close conditional probability. Therefore, the coefﬁcients bjl for each gene variable
are better to be reasonably small. Additionally, to avoid the special case in which some
optimal bjl’s will be inﬁnite, we consider the negative log-likelihood with the L—2 penalty

term, which is

min [02210541 +eXp 91f)” 13) )4—2112 1]. (3.1)

(’10; bil 1:1 j: 1
With the gene expression arrays, it is typically that p >> 11. Notice we have 2p pa-

rameters in the Optimization. Since p is often in thousands, and motivated by the logistic

assumption, we come up with SMO algorithm.

3.1.2 Algorithm

The spirit of the conditional logistic regression SMO algorithm follows the two class SMO
algorithm for penalized logistic regression by Keerthi et al. (2002). We extend the algorithm
to our case.
To minimize (3.1), we rewrite it as
n 1)
min [:- 32:11? 1+ szgféztjl] , (3-2)
”JO bﬂ i=1j=1
subject to the conditions

51) = -y1(bj0 + (51331)) a (3-3)

26

116,-) = 10g<1+ e511). w. J. (3.4)

where C is the regularization parameter. The Lagrangian dual form for the problem is
1 P
T§Zb 121+CZZQ(€2'J) +2200] ”521' %(bJ0+bleij)l
j=1 1=1 j=1 1=1 j=1
where the a,j’s are the Lagrangian dual multipliers. By the KKT optimality conditions, we

have

3L

 

Ob- jl 29111-20131 "y1lx1j :0 V]:
BL ” ,
5186::02](_y2)=07 VJ)
1=1
(9L
=C I —01--=0, V1, ,,
3613' 9(51J) 1] .7

so that bjl and {U can be expressed as function of the a,j’s, which is
1
bj1_ _ Zaijfhxzj , 613': (9”) (OC j)
1: 1

Deﬁne 6 = %-Z and C(15) = 66,13- — g(§,jj), so obviously
C(15) = 610g(5)+(1— 6)log(1— 6).

Therefore,

CG 5852].

81
6?: +€1j_9(€1j)7%j

 

= (914(6).

So the dual form of optimization problem (3.2), (3.3) and (3.4) becomes

max[——Z ,— 0220(35 ] (3.5)

1=1j=1

subject to the condition
71
26%,- = 0, Vj = 1,2,...,p, (3.6)

where bjl = 221:1 a,jy,-:1:,-j, Vj = 1, 2, ...,p. Rewrite the above in the dual form, which is

p n
ncingminL=-2—J 12b 21+CZZG(% go. jar} ,ji/,.

(1
U 1=1j=1

27

Denote

TI.
F1]": 313325 = 1313' Zaijymﬁj, Hij — Fij + 15010—20 ),
1:1
by KKT conditions, we have
87:
(90,-,-

 

I aij
= y-1F1j+G(—C—)—</>jyi
= (H1j”¢j)yi
= 0, V1,]:

So the optimal solution vector 01 = ((111, ...,a'n1,....,alp, ...,anp)' has to satisfyH =05),

V1: 1,2,...,n, and for each j =1,2,...,p. That is to say, if we denote

211190): arg mZa'X Hija 2(0111(3) = my main H1J'1 Vj,
we must have
Hapon = Human, V37 (37)

For somej E 1, 2, ...,p, suppose (11,12) satisﬁes H,“ 31$ H,2j, we deﬁne
tj ~ tj .. . . . . v
(1in =a,1j +—=, 61,2,- =a,2j—+-, C1,;j =01”, V27511,22,Vj, (3.8)
3111 912

and let
1110)) = (1)159) ,

where hj = #21 + 0:121 C(%Z). Therefore,

 

1 a. .
111,033) = E1j+—G'(az—Lj)-F12j-—GI( 22J)

3111 C yiz C
H111 — Hi2J' ,
where H,1 ,- and Higj are evaluated at 52. Since H,1j— H,2j aé 0 at t,- — 0 for some J', a

decrease in 1/J is possible by choosing tj suitably away from 0.
Through the procedure, we can tell that bjo- — —gbj, VJ’ = 1, 2,. , p. Therefore, at
the optimal solution a, we have —bj0 = HiupUlJ' = Hflowmj’ W = 1,2,-“JD. Since, in

numerical solution, it is usually not possible to achieve optimality exactly, there is a need to

28

deﬁne approximate optimality conditions. We denote T as our upper bound for how much

we can put up with. The exact KKT condition can be replace by

H

211])(Jlj _ Hzlow(j)j S T .

 

H. , V. .+H. ’. .
As a consequence, we may deﬁne -—b 10 = "‘1’“ )3 2 110100)]

The SMO algorithm can now be described as following:
1. Choose 010 = (031,612, ..., a,,,)’ satisfying conditions (3.6). One possible initial vector is

C . . .
a,j= -m—1, 1w1th y,=+1, VJ,

C . . .
a,j=;n3,1w1thy,=-1, VJ,

where m1 and 1112 denote the number of training examples in {+1} and {—1}, respectively.
Set 1 = 1.
2. If c13- satisﬁes (3.7) for some j E {1,2,...,p} , step. If (3.7) does not hold for any

jE {1,2,...,p} , let

i"? t”?
1*- : arcrmin ’1 't..- of“ . .= (if - -— —J— 11f+1 . .= 017.. . . —J— . 3.9
J O I“ J) ’ 1101110)] 11011.1(] )1 y?) , ( “WU” “WU” ( )
~01”

ylup
3. Update (IS-+1 according to the above formulation. If necessary, go back to step 2.
As stated in Keerthi (2002), the value of C should not be large. For our simulated

microarray data sets, when C > 0.5, the initial value of a is out of the domain of log function.

‘Hz' (DJ-”2'2.- (1): n . .
Recall that bjo = low 2 P and bjl = 2,21 a,jy,:r,-j, so the probability path of

l
“9(bj0+bjlx.j)

 

 

class y conditional on the jth gene expression level :roj is P(y]:1:,j) =
1+exp

3.1.3 Feature Selection

Besides predicting the correct cancer class for a new tumor sample, another challenge in
microarray cancer diagnosis is to identify the relevant genes which contribute the most to
the classiﬁcation. In our study, we used the ratio of between-group to within-group sum
of squares (BSS/WSS) (Dudoit et al. (2002)) and recursive feature elimination (RFE) to
determine the initial gene order.

About the BSS/WSS, intuitively, genes with relatively large variation between classes and

relatively small variation within classes are the most likely candidates for relevant genes.

29

BSS/WSS is a univariate gene selection method in which genes with large BSS/WSS ratios
are good candidate relevant genes. For a gene J', let D,-J' denote the expression level of gene
j under sample 1, Dk j denote the average expression level of gene j over samples in class
is 6 {+1, —1}, and DJ denote the average expression level of gene J' over all samples. The

BSS/WSS ratio for gene j is deﬁned as

BSS(j) _ 21 2:1: 1(yz' = ‘0ka " 5.1)2 (3.10)

W580) ‘ 2:.- 21 1(111 = 61(1):,- — 51,-)? '
We compute the BSS/WSS ratio for each of the p genes and order the genes in descending
order of the BSS/WSS ratio.

 

The gene selection method RF E depends on the classiﬁcation method. We rank the genes
in descending order of the absolute value of bjl. The coefﬁcient bjl reﬂects how the change
on gene j expression level will affect the conditional probability. The bigger the lbjll, the
more effective the gene j is to inﬂuence the assignment of class. Therefore, gene j is more
possible to be a relevant gene for the classiﬁcation.

Aiming at gene selection, we apply BSS/WSS and RFE. As another choice, we propose
a new-univariate gene ranking method in terms of their classiﬁcation performance, called
modiﬁed univariate ranking (MUR). The criterion is deﬁned as:

_1Jlx—j+—17‘I+(1—J)Iaf—ogl

pj- .1. _
0]. +0]-

 

, (3.11)

where 6 6 [0,1], 1",]; and 0;? indicate the average and standard deviation of the gene ex—

pression levels of gene j for all samples of class k 6 {+1, —1}. Our ranking statistics is

motivated by the univariate ranking statistics

Ie+-a1

+ _
aj +0]-

 

83':

It was introduced by Dudoit et al. (2002). The sj mainly takes care of the location difference
between the two class for any gene J' and can not detect any difference when the mean of the
two classes are equal but the standard deviations are totally different. However, .9, considers
both the location and scale difference even when one difference is hard to detect. Through
our simulation study, we conclude that pj ﬁts some data, including Leukemia data, better

than UR and BSS/WSS in selecting groups of genes for classiﬁcation.

30

 

3.1.4 Cut-off statistics (1,,

The probability path of class y conditional 011 the jth gene expression value :11]- IS

I
ylbj0+bj11’.j) '

 

P(yl$.j) = _
1 + exp
Outline of up with iterative RF E for gene selection:

0 Input: training set D with p genes and 11 samples

0 Pre-processing step: use all genes for classiﬁcation and rank p genes by the value of
lbjll-

0 Determine am: at jth iteration step, let qj denote the ordered list consisting of m =

p x 90%j'1 top ranked genes. We compute the statistics
Hquj 130/ = +1livz'j)
7?: =
2 lljeqj PU} = —1l$ij)

for each tissue sample in the training set, 1 = 1, 2, ..., n.

 

Our classiﬁcation rule is based on the comparison of the 17, value and a cut-off value
am: assign a new sample x, with gene expression values 17,-, J' = 1,2, ...,p to class

{+1} if 1)_ > am and assign :r_ to {—1} otherwise. Consider two groups of indices,
C1={161,...,71]y,= +1}, C2 = {1 E 1, ...,11.[y,- = —1}.

Ideally, we wish n'1ax,,-E(;217,- < mimegl 77,- so that the cut-off value am which can
classify all training samples to the right class exists. In practice, we can ﬁnd a value
am such that there will be least misclassiﬁcation on the training set. The classiﬁer
17, > am is applied on the test samples and the performance of the selected m genes

is evaluated by the test error.

0 Iterate the the previous steps until 111 = 1.

Output: the probability distribution of the class 3; conditional on each gene variable, selected
genes at each iteration (qm) associated with the cut-off value am, training error(cv-error if
cross-validation is done) and test error.

When apply the BSS/WSS to select am, we rank all p genes by the BSS(j)/WSS(j),
so we change the gene set qm at each iteration. Everything else is exactly the same as the

steps for RF E.

31

Table 3.1. Comparison of classiﬁcation methods for breast cancer prognosis data.

 

method trainﬁ error Test error 111.

van’t Veer et a1 15/76 2/19 70
BAM 16/76 3/19 6

CLR MUR(6 = 0.8) 16/76 3/19 1
CLR UR 16/76 3/19 1

CLR BSS/WSS 16/76 3/19 1
CLR RFE 16/76 3/19 1

 

 

 

 

 

 

Table 3.2. Output of CLR on breast cancer data.

 

 

 

 

b ‘0 b '1
AL080059 0.613 -1193

 

 

3.2 Results

3.2.1 Breast cancer prognosis data

Refer to the description of breast cancer prognosis data in section 2.2.3. Using all 4918
genes and C = 0.3, our CLR algorithm provides the the probability distribution of the class
y conditional 011 each individual gene variable. For the breast cancer diagnosis data, we use
all 76 training samples and get the ratio statistic corresponding to each sample. Our am
to determine the classiﬁer for different set of selected gene is the one which separates the
training samples with the lowest error rate. If in the case there are several such am values,
we choose their median for the classiﬁer.

Our iterative CLR algorithm combined with BSS/WSS or RF E produced 3 classiﬁcation
errors out of 19 test samples with 1 selected gene, AL080059. van’t Veer et al. (2002)
reported 2 classiﬁcation errors on the test set using 70 relevant genes. Yeung et al. (2005)
reported 3 classiﬁcation errors out of 19 test samples using 6 selected genes. Furthermore,
our selected gene, AL080059, is the only common gene in the 70 selected gene set by van‘t
Veer and 6 selected gene set by Yeung. By applying our CLR method for classiﬁcation, the
single AL080059 performs similarly to those two sets of genes. The gene AL080059 is ranked
top 1 by BSS/WSS and it has the biggest absolute value of the coefﬁcient bﬂ. To some
degree, the result conﬁrms the consistency of CLR in the sense that top rank BSS/WSS
gene also has a top rank coefﬁcient as we wish.

The result in Table 3.1 and Table 3.2 is from C = 0.3, 0.1 = 1.29. The selected gene by
CLR with MUR, UR, BSS/WSS or RFE is the gene identiﬁed as AL080059 which is also

32

Table 3.3. Comparison of CLR on leukemia data.

 

 

LIethod training error Test error No.of genes
CLR MUR 0/38 1/34 6
CLR UR 0/ 38 2/ 34 8
CLR BSS/WSS 0/38 1/34 9
CLR RFE 0/38 1/38 9

 

 

 

 

 

Table 3.4. Comparison of leukemia classiﬁcation methods.

 

 

Method CV—error Wt error No. 0T genes
Golub et al. 3/ 38 4/ 34 50
Tibshirani et al. 2/ 38 2/ 34 21
SVM RFE 2/38 1/34 31
PLR RFE 2/38 1/34 26

 

 

 

 

 

one of the 70 selected genes by van’t Veer et at. (2002) and one of the 6 chosen genes by
Yeung et al. (2005).

3.2.2 Leukemia data

This data set consists of 38 training samples and 34 test samples of the two types of acute
leukemia, acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) (Golub
et al. (1999)). Each sample is a vector of 7129 genes. The C value is limited by 0.2 due to
the domain of log function. The performance of the classiﬁer is evaluated on the test tissue
samples. 0

C = 0.1 is used in the simulation for results in Table 3.3.

The results of Golub et al. (1999), Tibshirani et al. (2002), SVM with RFE and PLR
with RFE are summarized in Table 3.4.

The CV-error in Table 3.4 is for 10-fold cross-validation error.

The result in Table 3.5 is with C=0.1, am=0.37 x 36.

The result in Table 3.6 is for C=0.1, am = 0.34 x (39).

Table 3.5. Parameters for the chosen genes in the ﬁnal classiﬁer from CLR MUR.

 

1834 1882 2288 2642 4847 5772
bjo 0.96 0.95 0.96 0.97 0.97 0.99

bjl -0.72 -0.67 -0.69 0.56 —0.78 0.68

 

 

 

 

 

 

 

 

 

33

Table 3.6. Parameters for the selected genes in the ﬁnal classiﬁer from CLR RF E.

 

 

 

 

 

 

 

 

 

 

 

461 745 1834 2020 3320 3847 4196 4847 5039
bjo 0.99 0.97 0.96 1.00 0.99 0.99 0.98 0.97 0.97
bjl -0.73 -0.73 -0.72 -0.79 -0.79 -0.73 -0.71 -0.78 -0.75

 

 

3.3 Conclusion

We have proposed a CLR for the 2—class microarray cancer diagnosis classiﬁcation problem.
Besides doing classiﬁcation, CLR can offer an estimate of the conditional density p,(:1:,).
The simulation results on the real leukemia data and breast cancer prognosis data show
that when using the same set of genes, CLR identiﬁes less signiﬁcant genes with smaller

error rate than SVM and PLR. When using the modiﬁed univariate ranking method to

 

select relevant genes for leukemia classiﬁcation, CLR provides 0 10-fold CV-error and 1 test
error. Moreover, it removes more genes than SVM and PLR, which makes the set of chosen
genes more manageable. Therefore, we think CLR is a good classiﬁcation method for two

class microarray cancer data and it can estimate the underlying distribution.

34

CHAPTER 4

Asymptotics and Variable Screening
for Microarrays with Fixed Small

Sample Size and Large Dimension

DNA microarray is a new and promising biotechnology which allows the monitoring of
expression levels for thousands of genes simultaneously. Microarray is being applied more
and more often in biological and medical research to address a wide range of problems,
including identifying a set of candidate genes that are most likely related to the outcome
in the experiment. Statistical considerations are frequently to the fore in the analysis of
microarray data, as researchers shift through massive amounts of data and adjust various
sources of variability in order to identify the most important genes among the many which
are measured. However, there are many more candidate genes in microarrays than the
number of available samples in almost all studies, which leads to the improper application
of traditional statistical technology. Some existing statistical methods deal with a single
gene at a time (see Chen et al. (2003) ), or deal with the casein which the sample size n —~> 00
(see Shao (2006)). The present chapter describes insight for the analysis of microarray data
with the more realistic case that sample size is ﬁxed and the dimension of variables p —-> 00.

By applying any type of shrinkage estimation to a linear model, we have more clues in
interpreting the effects of the predictors. For example, the best subset selection of size it

method, which shrinks the coefﬁcients by setting some coefficients to be exactly zero and

makes it much easier to interpret the data. In another words, it identiﬁes the important
variables for the outcome. This process is discrete since the genes are either retained or
dropped, so that the obtained result after variable selection might be extremely unstable.
For another type of shrinkage estimations, lasso estimation (see Zhu (2003 and 2004) and
Tibshirani (1999)), could be a very good choice. Fu and Knight proved that the Lasso
estimation is consistent when the regularization parameter over the sample size tends to 0
as sample size increases to inﬁnity and the number of predictors p is ﬁxed. But the condition
doesn’t hold in real microarray data, and it is hardly appropriate to consider asymptotic
methods for n —+ 00 when in reality the sample size 11 is ﬁxed.

In practice, it is often true that a given outcome of interest is affected signiﬁcantly by
only a few genes among the large number of candidate genes, and the rest of the genes
are approximately irrelevant to the outcome. Under such an idea, our task is to identify
these important genes based on the sampled data. This is actually a variable selection
problem. In the current statistical literature, however, there is no established variable
selection procedure that can deal with a variable selection problem with number of variables
p ..., 00 while sample size n is ﬁxed. Zheng and Loh (1997) conSidered linear model selection
with high-dimensional covariates, but they assumed that the dimension of the covariates
over the sample size tends to 0 as the sample size increases to inﬁnity.

In this chapter, we use ridge regression estimation (see Hoerl and Kennard (1970), which
shrinks the coefﬁcients but does not set any of those to be exactly 0. The problem considered
concerns the asymptotic properties of microarray data with ﬁxed sample size n and very
large dimension of gene variables. Furthermore, we propose a variable screening method
to eliminate the insigniﬁcant candidate genes and prove its consistency. The shrinkage
procedure of ridge regression is continuous, therefore it is quite stable. After screening out
the genes, if necessary, we may apply an established variable selection method, such as AIC
and BIC, to select a ﬁnal set of genes and ﬁt a linear model which interprets the relationship
between the predictors and the outcome in a concise way.

Mean squared error (MSE) consistency of the ridge regression

estimators, as dimension p —-+ co and sample size is ﬁxed, is proved in section 4.1. The

variable screening method is described in section 2. Finally, a simulation study is presented

36

in section 4.3 to investigate the performance of the variable screening method.

4.1 Mean Squared Error Consistency

Consider the model Y = X6 + 5, where E(5) = 0 and Var(s) = 0317,, X is a n x p matrix

and 6 is a p x 1 vector. Apply ridge regression method to obtain the estimator of 6, i.e.
mmw — X1370: — X13), (4.1)

subject to the condition

According to Fu (2004), this is equivalent to the Lg-norm SVM model

P
ngn (Y — x6)’(Y — X6) + 6,, Z 62., (4.2)
j=1

where hp is a regularization parameter. The equivalence is in the sense that for any positive
constant tp, there exists a positive constant hp such that model (4.1) and the model (4.2)
share the same optimal solution 6, so we can stay focused on’the model (4.2). The convex

objective function makes it feasible to get the optimal solution at
6 = (x’x + 6,1,)“1x’x. (4.3)

Throughout the chapter, X is the gene expression data and Y is the outcome vector for
the 11. samples. In this section, we will discuss the MSE consistency of 6 when n is ﬁxed while
p —> 00. Due to the fact p is involved in the dimension of X, we will state the conditions on

components of X as dimension p —> oo.

Assumption A. There exists a constant 0 S 15 < 0.5 such that each component of X is
1—26
h.
001g). 0,, = 0(1), 5% = 0(1) and —ﬂ§—— = 0(1) as p —+ 00.

Theorem 1. Under Assumption A, we claim that maleSp var(6,) —+ 0, as p —> oo.

37

Proof.

ID)

V,“ ) = (x'x+h,,1,,)-lx’a,2,X(x’x+h,,1,,)"1

2 I I I
11. XX _XXXX
—,,”< h 11.11,, (, +1,)
P 'P P P

 

 

 

 

 

 

 

 

 

 

 

 

2 I I I
0 X X X X X X
= it 1)-1_( 1)-1( +I>“ll
hp hp 1) hp P hp P
2 I I
‘71) X X -—1 X X 2 —1
= — I — I
’71)“ hp + P) (( p P) ) l
2 I I I I
Up X X _1 X XX X X X _1
= — I. — — I 2 .
hp [( hp + P) ( hp hp P + hp ) l
So we have
uar(6) _1 _1
= A — B
if: ,
p
where A: 23;; + [p and B: XhJ——;:—— + 1,, + 277—. Because each component ofX 7%
. lily-26 . . . x’ xx’ X
18 0(1) and —1——p = 0(1) (wh1ch implies (71p—71p— + 1,, + 27%)1= 0(1),,xp) under

assumption A, we have

 

 

V3361)
0'2 H1
EB
p
as p —-> 00. Further,
v ‘ .
]] ("é/3) — 1,,“ = sup Ia’A‘la — a’a — a’BTlal

0‘ Ilall=1
,5

So n1ax1S,S,,var(6,-) —> 0, as p —-> 00.

For the analysis of bias vector, we need the below two assumptions.
Assumption B. There are only p0 components of 6 which are nonzero (p0 doesn’t depend
on p). Furthermore, 6 is in the linear space generated by the rows of X’ X for sufﬁciently
large p, i.e., there is a vector bpxl such that 6 = X ’ X b when p is large enough.

As p ——> 00, n is ﬁxed throughout the chapter. X ' X has at most 11 positive eigenvalues.

Let A,,, be the ith nonzero eigenvalue of X ’ X , 1 = 1, 2, ..., 71. Without loss of generality, we

38

assume A,,, > 0.
Assumption C. Assume there is a sequence of positive numbers 5,) —> 00 such that 11,, =
0(6p). Moreover, there exists a ﬁnite positive constant c such that A,,, 2 65,, for all 1 =

1, 2, ..., 71. (Normally we can set the {p = p ). The constant C does not depend on p.

Theorem 2. Under the assumptions B and C, we have many-S, bias(6,~) —> 0, as p ——> 00.

Proof. When p is large enough, let F be an orthogonal matrix such that

Am 0nx1p—n)

r’x’xr :—
0(p—n>xn 0(p—n)x(p—n>

PXP

where Anxn is a diagonal matrix with elements A,p,1 = 1, 2, ..., n, then it follows that

A

6166(6) = 15(6) — 6

= (x’x + hplprlx’xa — 6

 

X’X _
P
F’X’XI‘ _
= -IF( h Iplrll 1+3
P
F’X’XI‘ __
z _r( h. [1,) 11713
P
a —rcr’,13,

_ r’x’xr —1 - -. . . . ,. , ., . _ , h;
where C — (T + 1p) 18 a diagonal matrix w1th ﬁrst 11 diagonal elements W,

1 = 1, 2, ...,11, and the rest p — 71 diagonal elements all equal to 1. Under assumption B,

D6 = F'X’XFF’b

= Anxn 0n><(p—n) Flb

0(p—n>xn 0(p—n)><(p—n>
Notice F’ 6 is a p x 1 vector and the only possible nonzero components are some of its ﬁrst
11 components. By the assumption that only p0 components of 6 are nonzero, we know the
ﬁrst 11 components of F ’ 6 are ﬁnite. Since A,,, 2 cép for all 1 = 1,2, ...,n, and we obtain

A
I”

that n1ax13,3,,h,p/(hp + A,,,) = 0(hp/Ep) = 0(1) . So maleSP bias(6,) ——> 0 .

From the above fact the following result is also of interest.

Lemma 1. Under conditions A, B and C, we have the estimation 6,- is MSE consistent for

39

[3, for all ’i = 1,2,...,p, asp—>00,

Proof. MSE(6,-) = E (3, — 6,)2 = var(6,) + biasQ(,6,) —» 0 when both terms approach 0 as

p—>oo.

4.2 Variable Screening Procedure

Let (1,, be a sequence of positive numbers satisfying (1,, —> 0 as p ——> 00. For each p value, we
screen out the ith gene if and only if |6,| S (1p. Therefore, after screening out procedure,
only genes associated with L3,] > up are remained in the model as predictors. The sequence
(1,, acts as a ﬁlter in the process and eliminates genes with relative small coefficients. We
will prove the consistency of the procedure when there are only ﬁnite components of 6 are

DOHZGI'O.

Theorem 3. Suppose that assumptions A, B and C hold. There are only ﬁnite nonzero gene

I2k<OO

expression measurements in each sample as p increases to 00. We also assume Ele,

. h .
for an inte er k such that , = ' —B— 2}” = 0 1 , for all 1 = 1,2....,n. One
g W P(Sc,,a,,l ( l . (

example of a choice is (5 = 0.4, k = 20, 15,) = p, 11,, = 1:1]:1/4, up = C2])_1/9

, where (:23 are

positive constants.) Then the variable screening method is consistent. in the sense that

’ lim P(All genes related to Y are remained) = 1 ,

[9"00

lim P(all genes unrelated to Y are screened out) = 1 . (4.4)
—’00

Proof. Statement (4.4) is equivalent to

plingo P([6,] > a}, for all i with 6, 75 0) = 1 , (4.5)
plingc P(],6,] S (1,, for all i with 6, = 0) = 1 . (4.6)

Assumptions A, B and C guarantee that 6, is MSE consistent for 6,, which implies for any

6 > 0,
plggo 13(11- - 11-1 > .) = 0.
Since 6, 72$ 0 and up = 0(1), we have

page P(li31l > 01)) =1'

40

That is equivalent to say

lim P(|6,-| S (1.1)) = 0.

P—’00

Therefore,

P “Bil > up for all i with 6, 7A 0)
= 1 — P(there exists at least one i satisfying |6i| 3 up among all i with 32. ¢ 0)

2 1— Z 19031! 3 up).
{l. all i With ﬁﬁéO}

There are only ﬁnite components of 6 are nonzero, and this ﬁnishes the proof for (4.5).

If ,6, = 0,

a — 13(3) = (x’x + hplprlx’mﬁ + 5) — (x'x + hprprlx'xp
.—. (X’X +h,,1,,)-lx’s

= Bpx'nE-nxlv
where

B = (X'X+hp1p)‘1X’

1 X’X 1 X’
= -—-x — 1 — x—
hllf‘s (hp p) hf,
l
= (O( 1_5))pxn
hp

 

 

 

2 El6l2k
Pea-I > up) 3 J,
a-
P
_ Elbias(6,-) + mlgk
agk
< 22k—1lbi88(.131)|2k + Elm|2k
_ 9k °
a~
p
. . . . . . * h
From the discusswn 1n the prev1ous section: b1as(6,;) = O(—E

).

fr»
'1‘:

From the analysis of matrix B x 5: E [rhlzk = O( _ 1 - ).
——3, 210—.)

41

Consequently,
P(|6,'| > tip for at least one 2' S p with 6,: = 0) g 2 P(|6,-| > up)
{izall i with 62:0}
S PP(|61| > ap)

= 0(p(§§—p>2k>+0<(——1-_—§—),—,>.
(1p '

For the given choice of (hp, 6p, ap) in the Theorem 3, both 0() terms in the above expression
are 0(1), which establishes result (4.6) and completes the proof.

The next result establishes the consistency of the variable screening method replacing
the moment condition on 5, constraint on X and hp by the normality assumption on the

residual.

Theorem 4. Suppose the assumptions A-C hold. 5 is normally distributed. Additionally,
, h ) hpa2/02
ap, hp and {p are chosen in the way that 0.1) = 0(1), 5% = 0(ap) and 735—52 —-> 00. Then

the result (4.4) holds.

Proof. Again, (4.4) is equivalent to (4.5) and (4.6). The proof of (4 5) is exactly the same
as that in the proof of Theorem 3. It remains to show (4.6).

If 6,- : 0, under the normality assumption on 5, 61 is normally distributed for each 1'. So

Pllﬁil > 0p) = P(6,' > (3.1,) + 13(32. < _ap)_ A
: (“Mad/31') - up) + (w —bia.s(6E-) — 01p
Sd(:8i) Sd(,82')

where (I) is the standard normal distribution function. In theorem 2, we proved that

 

 

),

bias(6,-) = 0(hp/ép) for all 1' = 1, 2, ...,p. From the condition that [IE—1112 = 0(ap), the bias part
is 0(ap). In Theorem 1, we have ﬂ—B—l—l ——+ 1 for all 1 = 1, 2, ...,p, which tells us
It

SF

iblaS(sz)-a2
sd(6.,-)

1 {hpap

0p
For a large enough p value, there exists a positive constant t E (0, 1) so that

 

P(|6,| > up) 3 2¢>(—t,/@ap/ap).

hpa2/02

By assumption —+ 00, we have t h. a. a Z 2qlog(§ ) for a large enough p
095p P P P I

value, where q is a constant such that p/{ﬁ = 0(1).

42

2
Now we apply the inequality 2<I>(—:r) S 6“” /2 for any :1: 2 1. The probability of interest

becomes

Pllﬁil > ap) : 2<I>(—\/2qlog€p) s. e‘qlog‘fpl = 65".

It follows that

l/\

P(|6,-| > ap for at least one i with 6,; = 0) Z P(|6,-| > up)

all I With 62:20
17/6}; -* 0.

lim Pal/31! 3 cap for all i with ,3,- = 0) = 1.
P—’00

l/\

That ﬁnishes the proof for (4.6).

4.3 Simulation Result

A simulation study was carried out to investigate the performance of the proposed variable
screening method with ﬁxed 11 and increasing p to study the asymptotic effect. We consider
two sets of p: p = 360 and p = 600 with same 11 = 60 to evaluate the MSE consistency.
For the variable selection part, we consider two sets of p: p = 360 and p = 1200. Since the
convergent rate can also be influenced by the variability of the residual 5 as in Theorem 1,

‘1/9 combined

we consider two values of the standard deviation of error : or = 1 and a = p
with each p for simulation. We assume only the ﬁrst 5 among all candidate genes are related
to Y (p0 = 5).

A ﬂowchart for generating microarray data X is shown as the ﬁgure 4.1. Our 71 samples
are from multi normal N(p0'21n, In) .

In each of 1000 simulation runs, y, was generated according to y,- = 33,6 +521, 1 = 1, 2, ..., n,
where 5:5 were independently generated from N(0, 02).

In table 3.1, we report MSE for the ﬁrst 10 genes. The convergent tendency is quite clear.
For distinguished nonzero 6,- values, the MSE convergent rate is slower than that of the 6’s

which are zero. In the simulation, the variability of the error doesn’t signiﬁcantly affect the

MSE convergent. rate.

43

 

 

Step I : Generate n samples from

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

multi normal
v
d = ﬂ _
211
V
Step 2: 1' =1
. < d No Step 3: Generate Matrix with demension 11x;
’ - >
with all elements being zero

 

 

 

Yes 1

 

 

Generate 11 random

 

 

 

 

 

 

 

 

 

 

. data from N ( 0’1) Column bind the matrix from step I, 2, and 3
V 1
Generate Diagonal Matrix with diagonal
elements being random data Simulation data x

 

 

 

 

 

 

 

 

Column Bind Digonal Matrix

1

Increase i

 

 

 

 

 

 

 

 

 

Figure 4.1. Flowchart of data generation

 

 

 

In Table 4.2, each simulation was applied to the generated data X and the true 6 vector
with nonzero components are quite close to 0. We carry out the simulation with hp = p025,
up = p_1/11. Those two parameters are chosen to satisfy the conditions in Theorem 4.
The results in Table 4.2 for small coefficients indicate that the proposed variable screening
method can just leave out the unrelated gene variables gradually. The consistency of the
screening method is strongly supported through the simulation study.

To show that the screening method is stable no matter how big or how small the nonzero
components of 6 are, we provide the simulation result for big nonzero 6,- in Table 4.3, where
each simulation was applied to the generated data X with hp = p025, a7, = p"1/11 x 11.
The results in Table 4.3 are unbelievably perfect even for small p and large variability of
the residual. The reason lies in the MSE consistency of the estimations.

Table 4.2 and 4.3 report the performance of the variable screening method in terms
of two measures with results for f1, f2, g1, g2, hl and h2. Note it is possible that after

screening, some of the 5 variables related to Y are not selected although the number of

remaining variables is 5 or larger.

Table 4.1. Simulation results for MSE consistence based on 1,000 runs (x ﬁxed), hp = p

n = 60, X is from the ﬂowchart

0.25

7

 

 

 

 

 

 

 

 

 

 

 

 

 

 

a p I 2 3 T 5 6 f 8 9 10

1 360 m1 0.281 0.747 0.129 0.317 0.204 0.022 0.019 0.103 0.025 0.084

600 0.095 0.263 0.116 0.168 0.149 0.048 0.027 0.016 0.017 0.024

p-1/9 360 0.242 0.466 0.354 0.265 0.084 0.006 0.032 0.017 0.076 0.008
600 0.146 0.272 0.251 0.181 0.019 0.022 0.008 0.016 0.015 0.003

1 360 m2 59.04 43.49 85.82 31.96 97.82 0.419 0.164 23.58 0.032 0.144

600 17.11 27.75 40.77 12.91 34.21 0.013 0.424 5.763 0.097 3.638

p-1/9 360 38.89 84.19 63.94 81.19 40.76 15.23 1.782 2.832 1.046 2.802
600 24.11 65.20 38.55 53.41 27.65 11.90 0.003 0.890 0.376 1.642

 

ml = MSE of the ﬁrst 10 gene variables with 6 = (1.4, —2.5, 1.8, —1.7, 1.2, 0, 0, ...,0)’

m2 = MSE of the ﬁrst 10 gene variables with 6 = (21.4, —22.5, 21.8, —21.7, 21.2, 0, 0, ..., 0)’

46

 

Table 4.2. Simulation results for close-to-O nonzero 6,- values based on 1,000 runs (x ﬁxed)

 

a n p S 3 4 5 6 7 B g 2 10
1 60 360 11 0 14 833 137 16 0 0 0
r2 0 14 833 137 16 0 0 0

gl 0 15 881 97 7 0 0 0

g2 0 15 880 97 7 0 0 0

hl 0 15 921 62 2 0 0 0

h2 0 15 920 62 2 0 0 0

1200 11 0 73 891 36 0 0 0 0

f2 0 73 887 36 0 0 0 0

gl 0 75 896 29 0 0 0 0

g2 0 75 896 29 0 0 0 0

h1 0 76 904 20 0 0 0 0

h2 0 76 904 20 0 0 0 0

p-1/9 360 1’1 0 3 835 162 0 0 0 0
r2 0 3 833 162 0 0 0 0

gl 0 4 944 52 0 0 0 o

g2 0 4 944 52 0 0 0 0

hl 0 5 982 13 0 0 0 0

h2 0 5 982 13 0 0 0 0

1200 r1 0 0 1000 0 0 0 0 0

12 o 0 1000 0 0 0 0 0

gl 0 0 1000 0 0 0 0 0

g2 0 0 1000 0 0 0 0 0

hl 0 0 1000 0 0 0 0 0

112 0 0 1000 0 0 0 0 0

 

 

 

Table 4.2 is simulated with 6 = (1.4,—2.5,1.8,—1.7,1.2,0,0,...,0)’, 5,, = p025, up =
p-l/ll, p0 = 5_

f1=frequencies of the number of remaining variables after screening

f2=frequencies of including all 5 relevant variables after screening

g1=frequencies of the number of remaining variables after screening and AIC

g2=frequencies of including all 5 relevant variables after screening and AIC

h1= frequencies of the number of remaining variables after screening and BIC

h2=frequencies of including all 5 relevant variables after screening and BIC

f1, 12, g1, g2, h1 and h2 are frequencies of including only variables related to Y when

the remaining number of variables is less than 5

47

 

 

 

 

 

Table 4.3. Simulation results for distinguished nonzero 6,- values based on 1,000 runs (x

ﬁxed)

 

 

071p £3456789§10
1 60 360 n 0 01000 0 0 0
1200100000000
g100100000000
g200100000000
5100100000000
5200100000000

120011 0 01000 0 0 0 0 0
1200100000000
g100100000000
g200100000000
5100100000000
5200100000000

p-1/9 360 11 0 01000 0 0 0 0 0
r200100000000
g100100000000
g200100000000
5100100000000
5200100000000

120011 0 01000 0 0 0 0 0
1200100000000
g100100000000
g200100000000
5100100000000
5200100000000

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The above table is obtained with hp = p025, ap = 11 X p‘l/ll, p0 = 5, 6 =
(21.4,-225,218, —21.7,21.2,0,0,...,0)'. f1, f2, g1, g2, 11] and 112 are the same as those
in Table 4.2.

4.4 Conclusions and Discussions

The established asymptotic result shows that for a ﬁxed sample size 71., the proposed variable
screening method has some good properties when the dimension p is sufﬁciently large. It
has perfect performance when the genes which are related to the outcome have signiﬁcant
influence on the outcome. For applications, we need to carry out some sensitivity analysis
on the gene expression data set X to check those regularity conditions, and to determine (1;,
and hp in the variable screening procedure. More research in how to reduce the requirements
on X is called for.

The use of ridge regression is a crucial part for the estimations being MSE consistent of
the true parameters and for the proposed variable screening method. Since ridge regression

is a type of shrinkage estimation, the use of other SVM estimations might also produce a

48

 

good screening method. For example, the L1 norm SVM estimation (or lasso estimation).
However, the path of lasso estimation doesn’t have a clear mathematical expression (Tib-
shirani (1996)). While ridge regression is much more easier in practice to get the properties
of the estimations because of the known solution (Saunders and Gammerman (1998)).
Penalized linear models are considered for Y and X in this chapter. In general, the
regression function between Y and X may be not strictly linear. When there exists a kernel
transformation function k(.) and it is linear for Y and K (X) in the feature space, our
results for the MSE consistency and the consistency of the screening method can both be
extended to the feature space for new variables. In the present chapter, our proofs and
simulations are for ﬁxed design matrix. More theoretical and numerical work for random

design matrix is necessary.

49

CHAPTER 5

Gene Selection Methods for

Microarray Data

It is important to know which genes are most relevant to the binary classiﬁcation task and
select these genes for a variety of reasonszremovingnoisy or irrelevant genes might improve
the performance of the classiﬁer,a candidate list of important genes can be used to further
understand the biology of the disease and design further experiments,and clinical device
recording on the order of tens of genes is much more economical and practical than one
requiring thousands of genes.

The gene selection problem is an example of what is called feature selection in machine
learning. Gibbs sampling (George (1993)) was used as one variable selection method. In
the context of classiﬁcation, feature selection methods fall into two categories ﬁlter meth-
ods and wrapper methods. Filter methods select features according to criteria that are
independent of those criteria that the classiﬁer optimizes. On the other hand, wrapper
methods use the same or similar criteria as the classiﬁer. We will discuss ﬁve feature
selection approaches: univariate ranking (UR, Golub et al., 2000), recursive feature elimi-
nation (RF E, Guyon et al., 2002), ratio of between—group sum of squares to within-group
sum of squares (BSS/WSS,reference), clustering (Sengupta (2003)), and pairwise ranking
(Jonassen (2002)). They are either ﬁlter methods or wrapper methods. There are other
biological methods for gene selection, see Dudoit. (2000), Chow (2001), Kim (2002) and
Jaeger (2003).

50

5. 1 Univariate Ranking

Univariate ranking method is also called signal-to—noise or P-metric. It deﬁnes a statistic
S j for each gene variable j and assign a rank according to the value of statistic. Sj totally

depends on the gene expression data. For each gene, we compute the following statistic:

 

S» ____ n+0) - 14—01
J 0+0) “fa—(J)

where 11+(j) and 11_( j ) are the means of the classes +1 and —1 for the jth gene. Similarly,
U+( j ) and 0.. (j) are the standard deviations for the two classes for the jth gene. Genes that
give the most positive values are most correlated with class +1, and genes that give the most
negative values are most correlated with class —1. One selects the most positive m / 2 genes
and the most negative 771/ 2 genes, and then uses this reduced dataset for classiﬁcation. A
basic question that arises for all feature selection algorithm is how many genes the classiﬁer
should use. One approach to answer this question is using hypothesis and permutation
testing (Golub et al. (1999)). The null hypothesis is that the UR statistic for each gene
computed on the training set comes from the same distribution as thatfor a random data
set. A random data set is the training set with its labels randomly permuted.

In detail, the permutation test procedure for the UR statistic is as follows:

(1) Generate the statistic for all genes using the actual class label and sort the genes
accordingly.

(2) Generate 100 or more random permutations of the class labels. For each case of
randomized class labels, generate the statistics for all genes and sort the genes accordingly.

(3) Build a histogram from the randomly permuted statistics using various numbers of
genes. We call this number k. For each value of k, determine different percentiles (1%, 5%,
50% etc.) of the corresponding histogram.

(4) Compare the actual signal-to-noise scores with the different signiﬁcance levels obtained
for the histograms of permuted class labels for each value of k. See the ﬁgure for an
illustration.

The solid curve is the UR statistic rank ordered computed on the training set. The three

dashed lines are the 5th, 50th, and 95th percentiles of the same rank ordered statistic as

51

computed from the random data. The number of statistical genes is designated as the value

of k, where the solid curve crosses the 5”" percentile curve.

5.2 Recursive Feature Elimination

The method recursively removes features based upon the absolute magnitude of the hyper-
plane elements. We ﬁrst outline the approach for linear SVMs. Given microarray data with
71 genes per sample, the SVM outputs the normal to the hyperplane, to, which is a vec-
tor with 71 components, each corresponding to the expression of a particular gene. Loosely
speaking, assuming that the expression values of each gene have similar ranges, the absolute
magnitude of each element in 112 determines its importance in classifying a sample, since the

following equation holds:

71

f(.7:) :11; x z+b= 2111,17,- +b.

1:1
The idea behind RF E is to eliminate elements of to that have small magnitude, since they

don’t contribute much in the classiﬁcation function. The SVM is trained with all genes;

then we compute the following statistic for each gene:
5(1) = le|

Where in]- is the value of the jth element of 10. We then sort S from largest to smallest value
and we remove the genes corresponding to the indices that fall in the bottom 10% of the
sorted list S . The SVM is retrained on this smaller gene expression set, and the procedure
is repeated until a desired number of genes, m, is obtained. When a nonlinear SVM is used,
the idea is to remove those features that affect the margin the least, since maximizing the
margin is the objective of the SVM (Papageorgiou et al. (1998)). The nonlinear SVM has

a solution of the following form:

I
f(:r) = ZQKUWW) + 1).
i=1

Let M denote the margin. Then we obtain Equation below:
1 l l l
. , 2
E = Z cpc,-Ii(;rp,:rr) =< Zc,¢(.r,),:cj6§(.rj) >= lel .
1 =

p.r=1 1': j 1

52

 

 

 

So for each gene 3', we compute to which extent the margin changes using the following

statistic:
8(1/M)
S(J)=l———l. (5-1)
3(le
where asj is the jth element of a vector of expression values 2:, We then sort S from the
largest to the smallest value, and we remove the genes corresponding to the indices that fall

in the bottom 10% of the sorted list S. The SVM is retrained and the procedure is repeated

just as the linear case.

5.3 Ratio of Between-group sum of squares to Within-
group sum of squares

The ratio of between-group to within-group sum of squares (BSS/WSS) was ﬁrst introduced
by Dudoit et al. (2002). About the BSS/WSS,intuitively, genes with relatively large vari-
ation between classes and relatively small variation within classes are likely candidates as
relevant genes. BSS/WSS is a univariate gene selection method in which genes with large
BSS/WSS ratios are good candidate relevant genes. For a gene j, let D, j denote the expres-
sion level of gene j under sample i, Ekj denote the average expression level of gene j over
samples in class 16 E {+1, -l}, and E, j denote the average expression level of gene j over
all samples. The BSS/WSS ratio for gene 3' is deﬁned as

335(1) _ 2121810127 = “(D—k)“ — 17.62

W550) — 2.211012- = k)(Dij - 55332

We compute the BSS/WSS ratio for each of the p genes and order the genes in descending

 

(5.2)

order of the BSS/WSS ratio.

5.4 Clustering method

Given a series of microarray experiments for a speciﬁc tissue under different conditions we
want to ﬁnd the genes most likely differentially expressed under these conditions. In another
words, we want to ﬁnd the genes that best explain the effects of these conditions. This task

is also called feature selection, a commonly addressed problem in machine learning, where

 

 

 

one has class-labeled data and wants to ﬁgure out which features best discriminate among
the classes. If the genes are the features describing the cell, the problem is to select the
features that have the biggest impact on describing the results and to drop the features
with little or no effect. These features can then be used to classify unknown data. Noisy
or irrelevant attributes make the classiﬁcation task more complicated, as they can contain
random correlation. Therefore we want to ﬁlter out these features. Typically, informative
genes are selected according to a test statistic or p-value rank according to a statistical test
such as the t-test. The problem here is that we might end up with many highly correlated
genes. Besides being an additional computational burden, it also can skew the results and
lead to misclassiﬁcations. Additionally, if there is a limit on the number of genes to choose
we might not be able to include all informative genes. Our approach is to ﬁrst ﬁnd similar
genes, group them and then select informative genes from these groups to avoid redundancy,
appeared in Gaeger (2003).

In order to increase the classiﬁcation performance we propose to use more uncorrelated
genes instead of just the top genes. By just using the 16 best ranking genes according to a test-
statistic we would select highly correlated genes. Correlation can be a hint that the two genes
belong to the same pathway, are co-expressed or are coming from the same chromosome.
In general we expect high correlation to have a meaningful biological explanation. If, e.g.,
genes A and B are in the same pathway it could be that they have similar regulation and
therefore similar expression proﬁles. If gene A has a good test score it is highly likely that
gene B will, as well. Hence a typical feature selection scheme is likely to include both genes
in a classiﬁer, yet the pair of genes provides little additional information than either gene
alone. Of course we could just select more genes in order to capture all relevant genes. But
not only would more genes involve higher computational complexity for classiﬁcation but
it also can skew the result if we have a lot more genes from one pathway. Furthermore
if there are several pathways involved in the perturbation but one pathway has the main
influence, we will probably select all genes from this pathway. If we then have a limit for
the number of genes we might end up with genes only from this pathway. If many genes
are highly correlated we could describe this pathway with fewer genes and reach the same

precision. Additionally, we could replace correlated genes from this pathway by genes from

54

other pathways and possibly increase the prediction accuracy. The same issue might be
true when selecting a lot of genes as well, but it is more compelling when we have a limited
budget of genes and can only select a few genes.

Our method for gene selection will therefore be to pre—ﬁlter the gene set and drop genes
that are very similar. For the remaining genes we will apply a common test statistic and
pull out the highest-ranking genes. One way to ﬁnd correlated genes would be to calculate
the correlation between all genes. Here we have two options:

(1) Select from the best genes (according to a test statistic) that have a pair-wise corre—
lation below a certain threshold.

(2) The k-th selected gene is the gene with highest p—value among all genes whose corre-
lation with each of the ﬁrst It — 1 is below the speciﬁed threshold.

Option (1) and option (2) are called ”correlation method”. Another approach is called
“clustering method”. Our idea is to cluster the genes, and then select one or more repre-
sentative genes from each cluster. The cluster quality is assessed by looking at the average
membership probability of its elements. An element belongs to the cluster to which it has
the highest membership probability. A higher cluster quality means how dispersion, and the
closer the quality gets to 0, the more scattered the cluster becomes. It would be favorable
to take more genes from a cluster of bad quality than from a cluster with good quality.

The drawback is that a cluster might represent a pathway that is totally unrelated to the
discrimination we look for. So we mask out and exclude clusters that have an average bad

test statistic p-value.

5.5 Pairwise Ranking Method

DLD and FLD are two discriminant methods for which a discriminant axis a is computed
on the basis of the available training data. The prediction using axis a is to assign to class
+1 if a’(:c — Lil—”#2) > 0, where 111 and 112 are the means of class +1 and —1, respectively.
DLD axis is a = S _1(/11 — 112), where S is the diagonal variance matrix whose elements are
the common variance estimate

2 (n1 -— Maggi + (712 — ”03.92:

0':
92 711-1-712-2

 

55

We evaluate a gene pair by computing the projected coordinates of each experiment on
the DLD axis using only these two genes. We then take the two sample t-statistic on the
projected points as the pair score. In the exhaustive method, we sort the score of all pairs
and select the top-ranked disjoint pairs. Assume the pair (9,, 9]) ranks top 1, then all pairs
containing 9,- or 93- are removed from the list. In the greedy pairs method, we select the
individual gene, g,, with the highest t-score, and ﬁnd the gene 93- that maximize the pair
t-score. Then 9,, gj are removed from the gene set and the procedure is repeated on the

remaining set until we have selected the desired number of genes. See B0 (2002).

BIBLIOGRAPHY

[1] U. Alon, N. Notterman, K. Gish, S. Ybarra, D. Mack and A. Levine, Broad patterns
of gene expression revealed by clustering analysis of tumor and normal colon tissues

probed by oligonucleotide arrays. Proceedings of the National Academy of Science, 96,
6745—6750, (1999).

[2] E. Allwein, R. Schapire and Y. Singer, Reducing multiclass to binary: a unifying ap—
proach for margin classiﬁers. Journal of Machine Learning Research 1:113—141, (2000).

[3] P. Bradley and O. Mangasarian, Feature selection via concave minimization and sup—
port vector machines. In J. Shavlik (eds), ICML ’98. Morgan Kaufmann, (1998).

[4] TH. Be and I. Jonassen, New feature subset selection procedures for classiﬁcation of
expression proﬁles. Genome Biology, 3(4):resear0110017.1-0017.11, (2002).

[5] M. Chow, E. Moler and I. Mian, Identifying marker genes in transcription proﬁling
data using a mixture of feature relevance experts. Physiol Genomics, 5299-111, (2001).

[6] C. Cortes and V.N. Vapnik. Support vector networks. Machine learning, 20:273-297,
(1995).

[7] J. Chen and C. Chen, Microarray gene expression.
In Encyclopedia of Biopharrnaceutical Statistics, Ed. Chow, S.C., Marcel Dekker,
Inc., New York, New York, 599-613, (2003).

[8] T. Dietterich and G. Bakiri, Error—correcting output codes: A gen-

eral method for improving multiclass inductive learning programs.
Proc. of the Ninth National Conference on Artiﬁcial Intelligence, AAAI Press, 572-
577, (1991).

[9] S. Dudoit, H. Yang, M. Callow and T. Speed, Statistical methods for identi-
fying differentially expressed genes in replicated cDNA microarray experiments.
Technical report 578, University of California at Berkeley, August (2000).

[10] S. Dudoit, J. Fridlyand and T. Speed, Comparison of Discrimination Methods for the
Classiﬁcation of Tumors Using Gene Expression Data. JASA, pp. 77-87(11), (2002).

[11] M. Eisen, P. Spellman, P. Brown and D. Botstein, Cluster analysis and display of
genome-wide expression patterns. Proc. Natl Acad. Sci. USA, 95, 14863—14868, (1998).

[12] T. Evgeniou, M. Pontil and T. Poggio, Regularization networks and support vector
machines. Advances in Computational Mathematics, 13:1—50, (2000).

57

 

[13] K. Knight and W. Fu, Asymptotics for lasso—type estimators. The Annals of Statistics,
Vol.28, No. 5, 1356—1378, (2000).

[14] W. Fu, Penalized regressions: the bridge versus the lasso.
Journal of Computational and Graphical Statistics, 7 (3), 397-416, (2004).

[15] E. George and R. McCulloch, Variable selection via Gibbs sampling.
J. Am. Stat. Assoc, 88, 881-889, (1993).

[16] T. Golub at al., Molecular classiﬁcation of cancer: class discovery and class prediction
by gene expression monitoring. Science, 286, 531-536, (1999).

[17] G. Getz, E. Levine and E. Domany, Coupled two-way clustering analysis of gene mi-
croarray data. Proc Natl Acad Sci USA, 97:12097—12084, (2000).

[18] I. Guyon, J. Weston, S. Barnhill and V. Vapnik, Gene selection for cancer classiﬁcation
using support vector machines. Machine Learning, 46:389-422, (2002).

[19] A. Hoerl and R. Kennard, Ridge regression biased estimation for nonorthogonal prob-
lems. Technometrics, 12, 55-67, (1970).

[20] T. Hastie and R. Tibshirani, Classiﬁcation by pairwise coupling. In Jordan M., Kearnsa
M., Solla S., eds. Advances in Neural Information Processing Systems, MIT Press 10,
(1998).

[21] H. Hishigaki, K. Nakai, T. Ono, A. Tanigami and T. Takagi, Assessment of prediction
accuracy of protein function from protein-protein interactiondata. Yeast, 18:523-531,
(2001).

[22] T. Hastie,S. Rosset,R. Tibshirani and J. Zhu, The entire regularization path for the
support vector machine. Journal of Machine Learning Research, 5, 1391-1415, (2004).

[23] J. Jaeger, R. Sengupta and W.L. Ruzzo, Improved gene selection for classiﬁcation of
microarrays. Pac Symp Biocomput, (2003).

[24] J. Khan, J. Wei, M. Ringner, L. Saal, M. Ladanyi, F. Westermann, F. Berthold, M.
Schwab, C. Antinescu, C. Peterson and P. Meltzer, Classiﬁcation and diagnostic predic-
tion of cancers using gene expression proﬁling and artiﬁcial neural networks. Nat. M ed,
7, 673—679, (2001).

[25] S. Kim, E. Dougherty, J. Barrera, Y. Chen, M. Bittner and J. 'Ifent, Strong feature
sets from small samples C. J. Comput. Biol, 9, 127-146, (2002).

[26] S. Keerthi, K. Duan, S. Shevade, A. Poo, A fast dual algorithm for kernel logistic
regression. 19th International Conference on Machine Learning, (2002).

[27] S. Li, Markov random ﬁeld modeling in Computer vision. Springer-Verlag: Tokyo,
(1995).

58

 

[28] Y. Lee and CK. Lee, Classiﬁcation of multiple cancer types by multicategory support
vector machines using gene expression data. Technical Report 1051, Department of
Statistics, University of Wisconsin, Madison, WI 53706, (2002).

[29] S. Mukherjee, P. Tamayo, D. Slonim, A. Verri, T. Golyb, J. Mesirov and T. Poggio,
Support vector machine classiﬁcation of microarray data. Technical Report, AI Memo
1677, MIT, (2000).

[30] J. Platt, Sequential minimal optimizationzA fast algorithm for training support vector
machines. Microsoft Research, Technical Report MSR-TR-98-14, (1998).

[31] J. Platt, Probabilistic outputs for support vector machines and comparisons to regu-
larized likelihood methods. Advances in Large Margin Classiﬁers, MIT Press.

[32] C. Peron, S. Jeffrey, M. van de Rijn, C. Rees, M. Eisen, D. Ross, A. Pergamenschikov,
C. William, S. Zhu, J. Lee, D. Lashkari, D. Shalon, P. Brown and D. Botstein, Distinc-

tive gene expression patterns in human mammary epithelial cells and breast cancers.
Proceedings of the National Academy of Science, 96, 9212-9217, (1999).

[33] B. Ripley. Pattern recognition and neural networks.
Cambridge: Cambridge University Press, (1996).

[34] S. Ramaswamy, P. Tamayo, R. Ritkin, S. Mukherjee, C. Yeang, M. Angelo, C. Ladd,
M. Reich, E. Latulippe, J. Mesirov, T. Poggio, W. Gerald, M. Loda, E. Lander
andT. Golub, Multicalss cancer diagnosis using tumor gene expression signatures.
Proc. Natl. Acad. Sci. USA, 26:15149-15154, (1998).

[35] C. Saunders, A. Gammerman and V. Vovk, Ridge regression learning algorithm in dual
variables. 15th International Conference on Machine Learning, ICML (1998).

[36] D. Slonim, P. Tamayo, J. Mesirov, T. Golub and E. Lan-
der, Class prediction and discovery using gene expression data.

 

In Proc. of the 4th Annual International Conference on Computational Molecular Biology,

Universal Academy Press, 263-272, (2000).

[37] G. Shieh, C Bai, C. Lee, Identify Breast Cancer Subtypes by Gene Expression Proﬁles.
Journal of Data Science, 2, 165-175, (2004).

[38] B. Scholkopf, A.J. Smola, R.C. Williamson and D. Schuurmans, New support vector
algorithms. Neural Computation, 12, 1207-1245, (2000).

[39] J. Shao and SC Chow, Variable screening in predicting clinical outcome with high-
dimensional microarrays. J Multivariate Analysis, (2006) To appear.

[40] R. Tibshirani, Regression shrinkage and selection via the lasso. J.R.S.S.B., 58, 267-288,
(1996).

[41] V. Vapnik, Statistical Learning Theory. John Wiley & Sons, New York, (1998).

[42] V. Vapnik, The nature of statistical learning theory. Springer Verlag, New York, (1995).

59

[43] L. van’t Veer, H. Dai, M. van de Vijver, Y. He, A. Hart, M. Mao, H. Peterse, K. van
der Kooy, M. Marten, A. Witteveen et al, Gene expression proﬁling predicts clinical
outcome of breast cancer. Nature, 415, 530-6, (2002).

[44] M. Wang, J. Yang, G. Liu, Z. Xu, K. Chou, Weighted-support vector machines for
predicting membrane protein types based on pseudo-amino acid composition. PubMed,
17(6):509-16, (2004).

[45] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio and V. Vapnik, Feature se-
lection for SVMs. Advances in Neural Information Processing Systems 13, MIT Press,
(2001).

[46] M. Xiong,L. J in, W. Li and E. Boerwinkle, Computational methods for gene expression-
based tumor classiﬁcation. BioTechniques, 29:1264-1270, (2000).

[47] Y.H. Yang, M.J. Buckley, S. Dudoit and T.P. Speed, Comparison of methods for image
analysis on cDN A microarray data. Technical report, (2000).

 

[48] Y.H. Yang, S. Dudoit, P. Luu and T.P. Speed, Normalization of cDNA microarray
data. SPIE Bios, (2001).

[49] K.Y. Yeung, C. Fraley, A. Murua, A.E. Raftery and W.L. Ruzzo, Model-based clus-
tering and data transformation for gene expression data. Bioinformatics, 17:977-987,
(2001;

[50] K. Yeung, Bayesian Model AveragingzDevelopment of an improved multi-class,gene
selection and classiﬁcation tool for microarray data. Bioinformatics, 21, 2394-2402,

'(20051

[51] X. Zheng and W. Loh, A consistent variable selection criterion for linear models with
high-dimensional covariates. Statistica Sinica, 7 (1997) 311-325.

[52] J. Zhu. S. Rosset and T. Hastie, Margin maximizing loss funtions.
Neural Information Processing Systems, 16, (2003).

[53] J. Zhu, S. Rosset, T. Hastie and R. Tibshirani, 1-norm support vector machines.
Neural Information Processing Systems 16, (2004).

[54] J. Zhu and T. Hastie, Classiﬁcation of gene microarrays by penalized logistic regression.
Biostatistics, 5(3), 427-443, (2004).

60

   

[[[jigligijjj[in