l l .HHWIHII H W” “WI? \ \ THEE-‘8 L 1007 ..—v.¢—.—.-.-.-.—-_4-.-u— This is to certify that the dissertation entitled LARGE DIMENSION AND SMALL SAMPLE SIZE PROBLEMS: CLASSIFICATION, GENE SELECTION AND ASYMPTOTICS presented by JUN LUO has been accepted towards fulfillment of the requirements for the Ph.D degree in Statistics and Probability W Major Professor’s Signature I? - 0L},— Zc-oé Date Doctoral Dissertation MSU is an Affirmative Action/Equal Opportunity Institution LIBRARY Michigan State University PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 2/05 p:/ClRC/DateDue.indd-p.1 LARGE DIMENSION AND SMALL SAMPLE SIZE PROBLEMS: CLASSIFICATION, GENE SELECTION AND ASYMPTOTICS By Jun Luo A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Statistics and Probability 2006 ABSTRACT LARGE DIMENSION AND SMALL SAMPLE SIZE PROBLEMS: CLASSIFICATION, GENE SELECTION AND ASYMPTOTICS By Jun Luo Classification of patient samples is an important aspect of cancer diagnosis and treatment. The support vector machine (SVM) and penalized logistic regression (PLR) have been successfully applied to microarray cancer diagnosis problems. The two methods treat equal penalty on each loss. That may lead to misclassification on unbalanced data. So we propose V-ridge regression (ll-RR), which puts a generalized weight on the loss of each sample and optimizes the weight. vector by the model itself, as an alternative method to the SVM and PLR for classification in microarray cancer diagnosis. Often a primary goal in microarray analysis is to identify the genes which are most responsible for classification in microarray. Two gene selection methods are considered, univariate ranking (UR) and recursive feature elimination (RF E) Simulation on the well known leukemia data and breast cancer prognosis data indicates that V—RR combined with either UR or REF tends to select less significant genes than other methods. Meanwhile, V-RR performs superior to SVM and PLR with a lower rate in both cross—validation error and test error. One of the weaknesses of the SVM is that given a tumor sample, it only predicts a cancer class label but does not provide any estimation of the underlying probability. The penalized logistic regression has the advantage of additionally providing an estimate of the underlying probability of being assigned to each class, but in fact it does not offer any estimate for the probability of the outcome class, conditional on an individual gene variable. We propose the conditional logistic regression (CLR) model, which is an alternative for the microarray cancer diagnosis classification, for the underlying probability of the response given any gene variable. In addition, since a primary goal in microarray cancer diagnosis is gene selection, we propose a new method called modified univariate ranking (MUR) as a new choice for dimension reduction. We show that when applied to a microarray data for classification, CLR performs similarly to SVM, PLR and BMA, but CLR has the advantage of providing the probability of the outcome class, conditional on any individual gene variable. Empirical results on leukemia and breast cancer data indicate that the CLR method combined with one gene selection method (MUR, BSS/VVSS or RFE) tends to perform superior on both CV—error and test error rate. Microarray data typically have very high dimension p and much smaller sample size n. Classical asymptotic theory deals with p fixed and 72. goes to infinity, which is no longer appropriate for microarray data analysis. There are discussions in the literature about the behavior of estimations when both p and n tend to infinity, but very few dealing with 77. fixed and p tends to infinity. The latter situation seems more relevant to microarray data in practice. Here we outline and describe the asymptotical behavior of ridge regression estimations when sample size n is fixed and dimension p tends to infinity. Given certain data, mean squared error consistency is established under certain regularity conditions. When there are only finite number of important genes that are actually related to the outcome, we propose a variable screening method to eliminate genes which are unrelated to the outcome and prove the asymptotic consistency of the procedure. After screening, the dimension-reduced microarray data can be further analyzed via a well-known variable selection method such as AIC and BIC. Some simulation results for testing the performance of the screening method are also presented. Copyright by .hniLuo 2006 To my deeply loved mother Laxiang Liu, my lenient father Youxin Luo and my supportive sister Yan Luo. To all my friends at Michigan State University. They have kept me accompany in the past 4 years. ACKNOWLEDGMENTS First of all, I sincerely appreciate my PhD adviser, Professor Yijun Zuo, for what he has done for me in the program. His excellent guidance and thoughtful lectures sparkled my passion in statistics and life. My dissertation is an allusion without Professor Yijun Zuo’s effort and time spent on me. I also want to give my special thanks to Professor Hira Koul, who has been very strict and very careful to be a substitute of Professor Yijun Zuo. I am happy to have him as my committee chair. I value his advice in the rehearsal and writing of the thesis more than he could imagine. I would like to thank Professors Sarat Dass, Robert Tempelman and Lijian Yang for servmg on my thesis committee and for their valuable advice and kind cooperation. I am also grateful to Professor James Stapleton for his generous help and spiritual support. I have made some lifetime friends at Michigan State University, such as Winny Chiang, Kirk D Dolan, Hongwen Guo, Wenmei Huang, Fang Li, Rong Liu, Aaron Thomas Porter, Yu Sun, Weixing Song, Tony Wang, Lily Wang, Jing Wang, Hui Zhang, Yongfang Zhu and Yanwei Zhang. I thank them for their moral support and friendship. In the last, I am grateful to my family in China. My loved mother, father and sister have been supportive of me all the way. They tried every possible means to make sure I am happy and healthy. No words can describe my feelings at this moment. I will just say “I Love You All, My Dear Familyl”! vi TABLE OF CONTENTS LIST OF TABLES ix LIST OF FIGURES x 1 Introduction 1 1.1 Standard SVM for 2-class classification ..................... 4 1.2 u-SVM formulation ................................ 7 1.3 Equivalence proof ................................. 8 1.4 Ll-norm Support Vector Machine ........................ 9 2 Classification and Gene selection of Cancer Microarrays by u-Ridge Re- gression 11 2.1 Approach ..................................... 13 2.1.1 V—Ridge Regression ............................ 13 2.1.2 Quadratic Programming ......................... 14 2.1.3 Computation of w and b ......................... 15 2.1.4 Feature selection ............................. 16 2.2 Results ............... " . . . ‘ ..................... 17 2.2.1 Choosing A and l/ ............................. 17 2.2.2 Leukemia data .............................. 19 2.2.3 Breast cancer prognosis data ....................... 20 2.3 Discussion ..................................... 21 2.4 Conclusion ..................................... 22 3 Density Curve Estimation and Classification of Microarray by Conditional Logistic Regression Model 23 3.1 Approach ..................................... 25 3.1.1 Conditional Logistic Regression Formulation .............. 25 3.1.2 Algorithm ................................. 26 3.1.3 Feature Selection ............................. 29 3.1.4 Cut-off statistics (1,, ............................ 31 3.2 Results ....................................... 32 3.2.1 Breast cancer prognosis data ....................... 32 3.2.2 Leukemia data .............................. 33 3.3 Conclusion ..................................... 34 vii 4 Asymptotics and Variable Screening for Microarrays with Fixed Small Sample Size and Large Dimension 4.1 Mean Squared Error Consistency ........................ 4.2 Variable Screening Procedure .......................... 4.3 Simulation Result ................................. 4.4 Conclusions and Discussions ........................... 5 Gene Selection Methods for Microarray Data 5.1 Univariate Ranking ................................ 5.2 Recursive Feature Elimination .......................... 5.3 Ratio of Between—group sum of squares to Within-group sum of squares . 5.4 Clustering method ................................ 5.5 Pairwise Ranking Method ............................ BIBLIOGRAPHY viii 43 48 50 51 52 53 53 55 57 2.1 2.2 2.3 2.4 3.1 3.2 3.3 3.4 3.5 3.6 4.1 4.2 4.3 LIST OF TABLES Parameter selection. ............................... 18 Comparison of leukemia classification methods ................. 19 Published results of leukemia classification methods .............. 19 Comparison of breast cancer classification methods. .............. 20 Comparison of classification methods for breast cancer prognosis data. . . . . 32 Output of CLR on breast cancer data. ..................... 32 Comparison of CLR on leukemia data. ..................... 33 Comparison of leukemia classification methods. ................ 33 Parameters for the chosen genes in the final classifier from CLR MUR. . . . 33 Parameters for the selected genes in the final classifier from CLR RF E. . . . 34 Simulation results for MSE consistence based on 1,000 runs (x fixed), hp = p025, n = 60, X is from the flowchart ............. ‘ ......... 46 Simulation results for close-to—O nonzero ,8,- values based on 1,000 runs (x fixed) 47 Simulation results for distinguished nonzero l3,- values based on 1,000 runs (x fixed) ........................................ 48 ix LIST OF FIGURES 1.1 Two class classification .............................. 5 1.2 Support vector machine ............................. 5 4.1 Flowchart of data generation ........................... 44 CHAPTER 1 Introduction The aim of the dissertation is to provide an up to date review of some different approaches to classification (mainly machine learning methods), propose two new methods of classification, compare their performance on some challenging data sets, and draw conclusions on their applicability to realistic microarray problems. The task of classification occurs in a wide range of human activity. At its broadest, the term could cover any context in which some decision or forecast is made on the basis of currently available information, and a classification procedure is then some formal method for repeatedly making such judgements in new situations. In this dissertation we shall consider a more restricted interpretation. We shall assume that the problem concerns the construction of a procedure that will be applied to a continuing sequence of cases, in which each new case must be assigned to one of a set of predefined classes on the basis of observed attributes or features. The construction of a classification procedure from a set of data for which the true classes are known has also been variously termed as pattern recognition, discrimination, or supervised learning (in order to distinguish it. from unsupervised learning or clustering in which the classes are inferred from the data). Microarray experiments raise numerous statistical questions in image analysis, experi- mental design, cluster and discriminant analysis, and multiple hypothesis testing. Here we focus on the classification of tumors using gene expression data. Three main types of sta- tistical problems are associated with tumor classification: (1) identification of new tumor classes using gene expression profiles, cluster analysis/ unsupervised learning; (2) classifica- tion of malignancies into known classes, discriminant analysis/supervised learning; and (3) identification of ”marker” genes that characterize the different tumor classes, called variable selection. In a two class microarray classification problem, we are given a training data set (131,311) ...... (3:7,, y“), where the input :13,- is a p-vector corresponding to the gene expression values of the ith experiment (or samples), gr,- = (2:21, ..., rip), see Eisen (1998), Getz (2000), Slonim (2000), Yang (2001), van’t Veer (2002) and Chen and Chen (2003). The output 3;,- is a binary class label and assumed taking values in {—1,+1}, see Xiong (2000) and Yeung (2001). The problem of interest is to find a classification rule from the training data, so that we can actually assign a class label from {—1, +1} when given a new sample .7: with p gene expression measurements. Data from these new types of experiments present a ”large p, small 7:.” problem; that is, a very large number of variables (genes) relative to the number of observations (tumor samples). In statistics when the number of variables is much larger than the number of samples, one is said to be facing the problem of the so called ”curse of dimensionality”, and the function estimated (in here, it is classifier of the microarray data) may be over-fitting (i.e.very high accuracy in fitting the training Samples but very low accuracy in assigning labels for the test data). This problem is mitigated by using some gene selection methods and making sure the classifier is smooth. So that the new samples similar to those in the training set will be labeled similarly. Machine learning is a scientific field that addresses the question of how to program systems to automatically learn and to improve with experience. Vapnik (1995) successfully invented the application of machine learning methods (called support vector machine or SVM) to two class classification problems. As a consequence, SVM has been applied for classification in cancer microarray data. Besides SVM, past publications on cancer classification using gene expression data have focused mainly on the cluster analysis of both tumor samples and genes and include applications of hierarchical clustering (Alon et at. (1999) and Perou et al. ( 1999)) and partitioning methods such as self-organizing maps ( Golub (1999)). Dudoit et al. (2002) compared the performance of a bunch of different discrimination methods for the classification of tumors based on gene expression profiles. These methods include traditional ones, such as nearest-neighbor and linear discriminant analysis, as well as more modern ones, such as classification trees, bagging and boosting. Statistical approaches are generally characterized by having an explicit underlying prob- ability model, which provides a probability of being in each class. In addition, it is usually assumed that the techniques will be used by statisticians, and hence some human interven- tion is assumed with regard to variable selection and transformation, and overall structuring of the problem. Unfortunately, SVM does not offer any underlying probability in the classi- fication. Zhu (2004) proposed penalized logistic regression (PLR) for classification in cancer microarray and an estimator of the underlying probability. But neither SVM nor PLR can avoid the misclassification on unbalanced data set: the larger the training sample size for one class, the smaller its corresponding classification error rate. The main cause is that the penalty of misclassification is taken to be the same for each training sample. Wang and Yang (2004) used weighted support vector machine on the prediction of membrane protein types. Inspired by this, we proposed a new non-parametric classification method called V-ridge regression ( u-RR), where each sample contributes dif- ferently. Since SVM, PLR and any discriminant methods do not study the conditional probability of the outcome, conditional upon each gene expression level, our conditional lo— gistic regression is indeed very useful in classification and providing underlying distribution. The detailed work is in chapter 2 and 3. DNA microarrays now permit scientists to screen thousands of genes simultaneously and determine whether these genes are active, hyperactive or silent in normal or cancerous tissue. Because these new microarray devices generate bewildering amounts of raw data, new analytical methods must be developed to sort out whether cancer tissues have distinctive features. In this dissertation, we address the problem of selection of a small subset of genes from broad patterns of gene expression data, recorded on DNA microarrays. Using available training examples from cancer and normal patients, we build a classifier suitable for genetic diagnosis, as well as drug discovery. Previous attempts to address gene selection in cancer microarray are listed in chapter 5, including correlation techniques and pairwise selection. A new method for gene selection based on univariate ranking method is also proposed. We demonstrate experimentally that the genes selected by this technique yield better classification performance and are biologically relevant to cancer. These findings are consistent with the published results in the literature. In contrast with the published methods, our method eliminates gene redundancy automatically and yields better and more compact gene subsets. Simulations are done with well-known golden standard data sets Leukemia and Breast Cancer Prognosis data. For gene selection in asymptotic case, existing statistical methods deal with a single gene at a time (see Chen et al. (2003) ), or deal with the case in which the sample size n -—> 00 (see Shae (2006)). Chapter 4 describes insight for the analysis of microarray data with the more realistic case that sample size is fixed and the dimension of variables p —» 00. 1.1 Standard SVM for 2-class classification The original SVM was motivated by the idea of maximizing the distance between the separa- tion hyperplane and the closest point in the training samples, also of making the prediction function as smooth as possible. See Vapnik (1995, 1998, 2001 and 2002), Ripley (1996), Bradley (1998), Allwein (2000), Mukherjee (2000), Scholkopf (2000) and Evgeniou (2000). Discriminant methods have been used for classification in cancer microarray data, see Du- doit (2002). Based on the n training data ($1,311) ....... (rn, yn), where I,- is a p-vector in the dot product space (through the chapter, we will focus on the linear separation hyperplane and use 11:,- as the basis function), y,- is the label from {+1, —1} for the ith sample. The classification rule is given by: f(1:) = sign(< 112,1: > +b). (1.1) To address the classifier for the microarray data with maximal margin distance, some sta- tistical methods have been applied. Throughout the dissertation, < any > is the inner product of the two vectors x and y. For a two class classification problem, the linear SVM fits a model f (3:) =< w, x,- > +b, where w and b are such that . 1 9 H mm — w ‘ + C ] , 1.2 ,5 [2H “2 Z} < ) subject to the condition y,(< 10,13 > +b)21—£,-, £1- 2 0, for all i=1,2,...,n, (1.3) Q. I X f Y face-ct" =5i n w.x+t’> ° denotes +1 ( ) g ( ) ‘ denotes-1 : , ° , . , . . Howwould you . ° _ ° classify this data? Figure 1.1. Two class classification Linear Classifiers X f y f ' = ,0) = Signmr. x+ f3} ' denotes +1 ° denotes -1 O, ' J . o , Any of these a ' would be fine. / ° , . é . . ° . _ ..but which is / 5* a . . best? I Figure 1.2. Support vector machine where {£3 is the soft margin which allows the possibility of violation on the distance con- straint for non-separable data (i.e. when (,7 > 1,\7’z' = 1,2, ....,n) This new formulation trades off two goals of finding a hyperplane with large margin (minimizing llwll) and finding a hyperplane that separates the data well (minimizing the loss). And the constant C > 0 controls the trade-off. In practice, C is determined by cross validation. The classifier is given by f (r) = sign(< 10,3: > +b). The above set-up is called C-SVM. Ideally, we are interested in the number of nonzero 5,, as that is the count of errors made by our classifier on the set of training examples. This count is the L0 norm, and is discontinuous and non-convex. We would like to stay close to L0, while maintaining convexity. We take the Lagrangian in the usual manner: L(w,€,b,a,i3) = $11]er 0251' + Zaill " 52' - yi($iw + ’0] — 2315i . (1-4) i=1 i=1 {:1 To find the dual form of the problem, we first need to minimize L(w, 5, b, a) with respect to 111,6 and b (for fixed a and fl), i.e. minw,§,b L(w, 5, b, (1,6). Since the Lagrangian function is linear in a and 6, we can not set the gradient with respect to a and [3 to zero. We obtain the following dual optimization problem: n 1 TI. Incax [g -—5 ”2:1 yiyjaiaj < 12,-,;rj > ], (1.5) subject to the constraint 0 S a g C, 2,11 my,- 2 0. The introduction of the box constraint on a, is necessary to ensure that the Lagrangian will be bounded (i.e., that we can not drive the cost to —oo). This will occur in the case in which (C — ai) is negative and 5,- goes to co, the summed expression 2?:1(C — 01,-)5, goes to —00. The box constraint prevents this from happening. The dual still has a quadratic objective, and differs from the optimal margin classifier only in the introduction of the box constraint. Indeed, we can still use to = 2:le oil-$231,- to give us the optimal value of w in terms of the optimal value of a. We must also verify that the KKT dual-complementary conditions are still satisfied in this optimization problem: a, = 0 =2 yi[:c;w + b]21, (12-: C => y.,-[:I;,’,-w + b] S 1, 0 < a, < C => y,-[;1,‘;'w + b] =1. As before, a,- will be nonzero only for the support vectors, where the set of support vectors now includes all data points on the margin boundary as well as those on the wrong side of the margin boundary. Most often, one uses the regularized optimization problem which is so called standard Lg-norm SVM: 71 min [2(1— y,(< u.r,:1:,-> +1»), + Alla/jg] (1.6) “”b i=1 where /\ 2 0 is a tuning parameter, and the classifier is defined in (1.1). Model (1.2) and (1.3) is basically equivalent to (1.6) in the sense that for any given C > 0, there exists a A = 1 / (2C) so that these two models share the same optimal value w, b, i.e. we will get the exactly same classification rule. Note that (1.6) has the form loss and penalty, and A is the tuning parameter that controls the trade-off between the loss and penalty. The loss (1 — y f )+ is called the hinge loss, and the Lg-norm penalty is called the ridge penalty. 1 .2 V-SVM formulation lVIotivated by the idea of maximizing the margin, another approach to get the classifier is considering _ 1 2 12 1533;; , [gllwllz + C<—up + Muzzle-fl . no subject to the constraint y,(< 11),:121' > +b) 2 p — 5,, 5,- 2 0, p Z 0, for all i=1,2,...n, (1.8) where C > 0 and z/ 6 [0,1] are parameters. To understand the role of p, note that for E = 0nx1,the constraint simply means the two classes are separated by the margin p/lel. The classifier for (1.7) is defined in (1.1). Applying the Lagrangian dual variable method on model (1.8), it turns out that C in (1.7) does not matter at all and (1.7) obtains the same decision function as , 1 2 1 " ] _ I‘ — l/ —— ' 109 21.53%) [QHH “2 p + n £1 £2 ( ) under the constraint (1.8), V is a parameter in [0,1]. Model (1.9) is known as u-SVM, which is proposed by Scholkopf, Smola, Williamson, and Bartlett (2000) for classification. The relationship between SVM and V-SVM with 2-norm penalty is analyzed below. We leave the comparison with l-norm to the feature selection section. The V-SVM possesses some additional properties than C-SVM besides that both methods provide the classifier using the training data. For example, V-SVM has the advantage of using a parameter u on controlling the number of support vectors and the fraction of training errors. 1 .3 Equivalence proof Generally (1.6) and (1.8) and (1.9) are two different problems with the same optimal solution set. Theorem 1. In model (1.8) with constraint (1.9), for any given u 6 [0,1] with optimal solution 5 > 0, we conclude (1.8) and (1.9) has the same classifier as (1.6) with A = 715/2. Proof. Focus on model ( 1.6), suppose there are flab such that ... 1 n (a b) = argggg [,kui — 225+ 1/n gt] , subject to the condition y,(< 10,115 > +0) 2 fi- fpéi 2 0,fOI‘ all i=1,2,...,n. Now we do the following transformation of variables: Then the classifier for (1.8) and (1.9) becomes f(x) = sign(< Ew’,:1: >) + pi), = sign(< w', :1: >) + b’. Recall (w' , b’) satisfies Tl I I . ~2 2 ~ ~ w ,b = (117‘ mm [1 2, w . -—1/ + n .], ( g'w,b,£,- //> H “2 p p/ 2 £1 i=1 subject to the condition yi(< 10,13 > +b)21— 5,, f,- 2 0, for all i=1,2,...,n. The above optimization problem is the same as subject to the condition yt(< w’,:v.- > +b’) 2 1— 5;, g;- 2 0, for all i=1,2,...,n. That is model (1.6) with A = 715/2, so the classifier is f(.r) = sign(< w’, :t: > +b’). Obviously we get the same decision function from the two models under the condition 5 > 0. 1.4 Ll-norm Support Vector Machine The standard 2-norm SVM is known for its good performance in two-class classification. Hastie (2004) studied the regularization path for SVMs. In the section, we talk about 1- norm SVM. 1-norm SVM has some advantage over the standard 2-norm SVM, especially l-norm SVM functions as one variable selection method and a classification method. In standart 2—class classification problems, we are given a set of training data (231,312), ..., (3:7,,yn), where the input r,- E R”, and the output y,- E {1, —1} is binary. We wish to find a classification rule from the training data, so that when given a new input at, we can assign a class y from {1, —1} to it. To handle this problem, l-norm support vector machine (l-norm SVM) was considered: 71 q 3113211 — we + gamma . (1.10) 0‘ i=1 j=1 subject to the condition |]/3]|1 = [131] + + [31] S s, where D = 111(13),...,hq(17) is a dictionary of basis functions, and s is a tuning parameter. The solution is denotes as [30(3) and 6(3); the fitted model is q [(JI) = [1’30 + Z )3th (.7?) . i=1 A The classification rule is given by 31977] f (27)] The 1-norm SVM has been successfully used in classification problem for cancer microar- ray data. To get a good fitted model f(a:) that performs well on future data, we also need to select an appropriate tuning parameter s. In practice, people usually pre—specify a finite set of values of s that covers a wide range, then either use a separate validation data set or use cross-validation to select a value for s that gives the best performance among the given set. In Zhu (2004), the chapter illustrates that the solution path [3(3) is piece-wise linear as a function of s (in the 1?." space); it also proposes an efficient algorithm to compute the exact whole solution path {6(5), 0 S s S 00}, hence help us understand how the solution changes with s and facilitate the adaptive selection of the tuning parameter 3. Under some mild assumptions, Zhu showed that the computational cost to compute the whole solution path 6(3) is 0(nq min(n, q)2) in the worst case and 0(nq) in the best case. l-norm SVM replaces the ridge penalty in 2—norm SVM with the L1-norm penalty on 5, i.e., the lasso penalty (See Tibshirani 1996), and considers the optimization below: 17. q 3513 [g1 — act + Zap-(avail. + Allfilli], (1.11) _ 3-1 which is equivalent Lagrangian version of the constrained optimization problem. The lasso penalty was first proposed by Tibshirani (1996) for regression problems, where the response y is continuous rather than categorical. It has also been used in Bradley (1998) for clas- sification problems under the framework of SVMs. Knight (2000) and Pa (2004) studied the asymptotics for lasso-type estimators. Similar to the ridge penalty, the lasso penalty also shrinks the fitted coefficients B’s towards zero, hence 1-norm SVM also benefits from the reduction in fitted coefficients’ variances. Another property of the lasso penalty is that because of the L1 nature of the penalty, making A sufficiently large, or equivalently s suf- ficiently small, will cause some of the coefficients fij’s to be exactly zero. Thus the lasso penalty does a kind of continuous feature selection, while this is not the case for the ridge penalty in 2-norm SVM, where none of the 53’s will be equal to zero. 10 CHAPTER 2 Classification and Gene selection of Cancer Microarrays by u—Ridge Regression There has been a recent explosion in the use of microarray data for classification in a variety of diagnostic areas, see Golub (1999) and Hastie (1998). The prediction of the diagnostic category of a tissue sample from its expression array phenotype given the availability of similar data from tissues in classified categories is known as classification or supervised learning. In the context of gene expression data, for example, different tumor types (Golub et al. 1999; Ramaswamy et al. 2001), response to therapy (van’t Veer et al. 2003). A challenge in predicting the diagnostic categories using microarray data is that the number of genes is much greater than the number of tissue samples available, and we assume only a subset of genes is relevant in distinguishing different classes. Selection of relevant genes for classification is known as feature selection, which is a primary goal in microarray data analysis. A small set of relevant genes is essential for the development of inexpensive diagnostic tests. The support vector machine (SVM) is one of the leading methods that has been suc- cessfully applied to classification of the cancer diagnosis (Lee & Lee 2002, Mukherjee et ul. 1999, and Ramaswamy et al. 2001). In two-class classification, the linear SVM fits a model 11 f(1:) 2 b0 + 131> I will be as small as possible. Roughly speaking, we may control Hail]2 so that the samples in test data will be assigned to the same group as these similar sample tissues in training data. The standard SVM puts the same weight on the loss of each training sample. Here, we put a generalized weight on the loss of each sample and regard the weights as unknown parameters. Consider 13 the optimization problem below: Tl- max min [Zt,(< 111,55, > —y,-) + AlleE] , (2.1) t 11) IX” 1:1 subject to the condition 72 §:a=a oguga 12a udga VL=LZMn. @2) i=1 Each t,- can be understood as the generalized weight for the loss function of the ith sample. The max — min forces each loss to approach 0. Since our optimization is actually a modi- fication of the dual form of ridge regression, we name (2.1) and (2.2) as V-ridge regression model (ll-RR). The constraint 221:1 t,- = 0 is a regularized condition. The parameters A and V are to be chosen. In practice, they are determined by cross-validation. We clearly state the selection process in section 2.2. The classification function for V-RR is given by f(:1:) = stgn(< 10,51: > +b) for the choice of A and v, where b is a constant. 2.1.2 Quadratic Programming As is well-known, the microarray data typically has small sample size it but large dimension of gene variables p (in the thousands). In the model we proposed above, one of the important roles of the vector t is reducing the number of parameters from p + 1 to n + 1, which makes the optimization feasible. The objective function L(w,b,t) = 2;, t,-(< 112,321 > ) + b — y,) + Allwllg is convex for w,b when the vector t is fixed. For any given t1Xn satisfying the constraints, notice that the value of b is not a function of t. In fact, any value of b will not change the optimal solution for the whole optimization problem (2.1) and (2.2). So we propose a rule to get the value of b in section 2.3. To reduce the computation, we simplify the part minwab L(w, b, t) by getting the score equation for to, whose expression is given by 8L(w,b,t) = 0. (911) That leads to n 2mu+§:uflp=0. (an i—l 14 Therefore, the optimization problem becomes searching for the weight vector which Tl max [— Auwné — Z ta] . (24) 2'21 tlxn subject to the condition n Xi, = 0, |t,| s 2), Vi = 1,2,...,n. i=1 Plug (2.3) into (2.4), then we have . 1 n n n mtln [24-A- Z 2 title < 172', 13k > + Ztiyi] , (2.5) i=1 i=1k=1 subject to the condition n Xi, = 0, |t,:| g u, V'i=1,2,...n. (2.6) i=1 Once A and u are determined, this is a standard quadratic programming problem, where t is a 1 x n vector with typically small n value. Lately there is a contributed package in R, Kernlab, which can make the computation even easier. 2.1.3 Computation of w and b For any chosen values of A and V, w is given by the expression to = —t1x,,.r,,xp/2A. Obvi- ously, the optimization (2.5) and (2.6) does not provide the optimal solution of b. Thinking of the fact that b is playing a role for samples with y,- = +1 and < 112, cc,- > being less than 1, or for samples with y,- = —1 and < w, :17,- > being greater than —1, we consider two groups of samples, Sl={261,2,...,n]yi= +1}, 52 ={i€1,2,...,n|yi= —1}. There exists a constant that is just large enough for all samples from group S 1 being correctly classified, and there exists another constant that is just small enough for samples from group 52 being correctly classified. For a new sample to be fairly classified, we define b as the average value of the two constants. That is b = (— min < aux,- > — max < 215$,- >)/2 . (2.7) i651 1682 This b can avoid the over-fitting on the training data. Actually it works very well on two real gene expression data sets. 15 2. 1.4 Feature selection Besides predicting the correct cancer class for a given tumor sample, another primary chal- lenge in microarray cancer diagnosis is to identify the relevant genes which contribute the most to the classification. Among all external gene selection methods, univariate ranking (UR) and recursive feature elimination (RFE) are the most often used. Golub et at. (1999) first introduced UR for each gene in the two class classification problem. The criterion is defined as: T+—T— 3j = #— , (28) where T;- + and 0; indicate the average and standard deviation of the gene expression values of gene j for all samples from class +1. Similarly, ?r‘; ‘ and or; indicate the average and standard deviation of the gene expression values of gene j for all samples from class {—1}. Genes that give the most positive values are supposed to be most correlated with class {+1}, and genes that give the most negative values are supposed to be most correlated with class {—1}. This ranking criteria implicitly assumes orthogonality among the genes, because each 33- is computed with information about a single gene and does not take into account mutual information between genes. Later on Dudoit et al. (2002) used the ratio of between-group to within-group sum of squares (BSS/WSS) to determine the initial gene order for the multi-class case. Opposite to the ranking method, RF E recursively removes genes based upon the absolute magnitude of the hyperplane elements. For the linear kernel, given microarray data with p genes per sample, a classification method will output to, which is a vector with p components, each corresponding to a particular gene. The absolute magnitude of each element [10]] determines its importance in classifying a sample. The idea behind RF E is to eliminate elements of to that have small magnitude. This screening procedure is iterative until a desired number of genes is obtained. 16 2.2 Results In this section, we fit V-RR to leukemia data set (Golub et al. (1999)) and breast cancer prognosis data set (van’t Veer et al. (2002)). UR and RF E are applied to reduce the number of genes at each step. For the UR method, we first use (2.8) to compute sj for all genes and rank the genes in the descending order of [3]]. Then we apply an iterative procedure that goes as following: start with fitting V—RR using all current genes, next remove 10% of the genes in the model that are at the bottom 10% of the ranking, then refit the model with the remaining genes, and iterate. For the RFE applied in V-RR, according to the recursive procedure, at the kth step of the iteration, we fit the model with the remaining kp genes, kp fk(1:) =< 10,110 > +bk = bk + Zaijrj , j=1 and eliminate the genes with the overall smallest 10% [111]] values. So at each step of RFE we may eliminate different number of genes. Notice that once the gene is removed from the model, it will never come back to the model again. So the parameters used in the RF E at the very beginning have a significant influence on the whole procedure of keeping genes at each step. The number of genes in the final model is selected by cross-validation and the performance of the final model is evaluated on the test samples. Before applying V-RR, we standardize each sample as is usually done in microarray studies, see Dudoit et al. (2002) and Guyon et al. (2002), so that the mean and standard deviation of the expression levels are 0 and 1, respectively. 2.2.1 Choosing A and V We randomly divide the training data into 10 groups. At each step we use one group as the cross-validation dataset and the other 9 groups are used as training data where the classifier comes from. So each sample in the training data will be regarded as cross—validation set for only one time and will be used as training data 9 times. The RF E method is more sensitive to the parameters, so we determine (A, V) by applying RFE and use the same parameter setting in UR. When searching for A and V used in the final model of V-RR, we follow the rule below: among the chosen A and V which result in the minimal 10-fold CV—error 17 Table 2.1. Parameter selection. A V CW RFE UR 25 0.1 2/38 37 41 0.3 3/38 0.5 4/38 0.7 3/38 0.9 2/38 20 42 100 0.1 2/38 43 35 0.3 2/38 30 40 0.5 3/38 0.7 2/38 26 40 0.9 3/38 200 0.1 4/38 0.3 2/38 47 34 0.5 2/38 37 41 0.7 4/38 0.9 3/38 on the whole training data without gene selection, we want the parameters with minimal summation of the CV-error across all iterations in gene elimination. l ()‘v V) : argarg("‘i'1A20.ng/gl CV1)(min k: 1VIisck) ? =1 where l is total number of iterations, CV1 means the 10-fold CV—error using all genes in the training data, and M z'sc,c means the number of misclassificati'on based on cross-validation at the kth iteration. We care about the CV-error without gene selection due to the disadvantage of RF E method. If the parameters do not perform well at the beginning, there is a bigger chance that it will not do a good job in the gene elimination procedure. In each iteration, we can update the V value to fit the data with remaining genes. Through this parameter selection, we can confidently claim that the parameters A and V work generally well at each step. In practice, however, we can not try all possible values of A and V. Table 1 shows the numerical parameter selection process for the leukemia data based on 12 results. The minimal summation of the CV—error happens at initial value A = 25 and V = 0.9, hence (25, 0.9) are chosen as the regularization parameters in V-RR model to fit the entire training data. Recall that we can adjust the V value later on. In Table 2.1, RFE column is the total number of 10—fold misclassifications across all iterations in gene elimination using RFE. Similarly, UR column is the total number of 10—fold misclassifications across all iterations in gene elimination using UR. 18 Table 2.2. Comparison of leukemia classification methods MetIocT 10-fold CV -error Test error m SVM UR 2/38 3/34 22 PLR UR 2/38 3/34 16 V-RR UR 0/38 2/34 7 SVM RFE 2/38 1/34 31 PLR RFE 2/38 1/34 26 V-RR RFE 0/38 0/34 11 Table 2.3. Published results of leukemia classification methods Method 10%ld CV-error Test error m Golub et al 3/38 4/34 50 Tibshirani et al 2/ 38 2/ 34 21 L1 norm SVM 2/38 2/34 17 2.2.2 Leukemia data Leukemia (Golub et al. 1999) consists of 38 training samples and 34 test samples of two types of acute leukemias, acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). Each sample contains 7129 gene expression levels. 10-fold cross validation is used as a criterion to determine parameters A and V. As in table 2.1, A = 25 and V = 0.9 are used in the V-RR model and the performance of the classifier is evaluated on the test data. For the classification method SVM and PLR, we follow the results from Zhu (2004). We can see that when using the same set of genes (i.e. using UR), V-RR yields less significant genes with reduced CV-error and test error than SVM and PLR. When using RFE, V-RR performs the best with least relevant genes and smallest CV-error and test error among the three listed methods. V-RR with UR obtained a more manageable set of genes than V-RR with RF E, at the cost of one more test error (which increases the test error from 0 to 1). The minimal cross-validation error for V-RR occurs at 7 genes and 11 genes in the UR and RF E, respectively. The list of the 7 genes chosen by V-RR UR are identified as 1745, 1834, 2020, 2288, 3320, 4847 and 5772. The list of 11 genes chosen by V-RR RFE are identified as numbers 1249, 1779, 1796, 1834, 1846, 1882, 2288, 2402, 4847, 5039 and 5950. The results of Golub et at. (1999), Tibshirani et al. (2002) and Zhu (2003) are summarized in Table 2.3. m is the number of selected genes. In Table 2.2, V-RR sacrifice two test errors by eliminating 4 genes. It is not necessary that the 7 genes selected by V—RR with UR are all in the set. of 11 genes selected by V-RR with RFE. 19 Table 2.4. Comparison of breast cancer classification methods. Method Test error m Van’t Veer et al. 2/19 70 Yeung 3/ 19 6 V-RR RF E 2 / 19 2 V-RR UR 4 / 19 2 2.2.3 Breast cancer prognosis data The breast cancer prognosis data (van’t Veer et al. (2002) and Shieh (2004)) consists of 97 primary breast tumor samples hybridized to cDNA arrays consisting of 24481 gene expression levels. The two categories are: the good prognosis group (patients who remained disease free for at least 5 years) and the poor prognosis group (patients who developed distant metastases within 5 years). We picked 4918 significantly regulated genes (at least a 2-fold difference and p—value < 0.01 in more than 3 samples) among 24481 genes. We further removed two samples with missing values at the 4918 gene expression levels. Therefore, the breast cancer prognosis datga in this chapter consists of 76 training samples and 19 test samples across 4918 genes. When we apply the V-RR with RFE to the 76 training samples each with 4918 genes, a cross-validation set formed by 4 samples from the good group and 3 samples from the poor group was randomly taken from the training data to determine the tuning parameters. For each A value, we choose that V E (0.1) which leads to the smallest CV-error rate and evaluate the performance of the classifier on the test samples. Due to the randomness of the cross validation set, we repeated the programming 10 times for each pair of A and V, and take the average of number of misclassifications across the 10 simulations. The average number of misclassifications at the kth iteration in gene selection was used as M 630;, in the parameter selection criterion. Then our V-RR method is trained on all 76 training samples to get the classifier. When applying RFE for gene selection, we found out that any A > 5 will provide the same gene selection result. Genes identified as 1865 and 2294 are selected with 2 classification error on the test data. The result varies as A decreases from 5. We compare the published results of Van’t Veer et al. (2002) and BMA Yeung (2004), to our selected genes and the corresponding classification errors. The comparison is summed up in table 2.4. 20 There is only one gene expression level identified as AL080059 in the above 4 sets of selected genes. 2.3 Discussion Among all the above classification methods with UR, V—RR found 7 genes that distinguished acute myeloid leukemia (AML) from acute lymphoblastic leukemia (ALL) with a lower error rate than SVM and PLR employed in Zhu (2004). That is to say, V-RR can reduce the test error with even smaller set of genes. This may imply the weights used in V-RR capture the interactions among genes better than SVM and PLR models. Comparing the results of classification methods with RFE, we found that V-RR dramat- ically improves all three criterion: CV-error, test error and number of genes, based on the existing gene selection results on leukemia and breast cancer prognosis data. Combined with either UR or RF E, V-RR obtained a smaller set of related genes without any cost of CV-error and test error. We can tell from Table 2.2 that V—RR with UR tends to select less (or same amount of) genes than V-RR with RF E, but UR leads to a higher test error. It is a trade—off. Through the simulation, we can tell that RFE is more sensitive to the values of regularization parameters than UR. So more attention should be focused in choosing the parameters in RFE than in UR. The set of genes chosen by V-RR with UR in leukemia data has 3 genes in common with the set of genes chosen by V-RR with RFE. In breast cancer prognosis data, they have one common selected gene named AL080059, which is also in the list of van’t Veer et al. (2002) and Yeung (2004). It is also worth to note that in the simulation, we can update the regularization parameters to get the picture of the overall performance at all iteration steps rather than keep the best performing parameters at an individual iteration step, as we did in the breast cancer prognosis data. However, we can not try out all A values. In our approach, the summation of the misclassifications over all steps is one criterion for the parameters, so the model with the chosen parameter A guarantees a good overall performance on the cross-validation data, which is consistent with our simulation results in the leukemia data and breast cancer data. 21 Here we propose a new method inspired by ridge regression for the two class classification. In going from the two class to the multi-class classification, the one-vs-rest scheme is often used: given K classes, the problem is divided into a series of K one-vs-rest problems, and each one-vs—rest problem is addressed by a different class—specific V-RR classifier; then a new sample takes the class corresponding to the classifier with largest real valued output. This is one approach for the multi-class classification. But how to extend the V-RR directly to multi-class case with less work is called for and I am working on it. For general information about multi-class classification, see Dietterich (1991), Ramaswamy (1998) and Lee and Lee (2002) 2.4 Conclusion We have proposed a V-ridge regression (V—RR) model for the two class microarray cancer diagnosis classification. The simulation results on the real leukemia data and breast cancer data show that when using the same set of genes (using UR), V-RR identifies less significant genes with smaller CV-error and test error than SVM and PLR. When using the recursive feature elimination method to select relevant genes, V-RR works perfect on leukemia data with 0 CV—error and 0 test error, and it gains comparable test error on breast cancer data. What is more, V-RR with RF E selects less genes than SVM, PLR, and other published results. This good property makes the set of chosen genes more manageable. Therefore, we claim V-RR is a good classification method for two class microarray data. 22 CHAPTER 3 Density Curve Estimation and Classification of Microarray by Conditional Logistic Regression hdodel ' The support vector machine (SVM) has been successfully applied to microarray cancer diagnosis problems. Khan (2001) proposed a classification method using gene expression profiling and artificial neural networks. However, one weakness of the SVM and Khan is that given a tumor sample, it only predicts a cancer class label but does not provide any estimate of the underlying probability. The penalized logistic regression has the advantage of additionally providing an estimate of the underlying probability of being assigned to a class, but it does not offer any estimate for the probability of the class 3; conditional on an individual gene variable. We propose the conditional logistic regression (CLR) model, which is an alternative for the microarray cancer diagnosis classification, for the underlying probability of the response given any gene variable. Meanwhile, since the gene selection purpose as a primary goal in microarray cancer diagnosis, we propose a new method called modified univariate ranking (MUR) as a new choice for dimension reduction. We show that when applied on a microarray data for classification, CLR performs compa- rable to those classification methods, e.g. SVM, PLR and BMA, but CLR has the advantage 23 of providing the probability of the class y conditional on any individual gene variable. Em- pirical results on Leukemia and Breast Cancer data indicate the CLR combined with one of the gene selection methods (MUR,BSS/WSS or RFE) tends to perform superior to SVM and PLR on both CV—error and test error rate. The support vector machine (SVM) is one of the leading methods that has been success- fully applied to classification of the cancer diagnosis. See Lee & Lee (2002), Mukherjee et al. (1999), and Ramaswamy et al. (2001). In two-class classification, the linear SVM fits a model f (:17) = b0 + xlprpxl that minimizes n 20 — y.f(x.)>+ +3111]? . i=1 The classification decision is then made according to sign[ f (1:)] However, one weak- ness of the SVM is that it only estimates sign[p(:1:) — 0.5], but doesn’t offer the the probability of interest p(:1:) or 112(3), where the p(:r) = P(Class = IIX = :13) is the conditional probability of a point being in class 1 given the gene measurements :13, and p,(.1:) = P(Class = 1|ith gene expression level). Recently, Zhu (2005) proposed a penalized logistic regression (PLR) classifier and the underlying probability p(:1:), but it still lacks the investigation on the estimation of pi(:r.). In this chapter, we use CLR model to get an insight of the p,-(a:) and take the p,(:1:) into consideration for classifier. The classification rule is given by sign[n§I—{~:—)b — a], where a is a cut-off value depending on the gene variables used in CLR. The CLR not only performs as well as SVM, PLR and BMA in two-class classification, but can provide an estimate of the probability of interest p,(a:). - Maximum likelihood and the Newton-Raphson algorithm is the traditional way to solve CLR numerically. However, the computation in microarray data is tedious. Instead, we use a sequential minimal optimization (SMO) algorithm (Platt ( 1998)) to solve CLR in this chapter. SMO was first proposed in Keerthi et al. (2002) for PLR model for two-class classification; we modify it to be applicable to the CLR model. Besides predicting the correct cancer class for a given tumor sample, a primary challenge in microarray cancer diagnosis is to identify the relevant genes that contribute most to the outcome. We apply three gene selection methods here, univariate ranking (UR) (Dudoit et 24 al. (2002), Golub et al. (1999) and Zhu (2004)), recursive feature elimination (RF E) (Guyon et al. (2002)), and the ratio of between-group to within-group sum of squares (BSS/WSS) (Dudoit et al. (2002)). Furthermore, we propose a modified univariate ranking (MUR) to improve the classification performance and eliminate genes. The comparison of SVM, PLR and CLR with some external gene selection method are done with two frequently studied microarray data sets. Our simulation results on the real data sets indicates CLR with MUR tends to select less genes with exceptional error rates than SVM, PLR and BMA with some gene selection method. More detailed report of simulation is in section 3.2. The formulation of CLR, the description of UR, RFE, BSS/WSS and MUR, and the cut-off value are described in section 3.1. 3.1 Approach In standard two class classification problems, we are given a set of training data (ethyl), (11:2, yg), ..., ($71,317,), where the input :13,- = (513“,513i2, ..., rip), the output y, is qualitative and assumes values in {+1, —1}. We wish to find a classification rule from the training data, so that when given a new input :1:_, we can assign a class label from {+1, —1} to it. Meanwhile, we wish to understand more on the relationship between the binary response and each gene variable. The relationship is defined by the probability of y conditional on :1:_j with logistic regression model. A statistic based on the conditional probability is accompanying a tissues sample and is used as our cut-off value for classification. 3.1.1 Conditional Logistic Regression Formulation Usually it is assumed that the training data are an independently and identically distributed sample from an unknown probability distribution. To estimate the curves of the conditional probability, we think of the parametric approach and use the most prominent logistic models, which has the form 1 1+ exp—311.1031) , P(yl$.1) = 25 P = , (yl$.2) 1+ exp—yf2(f-2) 1 1 + exp-90’0”?) ’ P(yl17.p) = where :roj represents the jth gene expression level of the tissues sample 13,, and fj(:r) = bjO + bjlx, j = 1,2, ...,p. Logistic regression models are usually fit by maximum likelihood. Given the training set, the negative log-likelihood is n P n p - Z 2108 P(3/ = yil$1jl = Z 2108(1+ exp—yifj(xfj))- i=1 j =1 i=1 j=1 We hope Two similar gene expression value measured on a same gene variable would lead to very close conditional probability. Therefore, the coefficients bjl for each gene variable are better to be reasonably small. Additionally, to avoid the special case in which some optimal bjl’s will be infinite, we consider the negative log-likelihood with the L—2 penalty term, which is min [02210541 +eXp 91f)” 13) )4—2112 1]. (3.1) (’10; bil 1:1 j: 1 With the gene expression arrays, it is typically that p >> 11. Notice we have 2p pa- rameters in the Optimization. Since p is often in thousands, and motivated by the logistic assumption, we come up with SMO algorithm. 3.1.2 Algorithm The spirit of the conditional logistic regression SMO algorithm follows the two class SMO algorithm for penalized logistic regression by Keerthi et al. (2002). We extend the algorithm to our case. To minimize (3.1), we rewrite it as n 1) min [:- 32:11? 1+ szgféztjl] , (3-2) ”JO bfl i=1j=1 subject to the conditions 51) = -y1(bj0 + (51331)) a (3-3) 26 116,-) = 10g<1+ e511). w. J. (3.4) where C is the regularization parameter. The Lagrangian dual form for the problem is 1 P T§Zb 121+CZZQ(€2'J) +2200] ”521' %(bJ0+bleij)l j=1 1=1 j=1 1=1 j=1 where the a,j’s are the Lagrangian dual multipliers. By the KKT optimality conditions, we have 3L Ob- jl 29111-20131 "y1lx1j :0 V]: BL ” , 5186::02](_y2)=07 VJ) 1=1 (9L =C I —01--=0, V1, ,, 3613' 9(51J) 1] .7 so that bjl and {U can be expressed as function of the a,j’s, which is 1 bj1_ _ Zaijfhxzj , 613': (9”) (OC j) 1: 1 Define 6 = %-Z and C(15) = 66,13- — g(§,jj), so obviously C(15) = 610g(5)+(1— 6)log(1— 6). Therefore, CG 5852]. 81 6?: +€1j_9(€1j)7%j = (914(6). So the dual form of optimization problem (3.2), (3.3) and (3.4) becomes max[——Z ,— 0220(35 ] (3.5) 1=1j=1 subject to the condition 71 26%,- = 0, Vj = 1,2,...,p, (3.6) where bjl = 221:1 a,jy,-:1:,-j, Vj = 1, 2, ...,p. Rewrite the above in the dual form, which is p n ncingminL=-2—J 12b 21+CZZG(% go. jar} ,ji/,. (1 U 1=1j=1 27 Denote TI. F1]": 313325 = 1313' Zaijymfij, Hij — Fij + 15010—20 ), 1:1 by KKT conditions, we have 87: (90,-,- I aij = y-1F1j+G(—C—)—jyi = (H1j”¢j)yi = 0, V1,]: So the optimal solution vector 01 = ((111, ...,a'n1,....,alp, ...,anp)' has to satisfyH =05), V1: 1,2,...,n, and for each j =1,2,...,p. That is to say, if we denote 211190): arg mZa'X Hija 2(0111(3) = my main H1J'1 Vj, we must have Hapon = Human, V37 (37) For somej E 1, 2, ...,p, suppose (11,12) satisfies H,“ 31$ H,2j, we define tj ~ tj .. . . . . v (1in =a,1j +—=, 61,2,- =a,2j—+-, C1,;j =01”, V27511,22,Vj, (3.8) 3111 912 and let 1110)) = (1)159) , where hj = #21 + 0:121 C(%Z). Therefore, 1 a. . 111,033) = E1j+—G'(az—Lj)-F12j-—GI( 22J) 3111 C yiz C H111 — Hi2J' , where H,1 ,- and Higj are evaluated at 52. Since H,1j— H,2j aé 0 at t,- — 0 for some J', a decrease in 1/J is possible by choosing tj suitably away from 0. Through the procedure, we can tell that bjo- — —gbj, VJ’ = 1, 2,. , p. Therefore, at the optimal solution a, we have —bj0 = HiupUlJ' = Hflowmj’ W = 1,2,-“JD. Since, in numerical solution, it is usually not possible to achieve optimality exactly, there is a need to 28 define approximate optimality conditions. We denote T as our upper bound for how much we can put up with. The exact KKT condition can be replace by H 211])(Jlj _ Hzlow(j)j S T . H. , V. .+H. ’. . As a consequence, we may define -—b 10 = "‘1’“ )3 2 110100)] The SMO algorithm can now be described as following: 1. Choose 010 = (031,612, ..., a,,,)’ satisfying conditions (3.6). One possible initial vector is C . . . a,j= -m—1, 1w1th y,=+1, VJ, C . . . a,j=;n3,1w1thy,=-1, VJ, where m1 and 1112 denote the number of training examples in {+1} and {—1}, respectively. Set 1 = 1. 2. If c13- satisfies (3.7) for some j E {1,2,...,p} , step. If (3.7) does not hold for any jE {1,2,...,p} , let i"? t”? 1*- : arcrmin ’1 't..- of“ . .= (if - -— —J— 11f+1 . .= 017.. . . —J— . 3.9 J O I“ J) ’ 1101110)] 11011.1(] )1 y?) , ( “WU” “WU” ( ) ~01” ylup 3. Update (IS-+1 according to the above formulation. If necessary, go back to step 2. As stated in Keerthi (2002), the value of C should not be large. For our simulated microarray data sets, when C > 0.5, the initial value of a is out of the domain of log function. ‘Hz' (DJ-”2'2.- (1): n . . Recall that bjo = low 2 P and bjl = 2,21 a,jy,:r,-j, so the probability path of l “9(bj0+bjlx.j) class y conditional on the jth gene expression level :roj is P(y]:1:,j) = 1+exp 3.1.3 Feature Selection Besides predicting the correct cancer class for a new tumor sample, another challenge in microarray cancer diagnosis is to identify the relevant genes which contribute the most to the classification. In our study, we used the ratio of between-group to within-group sum of squares (BSS/WSS) (Dudoit et al. (2002)) and recursive feature elimination (RFE) to determine the initial gene order. About the BSS/WSS, intuitively, genes with relatively large variation between classes and relatively small variation within classes are the most likely candidates for relevant genes. 29 BSS/WSS is a univariate gene selection method in which genes with large BSS/WSS ratios are good candidate relevant genes. For a gene J', let D,-J' denote the expression level of gene j under sample 1, Dk j denote the average expression level of gene j over samples in class is 6 {+1, —1}, and DJ denote the average expression level of gene J' over all samples. The BSS/WSS ratio for gene j is defined as BSS(j) _ 21 2:1: 1(yz' = ‘0ka " 5.1)2 (3.10) W580) ‘ 2:.- 21 1(111 = 61(1):,- — 51,-)? ' We compute the BSS/WSS ratio for each of the p genes and order the genes in descending order of the BSS/WSS ratio. The gene selection method RF E depends on the classification method. We rank the genes in descending order of the absolute value of bjl. The coefficient bjl reflects how the change on gene j expression level will affect the conditional probability. The bigger the lbjll, the more effective the gene j is to influence the assignment of class. Therefore, gene j is more possible to be a relevant gene for the classification. Aiming at gene selection, we apply BSS/WSS and RFE. As another choice, we propose a new-univariate gene ranking method in terms of their classification performance, called modified univariate ranking (MUR). The criterion is defined as: _1Jlx—j+—17‘I+(1—J)Iaf—ogl pj- .1. _ 0]. +0]- , (3.11) where 6 6 [0,1], 1",]; and 0;? indicate the average and standard deviation of the gene ex— pression levels of gene j for all samples of class k 6 {+1, —1}. Our ranking statistics is motivated by the univariate ranking statistics Ie+-a1 + _ aj +0]- 83': It was introduced by Dudoit et al. (2002). The sj mainly takes care of the location difference between the two class for any gene J' and can not detect any difference when the mean of the two classes are equal but the standard deviations are totally different. However, .9, considers both the location and scale difference even when one difference is hard to detect. Through our simulation study, we conclude that pj fits some data, including Leukemia data, better than UR and BSS/WSS in selecting groups of genes for classification. 30 3.1.4 Cut-off statistics (1,, The probability path of class y conditional 011 the jth gene expression value :11]- IS I ylbj0+bj11’.j) ' P(yl$.j) = _ 1 + exp Outline of up with iterative RF E for gene selection: 0 Input: training set D with p genes and 11 samples 0 Pre-processing step: use all genes for classification and rank p genes by the value of lbjll- 0 Determine am: at jth iteration step, let qj denote the ordered list consisting of m = p x 90%j'1 top ranked genes. We compute the statistics Hquj 130/ = +1livz'j) 7?: = 2 lljeqj PU} = —1l$ij) for each tissue sample in the training set, 1 = 1, 2, ..., n. Our classification rule is based on the comparison of the 17, value and a cut-off value am: assign a new sample x, with gene expression values 17,-, J' = 1,2, ...,p to class {+1} if 1)_ > am and assign :r_ to {—1} otherwise. Consider two groups of indices, C1={161,...,71]y,= +1}, C2 = {1 E 1, ...,11.[y,- = —1}. Ideally, we wish n'1ax,,-E(;217,- < mimegl 77,- so that the cut-off value am which can classify all training samples to the right class exists. In practice, we can find a value am such that there will be least misclassification on the training set. The classifier 17, > am is applied on the test samples and the performance of the selected m genes is evaluated by the test error. 0 Iterate the the previous steps until 111 = 1. Output: the probability distribution of the class 3; conditional on each gene variable, selected genes at each iteration (qm) associated with the cut-off value am, training error(cv-error if cross-validation is done) and test error. When apply the BSS/WSS to select am, we rank all p genes by the BSS(j)/WSS(j), so we change the gene set qm at each iteration. Everything else is exactly the same as the steps for RF E. 31 Table 3.1. Comparison of classification methods for breast cancer prognosis data. method trainfi error Test error 111. van’t Veer et a1 15/76 2/19 70 BAM 16/76 3/19 6 CLR MUR(6 = 0.8) 16/76 3/19 1 CLR UR 16/76 3/19 1 CLR BSS/WSS 16/76 3/19 1 CLR RFE 16/76 3/19 1 Table 3.2. Output of CLR on breast cancer data. b ‘0 b '1 AL080059 0.613 -1193 3.2 Results 3.2.1 Breast cancer prognosis data Refer to the description of breast cancer prognosis data in section 2.2.3. Using all 4918 genes and C = 0.3, our CLR algorithm provides the the probability distribution of the class y conditional 011 each individual gene variable. For the breast cancer diagnosis data, we use all 76 training samples and get the ratio statistic corresponding to each sample. Our am to determine the classifier for different set of selected gene is the one which separates the training samples with the lowest error rate. If in the case there are several such am values, we choose their median for the classifier. Our iterative CLR algorithm combined with BSS/WSS or RF E produced 3 classification errors out of 19 test samples with 1 selected gene, AL080059. van’t Veer et al. (2002) reported 2 classification errors on the test set using 70 relevant genes. Yeung et al. (2005) reported 3 classification errors out of 19 test samples using 6 selected genes. Furthermore, our selected gene, AL080059, is the only common gene in the 70 selected gene set by van‘t Veer and 6 selected gene set by Yeung. By applying our CLR method for classification, the single AL080059 performs similarly to those two sets of genes. The gene AL080059 is ranked top 1 by BSS/WSS and it has the biggest absolute value of the coefficient bfl. To some degree, the result confirms the consistency of CLR in the sense that top rank BSS/WSS gene also has a top rank coefficient as we wish. The result in Table 3.1 and Table 3.2 is from C = 0.3, 0.1 = 1.29. The selected gene by CLR with MUR, UR, BSS/WSS or RFE is the gene identified as AL080059 which is also 32 Table 3.3. Comparison of CLR on leukemia data. LIethod training error Test error No.of genes CLR MUR 0/38 1/34 6 CLR UR 0/ 38 2/ 34 8 CLR BSS/WSS 0/38 1/34 9 CLR RFE 0/38 1/38 9 Table 3.4. Comparison of leukemia classification methods. Method CV—error Wt error No. 0T genes Golub et al. 3/ 38 4/ 34 50 Tibshirani et al. 2/ 38 2/ 34 21 SVM RFE 2/38 1/34 31 PLR RFE 2/38 1/34 26 one of the 70 selected genes by van’t Veer et at. (2002) and one of the 6 chosen genes by Yeung et al. (2005). 3.2.2 Leukemia data This data set consists of 38 training samples and 34 test samples of the two types of acute leukemia, acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) (Golub et al. (1999)). Each sample is a vector of 7129 genes. The C value is limited by 0.2 due to the domain of log function. The performance of the classifier is evaluated on the test tissue samples. 0 C = 0.1 is used in the simulation for results in Table 3.3. The results of Golub et al. (1999), Tibshirani et al. (2002), SVM with RFE and PLR with RFE are summarized in Table 3.4. The CV-error in Table 3.4 is for 10-fold cross-validation error. The result in Table 3.5 is with C=0.1, am=0.37 x 36. The result in Table 3.6 is for C=0.1, am = 0.34 x (39). Table 3.5. Parameters for the chosen genes in the final classifier from CLR MUR. 1834 1882 2288 2642 4847 5772 bjo 0.96 0.95 0.96 0.97 0.97 0.99 bjl -0.72 -0.67 -0.69 0.56 —0.78 0.68 33 Table 3.6. Parameters for the selected genes in the final classifier from CLR RF E. 461 745 1834 2020 3320 3847 4196 4847 5039 bjo 0.99 0.97 0.96 1.00 0.99 0.99 0.98 0.97 0.97 bjl -0.73 -0.73 -0.72 -0.79 -0.79 -0.73 -0.71 -0.78 -0.75 3.3 Conclusion We have proposed a CLR for the 2—class microarray cancer diagnosis classification problem. Besides doing classification, CLR can offer an estimate of the conditional density p,(:1:,). The simulation results on the real leukemia data and breast cancer prognosis data show that when using the same set of genes, CLR identifies less significant genes with smaller error rate than SVM and PLR. When using the modified univariate ranking method to select relevant genes for leukemia classification, CLR provides 0 10-fold CV-error and 1 test error. Moreover, it removes more genes than SVM and PLR, which makes the set of chosen genes more manageable. Therefore, we think CLR is a good classification method for two class microarray cancer data and it can estimate the underlying distribution. 34 CHAPTER 4 Asymptotics and Variable Screening for Microarrays with Fixed Small Sample Size and Large Dimension DNA microarray is a new and promising biotechnology which allows the monitoring of expression levels for thousands of genes simultaneously. Microarray is being applied more and more often in biological and medical research to address a wide range of problems, including identifying a set of candidate genes that are most likely related to the outcome in the experiment. Statistical considerations are frequently to the fore in the analysis of microarray data, as researchers shift through massive amounts of data and adjust various sources of variability in order to identify the most important genes among the many which are measured. However, there are many more candidate genes in microarrays than the number of available samples in almost all studies, which leads to the improper application of traditional statistical technology. Some existing statistical methods deal with a single gene at a time (see Chen et al. (2003) ), or deal with the casein which the sample size n —~> 00 (see Shao (2006)). The present chapter describes insight for the analysis of microarray data with the more realistic case that sample size is fixed and the dimension of variables p —-> 00. By applying any type of shrinkage estimation to a linear model, we have more clues in interpreting the effects of the predictors. For example, the best subset selection of size it method, which shrinks the coefficients by setting some coefficients to be exactly zero and makes it much easier to interpret the data. In another words, it identifies the important variables for the outcome. This process is discrete since the genes are either retained or dropped, so that the obtained result after variable selection might be extremely unstable. For another type of shrinkage estimations, lasso estimation (see Zhu (2003 and 2004) and Tibshirani (1999)), could be a very good choice. Fu and Knight proved that the Lasso estimation is consistent when the regularization parameter over the sample size tends to 0 as sample size increases to infinity and the number of predictors p is fixed. But the condition doesn’t hold in real microarray data, and it is hardly appropriate to consider asymptotic methods for n —+ 00 when in reality the sample size 11 is fixed. In practice, it is often true that a given outcome of interest is affected significantly by only a few genes among the large number of candidate genes, and the rest of the genes are approximately irrelevant to the outcome. Under such an idea, our task is to identify these important genes based on the sampled data. This is actually a variable selection problem. In the current statistical literature, however, there is no established variable selection procedure that can deal with a variable selection problem with number of variables p ..., 00 while sample size n is fixed. Zheng and Loh (1997) conSidered linear model selection with high-dimensional covariates, but they assumed that the dimension of the covariates over the sample size tends to 0 as the sample size increases to infinity. In this chapter, we use ridge regression estimation (see Hoerl and Kennard (1970), which shrinks the coefficients but does not set any of those to be exactly 0. The problem considered concerns the asymptotic properties of microarray data with fixed sample size n and very large dimension of gene variables. Furthermore, we propose a variable screening method to eliminate the insignificant candidate genes and prove its consistency. The shrinkage procedure of ridge regression is continuous, therefore it is quite stable. After screening out the genes, if necessary, we may apply an established variable selection method, such as AIC and BIC, to select a final set of genes and fit a linear model which interprets the relationship between the predictors and the outcome in a concise way. Mean squared error (MSE) consistency of the ridge regression estimators, as dimension p —-+ co and sample size is fixed, is proved in section 4.1. The variable screening method is described in section 2. Finally, a simulation study is presented 36 in section 4.3 to investigate the performance of the variable screening method. 4.1 Mean Squared Error Consistency Consider the model Y = X6 + 5, where E(5) = 0 and Var(s) = 0317,, X is a n x p matrix and 6 is a p x 1 vector. Apply ridge regression method to obtain the estimator of 6, i.e. mmw — X1370: — X13), (4.1) subject to the condition According to Fu (2004), this is equivalent to the Lg-norm SVM model P ngn (Y — x6)’(Y — X6) + 6,, Z 62., (4.2) j=1 where hp is a regularization parameter. The equivalence is in the sense that for any positive constant tp, there exists a positive constant hp such that model (4.1) and the model (4.2) share the same optimal solution 6, so we can stay focused on’the model (4.2). The convex objective function makes it feasible to get the optimal solution at 6 = (x’x + 6,1,)“1x’x. (4.3) Throughout the chapter, X is the gene expression data and Y is the outcome vector for the 11. samples. In this section, we will discuss the MSE consistency of 6 when n is fixed while p —> 00. Due to the fact p is involved in the dimension of X, we will state the conditions on components of X as dimension p —> oo. Assumption A. There exists a constant 0 S 15 < 0.5 such that each component of X is 1—26 h. 001g). 0,, = 0(1), 5% = 0(1) and —fl§—— = 0(1) as p —+ 00. Theorem 1. Under Assumption A, we claim that maleSp var(6,) —+ 0, as p —> oo. 37 Proof. ID) V,“ ) = (x'x+h,,1,,)-lx’a,2,X(x’x+h,,1,,)"1 2 I I I 11. XX _XXXX —,,”< h 11.11,, (, +1,) P 'P P P 2 I I I 0 X X X X X X = it 1)-1_( 1)-1( +I>“ll hp hp 1) hp P hp P 2 I I ‘71) X X -—1 X X 2 —1 = — I — I ’71)“ hp + P) (( p P) ) l 2 I I I I Up X X _1 X XX X X X _1 = — I. — — I 2 . hp [( hp + P) ( hp hp P + hp ) l So we have uar(6) _1 _1 = A — B if: , p where A: 23;; + [p and B: XhJ——;:—— + 1,, + 277—. Because each component ofX 7% . lily-26 . . . x’ xx’ X 18 0(1) and —1——p = 0(1) (wh1ch implies (71p—71p— + 1,, + 27%)1= 0(1),,xp) under assumption A, we have V3361) 0'2 H1 EB p as p —-> 00. Further, v ‘ . ]] ("é/3) — 1,,“ = sup Ia’A‘la — a’a — a’BTlal 0‘ Ilall=1 ,5 So n1ax1S,S,,var(6,-) —> 0, as p —-> 00. For the analysis of bias vector, we need the below two assumptions. Assumption B. There are only p0 components of 6 which are nonzero (p0 doesn’t depend on p). Furthermore, 6 is in the linear space generated by the rows of X’ X for sufficiently large p, i.e., there is a vector bpxl such that 6 = X ’ X b when p is large enough. As p ——> 00, n is fixed throughout the chapter. X ' X has at most 11 positive eigenvalues. Let A,,, be the ith nonzero eigenvalue of X ’ X , 1 = 1, 2, ..., 71. Without loss of generality, we 38 assume A,,, > 0. Assumption C. Assume there is a sequence of positive numbers 5,) —> 00 such that 11,, = 0(6p). Moreover, there exists a finite positive constant c such that A,,, 2 65,, for all 1 = 1, 2, ..., 71. (Normally we can set the {p = p ). The constant C does not depend on p. Theorem 2. Under the assumptions B and C, we have many-S, bias(6,~) —> 0, as p ——> 00. Proof. When p is large enough, let F be an orthogonal matrix such that Am 0nx1p—n) r’x’xr :— 0(p—n>xn 0(p—n)x(p—n> PXP where Anxn is a diagonal matrix with elements A,p,1 = 1, 2, ..., n, then it follows that A 6166(6) = 15(6) — 6 = (x’x + hplprlx’xa — 6 X’X _ P F’X’XI‘ _ = -IF( h Iplrll 1+3 P F’X’XI‘ __ z _r( h. [1,) 11713 P a —rcr’,13, _ r’x’xr —1 - -. . . . ,. , ., . _ , h; where C — (T + 1p) 18 a diagonal matrix w1th first 11 diagonal elements W, 1 = 1, 2, ...,11, and the rest p — 71 diagonal elements all equal to 1. Under assumption B, D6 = F'X’XFF’b = Anxn 0n><(p—n) Flb 0(p—n>xn 0(p—n)><(p—n> Notice F’ 6 is a p x 1 vector and the only possible nonzero components are some of its first 11 components. By the assumption that only p0 components of 6 are nonzero, we know the first 11 components of F ’ 6 are finite. Since A,,, 2 cép for all 1 = 1,2, ...,n, and we obtain A I” that n1ax13,3,,h,p/(hp + A,,,) = 0(hp/Ep) = 0(1) . So maleSP bias(6,) ——> 0 . From the above fact the following result is also of interest. Lemma 1. Under conditions A, B and C, we have the estimation 6,- is MSE consistent for 39 [3, for all ’i = 1,2,...,p, asp—>00, Proof. MSE(6,-) = E (3, — 6,)2 = var(6,) + biasQ(,6,) —» 0 when both terms approach 0 as p—>oo. 4.2 Variable Screening Procedure Let (1,, be a sequence of positive numbers satisfying (1,, —> 0 as p ——> 00. For each p value, we screen out the ith gene if and only if |6,| S (1p. Therefore, after screening out procedure, only genes associated with L3,] > up are remained in the model as predictors. The sequence (1,, acts as a filter in the process and eliminates genes with relative small coefficients. We will prove the consistency of the procedure when there are only finite components of 6 are DOHZGI'O. Theorem 3. Suppose that assumptions A, B and C hold. There are only finite nonzero gene I2k up for all i with 6, 7A 0) = 1 — P(there exists at least one i satisfying |6i| 3 up among all i with 32. ¢ 0) 2 1— Z 19031! 3 up). {l. all i With fifiéO} There are only finite components of 6 are nonzero, and this finishes the proof for (4.5). If ,6, = 0, a — 13(3) = (x’x + hplprlx’mfi + 5) — (x'x + hprprlx'xp .—. (X’X +h,,1,,)-lx’s = Bpx'nE-nxlv where B = (X'X+hp1p)‘1X’ 1 X’X 1 X’ = -—-x — 1 — x— hllf‘s (hp p) hf, l = (O( 1_5))pxn hp 2 El6l2k Pea-I > up) 3 J, a- P _ Elbias(6,-) + mlgk agk < 22k—1lbi88(.131)|2k + Elm|2k _ 9k ° a~ p . . . . . . * h From the discusswn 1n the prev1ous section: b1as(6,;) = O(—E ). fr» '1‘: From the analysis of matrix B x 5: E [rhlzk = O( _ 1 - ). ——3, 210—.) 41 Consequently, P(|6,'| > tip for at least one 2' S p with 6,: = 0) g 2 P(|6,-| > up) {izall i with 62:0} S PP(|61| > ap) = 0(p(§§—p>2k>+0<(——1-_—§—),—,>. (1p ' For the given choice of (hp, 6p, ap) in the Theorem 3, both 0() terms in the above expression are 0(1), which establishes result (4.6) and completes the proof. The next result establishes the consistency of the variable screening method replacing the moment condition on 5, constraint on X and hp by the normality assumption on the residual. Theorem 4. Suppose the assumptions A-C hold. 5 is normally distributed. Additionally, , h ) hpa2/02 ap, hp and {p are chosen in the way that 0.1) = 0(1), 5% = 0(ap) and 735—52 —-> 00. Then the result (4.4) holds. Proof. Again, (4.4) is equivalent to (4.5) and (4.6). The proof of (4 5) is exactly the same as that in the proof of Theorem 3. It remains to show (4.6). If 6,- : 0, under the normality assumption on 5, 61 is normally distributed for each 1'. So Pllfiil > 0p) = P(6,' > (3.1,) + 13(32. < _ap)_ A : (“Mad/31') - up) + (w —bia.s(6E-) — 01p Sd(:8i) Sd(,82') where (I) is the standard normal distribution function. In theorem 2, we proved that ), bias(6,-) = 0(hp/ép) for all 1' = 1, 2, ...,p. From the condition that [IE—1112 = 0(ap), the bias part is 0(ap). In Theorem 1, we have fl—B—l—l ——+ 1 for all 1 = 1, 2, ...,p, which tells us It SF iblaS(sz)-a2 sd(6.,-) 1 {hpap 0p For a large enough p value, there exists a positive constant t E (0, 1) so that P(|6,| > up) 3 2¢>(—t,/@ap/ap). hpa2/02 By assumption —+ 00, we have t h. a. a Z 2qlog(§ ) for a large enough p 095p P P P I value, where q is a constant such that p/{fi = 0(1). 42 2 Now we apply the inequality 2(—:r) S 6“” /2 for any :1: 2 1. The probability of interest becomes Pllfiil > ap) : 2(—\/2qlog€p) s. e‘qlog‘fpl = 65". It follows that l/\ P(|6,-| > ap for at least one i with 6,; = 0) Z P(|6,-| > up) all I With 62:20 17/6}; -* 0. lim Pal/31! 3 cap for all i with ,3,- = 0) = 1. P—’00 l/\ That finishes the proof for (4.6). 4.3 Simulation Result A simulation study was carried out to investigate the performance of the proposed variable screening method with fixed 11 and increasing p to study the asymptotic effect. We consider two sets of p: p = 360 and p = 600 with same 11 = 60 to evaluate the MSE consistency. For the variable selection part, we consider two sets of p: p = 360 and p = 1200. Since the convergent rate can also be influenced by the variability of the residual 5 as in Theorem 1, ‘1/9 combined we consider two values of the standard deviation of error : or = 1 and a = p with each p for simulation. We assume only the first 5 among all candidate genes are related to Y (p0 = 5). A flowchart for generating microarray data X is shown as the figure 4.1. Our 71 samples are from multi normal N(p0'21n, In) . In each of 1000 simulation runs, y, was generated according to y,- = 33,6 +521, 1 = 1, 2, ..., n, where 5:5 were independently generated from N(0, 02). In table 3.1, we report MSE for the first 10 genes. The convergent tendency is quite clear. For distinguished nonzero 6,- values, the MSE convergent rate is slower than that of the 6’s which are zero. In the simulation, the variability of the error doesn’t significantly affect the MSE convergent. rate. 43 Step I : Generate n samples from multi normal v d = fl _ 211 V Step 2: 1' =1 . < d No Step 3: Generate Matrix with demension 11x; ’ - > with all elements being zero Yes 1 Generate 11 random . data from N ( 0’1) Column bind the matrix from step I, 2, and 3 V 1 Generate Diagonal Matrix with diagonal elements being random data Simulation data x Column Bind Digonal Matrix 1 Increase i Figure 4.1. Flowchart of data generation In Table 4.2, each simulation was applied to the generated data X and the true 6 vector with nonzero components are quite close to 0. We carry out the simulation with hp = p025, up = p_1/11. Those two parameters are chosen to satisfy the conditions in Theorem 4. The results in Table 4.2 for small coefficients indicate that the proposed variable screening method can just leave out the unrelated gene variables gradually. The consistency of the screening method is strongly supported through the simulation study. To show that the screening method is stable no matter how big or how small the nonzero components of 6 are, we provide the simulation result for big nonzero 6,- in Table 4.3, where each simulation was applied to the generated data X with hp = p025, a7, = p"1/11 x 11. The results in Table 4.3 are unbelievably perfect even for small p and large variability of the residual. The reason lies in the MSE consistency of the estimations. Table 4.2 and 4.3 report the performance of the variable screening method in terms of two measures with results for f1, f2, g1, g2, hl and h2. Note it is possible that after screening, some of the 5 variables related to Y are not selected although the number of remaining variables is 5 or larger. Table 4.1. Simulation results for MSE consistence based on 1,000 runs (x fixed), hp = p n = 60, X is from the flowchart 0.25 7 a p I 2 3 T 5 6 f 8 9 10 1 360 m1 0.281 0.747 0.129 0.317 0.204 0.022 0.019 0.103 0.025 0.084 600 0.095 0.263 0.116 0.168 0.149 0.048 0.027 0.016 0.017 0.024 p-1/9 360 0.242 0.466 0.354 0.265 0.084 0.006 0.032 0.017 0.076 0.008 600 0.146 0.272 0.251 0.181 0.019 0.022 0.008 0.016 0.015 0.003 1 360 m2 59.04 43.49 85.82 31.96 97.82 0.419 0.164 23.58 0.032 0.144 600 17.11 27.75 40.77 12.91 34.21 0.013 0.424 5.763 0.097 3.638 p-1/9 360 38.89 84.19 63.94 81.19 40.76 15.23 1.782 2.832 1.046 2.802 600 24.11 65.20 38.55 53.41 27.65 11.90 0.003 0.890 0.376 1.642 ml = MSE of the first 10 gene variables with 6 = (1.4, —2.5, 1.8, —1.7, 1.2, 0, 0, ...,0)’ m2 = MSE of the first 10 gene variables with 6 = (21.4, —22.5, 21.8, —21.7, 21.2, 0, 0, ..., 0)’ 46 Table 4.2. Simulation results for close-to-O nonzero 6,- values based on 1,000 runs (x fixed) a n p S 3 4 5 6 7 B g 2 10 1 60 360 11 0 14 833 137 16 0 0 0 r2 0 14 833 137 16 0 0 0 gl 0 15 881 97 7 0 0 0 g2 0 15 880 97 7 0 0 0 hl 0 15 921 62 2 0 0 0 h2 0 15 920 62 2 0 0 0 1200 11 0 73 891 36 0 0 0 0 f2 0 73 887 36 0 0 0 0 gl 0 75 896 29 0 0 0 0 g2 0 75 896 29 0 0 0 0 h1 0 76 904 20 0 0 0 0 h2 0 76 904 20 0 0 0 0 p-1/9 360 1’1 0 3 835 162 0 0 0 0 r2 0 3 833 162 0 0 0 0 gl 0 4 944 52 0 0 0 o g2 0 4 944 52 0 0 0 0 hl 0 5 982 13 0 0 0 0 h2 0 5 982 13 0 0 0 0 1200 r1 0 0 1000 0 0 0 0 0 12 o 0 1000 0 0 0 0 0 gl 0 0 1000 0 0 0 0 0 g2 0 0 1000 0 0 0 0 0 hl 0 0 1000 0 0 0 0 0 112 0 0 1000 0 0 0 0 0 Table 4.2 is simulated with 6 = (1.4,—2.5,1.8,—1.7,1.2,0,0,...,0)’, 5,, = p025, up = p-l/ll, p0 = 5_ f1=frequencies of the number of remaining variables after screening f2=frequencies of including all 5 relevant variables after screening g1=frequencies of the number of remaining variables after screening and AIC g2=frequencies of including all 5 relevant variables after screening and AIC h1= frequencies of the number of remaining variables after screening and BIC h2=frequencies of including all 5 relevant variables after screening and BIC f1, 12, g1, g2, h1 and h2 are frequencies of including only variables related to Y when the remaining number of variables is less than 5 47 Table 4.3. Simulation results for distinguished nonzero 6,- values based on 1,000 runs (x fixed) 071p £3456789§10 1 60 360 n 0 01000 0 0 0 1200100000000 g100100000000 g200100000000 5100100000000 5200100000000 120011 0 01000 0 0 0 0 0 1200100000000 g100100000000 g200100000000 5100100000000 5200100000000 p-1/9 360 11 0 01000 0 0 0 0 0 r200100000000 g100100000000 g200100000000 5100100000000 5200100000000 120011 0 01000 0 0 0 0 0 1200100000000 g100100000000 g200100000000 5100100000000 5200100000000 The above table is obtained with hp = p025, ap = 11 X p‘l/ll, p0 = 5, 6 = (21.4,-225,218, —21.7,21.2,0,0,...,0)'. f1, f2, g1, g2, 11] and 112 are the same as those in Table 4.2. 4.4 Conclusions and Discussions The established asymptotic result shows that for a fixed sample size 71., the proposed variable screening method has some good properties when the dimension p is sufficiently large. It has perfect performance when the genes which are related to the outcome have significant influence on the outcome. For applications, we need to carry out some sensitivity analysis on the gene expression data set X to check those regularity conditions, and to determine (1;, and hp in the variable screening procedure. More research in how to reduce the requirements on X is called for. The use of ridge regression is a crucial part for the estimations being MSE consistent of the true parameters and for the proposed variable screening method. Since ridge regression is a type of shrinkage estimation, the use of other SVM estimations might also produce a 48 good screening method. For example, the L1 norm SVM estimation (or lasso estimation). However, the path of lasso estimation doesn’t have a clear mathematical expression (Tib- shirani (1996)). While ridge regression is much more easier in practice to get the properties of the estimations because of the known solution (Saunders and Gammerman (1998)). Penalized linear models are considered for Y and X in this chapter. In general, the regression function between Y and X may be not strictly linear. When there exists a kernel transformation function k(.) and it is linear for Y and K (X) in the feature space, our results for the MSE consistency and the consistency of the screening method can both be extended to the feature space for new variables. In the present chapter, our proofs and simulations are for fixed design matrix. More theoretical and numerical work for random design matrix is necessary. 49 CHAPTER 5 Gene Selection Methods for Microarray Data It is important to know which genes are most relevant to the binary classification task and select these genes for a variety of reasonszremovingnoisy or irrelevant genes might improve the performance of the classifier,a candidate list of important genes can be used to further understand the biology of the disease and design further experiments,and clinical device recording on the order of tens of genes is much more economical and practical than one requiring thousands of genes. The gene selection problem is an example of what is called feature selection in machine learning. Gibbs sampling (George (1993)) was used as one variable selection method. In the context of classification, feature selection methods fall into two categories filter meth- ods and wrapper methods. Filter methods select features according to criteria that are independent of those criteria that the classifier optimizes. On the other hand, wrapper methods use the same or similar criteria as the classifier. We will discuss five feature selection approaches: univariate ranking (UR, Golub et al., 2000), recursive feature elimi- nation (RF E, Guyon et al., 2002), ratio of between—group sum of squares to within-group sum of squares (BSS/WSS,reference), clustering (Sengupta (2003)), and pairwise ranking (Jonassen (2002)). They are either filter methods or wrapper methods. There are other biological methods for gene selection, see Dudoit. (2000), Chow (2001), Kim (2002) and Jaeger (2003). 50 5. 1 Univariate Ranking Univariate ranking method is also called signal-to—noise or P-metric. It defines a statistic S j for each gene variable j and assign a rank according to the value of statistic. Sj totally depends on the gene expression data. For each gene, we compute the following statistic: S» ____ n+0) - 14—01 J 0+0) “fa—(J) where 11+(j) and 11_( j ) are the means of the classes +1 and —1 for the jth gene. Similarly, U+( j ) and 0.. (j) are the standard deviations for the two classes for the jth gene. Genes that give the most positive values are most correlated with class +1, and genes that give the most negative values are most correlated with class —1. One selects the most positive m / 2 genes and the most negative 771/ 2 genes, and then uses this reduced dataset for classification. A basic question that arises for all feature selection algorithm is how many genes the classifier should use. One approach to answer this question is using hypothesis and permutation testing (Golub et al. (1999)). The null hypothesis is that the UR statistic for each gene computed on the training set comes from the same distribution as thatfor a random data set. A random data set is the training set with its labels randomly permuted. In detail, the permutation test procedure for the UR statistic is as follows: (1) Generate the statistic for all genes using the actual class label and sort the genes accordingly. (2) Generate 100 or more random permutations of the class labels. For each case of randomized class labels, generate the statistics for all genes and sort the genes accordingly. (3) Build a histogram from the randomly permuted statistics using various numbers of genes. We call this number k. For each value of k, determine different percentiles (1%, 5%, 50% etc.) of the corresponding histogram. (4) Compare the actual signal-to-noise scores with the different significance levels obtained for the histograms of permuted class labels for each value of k. See the figure for an illustration. The solid curve is the UR statistic rank ordered computed on the training set. The three dashed lines are the 5th, 50th, and 95th percentiles of the same rank ordered statistic as 51 computed from the random data. The number of statistical genes is designated as the value of k, where the solid curve crosses the 5”" percentile curve. 5.2 Recursive Feature Elimination The method recursively removes features based upon the absolute magnitude of the hyper- plane elements. We first outline the approach for linear SVMs. Given microarray data with 71 genes per sample, the SVM outputs the normal to the hyperplane, to, which is a vec- tor with 71 components, each corresponding to the expression of a particular gene. Loosely speaking, assuming that the expression values of each gene have similar ranges, the absolute magnitude of each element in 112 determines its importance in classifying a sample, since the following equation holds: 71 f(.7:) :11; x z+b= 2111,17,- +b. 1:1 The idea behind RF E is to eliminate elements of to that have small magnitude, since they don’t contribute much in the classification function. The SVM is trained with all genes; then we compute the following statistic for each gene: 5(1) = le| Where in]- is the value of the jth element of 10. We then sort S from largest to smallest value and we remove the genes corresponding to the indices that fall in the bottom 10% of the sorted list S . The SVM is retrained on this smaller gene expression set, and the procedure is repeated until a desired number of genes, m, is obtained. When a nonlinear SVM is used, the idea is to remove those features that affect the margin the least, since maximizing the margin is the objective of the SVM (Papageorgiou et al. (1998)). The nonlinear SVM has a solution of the following form: I f(:r) = ZQKUWW) + 1). i=1 Let M denote the margin. Then we obtain Equation below: 1 l l l . , 2 E = Z cpc,-Ii(;rp,:rr) =< Zc,¢(.r,),:cj6§(.rj) >= lel . 1 = p.r=1 1': j 1 52 So for each gene 3', we compute to which extent the margin changes using the following statistic: 8(1/M) S(J)=l———l. (5-1) 3(le where asj is the jth element of a vector of expression values 2:, We then sort S from the largest to the smallest value, and we remove the genes corresponding to the indices that fall in the bottom 10% of the sorted list S. The SVM is retrained and the procedure is repeated just as the linear case. 5.3 Ratio of Between-group sum of squares to Within- group sum of squares The ratio of between-group to within-group sum of squares (BSS/WSS) was first introduced by Dudoit et al. (2002). About the BSS/WSS,intuitively, genes with relatively large vari- ation between classes and relatively small variation within classes are likely candidates as relevant genes. BSS/WSS is a univariate gene selection method in which genes with large BSS/WSS ratios are good candidate relevant genes. For a gene j, let D, j denote the expres- sion level of gene j under sample i, Ekj denote the average expression level of gene j over samples in class 16 E {+1, -l}, and E, j denote the average expression level of gene j over all samples. The BSS/WSS ratio for gene 3' is defined as 335(1) _ 2121810127 = “(D—k)“ — 17.62 W550) — 2.211012- = k)(Dij - 55332 We compute the BSS/WSS ratio for each of the p genes and order the genes in descending (5.2) order of the BSS/WSS ratio. 5.4 Clustering method Given a series of microarray experiments for a specific tissue under different conditions we want to find the genes most likely differentially expressed under these conditions. In another words, we want to find the genes that best explain the effects of these conditions. This task is also called feature selection, a commonly addressed problem in machine learning, where one has class-labeled data and wants to figure out which features best discriminate among the classes. If the genes are the features describing the cell, the problem is to select the features that have the biggest impact on describing the results and to drop the features with little or no effect. These features can then be used to classify unknown data. Noisy or irrelevant attributes make the classification task more complicated, as they can contain random correlation. Therefore we want to filter out these features. Typically, informative genes are selected according to a test statistic or p-value rank according to a statistical test such as the t-test. The problem here is that we might end up with many highly correlated genes. Besides being an additional computational burden, it also can skew the results and lead to misclassifications. Additionally, if there is a limit on the number of genes to choose we might not be able to include all informative genes. Our approach is to first find similar genes, group them and then select informative genes from these groups to avoid redundancy, appeared in Gaeger (2003). In order to increase the classification performance we propose to use more uncorrelated genes instead of just the top genes. By just using the 16 best ranking genes according to a test- statistic we would select highly correlated genes. Correlation can be a hint that the two genes belong to the same pathway, are co-expressed or are coming from the same chromosome. In general we expect high correlation to have a meaningful biological explanation. If, e.g., genes A and B are in the same pathway it could be that they have similar regulation and therefore similar expression profiles. If gene A has a good test score it is highly likely that gene B will, as well. Hence a typical feature selection scheme is likely to include both genes in a classifier, yet the pair of genes provides little additional information than either gene alone. Of course we could just select more genes in order to capture all relevant genes. But not only would more genes involve higher computational complexity for classification but it also can skew the result if we have a lot more genes from one pathway. Furthermore if there are several pathways involved in the perturbation but one pathway has the main influence, we will probably select all genes from this pathway. If we then have a limit for the number of genes we might end up with genes only from this pathway. If many genes are highly correlated we could describe this pathway with fewer genes and reach the same precision. Additionally, we could replace correlated genes from this pathway by genes from 54 other pathways and possibly increase the prediction accuracy. The same issue might be true when selecting a lot of genes as well, but it is more compelling when we have a limited budget of genes and can only select a few genes. Our method for gene selection will therefore be to pre—filter the gene set and drop genes that are very similar. For the remaining genes we will apply a common test statistic and pull out the highest-ranking genes. One way to find correlated genes would be to calculate the correlation between all genes. Here we have two options: (1) Select from the best genes (according to a test statistic) that have a pair-wise corre— lation below a certain threshold. (2) The k-th selected gene is the gene with highest p—value among all genes whose corre- lation with each of the first It — 1 is below the specified threshold. Option (1) and option (2) are called ”correlation method”. Another approach is called “clustering method”. Our idea is to cluster the genes, and then select one or more repre- sentative genes from each cluster. The cluster quality is assessed by looking at the average membership probability of its elements. An element belongs to the cluster to which it has the highest membership probability. A higher cluster quality means how dispersion, and the closer the quality gets to 0, the more scattered the cluster becomes. It would be favorable to take more genes from a cluster of bad quality than from a cluster with good quality. The drawback is that a cluster might represent a pathway that is totally unrelated to the discrimination we look for. So we mask out and exclude clusters that have an average bad test statistic p-value. 5.5 Pairwise Ranking Method DLD and FLD are two discriminant methods for which a discriminant axis a is computed on the basis of the available training data. The prediction using axis a is to assign to class +1 if a’(:c — Lil—”#2) > 0, where 111 and 112 are the means of class +1 and —1, respectively. DLD axis is a = S _1(/11 — 112), where S is the diagonal variance matrix whose elements are the common variance estimate 2 (n1 -— Maggi + (712 — ”03.92: 0': 92 711-1-712-2 55 We evaluate a gene pair by computing the projected coordinates of each experiment on the DLD axis using only these two genes. We then take the two sample t-statistic on the projected points as the pair score. In the exhaustive method, we sort the score of all pairs and select the top-ranked disjoint pairs. Assume the pair (9,, 9]) ranks top 1, then all pairs containing 9,- or 93- are removed from the list. In the greedy pairs method, we select the individual gene, g,, with the highest t-score, and find the gene 93- that maximize the pair t-score. Then 9,, gj are removed from the gene set and the procedure is repeated on the remaining set until we have selected the desired number of genes. See B0 (2002). BIBLIOGRAPHY [1] U. Alon, N. Notterman, K. Gish, S. Ybarra, D. Mack and A. Levine, Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Science, 96, 6745—6750, (1999). [2] E. Allwein, R. Schapire and Y. Singer, Reducing multiclass to binary: a unifying ap— proach for margin classifiers. Journal of Machine Learning Research 1:113—141, (2000). [3] P. Bradley and O. Mangasarian, Feature selection via concave minimization and sup— port vector machines. In J. Shavlik (eds), ICML ’98. Morgan Kaufmann, (1998). [4] TH. Be and I. Jonassen, New feature subset selection procedures for classification of expression profiles. Genome Biology, 3(4):resear0110017.1-0017.11, (2002). [5] M. Chow, E. Moler and I. Mian, Identifying marker genes in transcription profiling data using a mixture of feature relevance experts. Physiol Genomics, 5299-111, (2001). [6] C. Cortes and V.N. Vapnik. Support vector networks. Machine learning, 20:273-297, (1995). [7] J. Chen and C. Chen, Microarray gene expression. In Encyclopedia of Biopharrnaceutical Statistics, Ed. Chow, S.C., Marcel Dekker, Inc., New York, New York, 599-613, (2003). [8] T. Dietterich and G. Bakiri, Error—correcting output codes: A gen- eral method for improving multiclass inductive learning programs. Proc. of the Ninth National Conference on Artificial Intelligence, AAAI Press, 572- 577, (1991). [9] S. Dudoit, H. Yang, M. Callow and T. Speed, Statistical methods for identi- fying differentially expressed genes in replicated cDNA microarray experiments. Technical report 578, University of California at Berkeley, August (2000). [10] S. Dudoit, J. Fridlyand and T. Speed, Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. JASA, pp. 77-87(11), (2002). [11] M. Eisen, P. Spellman, P. Brown and D. Botstein, Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, 95, 14863—14868, (1998). [12] T. Evgeniou, M. Pontil and T. Poggio, Regularization networks and support vector machines. Advances in Computational Mathematics, 13:1—50, (2000). 57 [13] K. Knight and W. Fu, Asymptotics for lasso—type estimators. The Annals of Statistics, Vol.28, No. 5, 1356—1378, (2000). [14] W. Fu, Penalized regressions: the bridge versus the lasso. Journal of Computational and Graphical Statistics, 7 (3), 397-416, (2004). [15] E. George and R. McCulloch, Variable selection via Gibbs sampling. J. Am. Stat. Assoc, 88, 881-889, (1993). [16] T. Golub at al., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531-536, (1999). [17] G. Getz, E. Levine and E. Domany, Coupled two-way clustering analysis of gene mi- croarray data. Proc Natl Acad Sci USA, 97:12097—12084, (2000). [18] I. Guyon, J. Weston, S. Barnhill and V. Vapnik, Gene selection for cancer classification using support vector machines. Machine Learning, 46:389-422, (2002). [19] A. Hoerl and R. Kennard, Ridge regression biased estimation for nonorthogonal prob- lems. Technometrics, 12, 55-67, (1970). [20] T. Hastie and R. Tibshirani, Classification by pairwise coupling. In Jordan M., Kearnsa M., Solla S., eds. Advances in Neural Information Processing Systems, MIT Press 10, (1998). [21] H. Hishigaki, K. Nakai, T. Ono, A. Tanigami and T. Takagi, Assessment of prediction accuracy of protein function from protein-protein interactiondata. Yeast, 18:523-531, (2001). [22] T. Hastie,S. Rosset,R. Tibshirani and J. Zhu, The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5, 1391-1415, (2004). [23] J. Jaeger, R. Sengupta and W.L. Ruzzo, Improved gene selection for classification of microarrays. Pac Symp Biocomput, (2003). [24] J. Khan, J. Wei, M. Ringner, L. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. Antinescu, C. Peterson and P. Meltzer, Classification and diagnostic predic- tion of cancers using gene expression profiling and artificial neural networks. Nat. M ed, 7, 673—679, (2001). [25] S. Kim, E. Dougherty, J. Barrera, Y. Chen, M. Bittner and J. 'Ifent, Strong feature sets from small samples C. J. Comput. Biol, 9, 127-146, (2002). [26] S. Keerthi, K. Duan, S. Shevade, A. Poo, A fast dual algorithm for kernel logistic regression. 19th International Conference on Machine Learning, (2002). [27] S. Li, Markov random field modeling in Computer vision. Springer-Verlag: Tokyo, (1995). 58 [28] Y. Lee and CK. Lee, Classification of multiple cancer types by multicategory support vector machines using gene expression data. Technical Report 1051, Department of Statistics, University of Wisconsin, Madison, WI 53706, (2002). [29] S. Mukherjee, P. Tamayo, D. Slonim, A. Verri, T. Golyb, J. Mesirov and T. Poggio, Support vector machine classification of microarray data. Technical Report, AI Memo 1677, MIT, (2000). [30] J. Platt, Sequential minimal optimizationzA fast algorithm for training support vector machines. Microsoft Research, Technical Report MSR-TR-98-14, (1998). [31] J. Platt, Probabilistic outputs for support vector machines and comparisons to regu- larized likelihood methods. Advances in Large Margin Classifiers, MIT Press. [32] C. Peron, S. Jeffrey, M. van de Rijn, C. Rees, M. Eisen, D. Ross, A. Pergamenschikov, C. William, S. Zhu, J. Lee, D. Lashkari, D. Shalon, P. Brown and D. Botstein, Distinc- tive gene expression patterns in human mammary epithelial cells and breast cancers. Proceedings of the National Academy of Science, 96, 9212-9217, (1999). [33] B. Ripley. Pattern recognition and neural networks. Cambridge: Cambridge University Press, (1996). [34] S. Ramaswamy, P. Tamayo, R. Ritkin, S. Mukherjee, C. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J. Mesirov, T. Poggio, W. Gerald, M. Loda, E. Lander andT. Golub, Multicalss cancer diagnosis using tumor gene expression signatures. Proc. Natl. Acad. Sci. USA, 26:15149-15154, (1998). [35] C. Saunders, A. Gammerman and V. Vovk, Ridge regression learning algorithm in dual variables. 15th International Conference on Machine Learning, ICML (1998). [36] D. Slonim, P. Tamayo, J. Mesirov, T. Golub and E. Lan- der, Class prediction and discovery using gene expression data. In Proc. of the 4th Annual International Conference on Computational Molecular Biology, Universal Academy Press, 263-272, (2000). [37] G. Shieh, C Bai, C. Lee, Identify Breast Cancer Subtypes by Gene Expression Profiles. Journal of Data Science, 2, 165-175, (2004). [38] B. Scholkopf, A.J. Smola, R.C. Williamson and D. Schuurmans, New support vector algorithms. Neural Computation, 12, 1207-1245, (2000). [39] J. Shao and SC Chow, Variable screening in predicting clinical outcome with high- dimensional microarrays. J Multivariate Analysis, (2006) To appear. [40] R. Tibshirani, Regression shrinkage and selection via the lasso. J.R.S.S.B., 58, 267-288, (1996). [41] V. Vapnik, Statistical Learning Theory. John Wiley & Sons, New York, (1998). [42] V. Vapnik, The nature of statistical learning theory. Springer Verlag, New York, (1995). 59 [43] L. van’t Veer, H. Dai, M. van de Vijver, Y. He, A. Hart, M. Mao, H. Peterse, K. van der Kooy, M. Marten, A. Witteveen et al, Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 530-6, (2002). [44] M. Wang, J. Yang, G. Liu, Z. Xu, K. Chou, Weighted-support vector machines for predicting membrane protein types based on pseudo-amino acid composition. PubMed, 17(6):509-16, (2004). [45] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio and V. Vapnik, Feature se- lection for SVMs. Advances in Neural Information Processing Systems 13, MIT Press, (2001). [46] M. Xiong,L. J in, W. Li and E. Boerwinkle, Computational methods for gene expression- based tumor classification. BioTechniques, 29:1264-1270, (2000). [47] Y.H. Yang, M.J. Buckley, S. Dudoit and T.P. Speed, Comparison of methods for image analysis on cDN A microarray data. Technical report, (2000). [48] Y.H. Yang, S. Dudoit, P. Luu and T.P. Speed, Normalization of cDNA microarray data. SPIE Bios, (2001). [49] K.Y. Yeung, C. Fraley, A. Murua, A.E. Raftery and W.L. Ruzzo, Model-based clus- tering and data transformation for gene expression data. Bioinformatics, 17:977-987, (2001; [50] K. Yeung, Bayesian Model AveragingzDevelopment of an improved multi-class,gene selection and classification tool for microarray data. Bioinformatics, 21, 2394-2402, '(20051 [51] X. Zheng and W. Loh, A consistent variable selection criterion for linear models with high-dimensional covariates. Statistica Sinica, 7 (1997) 311-325. [52] J. Zhu. S. Rosset and T. Hastie, Margin maximizing loss funtions. Neural Information Processing Systems, 16, (2003). [53] J. Zhu, S. Rosset, T. Hastie and R. Tibshirani, 1-norm support vector machines. Neural Information Processing Systems 16, (2004). [54] J. Zhu and T. Hastie, Classification of gene microarrays by penalized logistic regression. Biostatistics, 5(3), 427-443, (2004). 60 [[[jigligijjj[in