"Em LIBRARY 2 00'”; Michigan §tate University This is to certify that the dissertation entitled DEPENDENCY AND PURITY IN LARGE-SCALE STATISTICAL SIGNIFICANCE TESTING: A DNA MICROARRAY PERSPECTIVE presented by Keyur Hemantkumar Desai has been accepted towards fulfillment of the requirements for the Ph.D. degree in Electrical Engineering .11, Profess- ’s Signat" 09/24/2008 Date MSU is an Affirmative Action/Equal Opportunity Employer —.-.-.—a-.-o----I-u-p-u-o---a-o-n-u-u---o--o-o-I-u-o-u-u-u-u-n-o—----u. PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 5/08 K:/Prolecc&Pres/CIRC/DateDue indd DEPENDENCY AND PURITY IN LARGE-SCALE STATISTICAL SIGNIFICANCE TESTING: A DNA MICROARRAY PERSPECTIVE By Keyur Hemantkumar Desai A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Electrical Engineering 2008 ABSTRACT DEPENDENCY AND PURITY IN LARGE-SCALE STATISTICAL SIGNIFICANCE TESTING: A DNA MICROARRAY PERSPECTIVE By Keyur Hemantkumar Desai Statistical methods for detecting differential gene expression (DGE) in DNA mi- croarray experiments are discussed. A comprehensive definition of DGE refers to “statistical dependence” between gene expression levels and biological conditions of interest, such as differing environments, treatments, time points, phenotypes, or clin- ical outcomes. The ability to detect genuine DGEs has become crucial in the effort to understand and cure difficult diseases like cancer, diabetes, and Alzheimer’s dis- ease. The statistical nature of microarray data necessitates the use of “large-scale significance testing” to detect DGE. Large-scale testing is qualitatively different than the usual one-at-a-time testing: implied information from the “other” cases can force its way into the decision rule. Two prominent attributes of gene expression data, “inter-gene dependency” and “purity,” complicate the situation even further. The for- mer refers to the statistical patterns of dependencies among gene expression levels, and the latter stems from the fact that a great majority of genes show no change in expression even under changing conditions. Inter-gene dependency is perceived commonly as a harmful force cluttering the decision-making. Contrastingly, this research takes a more sympathetic view of inter- gene dependency when seen together with purity. We argue that their combination gives rise to not only accurate but also more powerful statistical reasoning. Our effort to combine the two has culminated in two contributions, each of which has both applied and theoretical implications. The first is a method for a better assessment of the number of false positives in the presence of heavy inter-gene dependency and extreme sampling errors due to exceedingly small sample sizes. The second is a method for “exploiting” inter-gene dependency to yield substantially more powerful DGE detection procedures. The empirical evidence from real and simulated test cases suggests the usefulness of our ideas. ACKNOWLEDGEMENT This research was supported in part by a Michigan State University IRGP grant, a graduate fellowship from the MSU Quantitative Biology Program, and a research fellowship from the MSU Department of Electrical and Computer Engineering. The support of the MSU High Performance Computing Center is appreciated. I am grateful to Professors Christina Chan, Hayder Radha, Justin McCormick, and Jack Deller for being on my Ph.D. committee. Their feedback and advice have been a valuable contribution to this work. Thanks to Professor Jack Deller for being a nearly perfect advisor. I owe my successful transition from engineering to genomics to his guidance, encouragement, support, and constant assurance. Thanks to Professor Justin McCormick for his generosity in time to explain intricacies of cancer research to a naive engineering student. Thanks to Professor Hayder Radha for his continual interest in my progress. I am grateful to my friends Shirish, Sharath, Amit, Akshay, Snehal, Darshak, Amit Gore, Koushik, Kiran, and Abhishek for their friendship, support, and motivation. I am indebted to Bina Desai for her interest in my early education. Many thanks to Professor S.D. Mahanti for his support and encouragement during my early days in the graduate school. This dissertation is dedicated to my mother, father, brother, and grandparents. Their love, support, and sacrifices have carried me to where I am today. iv TABLE OF CONTENTS List of Tables .................................. vii List of Figures .................................. ix Introduction and Background 1 1.1 Foundations of Modern Biology ..................... 2 1.2 Statistical Reasoning in Differential Gene Expression Detection . . . . 3 1.3 Variability in Gene Expression Data ................... 6 1.4 Testing for Statistical Significance .................... 8 1.5 Large-scale Inferences ........................... 10 1.5.1 The Sc0pe ............................. 10 1.5.2 The Goals ............................. 11 1.6 An Outline ................................ 11 A1: DNA microarrays ............................. 12 Large-Scale Significance Testing 15 2.1 A Statistical Formulation for DGE Detection .............. 16 2.2 Role of the Number of False Discoveries in Measuring Quality . . . . 19 2.3 Literature Review and Contributions of This Research ........ 21 2.3.1 Control. .............................. 21 2.3.2 Power ............................... 25 2.4 Dependency in Gene Expression Data .................. 27 2.5 Purity, Identifiability and Zero Assumption ............... 28 Estimating the Number of False Discoveries 30 3.1 Overview .................................. 31 3.2 The Proposed Method .......................... 33 3.2.1 Estimating the Moments ..................... 36 3.2.2 Estimating Correlation Densities ................ 42 3.2.3 Fitting the Maximum Entropy Distribution .......... 47 3.3 Summary of the Method ......................... 53 3.4 Test Cases ................................. 54 3.4.1 Real Data ............................. 54 3.4.2 Simulated Data .......................... 60 3.5 Discussion ................................. 66 Exploiting Correlation to Improve Gene-ranking 68 4.1 Overview .................................. 69 4.2 The Proposed Method .......................... 7 ‘7 4.2.1 Choosing a Distance Metric ................... 74 4.2.2 An Intuitive Interpretation .................... 77 4.2.3 Estimating the Covariance .................... 77 4.2.4 The Final Equation ........................ 79 4.3 Implementation Details .......................... 79 4.3.1 Numerical stability and Computational Complexity ...... 80 4.3.2 Algorithm for Two-State Studies ................. 81 4.3.3 Algorithm for Multi-State and Continuous-State Studies . . . 82 4.4 Test Cases ................................. 82 4.4.1 Case I: Real Data With Induced Differences .......... 83 4.4.2 Case II: Simulated Data ..................... 88 4.5 Discussion ................................. 92 5 Concluding Remarks 94 5.1 Summary and Concluding Remarks ................... 94 5.2 Future Work ................................ 96 Bibliography 103 vi 1.1 4.1 LIST OF TABLES An example of two-state comparative experiments. The HIV study of van’t Wout et al. (2003). ......................... TEllipsoid in action with Top 100 fig’s. Corresponding ti’s and their rank are also shown. TEllipsoid = 22 NoFPs; raw t-statistics : 68 NoFPs. Truly null genes are printed in bold-sans typeface. ...... vii 89 1.1 3.1 3.2 3.3 3.4 3.5 3.6 LIST OF FIGURES An example of an approximately 40,000 probe spotted oligonucleotide microarray with enlarged inset to show detail. ............. Discrete support Pt ([3 markers) versus continuous support 8t (solid boundary). St is standardized to improve numerical stability ...... Effect of sampling fluctuations on the empirical correlation density. (a) BRCA example (b) HIV example. For each sub figure: Left panel is the histogram of sample correlations after applying the Fisher transfor- mation (3.10) and a normal distribution (heavy curve) fit to it; Right panel is the histogram of de-noised correlations and a modified beta distribution fit to it (heavy curve). This summarizes the cumulative effect of (I?) gene-gene correlations in a single parameter fl. ..... BRCA example: Estimated (V, C) moments and a mazent distribution fit to it. Third moment estimate of P(V, C) (left) exhibits finer details than the second moment estimate (right). ............... The distributions of the number of false discoveries; Panel (a) the BRCA study, Panel (b) the HIV study. To show the effect of skewness corrections, the third moment distribution (solid curve) is compared to its second moment counterpart (dashed curve). For BRCA the second moment mean estimate is 79 compared to 104 for the third-moment; while for HIV these are 19 and 8. The BRCA panel also shows 50% (solid line) and 75% (dotted line) confidence intervals. ........ Simulation experiments comparing conditional estimates: third mo- ment estimates + marker and second moment 0 marker. This figure corresponds to the left-sided tail-area with 61 = —2.0. Substantial row-wise correlation is present. The abscissa is the realized count while the ordinate is the estimated count. For the significance of different (k, k0, 90)’s refer to the main text. Third moment skewness corrections enhance the estimation accuracy. .................... Left-sided tail-area with 61 = —2.5, else the caption for Figure 3.5 is applicable .................................. viii 13 48 56 57 58 63 65 4.1 4.2 4.3 4.4 F DRs for Case 1. The number of truly differential gene is 300. Panel (a) r2300; Panel (b) 1:100. “Exploiting correlation” considerably en- hances the statistical power. Square (El) marker 2 tEllipsoid. Lines: solid :: raw t-statistic; dotted = SAM; dashed 2 EDGE. ....... FDRs for Case 2a. The number of truly differential gene is 1200. Panel (a) r=1200; Panel (b) r2300. “Exploiting correlation” appreciably enhances the statistical power. Square (III) marker 2 tEllipsoid. Lines: solid :- raw t-statistic; dotted :— SAM; dashed 2 EDGE. ....... FDRs for Case 2b. Panel (a) r21200; Panel (b) r2300. The sample size is smaller than that in Cases 1 & 2a and yet “Exploiting correlation” has apparent benefits. Square (D) marker :— tEllipsoid. Lines: solid : raw t-statistic; dotted : SAM; dashed : EDGE ............. FDRs for simulated data. Panel (a) 7:100; Panel (b) r250. (Small) sample sizes: n1:10, 722-210. Yet, “Exploiting correlation” consider- ably enhances the statistical power. Square (D) marker 2 tEllipsoid. Lines: solid : raw t-statistic; dotted : SAM; dashed :2 EDGE. ix 86 87 90 91 Chapter 1 Introduction and Background The objective of this research was to improve the existing statistical techniques of differential gene analysis. The literature identifies “dependency among gene expres- sions” as the chief obstacle in drawing meaningful conclusions from micorarray data and “purity” as its vast untapped resource. This research focuses on understanding, correcting, and exploiting the effects of dependency by combining it with purity in a general setting. It has culminated in two major findings pertinent to the general approach for detecting “statistical linear dependence” between gene expression levels and measured biological states. The general approach consists of two main steps: (i) ranking the genes and (ii) evaluating the number of false positives for a significance cut-off. Our first finding establishes the importance of third moment skewness cor- rections in estimating the number of false positives. The second finding provides a general technique to revise a given gene-ranking for better statistical power. A good starting point is to discuss the role of statistical reasoning in interpreting the gene expression data. Organization of the chapter. Section 1.1 reviews the foundations of modern bi- ology. Section 1.2 begins with a “bird’s eye view” of the problem of differential gene detection and the potential implications of the present work. An example of two- state microarray data and a listing on typical sources of within state gene expression variations are given in Section 1.3. A contrast between one-at-a-time and large-scale significance testing is drawn in Section 1.4. In Section 1.5, we discuss the possible scope and the key goals of large-scale inferences and decision-making germane to the expression profiling data. An outline of the entire dissertation appears in Section 1.6. A short overview of DNA microarray technology concludes this chapter. 1.1 Foundations of Modern Biology Before we begin, it is helpful to recapitulate the foundations of modern biology. Refer to Watson et al. (2004) for a more complete discussion. Hunter (1993) provides a brief survey of biology for researchers trained in engineering or computer science. 0 Life changes and develops through evolution and that all life forms known have a common origin. a The cell is the fundamental unit of life. Cells arise from other cells through cell division, and in multicellular organisms, every cell in the organism’s body is produced from a single cell in a fertilized egg and hence contains the same genotype. 0 Biological form and function are passed on to the next generation by genes, which are the primary units of inheritance and made of DNA. All information flows from the genotype, the genetic makeup of the organism, to the phenotype, the observable physical or biochemical characteristics of the organism. o The total complement of genes in an organism or cell is known as its genome. When a gene is active, the DNA code is transcribed into an RNA copy of the gene’s information. 0 Gene expression refers to the transcription of DNA into messenger RNA (mRN A) by RNA polymerase. The mRNA is then translated into protein by the ribo- some. In gene expression analysis, expression level refers to the amount of mRNA detected in a sample. 0 Phenotypic differences can ultimately be understood in terms of the regulation of gene expression. 1.2 Statistical Reasoning in Differential Gene Ex- pression Detection Gene expression profiling experiments measuring the expression levels of thousands of genes at once to create a global picture of cellular function have become a vital component of modern biomedical research. They let us investigate complex human diseases—such as cancer, diabetes, Alzheimer’s disease, etc—in unprecedented ge- netic detail. However, in order to draw meaningful conclusions from the wealth of gene expression data, we must rely on statistical reasoning for two main reasons. First, the measurements are based on mRNA molecules drawn from a cell popula- tion. Second, there are several uncontrollable sources of noise and uncertainty. Con- sequently, the promise of profiling—better understanding of regulatory mechanisms, more effective therapeutic targets, accurate (sub) categorization of diseases, superior clinical diagnosis-prognosis, etc—relies heavily on pertinent statistical methodolo- gies (Lander, 1999; Petricoin et al., 2002; Singh et al., 2002; Jones and Baylin, 2002). Surprisingly, despite a major research interest in the statistical methods for gene ex- pression data analysis, several key issues have remained unresolved due to their rather difficult and unfamiliar nature (Frantz, 2005). Statistical methods for detecting differential gene expression (DGE) between two or more biological samples have attracted considerable attention. The comprehensive definition of DGE refers to “statistical dependence” between gene expression levels and biological conditions of interest, such as differing environments, treatments, time points, phenotypes, or clinical outcomes. In practice, however, “linear dependence,” also known as “second moment dependence,” is pursued more frequently because its detection is readily subject to a statistically rigorous approach, and yet its existence often suggests a phenomenon of scientific interest and clinical utility. For convenience, the biological conditions are referred as “states.” The term “two- state” refers to binary valued states, such as healthy versus cancer; “multi-state” to multi valued states, such as different categories of breast cancer; and “continuous- state” to continuous valued states, such as the age of the subject or insulin dosage, etc. The need for detecting DGE between two states is very common. The methods customized for two-state is the present focus. Note that even after the common goal being to detect linear dependence, as per the situation (two-state, multi-state, or continuous-state), the methods may differ in some mathematical and implementation aspects. In fact, detecting second moment dependence in a two-state study can also be interpreted in terms of detecting statistical mean difference between the two states, and, in turn, dealt more efficiently through the standard (unpaired) t-statistic. Indeed the t-test and simple linear regression are mathematically equivalent (refer to Section 2.1) implying that the present ideas for two—state methods are also applicable to multi-state or continuous-state methods, perhaps with little or no modifications. Additionally, some of the outcomes of this research also seem translatable to DGE notions other than “central tendency,” for example fold change (Biotechnology, 2006) or tail-rank statistics (Coombes et al., 2008; Zheng and Pepe, 2007). These extensions will be the subjects of future research. The overarching theme of this dissertation is to develop methods for combining inter-gene dependency (Section 2.4) and purity (Section 2.5) to improve statistical decision-making. The former refers to (statistical) dependencies among gene expres- sion levels and the latter to an abstract notion about the test statistics of differen- tially expressed (DE) genes not falling in certain intervals. Previous understanding perceived inter-gene dependency as more of a harmful force cluttering the decision- making, but now newer understanding suggests that when combined with purity, they can yield not only accurate but also more powerful statistical inference (Ben- jamini, 2008; Morris, 2008; Cai, 2008; Efron, 2008; Klebanov and Yakovlev, 2007). The present scope covers methods which include only the measured expression and “assigned” state labels; however, the possibility of extending these ideas to integrate legitimate “prior” knowledge like “gene grouping,” as in enrichment analysis (Subra~ manian et al., 2005), has also shown potential. Note. To avoid spurious mix-ups, the term “dependence” is reserved for statisti- cal dependence between gene expression levels and biological conditions, whereas the term “dependency” is reserved for statistical dependence among gene expression lev- els. Throughout this dissertation, the words null and non-differential are used inter- changeably; the same holds true for the words non-null and differential. 1.3 Variability in Gene Expression Data Table 1.1 shows typical data generated by a two-state study. A distinctive char- 1 acteristic of these data sets is a huge number of cases, a “large m,’ and very few measurements per case, a “small n.” This particular data set is from van’t Wout et a1. (2003) who measured the mRNA levels of 7680 genes on 8 separate microar- rays, 1—4 assigned to healthy cells and 5—8 to HIV infected cells. The challenge for the statistical methods is to discover the true signal, for example the true mean dif- ference, in the presence of substantial within state (per) gene expression variations. There are at least four factors contributing to these within state gene expression variations. These are: 1. Cross-hybridization noise due to some mRNAs molecules cross-hybridizing to the probes that are supposed to detect another mRNA. 2. Measurement noise associated with the microarray technology and related bio- chemical processing. 3. Biological variability: The mRN A molecules are obtained from a cell pOpulation and the individual cells can have differing gene expression levels due to a vari- ety of influences including (i) differences in the cells’ micro-environments (e.g., nutrient and temperature gradients), (ii) the growth phase differences between cells in the culture, (iii) the phase variations, and (iv) the periods of rapid change in the gene expression and multiple additional stochastic effects that healthy healthy healthy healthy HIV HIV HIV HIV gene 1 9.609 7.323 5.328 13.63 8.757 21.72 0.4873 6.364 gene 2 2642 5034 537.2 766.8 1123 961.7 601.7 1016 gene 1000 42.03 13.94 60.91 70.26 84.03 25.91 103.9 99.96 gene 1001 135.1 115.2 111.8 134.1 151.7 145.1 135.9 166.5 gene 4000 1455 513.1 1159 438.1 1806 647.4 1921 759.3 gene 4001 555.2 909.3 216.8 902.8 775.9 1510 492.7 1981 gene 7679 15.31 41.25 16.82 14.23 45.49 55.23 10.16 32.86 gene 7680 47.73 109.7 31.99 151.7 63.14 99.14 33.04 44.3 Table 1.1: An example of two-state comparative experiments. The HIV study of van’t Wout et a1. (2003). cannot be controlled (Hatfield et al., 2003; Baldi and Brunak, 2001; Watson et al., 2004; Alberts et al., 2002). 4. Confounding variables, such as age, sex, race, living style, genotype, etc. in studies with differing subjects. These can introduce large-scale variation (Leek and Storey, 2007). 1.4 Testing for Statistical Significance Due to the unavoidable variability, “statistical” decision-making is necessary to de- tect DGE. The theory and methods of statistical hypothesis testing deal specifically with such situations (Lehmann and Romano, 2006), yet testing each gene separately ” can be very “agonizing. From the beginning itself, the validity of a reference null distribution (obtained either from theory or re-sampling, see Section 2.1) may pose a concern. Next, a smaller p-value (defined in Section 2.1) does contradict the null hypothesis of “no DGE,” however its role as a comprehensive measure of the inherent ambiguity is often under question (Ioannidis, 2005). Both the celebrated Neyman- Pearson lemma and the Bayesian decision theory are of limited use as they require specified knowledge of distributions under the alternative hypothesis. Moreover, the latter also requires prior distributions of hypotheses occurrence. Maybe the problem in itself is ill-posed. Berger (2003) and accompanying “follow-up comments” highlight different point of views in this matter. Fortunately, these conceptual difficulties go away when considering several thousand genes in parallel. In this case, at least in principle, there is enough data to infer the null, the possible alternatives, and their relative proportions. The statistical framework known as large-scale significance testing (LSST) is a natural extension of one-at-a-time significance testing. The theory and methods of LSST deal with thousands of simultaneous tests (Lehmann and Romano, 2006). In other words, this framework can deal with all rows of Table 1.1 simultaneously. How- ever, earlier approaches in LSST were limited in their treatment of dependency among tests. In practice, most microarray data exhibit substantial inter-gene dependency among expression variations (Owen, 2005), which eventually gets manifested in depen- dency among tests. There are number of numerical studies which point out that the LSST approaches not entertaining inter-gene dependency properly can lead to highly inaccurate reporting of the underlying facts (Qiu et al., 2005b; Kim and van de Wiel, 2008) After deciding that a proper treatment of dependency is a worthwhile research endeavor, the other alternative which we explored is cluster analysis, e.g., Eisen et al. (1998). When tailored to report two clusters, then clustering in essence can deal with dependency through an appropriate similarity metric. The drawback is that clustering neglects valuable information like assigned state values. Also, the overall framework itself is not very conducive to deducing precise statistical guarantees. Therefore, this work focusses on extending existing LSST methods to include dependency. Specifically, we focus on extending the relevant LSST methods to include com- monly observed patterns of “inter-gene correlation” in expression variations. Recall that correlation is a “scale-free” measure of statistical linear dependence also known as second moment or second-order dependence. The present emphasis is on correct- ing and exploiting the effects of inter-gene correlation by combining it with purity—a statistical assumption stemming from the fact that in many comparative studies the activity of a great majority of genes remains unchanged with respect to the treatment of interest. This can also be looked as drawing second-order conditional inferences based on cases that are fundamentally less ambiguous than the others. The scope and the goals of large-scale inferences germane to expression profiling data are discussed next. 1.5 Large-scale Inferences 1.5.1 The Scope 0 The tests are related in the sense that the noise and uncertainty associated with them are of similar statistical nature perhaps due to common origins. 0 We assume: For convenience the statistical reasoning may be done around per- gene univariate summary statistics, T1, . . . ,Tm, and related null hypotheses, H1, . . . , Hm; however, that the original measurements are available and hence inferring the statistical structure among Ti’s is possible. 0 By convention, the rejection (or significance) regions to be considered are “tail- areas.” Recall that the term rejection-region refers to an interval in the space of test statistics wherein we are interested in rejecting the null hypotheses. Both “one-sided” and “two-sided” tail-areas are allowed. Working with tail- areas naturally yields the methods which are sequential in nature and begin by ordering the test statistics; for example, see the Benjamini and Hochberg method described in Section 2.3. The rationale behind tail-areas is that the evidence for a gene being null decreases monotonically as we move farther from the intervals wherein most of the probability mass of the null distribution is concentrated. Methods em- ploying rejection-regions different than tail-areas are observed to yield results with certain interpretational difficulties (Storey, 2002b; Rice and Spiegelhalter, 2008). 0 Moreover, the number of genes, denoted by m, may run in several thousands— for current comparative microarray experiments m typically ranges from 5000 to 10 25000. Large-scale methods and algorithms should anticipate and accommodate computational constraints and statistical convergence issues that are typical to this range. 0 The methods are customized in the sense that they try to capture the scien- tific context affecting the data—the present emphasis is on incorporating and exploiting typical dependency structures by combining them with purity. 1.5.2 The Goals Broadly speaking, the goal is to reliably identify as many DE genes as possible. To elaborate on more specific statistical goals, the following terminology is helpful: When a null hypothesis H,- is rejected, in other words when a gene is declared DE, then a “discovery” is made; when a gene declared as DE is indeed non-DE, then a “false discovery” is made. A broader statistical goal is to report as many discoveries as possible while not exceeding a specified “false discovery proportion.” Sometimes a “ball park” figure for the number of discoveries is specified and then, the goal is to minimize the false discovery proportion as much as possible. When recast in the language of statistical inference, the former amounts to an accurate estimation of the number of false discoveries for a rejection region and the latter to revising the test statistics to increase statistical power. 1.6 An Outline This dissertation is outlined as follows. Chapter 1 introduces the “problem.” Chap- ter 2 presents a mathematical formulation, reviews the relevant literature, discusses inter-gene dependency and purity, and underlines our main contributions. The issue 11 HI of estimating the number of false discoveries is discussed in Chapter 3, whereas Chap- ter 4 presents a novel gene re-ranking method. Conclusion and future work appear in Chapter 5. At the beginning of each chapter a separate paragraph called “organization of the chapter” is provided to guide the reader through that particular chapter. Supplement: DNA Microarrays A DNA microarray uses the selective nature of DNA-DNA or DNA-RNA hybridiza- tion. In this process two complementary strands of DNA (sometimes DNA and a complementary strand of RNA) bind. An array is a collection of microscopic DNA segments attached to a solid surface, such as glass, plastic, or silicon chip. The affixed DNA segments are known as probes, thousands of which can be placed in known 10- cations on a single array (see Fig. 1.1). In a typical experiment, the microarray is washed with a sample containing fluorescent dyed mRNA transcripts. The probe and the target sequences will hybridize according to their complementary nature. The abundance of target molecules can then be identified based on the resulting fluo- rescence patterns. Typical numbers reported by a microarray appears in Table 1.1. Based on the way DNA microarrays are fabricated, two distinct microarray platforms have evolved, each with its own pros and cons. Spotted microarrays or two-color microarrays. These arrays employ clones or oligonucleotides that are spotted onto glass slides and two distinctly labeled com- plementary DNA (cDNA) samples are hybridized together on a single array. The advantage is that the spotting process is relatively inexpensive and it is easier to make custom made arrays (Schena et al., 1995). This flexibility however comes at a 12 Figure 1.1: An example of an approximately 40,000 probe spotted oligonucleotide microarray with enlarged inset to show detail. price: The spotting process is inherently variable. This limitation is circumvented by reporting relative mRNA expression levels instead of absolute expression levels(Woo et al., 2004). The relative levels are obtained by co-hybridizing a two-color array with two RNA samples, namely, experimental and reference. Here, the reference should not be confused with the control (for example, while comparing normal cells and their cancer mutants, the RNA sample from the normal cells is termed the control and from the cancer cells the treatment). Affymetrix microarrays or one-color microarrays. These arrays employ short (25- mer) oligonucleotide probes deposited on a silicon substrate through high precision photolithography (Lockhart et al., 1996). This process is similar to the one used in manufacturing of electronic microchips and remarkably accurate in producing nearly identical arrays. However, the flexibility of customization is lost since for every custom array an expensive mask must be created. One-color microarrays report absolute 13 mRN A expression levels. Normalization. Measurements reported by two separate microarrays, even if they belong to a single study, are seldom comparable in their original form. The major reasons are unequal quantities of starting RNA, differences in labeling or detection efficiencies between the fluorescent dyes used, and systematic biases in the measured expression levels (Quackenbush, 2002). A transformation, known as normalization, is necessary. Robust normalization methods for both one-channel and two-channel microarray data are available. Cross-hybridization. Even after perfect normalization, the measured mRNA levels can still suffer from experimental noise due to cross-hybridization: some mRNAs may cross-hybridize probes in the array that are supposed to detect another mRNA. This problem has been alleviated somewhat by a systematic selection of expressed sequence tags (ETS) with high specificity-selectivity (Li and Stormo, 2001). Both microarray platforms have evolved to provide very similar data quality (Patterson et al., 2006). 14 Chapter 2 Large-Scale Significance Testing The task of differential gene expression (DGE) detection is often recast as large-scale significance testing (LSST). Inter-gene dependency among gene expression variations weakens the LSST DGE detection formulation by introducing substantial dependency among test statistics. Any other formulation is equally prone to the effects of depen- dency. However, within the framework of LSST itself, dependency can be understood, corrected, and exploited by combining it with purity. This chapter lays down the necessary statistical groundwork to do so. A direct treatment of generic any-order dependency is impractical, but treating second-order dependency is both useful and mathematically tractable. A scale-free version of second-order dependency is correla- tion and how to include inter-gene correlation in the LSST formulation is discussed. Organization of the chapter. Section 2.1 presents a statistical formulation of the “problem.” A mapping from inter-gene correlation to inter-test correlation is given by Eqn. 2.3. Error measures, the false discovery proportion and the false discovery rate, appear in Section 2.2. Section 2.3 reviews the relevant literature and states the principal contributions of this work. A discussion on inter-gene dependency is 15 provided in Section 2.4. Purity and two related concepts, “identifiability” and “zero assumption” are discussed in Section 2.5. 2.1 A Statistical Formulation for DGE Detection There are m + 1 random variables involved in this statistical reasoning: L and (X1, . . . ,Xm). Here, L refers to “state” defined in Section 1.2, and Xi to expres- sion level of gene i. Their (unknown) underlying joint probability distribution is .7: (L, X 1, . . . ,Xm). A microarray experiment obtains n independent and identically distributed samples from this distribution. A “random” data set is written as (L;X1; . . .;Xm), where X,- = (X21, . . . ,Xz-n) denotes a random sample for gene i and L 2 (L1, . . . , Ln) the corresponding states. The subscript i in Xij indexes the genes and j the microarrays. To increase nor- mality, raw Xij’s are converted to a logarithmic scale. The “realized” data set is written as (l;x1; . . . ;xm) with l = (l1,...,ln) denoting the “assigned” states and x,- = (3:21,” .,x,-n) the measured expression for gene i. The overarching goal is to test whether random variables L and Xi: i = 1,. . . , m, are dependent. The existence of statistical dependence between L and X,- suggests that gene i may be crucial to the phenotypic distinctions being studied. A rigorous treatment of general statistical dependence measures like mutual information (Cover and Thomas, 1991) is often impractical. Instead most approaches test linear (or second moment) statistical dependence. A univariate statistic measuring linear de- pendence between L and Xi can be obtained through simple linear regression between X,- and L. The LSST for DGE detection involves m univariate per gene summary statis- 16 tics, T1,...,Tm, with T,- corresponding to gene i. These Ti’s are assessed using H1, . . . , Hm, the null hypotheses of independence between L and Xi: i = 1,. . . ,m. A T,- is gauged with respect to its null distribution p(T§|H,-): The fact, it is “typi- cal” under Hi, suggests that L and X,- are uncorrelated. Note, however, that other forms of statistical dependence between L and Xi may go undetected. Philosophi- cally the bottleneck is caused by the fact that statistical independence has a concrete mathematical interpretation, but statistical dependence is just an arbitrary functional relation between random variables. The two-state situation can be handled efficiently through the standard unpaired t-statistic: Ti = (99,2 — Xt;1)/Sz'. (2-1) where X23,“ is the mean of gene i in state It and S,- is the pooled within-state stan- dard deviation of gene i. Mathematically the t-statistic is equivalent to performing simple linear regression, and hence, in essence it tests linear dependence. The usual interpretation, i.e., the t-statistic detecting a mean difference and H,- referring to the true mean difference being zero, is more obvious and well-known. If the additional assumption of normality of X,- is made, then p(T,-|H,-) can be replaced by the Stu- dent’s t-distribution. Doing so is not absolutely necessary: Permutation calculations can estimate a putative null cumulative density function (cdf) common to all genes, say G0, which may represent the basic facts more accurately (Efron, 2007a). Dependency among “truly null” genes has profound implications on the “signifi- cance cut-off,” i.e., the threshold beyond which if a T,- occurs, then its corresponding gene is declared non-null. Dependency among null and / or non-null genes has pro- found implications on the possibility of mapping (T1, . . . ,Tm) to (T i" , . . . ,T;;,) for better statistical power. Again, a general treatment of dependency among Ti’s is 17 impractical, but correlation among Ti’s is tractable, both mathematically and com- putationally, and will be pursued. When Ti’S are obtained through simple linear regression, then correlation among Ti’s is easy to track. Lemma 1 of Owen (2005) states that for Ti’s measuring the “linear ’ is independent of correlation” between L and Xi, if expression of gene 2' and gene i state L, then COI(Ti, Ti, I x’i’ Xi’) = fiiil. (2.2) In Eqn. (2.3) pa, is the “Pearson product-moment correlation coefficient” between samples x,- and Xi]. It can be verified through computer simulations that the same fact applies to t-statistics obtained through simple linear regression. The discussion surrounding Eqn. (2.3) suggests following: When expression of I gene i and gene i is independent of state L, and statistics T,- and Ti; are due to simple linear regression, then COI'(TZ', T2") =- pz'il, (2.3) where pii’ is the “theoretical” correlation between gene i and gene i’. For exceedingly small n, the sampling error in p“, is a concern. This concern, when making inferences for millions of inter-gene correlation coefficients simultaneously, can be circumvented through clever data processing (see Section 3.2.2). Owen (2005, Section 4) and Efron (2007a, Section 2) both adopt a similar point of view. For analytical convenience, Ti’s can be converted to z-values: z, = (1,—1 {00cm}, i=1,...,m, (2.4) 18 where G0 is the putative null cdf mentioned above and -1 is the inverse cdf of N(0,1). For “truly null” cases we expect Zz- ~ N(0, 1). Note that Eqn. (2.4) is a nonlinear order-preserving monotone transformation. The relationship between Cor(T,-,Tz-/) and Cor(Z,-, Zi’) must be calibrated. In practice, the approximation Cor(Z,-, Z24) as Cor(T,-,Tz-;) is seen to work well (Efron, 2007a). Methods for direct estimation of Cor(Zz-, Z24) are discussed by Zou and Hall (2002). Some DGE methods convert test statistics T1, . . . ,Tm to p-values, P1, . . . ,Pm, through p(T1|H1), . . . ,p(Tm|Hm). For example, a one-sided p-value corresponding to T,- = t,- is obtained as Pi = min [Pr(T,- < t,- | Hi),Pr(T,- > t,- | Hill; a two-sided p-value is obtained as p,- = Pr (ITzI < ItiI I Hi)- 2.2 Role of the Number of False Discoveries in Mea- suring Quality Recall the terminology introduced in Section 1.5. When a null hypothesis H,- is rejected, in other words when a gene is “declared” non-null, then a “discovery” is made; when a gene declared as non-null is indeed null, then a “false discovery” is made. Let for a random data set (L; X1; . . . ; Xm) a decision rule report R number of discoveries. Let V (_<_ R) of these are false discoveries. The ratio V/ R, known as the False Discovery Proportion (FDP) and interpreted as zero when R = 0, is of obvious appeal, especially in exploratory data analysis where statistical findings form a basis for further investigation. Contrastingly, pre-F DP LSST methods focused on V—only-error-rates because the emphasis was more on confirmatory data analysis and stricter safe-guards against making erroneous positive statements were necessary. Notice, when both V and 19 R are involved, then the decision rule is highly data dependent: If data with less ambiguity is encountered then “relatively” more findings are reported and vice versa. FDP and FDR. The usual convention assumes the random variable R as “observ- able” and V as “unobservable” and hence requiring estimation. In their breakthrough paper, Benjamini and Hochberg (1995) used an all-null-theoretical-expectation, m Er(V) = ZPr (T.- e P I Hi), (25) i=1 as an estimate of the “realized” 2) for an arbitrary rejection-region I‘. The “dot” in EF(V) emphasizes the fact that it is an “all null” expectation, which may be contrasted with EP(V), the true (unknown) expectation for that 1". These two expectations start deviating appreciably as the proportion of null genes, 7r0, reduces. If the putative null cdf G0 is obtained through permutations, then Eqn.(2.5) is a part of the Yekutieli- Benjamini F DP Estimator described in (Qiu and Yakovlev, 2006, Section 3.2). Benjamini and Hochberg showed that their procedure, which estimates the realized v/r through EF(V)/r with r referring to the realized value of R, can “bound” the average F DP, i.e., E(V/ R), at an arbitrary prescribed level. The average FDP is now widely known as the False Discovery Rate (FDR). The bounding of FDR is termed as “control of FDR” and discussed in the next section. 20 2.3 Literature Review and Contributions of This Research 2.3.1 Control. “Control of an error measure” is a concept that developed in the Frequentist-framework of LSST. Here, the error measures are compound involving a hypothetical data en- semble which can be generated by the experiment. The goal is to develop methods that can keep the error measure below a prescribed “level.” Among all the methods that can do so, the ones with more average power (the proportion of true discoveries which are reported) are sought after. Controlling FDR: Independent tests. The sequential p-value algorithm of Ben- jamini and Hochberg (1995), which controls the FDR at level a, has following steps: 1. Let H1...Hm be the null hypotheses and P1...Pm their corresponding p- values 2. Order p-values in increasing order and denote them by P(l) . . . P(m) 3. Find R = argmax (P(k)m) 3 (ka), where k is an integer ngSm 4. Reject all H“) for i = 1,. . .,R. This algorithm assures that the reported discoveries have an “estimated F DP” always below a. Benjamini and Hochberg proved the following: When truly null cases are statistically independent, then in the long run behavior the above procedure guaran- tees E(V/ R) S (1. Storey et a1. (2004) provides an alternative and a more concise proof. That the equality holds, requires additional assumptions. 21 According to Benjamini (2008) the motivation behind their 1995 method was to deal efficiently with around 100 statistically independent hypotheses in a clinical setting. Their method is now applied to situations with thousands and sometimes even millions of cases and yet whenever independency holds there is no reason to disbelieve its efficacy. By the time it was realized that the method might be critically susceptible to inter-case dependency was realized, it had become a standard technique. In DGE detection, dependency is intrinsic because the gene expression data rep- resent the life process, a fundamentally coordinated activity. Yet the exact nature, form, and amount of dependency are still being debated—Section 2.4 discusses this in detail. This issue of dependency has profoundly affected the LSST research. Despite strong interest, the issue remains unresolved. Not only very little is understood about how to exploit dependency, but even to determine the adverse effects of dependency on methods developed with the assumption of independency has been found very challenging. In relation to dependency, a major emphasis in the LSST research has been to establish “validity” of any proposed procedure. Validity refers to the fact that for a given procedure, the FDR can be bounded below an arbitrarily prescribed level. Validity: Dependent tests. Benjamini and Yekutieli (2001) show that for tests with a positive-regression dependency condition, estimating the F DP as EP(V) / r still ensures validity, and if this FDP estimator is modified as EP(V)-( 27:1 ihl) / r, then validity is ensured for arbitrary dependency. However, this implies a very conservative F DP estimation; for example, with m = 25000 a scaling of 1/ 10.7 is performed. A slightly different but more provocative interpretation is following: in order to ensure validity in the presence of dependency, the realized data are judged at a much more 22 stringent F DP level than what is needed. Benjamini (2008) argues that the real situations fall between positive regression dependency and arbitrary dependency and hence a less conservative scaling should work. Instead van der Laan et a1. (2004) propose to satisfy the probabilistic constraint Pr (F DP > '7) < a by modifying the existing procedures that satisfy the probabilistic constraint Pr (V > '7) < a; this conversion does not require independent tests, but what happens to the average power is not clear. Lehmann and Romano (2005) propose two more methods to satisfy the probabilistic constraint Pr (FDP > '7) < a under dependency. The first method can work under an abstract mathematical condition on dependency structure claimed to be reasonable. The second method can work for arbitrary dependency but the same problem as in Benjamini and Yekutieli (2001) with arbitrary dependency persists: The realized data is judged at a more stringent FDP level than necessary. Since the existing FDR procedures that are capable of ensuring validity under an arbitrary dependency are highly conservative, most investigators still rely on the original Benjamini and Hochberg procedure. Validity and variance. After proving validity of a procedure, for arbitrary de- pendency or a special dependency structure of interest, further evaluations are still needed. An important measure is the variance of the FDP estimation, E(F DP — (1)2; here, the actual FDP is contrasted with the intended level a. Owen (2005) argues that this could be a serious issue because inter-gene dependency found in most mi- croarray data tends to inflate the variance of V considerably compared to what it may be under independency; in the cases considered by Owen, the inflation is nearly 100 times. 23 Therefore, approximating the realized 2) by an all-null-theoretical-expectation for heavy dependency may be critically unrealistic. A simulation in (Efron, 2007a, Section 5) emphasizes this issue. Qui et a1. (Qiu and Yakovlev, 2006; Qiu et al., 2005a, 2006, 2005b) also explore this effect and warn against neglecting the effect of dependency on an overall, as well as case-to-case, basis. An attempt to develop a better estimate of 12 may be worthwhile; however, the ways to do so are not obvious. Conditional v estimates. Efron suggests “conditional v inferences” to account for inter-gene dependency. The notion of purity as in Section 2.5 is crucial for condi- tioning to work. Efron (2004) proposes a technique called “empirical null” to perform conditioning; this technique works solely on the realized t,- histogram and the depen- dency observed in the actual data is not entertained. In a follow-up paper (Efron, 2007a), Efron develops a second moment theory for “null T,- histogram” to gain insight into the behavior of V. He makes a crucial obser- vation that under heavy inter—gene correlation the behavior of the null T,- histogram is highly regulated in the sense that between its tail and its center there is extreme negative correlation. Contribution 1. Based on Efron’s observation, this work builds a moment-based estimator of the number of false discoveries. Sections 3.1—3.3 develop the methodol- ogy behind the proposed estimator and Section 3.4 uses real and simulated examples for verification. The key issue, the proposed estimator addresses, is the exceedingly small n situations as encountered in the HIV example of Section 1.3 where extracting any useful information about the dependency structure in itself is a major challenge. A notable feature of the proposed approach is that it naturally leads to an estimate of the entire probability distribution of the random variable V. 24 2.3.2 Power For the present LSST formulation of DGE, the test statistics T1,...,Tm are “de- signed” to capture linear dependence between random variables L and X1, . . . ,Xm. The observation that there might be ways of mapping the original set of statistics (T1, . . . ,Tm) to a newer set (Ti’, . . . , T3,) for better statistical power is very appeal- . ing. There are two notable and successful attempts in this regard, both, at some level, relying on the assumption of independent or weakly dependent Ti’s. Estimating proportion of null cases. Such attempts can be seen as identically scaling the original (T1, . . . ,Tm) in terms of their ability to detect linear dependence. The philosophy behind this adjustment is: if the proportion of null genes (no) is ap- preciably small, then the case for linear dependence is strengthen accordingly. Storey (2002a) and Pawitan et al. (2005) present and review successful attempts. Most no estimators require the assumption of independency or weak dependency among Ti’s. Borrowing strength. The understanding that a better estimate of variance in the standard t-statistic [Eqn. 2.1] should yield more power, led to the idea of a “stabilized variance estimation,” also known as “smooth t-test.” The idea has been reintroduced in several forms. Baldi and Brunak (2001) use a Bayesian approach to trade-off between the sample variance for the gene of interest and a combined sample variance from similar looking samples; it is called a “shrinkage” estimate. Newton et al. (2001) use a Gamma-Poisson model which achieves a similar effect. The “fudge factor” of the SAM algorithm, Tusher et al. (2001), is also of the same variety. All these approaches make an implicit assumption that amount of inter-gene correlation is small enough so a combined sample variance from multiple genes will not suffer from excessive random 25 fluctuations. Borrowing strength through signal-induced correlation. Tibshirani and Wasser- man (2006) present a scheme to combine Ti’s using the inter-T,- correlation for ob- taining Ti*’s with better statistical power. Their scheme averages a T,- with other Ti’s in its “correlation neighborhood.” They assume that the correlation among Ti’s is primarily due to the treatment effect. Therefore, “truly” non-null Ti’s would add up together and map to stronger Cl}*’s by reducing the noise through averaging, and truly null Ti’s would largely be unaffect by the averaging. Philosophically their idea is to add-up signal strength among non-null cases through the device of sample correlation. Storey et al. (2007) present an interesting, more general approach, of accomplish- ing a similar effect. Exploiting residual correlation. Microarray samples that are assigned the same “state” do exhibit a substantial inter-gene correlation which suggests that the source of inter—gene correlation is more intrinsic since hypothetically there are no treatment differences for identical states. Therefore, we conclude that the source / sources of correlation affect both null and non-null genes. However, somehow, the possibility that this “intrinsic” correlation among Ti’s can be “exploited” to yield a superior (Tf, . . . ,T,”;,) is unexplored, at least in the LSST formulation of DGE. This work attempts to do so. Our contribution in this regard is described below. Contribution 2. The basis of Contribution 1 is to perform conditioning on eviden- tially more likely null genes and while doing so incorporate the observed inter-gene correlation. PhiIOSOphically we employ the same idea but this time to yield a better (T1, . . . ,Tm) —> ( f, . . .,T;';,) mapping. Sections 4.1—4.3 develop the methodology 26 behind the proposed re-ranking procedure and Section 4.4 uses real and simulated examples for verification. The approach relies on the residuals of simple linear re- gression. This contribution is complementary to the first as the proposed approach yields gains for moderate n situations, for example the Prostate cancer data of Singh et al. (2002) as in Section 4.4. If substantial null-null correlation is present and purity holds, then the gain in statistical power is notable. 2.4 Dependency in Gene Expression Data Dependency refers to the fact that in a random data set the expression levels Xij and Xi’j are statistically dependent. Formally,p (Xij’Xi ’j) 74 p(X.,; j)p (Xi’j)' It is helpful to think from the point view of mutual information between Xij and X i; 3" Recall that the mutual information (Cover and Thomas, 1991) between random variables X and Y is defined as “m =//X” “”9 0gb I__(>py). (3.5) #i i If Tz'j models the correlation of a z-value pair (Zi, Z j), then, by adding bivariate normality, Eqn. (3.5) can be approximated as: 2r,,-z[k]z[z] — z[k]2 — 4112 A2 k.1 z ———e — Y Y 1: Y , #2I I g27r\/1———rzz; xp( 2(1_Tzzj) ) < k)< l>+ k l< k) (3.6) where 1k=l equals 1 if k = l otherwise 0. Now, let g(r) denote the empirical density of the pair-wise correlations of (I?) (Z1, Zj) pairs. This quantity together with Eqn. (3.6) yield a useful approximation for the second moments as stated in the following proposition. Proposition 3.2.1. The (central) second moment of a histogram count pair (Yk, Y1), where k may equal 1, is given by #2Ika I] a mzAzcgelki, zlll) - +1k=z> — (annual + chunks] + name. ll) , (3.7) where: E(YkYij) = E{ Zlklpl) (21M) (2 Ijl?‘l)} =1 q=1 r=l = Z Pr(zp e zk,zq 6 21,2. e 2,) prqrr +1k=j Z Pr(zp e 2k, Z1 6 zj)+1,,=,z Pr(zp e 2k, Zq e 2,) 1975a prq + 11:,- Z Pr(zp e 2,, Zq e 2,) + 1k=l=j ZPr(zp e zk). 10754 P 1k=l=j = 1 if k = l = m, else 1k=l=j = 0. Let 1 Tij Tik R3fi‘ljtk] = Tij 1 Tjk Tile Tjk 1 denote the 3 x 3 covariance of the triplet (Zi: Zj, Zk) where i 74 j 74 k. Notice that this matrix is an element of the space of 3 x 3 correlation matrices, say R3. Notice that the matrices in R3 are always positive (semi) definite. Now let h(R3) denote the empirical density of all such R3 [i, j, k] for the true 40 p(Zl, . . . , Zm). Together with Eqns. (3.4)—(3.7), this yields an approximation for the third moment which is again embodied in a proposition. Proposition 3.2.2. The (central) third moment of a histogram count triplet (Yk,Yl, Yj), where k, l, and j may be equal, is given by islets] z m§A3anlktzULzm> + m—Z-A2 [lkszg(ZIkla 2v» + 1k=ng(Zlkla 21m + lag-69w]. 4m] + 1k=z=j(Yk) - (Yk)(Yz) - [lYklu2IlJl + (3’1)#2lk,jl+ (leuzlkllll , where h R _ Hh($,y,2) = [R3 (2W)3/(2|I3{:|1/2 exp (-%Ix.y,ZIR3llx.y,ZIT) 4R3, (3-8) and Gg(x,y) and [1.2 [i, j] are defined in Proposition 3.2.1. The error in this approxi- mation is of 0(A3). The integral in Eqn. (3.8) is computed over three dimensions. A kth moment It calculation would involve ('5) -D integral and require integration in 72(2). Next, to get the moments of (V, C) we combine the moments of the corresponding 41 Yk’s. For example, E (v — E(V»2 = Z Hzlk, I] use {k,z:z,,,z,cr,.} E (C — (0))2 = Z #2Ika 1] (3.91» {k,z:zk,z,crc} E {W — E(V))(Ml/2'"1 (fi I (S, P), the Jacobian is given by 27” (H,- si)m [see Theorem 3, (Olkin, 1953)]. Thus, after marginalization over S i V —(V+m+1)/2 m 0° 7(V+1) Jii , fm(Pl )o<|P| £11 [0 s, exp( 232)ds.. (3.16) where p“ is the ith diagonal element of P‘l. The product occurs because of inde- pendence of si’s. Similarly to (Barnard et al., 2000), substituting w, = pii/2szz yields m _V/2 m fm(P I V) CC lPl—(V+m+1)/2 H p’ii H [00 wgu-2)/2 exp (—wz-)dwz- , i=1 i=1 0 (3.17) which leads to an expression for the probability density of the correlation matrix P: —u/2 m fm(P | u) oc IPI(”")(”"1)/2’1 H IPz-z-I . (3.18) i=1 45 where Pi; is the ith principal sub-matrix of P, and pii = |Pz-z-I/ |P|, with |A| denoting the determinant of A. For P with probability density Eqn. (3.18), the marginal density of an arbitrary K x n correlation sub-matrix, denoted by PR, is obtained as follows. Let the K. x K. covariance sub-matrix 2).; undergo the transformation 2).; —> (SmPn). Then, due to the marginalization property of inverse-Wishart, 23,; ~ W;1(I, u — m + K). Steps Eqns. (3.13)—(3.18) for 23,; yield: —(u—m+n)/2 m fm(PK;|1/)O( IPK, |(V— m-I-It— 1)( (it— 1)/2- -1 UK ()PK, )‘iiI I:I. i=1 Substituting n = 2 in (3.15) yields (u— —m 1)/2 fm(P2|V)Efm(p12|I/)O<(1—p12) . |p12|31. which has the same parametric form as Eqn. (3.11). By setting 11 — m = 26 + 1 we can force the inherent marginal densities of P entries IPij (i 75 j)] to equal g(p)—the specific aim with which the derivation began. Finally, substituting K. = 3 in (3.15) gives: 2(8+1) (1 — Pig _ .033 — Pf3 ‘I‘ 2P12923013) [(1 — Pig)“ " P33)(1 " Pf3l] fi+2 h(P3) OC (3.19) It should be noted that even though for large m inverse-Wishart, Eqn. (3.13), is a tenuous assumption, its use is strictly to estimate h(P3) from the g(p), there is no concern for the entire P. Also, since single parameter probability densities on a positive definite matrix space are very few, this one is chosen for its useful “marginalization” property. The justification of this choice rests in the fact that 46 Eqn. (3.19) is an empirically realistic estimate as evident in the results of Section 3.4. The “Bayesian correlation priors” point of view (Liechty et al., 2004) was especially helpful in formulating these ideas. Exploring other ways to obtain h(P3) is a subject for future research. 3.2.3 Fitting the Maximum Entropy Distribution P(V, C) is inherently a discrete distribution with support D = {(i,j) : 0 _<_ i S m, 0 S j S m, 0 S i + j S m}, and any moment based inference would invariably involve computation over 8. Growing cardinality of D (cc m2) makes dealing with large m difficult. However, computation can be reduced substantially by truncating Dto Dt={(i1j) I Vmin S. 2 S Vmax, Cmin S j S Cmax, O S 2+]. S m}: Where Vmin = max([(V) — l - Std(V)_|,0), Vmax = min (RV) + l - Std(V)l,m); Cmin = max (L(C) — l - Std(C)j,0), Cmax = min ([(C) +l - Std(C)l,m). Here, Chebyshev’s inequality guides the choice of parameter I: For l 2 6, the loss of accuracy due to truncation is negligible. Computation can be reduced even further by recognizing the fact that the distri- butions imposed on the basis of a small number of moment constraints often enjoy a high-level of regularity and a sparser mesh should suffice. In fact, the computation— accuracy trade-off is easy to deal with, if the task is changed to learning a continuous probability density p(x, y) over continuous support St = {(x,y) : x E range(V),y E range(C), #22 S x +y S 1 — W}. 47 (0,0) V ——> (m,0) (“m f C) (1 _%),_—_(n¥)) Figure 3.1: Discrete support Dt (Cl markers) versus continuous support St (solid boundary). St is standardized to improve numerical stability. range(Y) is [W, W] (see Figure 3.1). The above standardization offers numerical stability, but the moment constraints must be scaled appropriately. Let ’Pc denote the space of feasible p(x,y)’s, then Vp(x, y) 6 ’PC: [8 xiyjp(x,y) dx dy = pij, and 0 S i +j s K; (3.20) t where ”v = E{(V — (V))’(C’ — (0))1'} /mi+J’ . (3.21) In Eqn. (3.21) (i, j):(0,0) corresponds to the constraint fSt p(x,y) dx dy = 1, while (i,j):(1, 0) together with (i,j):(0, 1) imply that Vp 6 ’Pc has mean (0,0). Selection of a unique p(x, y) relies on the principle of entropy maximization (max- ent) which seeks a p(x, y) 6 PC with maximum information entropy (Jaynes, 2003). The information entropy essentially measures the spread of the distribution, and hence, maxent can be seen as a criterion, which within the knowledge constraints, chooses a least “assuming” probability density—arguably, a correct approach in the framework of statistical inference. 48 The application of maxent leads to the following optimization problem: p*(x,y) = argmax {—fs p(x, y) lnp(x,y) dxdy}. (3.22) t p(x,y)€Pc The solution takes the following exponential form: EXP (ZlgHjSK Aijxiyj) f5, exp (219+ng Atjxiyj) d2: dy 1910?, y) = (3-23) Solving (3.22) requires concepts from the Calculus of variations and Lagrange multi- pliers. The reasoning leading to the exponential form Eqn. (3.23) and a procedure to determine optimal Aij’s will now be discussed. In addition to the fact that information entropy functional is concave (Cover and Thomas, 1991), the constraints in Eqn. (3.21) are also linear in p(x,y). Thus, the problem in Eqn. (3.22) is a convex program which can readily be solved in a Lagrangian dual framework, where one works with an unconstrained upper-bound (lower-bound if minimization) that is easy to optimize. More importantly, in the present case, the framework allows the conversion of the original infinite dimension problem of functional variation into a finite dimensional problem with as few variables as the number of constraints. Proposition 3.2.3. The dual \IJ(/\) of the concave optimization problem Eqn. (3.22) is given by: \II()\)=ln [Sexp Z Aijxiyj dxdy — Z Aijpzj, (3.24) t 19+ng 29+ng where )‘ij is the Lagrange multiplier corresponding to the (ij)th constraint and )1”, 49 and i,j, K are defined in Eqn. (3.21). Proof. By the definition of the Lagrangian dual function, the Lagrangian III(/\) : 311p [- [Stp($,y)1np(x,y)dx dy+ Z An (thiyjpwwwx dy - #2“) p(x,y)el’ i+ng (3.25) Taking the functional variation of the square bracketed term in Eqn. (3.25) with respect to the unknown density p(x, y) and using the fact that f St p(x, y) dx dy = 1, the maximizer of Eqn. (3.25) is obtained, eXp (219+ng 42739319) 1010:, y) = (3.26) fSt ‘3’“) (219% _<_K Atjmiyj) 611' dy Inserting Eqn. (3.26) into Eqn. (3.25), yields \IJ()\) =/Sp(x.y)1n [.9 exp. 2 Win) dandy dxdy— Z w” t t 19+ng 19+ng 1 I = In [/ exp ( Z Aiszgfl) d2) dyJ — ZZSi+jSK Aijpzj, (3.27) 5‘ 132+ng . _ 00 _ 10 _ 01 _. where the facts fStp(x,y) dx dy u — 0, u — 0, and ,u — 0 from Eqn. (3.21) have been used. [I It is easy to verify that the Hessian of Eqn. (3.24) is positive definite and hence \I'(/\) is convex. Suppose X" is the minimum of \II(/\), then the corresponding pri- mal solution p” (x, y)——obtained via Eqn. (3.26)—indeed maximizes Eqn. (3.22). To verify this, let p0(x,y) be the maximizer of Eqn. (3.22), then from Eqn. (3.25) \II(A) Z ’H{f°(x,y)},V)\. Now from the general optimization theory, the functional 50 variation of Lagrangian with respect to p(x,y) evaluated at p0(x,y) must be zero, which implies p0(x, y) could be written in the form Eqn. (3.26) for some A0. But then, ‘1’(/\*) S ‘I’(/\0) => ‘1’(/\*) S H{f0($, 31)}; consequently, WW) = H{f°($, 31)} => X“ = A0; hence va (x,y) is the maximizer of Eqn. (3.22). What remains is the need for an efficient method of minimizing \II(/\). “Newton’s method” is used. According to Newton’s method, if a multi-variable function \II(/\) is twice difl'erentiable and the initial guess point A0 is in the “neighborhood” of X", then the sequence n+1 = An — amnion—lawn). n 2 o (3.28) converges to X". In (3.28) A\II(/\n) denotes the gradient of \II()\) evaluated at An and H 0110171)} the Hessian. The parameter 7 > 0 allows finer control of step sizes to avoid numerical instabilities. At the nth iteration, \II(/\) is replaced by its second- order Taylor expansion around An and then minimized exactly, which produces the minimum An+1. At the (n + 1)th iteration, An+1 becomes the point of expansion and the method continues until it convergences. ~ The elements of the gradient A\II()\) are given by: ~ . . exp :1<'+'"1 {Co(t,-)}, Eqn. (3.2). ti’s are linear regression test statistics (Section 2.1), Q is the test statistic cdf, and (D‘l is the standard normal inverse cdf. 2. Determine fl, Eqn. (3.12), summarizing the inter-Z,- correlations. 3. Partition [—5, 5], the Zz- sample space, into K = 100 equal width bins (A = 0.1) and use Eqn. (3.4), Proposition 3.2.1, and Proposition 3.2.2 to compute the first three moments of the Z,- histogram 4. Use determined 2,; histogram moments to compute the first three moments of V and C, Eqn. (3.9) 5. Use determined moments with the maxent, Section 3.2.3, to estimate P(V, C) 6. Use the estimated P(V, C) to determine P(Vlc) 53 3.4 Test Cases MAT LAB code for the algorithm developed above can be requested via email at keyurdesai@gmail.com. The approach was tested on two real data sets, both showing a significant amount of inter-gene correlation and unusual null count behavior. Ca1- culations below are for 61 = —2.5 and 60 = 1. Comparisons with the second-order i) of Efron (2007a) are also made. 3.4.1 Real Data The BReast CAncer (BRCA) study of Hedenfalk et al. (2001) has m = 3226 and n = 15 with 7 samples assigned to BRCAl mutations and 8 to BRCA2. The study sought to identifying genuine mRNA activity differences between these two categories. The study used two—color microarrays and hence the measurements are in terms of “ratios.” The logarithms of these ratios are used to raise normality (Tsai et al., 2003). The HIV study of van’t Wout et al. (2003) has m = 7680 and n = 8 with 4 samples assigned to an HIV infected condition and the remaining 4 to the control. The control (CD4-T cell lines) was infected by the HIV-IBRU virus. This study reported raw mRN A levels which were also converted to logarithms for the present purpose. The approach reduces the entire data matrix to just two parameters: The observed C and the ,8. As evident in Figure 3.2, the parametrization of Section 3.2.2 is realistic. For BRCA example 8:17.77 and for HIV )8—351. Additional details are provided in the caption. The next step is to compute the moments of (V, C) per Propositions 1 and 2. These 54 x 10 x 10 - ii,- —histogram - Tij —histogram ——N(0, 0.3382) (1 — r2)17'77 4 - , 4 - G 5 c c a) I I a) 3 3 c- c7 9 “2 LI. I LL 2 r 2 r . I . ( I ”I 'III 0 H . O ..IIII. . I. '2 0 2 "I 0 1 r r (a) Figure 3.2: Caption appears on the next page below Figure 3.2. calculations require m, 8, c, 6, and A. A = 0.1 is selected. A maxent distribution was fit to these moments. Figure 3.3 reports the moments and the corresponding maxent distribution for the BRCA data. Here, (V, C) show strong negative correlation of -0.89; a similar figure is reported by (Efron, 2007a, Table 1). Furthermore, V shows significant positive skewness, which causes C to show negative skewness. This is not surprising as V is bounded below by 0 and yet has small mean but inflated variance. In effect, third-moment provides additional detail about the joint behavior of (V, C). 55 6 6 x10 x10 3.25 ’ - fij—histogram4 ’ _ (1 _ 7.2)351 3 " —N(0, 0.6542) rs-histogram 3 L a 2’ a c 5 s 3 3 3 II 11. 1 - 1 - e o O -2 o 2 —1 p 1 (b) Figure 3.2: Effect of sampling fluctuations on the empirical correlation density. (a) BRCA example (b) HIV example. For each sub figure: Left panel is the histogram of sample correlations after applying the Fisher transformation (3.10) and a normal distribution (heavy curve) fit to it; Right panel is the histogram of de-noised corre- lations and a modified beta distribution fit to it (heavy curve). This summarizes the cumulative effect of (75”) gene-gene correlations in a single parameter ,6. During the maxent numerical optimization a 100 x 500 equal-spaced mesh was found sufficient for the BRCA study; however, for the HIV study, it was necessary to expand the mesh to 400 x 2000 because of both larger m and heavier (V, C) correlation. The BRCA Optimization took 30 iterations to converge, whereas the HIV optimization 56 289.7 33113.3 10299.8 -1399069.6 -66210.9 3882259 2600 \ 2400 2200 2000 (7 1800 100 2200 2000 ' 1800 0 V C 0 V Figure 3.3: BRCA example: Estimated (V, C) moments and a maxent distribution fit to it. Third moment estimate of P(V, C) (left) exhibits finer details than the second moment estimate (right). took 70. Figure 3.4 reports the estimated P(V|c). Second moment and third moment estimates are shown separately. In the framework of statistical inference, such a distribution is the ultimate goal. Point estimates and associated confidence intervals 57 0.05 . . f .‘x 97—109 0.04 - .' ‘. . ' I "—1 I 73—85 : 1| O : I. a :Iznxrnuxn.‘ ............ 53 0-02" : 69—89 ‘ 9 —113 ' I ' | ' 0 0.01 - ,' 1‘ ' | ' 0 ' \ ' s . . _v' . 1 1‘. __ _ 1 00 20 40 60 80 100 120 The Number of False Discoveries (NoFD) (a) 0.1 0.08 - a 0.06 - LL. o E, . 5: 0.04 3.‘ ‘Q ~~ 0.02 - N,‘ 40 50 60 70 0 1’0 20 30 The Number of False Discoveries (NoFD) (b) Figure 3.4: The distributions of the number of false discoveries; Panel (a) the BRCA study, Panel (b) the HIV study. To show the effect of skewness corrections, the third moment distribution (solid curve) is compared to its second moment counterpart (dashed curve). For BRCA the second moment mean estimate is 79 compared to 104 for the third-moment; while for HIV these are 19 and 8. The BRCA panel also shows 50% (solid line) and 75% (dotted line) confidence intervals. 58 can be extracted in accordance with a ‘loss function.” If the mean of the estimated P(Vlc) is used to determine the realized v, then for the BRCA study, third moment calculations suggests 104 false discoveries versus 79 for second moment. Contrastingly, the “all-null—theoretical—expectation” yields only 20 false discoveries. These numbers must be put in perspective by noting that the actual 2,; count falling in the left-sided tail-area [00, —2.5] is 116. The standard Benjamini and Hochberg 1995 procedure evaluates the F DP coincident with the left-sided tail- area [00, -2.5] as 0.1724, the second moment as 0.68, the third moment as 0.9, and the Benjamini and Yekutieli 2001 procedure as 1 (actually as 1.4759). For HIV example, third moment calculates 8 false discoveries compared to 19 for second-moment. Contrastingly, the “all-null-theoretical-expectation” suggests 48 false discoveries. This time the actual 2,- count in the left-sided tail-area [00, —2.5] is 46. In this case, the F DP coincident with the left-sided tail-area [00, —2.5] is evaluated by Benjamini and Hochberg 1995 procedure as 1 (actually as 1.0435), the second moment as 0.41, the third moment as 0.17, and the Benjamini and Yekutieli 2001 procedure as 1 (actually as 9.94). That the conservative scaling in Benjamini and Yekutieli 2001 procedure to ensure “validity” can cause a significant loss in statistical power (Section 2.3), becomes ap- parent in the HIV example. Whereas “all-null-theoretical-expectation” of Benjamini and Hochberg 1995 procedure can let through a potentially powerless data set, be- comes apparent in the BRCA example. However, extensive treatment of inter-gene correlation combined with identifiability may lead to more realistic conclusions. The second-order i) developed in (Efron, 2007a) found 77 false discoveries for the BRCA study. Efron compares that to the results of nonparametric analysis and concludes underestimation, but the issue is left unexplored. Our main finding is that 59 moments higher than second are important in describing the null Zz' histogram. 3.4.2 Simulated Data It is helpful to test the method on simulated data where the true answer is known. In simulation below all genes are attributed as “null” and hence no treatment effect is added. The objective of this all-null simulation is to evaluate the estimation accuracy of the developed method. The estimated tail-counts are contrasted with the true tail- counts. In each trial, a 3226 x 15 matrix with entries simulating raw mRNA levels is generated based on the Gamma-Gamma model described below. First 7 columns are assigned “state 1” and the remaining 8 are assigned “state 2.” The standard two-sample t-statistic is used enroute to the z-values, 21, . . . , 23226 The number of zi’s falling in the tail-area is the “true count” which is compared with the “estimated count” from the method. Let the mRNA level Xij of gene i, measured by jth microarray, be Xij ~ Gamma(k,9i), for j = 1, . . . ,n, (3.32) where Gamma(k, 6) is the Gamma distribution with the shape parameter k > 0 and the scale parameter 6’ > 0, similarly to the Gamma-Gamma model in (Newton et al., 2001). In (3.32) the shape parameter k is common to all genes. Note that the index variable I: of Section 3.2.1 has no connection with this k. The 6,- scale parameters characterize the underlying mRN A levels which vary from gene to gene: 6,- i'lé" Gamma(k0, 90), for i = 1, . . . ,m. (3.33) 60 The intuition that genes with bigger underlying mRNA levels should have higher variance is consistent with model (3.32) since the mean of the ith gene is k6,- and variance is 1062-2. The parameters (10, 100,60) may be chosen on the basis of the overall gene expression histogram of some real microarray data set. Three sets of results, (1,0.6, 500), (2, 0.39, 384), and (3, 0.33, 300), are presented. These numbers were chosen to preserve the total sample variance. These particular values are based on the HIV data of van’t Wout et al. (2003) which were collected using Affymetrix microarrays. In particular, case kzl implies that the Xz-j’s are exponentially distributed, while case 1022 implies a unimodal distribution with heavy tails and a noticeable departure from Gaussianity. Case 1023 is characterized by a more normal (Gaussian) looking distribution, however, with slightly heavier tails. Following technique was employed to add substantial row-wise correlation: Through the normal cdf, map the entries of a m x n matrix of correlated Gaussian random variables 20 = LTz, where 2,,- ‘kf’ N(0, 1), (3.34) to their p-values Pij = (ij), and further map these p-values into Xij’s through the inverse Gamma cdf’s as in accordance with Eqn. (3.32). In Eqn. (3.34), LTL = R is the Cholesky factorization of the correlation matrix R in which L is a lower triangular matrix. Developing proper correlation matrices for simulation is a challenging problem. Even the methods described in the comprehensive work of Marsaglia and Olkin (1984) fail to generate arbitrarily dense correlation matrices for a large m (m > 1000). In- stead we take a novel approach based on the recent work of Higham (2002) whose basic idea is to generate a symmetric matrix and covert it to a nearest correlation matrix based on Frobenius norm minimization. There are many ways to solve the 61 300’ (k = 1,160 = 0.6,00 = 500) J 41 250- + 0* 'E 3 + 3’ 8 200~ + 1 3 ‘5150- 5 .§ iii “100’ -I 50" 0 50 100 130 200 230 300 Realized count (3) . . , 1 . +1 300- (k = 2,190 = 0.391, 60 = 384) 1 + 250- +1.? I ’S’ o 8 200- + + 09 "U ‘1’ 5 a 150- 9 ) .§ 1‘3 100» 50- . 0 50 100 _150 200 250 300 ReaIIzed count (b) Figure 3.5: 61 = —2.0. For the caption, refer to panel (c) on the next page. 62 300- (k = 3, k0 = 0.333, 90 = 300) + + 250- 1 ++ E + ++#‘I§-* O 3 O 8 200- "U +. O .22 co 150- .E ‘0-0 6“ Q 11"} 100- 50 0 0 5‘0 100 130 200 250 300 ReaIIzed count (C) Figure 3.5: Simulation experiments comparing conditional estimates: third moment estimates + marker and second moment 0 marker. This figure corresponds to the left-sided tail-area with (51 = —2.0. Substantial row-wise correlation is present. The abscissa is the realized count while the ordinate is the estimated count. For the significance of different (10, k0, 60)’s refer to the main text. Third moment skewness corrections enhance the estimation accuracy. numerical Optimization proposed in Higham (2002); however, recently proposed N ew- ton method of Qi and Sun (2006) is noteworthy for its speed and accuracy. Higham (2002) provides a lucid exposition of the problem of finding the nearest correlation matrix. Figures 3.5 and 3.6 compare the second moment estimates versus the third moment estimates for 6 = —2.0 and 6 = —2.5, respectively. Both cases use 60 = 1 for the center-area boundary. The choice of a center-area is discussed in Section 3.5. For each sets of (19,100,190), 800 test data were simulated. On each test data the approach was applied in its entirety and no additional knowledge was assumed. Throughout the 63 Estimated count Estimated count Figure 3.6: For the caption, refer to panel (c) on the next page. 120 100- 80- 60- 40 20- 120: 100- 80- 60- 40- 20' (’9 = 11k0 = 06,00 = 50%)].1 + ++-‘ «I? +: + + #* ++ 0 +0) cg o ++++Qi+ +++J$ O¢8 . O + + O s cc 3 2° 4° 60 80 100 120 Realized count (3) (’9 = 2, k0 = 0391,190 = 384) :- ++ + + +++++‘ +fi+++++ +0 + ++qd++d$ogl + $ 60 l is 2’0 4’0 610 8’0 100 120 Realized count (b) 64 15,, (k = 3, k0 = 0.333, 00 = 300) + 75120- + + it o B 90’ ++ +08 0 ...- 00 (D O 0 .§ 1;; 60- I.“ 30- ,r 0 i) 610 . 90 120 130 Realized count (C) Figure 3.6: Left-sided tail-area with 61 = —2.5, else the caption for Figure 3.5 is applicable. mean of the estimated P(V/c) was used as the final estimate of the tail-area count. Note that the all-null-theoretical-estimates are always 20 for 61 = —2.5 and 73 for 61 = —2.0, regardless of the test data analyzed. For all three sets of (k, k0,190), third moment skewness corrections enhance the estimation accuracy. Evidently, the third moment estimates are significantly less prone to “over / under estimation events” than the second moment estimates. A positive results in three very different simulations emphasize the over-all usefulness of the developed method. 65 3.5 Discussion Improved DNA microarray technology, refined standardization procedures, and a careful execution of laboratory protocols collectively lead to testing situations with individually accurate but strongly dependent null hypotheses. Inter-test dependency arising from intrinsic gene-gene interactions cannot be circumvented by experimental design. Therefore, the effects of dependency on the accuracy of decision-making must be carefully analyzed. In particular, due to dependency, the “realized” v/r may vary substantially on a case-to-case basis and the control of E(V/ R) may no longer ensure the quality of reported discoveries. Treating general dependency is impractical, but admitting second-order depen- dency is possible and widely attempted. The developments in this chapter emphasize that an “explicit” combination of estimated correlation with identifiability can miti- gate the unfavorable effects of inter-gene correlation, and that the moment theory of null statistic histogram allows to do so. It is tempting to conclude that a large number of vectors drawn from the distri- bution N(0, R) can yield the required moment estimates and the mathematical work in Sections 3.2.1—3.2.2 is unneeded. However, such a conclusion neglects excessive sampling errors that are present in sample correlation coefficients. With substantial sampling errors, a realistic estimate of the underlying R cannot be obtained but the quantities g(r) and h(R3) are still obtainable. Permutation calculations, as in Section 4 of (Efron, 2007a), present an alternative way to estimate the moments. They too can run into computational difficulties, especially when the test statistic is computation intensive. An even bigger difficulty is innate when samples are few: For a two-state study like HIV, 4—4 samples each, only 70 unique permutations are available. Nevertheless the permutation based approach 66 is promising as a subject of further research. When a direct extraction of inter-test correlation is not feasible, the “single om- nibus parameter models” remain useful: They allow the user to systematically exer- cise judgement by selecting different values for fl to examine a range of “correlation effects.” The distribution of interest P(V, C) tends to have complicated support like ’D (Figure 3.1), and the maxent algorithm is well-suited to such complicated support regions. At a more fundamental level, moment is intended to minimize the amount of unintentional prior information brought into the inference. For the present approach, apart from the numerical parameters A (bin width) and the mesh resolution in maxent, the only open choice is 60, the center-area boundary. The choice 60 = 1 in this work is based on the first eigenvector analysis of (Efron, 2007a) which suggests that (within certain approximations) the interval [—1, 1] has completely opposite count behavior from the rest of the Z space. Does more inter-ZZ- correlation translate into more extreme Cor(V, C)? The answer is surprisingly no. In the BRCA example, Cor(V, C) is —0.89, whereas in the HIV example it is reduced to -0.75. Further insight into this behavior should be a useful contribution. Here, the aim was to mitigate the unfavorable effects of inter-gene correlation when estimating the number false discoveries. In fact, inter-gene correlation can be “exploited” to yield a superior gene-ranking. This is the topic of the next chapter. 67 Chapter 4 Exploiting Correlation to Improve Gene-ranking In the LSST DGE detection formulation (Section 2.1), genes are represented by uni- variate summary statistics. Moreover, there is correlation among test statistics due to inter-gene correlation among gene expression variations. The observation that in high-dimensional inference there can be ways of mapping the original set of statistics (t1, . . . , tm) to a newer set (t*, . . . , tin) with more statistical power is very appealing. The research described in this chapter develops a technique to obtain one such map- ping. The key idea is to combine correlation with purity through a distance metric that can account for the effect of correlation on the joint distribution of Ti’s. As a spe- cial case, we develop a method that builds upon the widely used two-sample t-statistic approach suitable for two-state studies. The method uses the Mahalanobis distance as the distance metric. An extension accommodating multi-state and continuous-state studies is also discussed. 68 Organization of the chapter. Section 4.1 provides an overview of the proposed approach. Section 4.2 obtains closed form expressions for the minimum Mahalanobis distance estimates. Section 4.3 builds on the theory of Section 4.2 to develop the tEllipsoid gene-ranking method. In Section 4.4, we apply tEllipsoid to the prostate cancer data of Singh et al. (2002) and evaluate the gain in statistical power. Sec- tion 4.5 discusses the implications of these results. 4.1 Overview Detecting differentially expressed genes in the presence of substantial inter-gene cor- relation is a challenging problem. Research has focused largely on understanding the harmful effects of correlation on the threshold settings demarcating null and non-null genes. The research discussed in Chapter 3 developed a method which explicitly admits the observed sample correlation in the analysis and offers a more accurate assessment of significance cut-offs. The research described in this chapter shows that correlation can, in fact, be exploited to share information across tests, which, in turn, can increase statistical power. The key contributions in Chapter 3 were in part mo- tivated by the exceedingly small sample situations (72. ~ 5-10), whereas the work in this chapter is intended to benefit situations with n 2 20. It is helpful to think in terms of mapping the original set of observed statistics (t1, . . . , tm) to a newer set (t*, . . . ,tfn) for better statistical power (Section 2.3). The literature is not devoid of attempts to develop such mappings that exploit correlation among (the random variable interpretation of these) test statistics, but such efforts have not produced compelling results. We posit that the limitations of such develop- ments are due, at least in part, to neglecting identifiabz’lz’ty—the act of incorporating 69 the genes whose test statistics fall near the center of the null distribution as explicitly null in the inference. The statistical risk of imposing identifiability weigh less than not doing so because of the fact that in most comparative studies the activity of a great majority of genes remains unchanged with respect to the treatment of interest (Section 2.5). This chapter presents a framework to obtain a better gene-ranking by combin- ing correlation and identifiability through an optimization criterion. The framework builds upon the widely-used two-sample t-statistic approach and uses the Maha- lanobis distance as the optimality criterion. Although the initial motivation was to improve statistical inference in two-state microarray studies, the framework readily generalizes to multi-states and continuous-states as well as to other multiple compar- ison applications. The connection between the standard t-statistic and simple linear regression as discussed in Section 2.1 is crucial to this generalization. Recall the basic LSST formulation from Section 2.1: Genes are represented by summary statistics and the corresponding null hypotheses, H1,H2,...,Hm T1,T2,...,Tm t1,t2,...,tm, where T,- refers to a “random value” and t,- the “observed value.” The magnitudes of ti’s establish a gene-ranking, and the top 7' << m genes with the largest t scores are reported as statistically significant discoveries. The investigator can either explicitly supply r or rely on the false discovery rate (FDR) calculations to find a maximal r with the allowable FDR. The present discussion assumes that r is fixed. 70 The issue of correctly estimating the F DP in the presence of correlation has re- ceived much recent attention because highly correlated tests increase the variance of the F DP leading to unreliable results (Owen, 2005). As discussed in Efron (2007a) and Chapter 3, for “over powered” data sets, there may be significantly fewer tail- area null counts than expected, while for “under powered” data sets, the situation can worsen with many more tail-area null counts than expected. Importantly though, techniques for correctly estimating the F DP do not change the gene-ranking, but only the size of the reported list. The research discussed in this chapter was motivated by the notion that, for “under powered” data sets, it might be possible to exploit correlation among test statistics to establish a gene ranking that has better statistical power than the original t2- based ranking. The method that resulted from an exploration of this question indeed seems to improve the statistical power of all data sets. The proposed method uses, (i) a vector of the observed statistics t = [t1,t2, . . . ,tm]T and (ii) an estimate of the covariance matrix of the vector T = [T1, T2, . . . ,Tm]T, to output a substantially revised version of t, denoted ult, whose corresponding entries can be used to establish an improved gene-ranking. The method is summarized as follows. Let T ~ (u, 2). Note that no distributional information nor higher order statistics of T are assumed. Now based on the observed value t, we can estimate ult, but while doing so we invoke the zero assumption (ZA) (Efron, 2008) that the smallest P0(%) of ti’s are associated with null genes. Based on the ZA, we can set corresponding entries of u to zero. For the remaining entries of u we obtain minimum Mahalanobis distance (Mahalanobis, 1936) estimates. Inter-gene correlation causes the vector T to distribute around the center of mass 11 in an hyperellipsoidal manner, and the Mahalanobis distance is a natural way to 71 measure vector distances in such a distribution. In fact, to emphasize the geometric intuition of tracking the center of an ellipsoid, the method is named as “tEllipsoid.” Through extensive experimentation with both real and simulated data, it has been found that for a truly null t,- which happens to be in tail-area, the corresponding uz-It consistently tends to zero (its theoretical value). TWO prior research efforts, Storey et al. (2007) and Tibshirani and Wasserman (2006), were particularly useful in formulating the present approach. Interestingly, both approaches aim at exploiting the signal structure among non-null tests, whereas the present approach aims at exploiting the structure among null tests; see Section 2.3 for details. 4.2 The Proposed Method An m x n matrix of gene expressions, for m genes and n samples, is given. For the present discussion we assume that the samples fall into two states 19 = 1 and k = 2 and there are ”k samples in group I: with n1 + n2 2 n. The generalization to multi- state and continuous-state is discussed at the end of the section. We start with the standard (unpaired) t-statistic: t2- = ———' (4.1) where ink is the mean of gene i in group k and s7; is the pooled within-group standard deviation of gene i. If the ith gene is indeed null, then we expect the random variable T,- ~ (0, u/ (12—2)). Here, the degrees of freedom 11 is obtained from either the unpaired t-test theory or the permutation null calculations as discussed in Section 2.1. Note that T,- is a random variable and t,- is its observed value. If gene i is non-null, then 72 we expect T,- ~ (11.2303). For non-null genes, the values of uz- and 0',- depend on the amount of up / down regulation, the number of samples in each group, and 11. Without loss of generality, we may assume that the genes are indexed so that W S |t2| S S ltml- (4-2) Then, a reasonable way to impose identifiability on null genes is through the ZA, namely, that P0(%) of the genes—those with the smallest t statistics—are null. Sec- tion 2.5 provides a comprehensive discussion regarding the rationale behind the ZA. The use of the ZA is justified in the present situation as long as P0 is sufficiently small so that the bottom P0(%) genes would almost certainly be declared null for reason- able FDR’s. Potentially, the statistical risk of “imposing” identifiability weigh less than not imposing it, especially when purity holds. Empirical evidence of Section 4.4 supports this intuition. Formally, the ZA is stated as follows: Let c be the largest integer (gene index) such that c/m 3 P0 / 100, denoted c = [0.01mPol, (4.3) then genes with indices 1,2,. . . ,c are assumed null. Let us partition the set of t statistics into those corresponding to genes declared null under the ZA, {t1, t2, . . . , tc}, and those for the remaining m — c genes which continue to compete for the non-null designation, {tc+1, tc+2, . . . , tm}. (The present c is different from the c in Chapter 3.) For convenience, we introduce the following vector notation, T T _ T T _ T T t - l to town l - l ta» to) l - 73 Then the random vector T is distributed in the following way: T ~ (11, E), (4.4) where u is the underlying mean vector and 2 the covariance matrix. The correspond- ing partitions of (u, E) are denoted u(0) and 2: 2(00) 2(01) “(1) 23(10) 2(11) u: The central hypothesis here is that there is a vector, say ult, whose elements represent a reordering of the elements of the t, such that gene-ranking represented by u|t has better statistical power for detecting non-null genes than that based on is itself. The present effort focuses mainly on the second moment distributional character- istics of t. However, in fact, if the gene expressions are normally distributed, then, perhaps, t is described more accurately by the multivariate Student distribution. Exploitation of this additional structure will be considered in future work. 4.2.1 Choosing a Distance Metric We are interested in obtaining an estimate of the vector mean 11 based on the obser- vation t. This requires an appropriate metric in the space of t vectors, with which to quantify the distance of the observed t from the center of mass 11, say dist(t, u). The 82 norm induces a useful metric between t and u provided that we first decorrelate 74 the vector elements as 2—1/2 (t — 11), thus yielding dist(t,u) = \/n>:-1/2(t—u)ll2 = fl; _ u)T2-1(t — u). (M) This weighted Euclidean distance is sometimes called the Mahalanobis distance in the pattern classification literature as in Deller et al. (1999) and Dejviver and Kittler (1982). We can relate t to u through the Mahalanobis distance but while doing so we invoke the ZA, which, in turn, implies that the first c entries of u are zero. This yields the estimate 0 u* = , where uh) = argmin (4.6) 1121) 11(1)ElRm_c T -1 t(0) " 0 23(00) ”(01) t(0) “ 0 to) - “(1) 2(10) 23(11) to) - “(1) In effect, ufl) combines the identifiability information based on the ZA with the information about the covariance structure of T which too can be obtained from the measured X itself. Notably the optimization in Eqn. (4.6) enjoys closed form solution: 1121) = 13(1) — 2(10)2&)}))t(0). (4.7) The derivation leading from Eqn. (4.6) to Eqn. (4.7) follows. 75 Suppose that .4 -A B . 2 = C D and u(1) = ta) — 11(1). (4.8) We also have C 2 BT. Substituting these in Eqn. (4.6) yields: 11:1) -_- fiSgEIElE—ctfg’fitw) + zeacqo) + afipam (4.9) 1 In Eqn. (4.9), by setting the gradient w.r.t 11(1) to 0, we obtain: 11),) = —CTD—1t(0). (4.10) Now for 2‘1, we can appeal to the matrix inversion lemma (Golub and Van Loan, 1996): ”(0%) (1+ 2(01)Q_12(10)2(_0l))) “Zahzwnq'l _Q*12(10)2(-0%)) Q—l 2‘1: 1 where Q = 201) — 2(10)2(—0%))2(01). Plugging this in Eqn. (4.10) yields: 4 _ _1 Combining Eqn. (4.11) with Eqn. (4.8) provides the desired expression: *_ —1 “(1) - t0) — 2<10>’3(oo)t(0) '3'- 76 4.2.2 An Intuitive Interpretation If t0) is written as “(1) + e(1), then Eqn. (4.7) can be seen as 2|: —1 Ul(1) : 11(1) + [9(1) ’ 23(10)2(00)t(0)l ' Then, the proposed method is interpreted in the following way: The method uses the identifiable null cases to predict and, in turn, suppress the “null energy” in the competing cases. 4.2.3 Estimating the Covariance Notice that Eqn. (4.7) involves theoretical covariances whose estimates must be ob- tained from the data set at hand. To estimate the required entries of )3, we make two observations. The validity of these observations can be established through computer simulations. An intuitive argument also appears. Note that Eqn. (4.7) does not re- quire the covariance between two non-null Ti’s. The realization of the fact that the t-statistic is a “standardized” statistic helps understand these covariance—correlation relationships. Observation 1. If genes i and i’ both are null, then u COV (Ti? Tzl) % COI‘ (Xi? Xi’) :5 (4.12) This observation maybe intuitive to the reader from Eqn. (4.1) itself, or it is easily verified through a computer simulation. Efron (2007a), Owen (2005), and the method deveIOped in Chapter 3 all use this observation for their respective conditional FDR calculations. 77 Observation 2. Similarly, if the gene i is null and i’ non-null (or conversely), then 1/ ’ u—2 Cov (T,, T,,) 2 Cor (X,, X,,) (4.13) where Cor(ff'i, X24) denotes the “residual” correlation between gene i and i’. Residual correlation is the correlation in the fraction of gene expression that is unaffected by the treatment of interest. For null genes the fraction is one because by definition the gene expression X,- for a null gene is independent of the state variable L. For a non-null gene, the fraction is determined by how strongly that gene is affected by the treatment. Equations (4.12) and (4.13) suggest to use “residual” sample correlations to esti- mate Cov(t,-, tiz): (“22' 55%) (532' 530-) where :iiz-j denotes the (observed) residual corresponding to the ith gene and the jth microarray. The scale factors cancel in the terms 200) and 23—1 so that (00)’ estimating u/(u—2) is not required. These arguments hold for two-state, multi-state, 607! (E,Til) CX , (4.14) and even continuous-state studies provided the test statistics are obtained through simple linear regression. Here, the entries of the residual matrix are the differences between the observed expression values and the corresponding predicted expression values from regression equations X,- = a + bL + e,i = 1,. . . ,m. Note that the two- sample t-statistic can be interpreted as simple linear regression with a binary-valued covariate. The situations where test statistics are not from simple linear regression will be considered in future work. 78 4.2.4 The Final Equation In light of Eqn. (4.14), Eqn. (4.7) takes the practical form 62:1) = t0) — C(10)?)(Bbtm), (4.15) where C is the sample correlation matrix of the residuals. In most cases computing the . . ~ _1 . . ~ _ 1 full matrix inverse (C(00)) IS not necessary and solvmg the term C (00)t(0) through an efficient linear solver reduces the computation considerably. 4.3 Implementation Details This section outlines a self-contained differential analysis algorithm based on the ideas discussed in Section 4.2. Its name tEllipsoid was coined to emphasize the geometric intuition of tracking the center of an hyper-ellipsoid. TEllipsoid takes a gene expression matrix X and assigned biological conditions and provides a specified number, say r, of the most differentially expressed genes. In principle, the ranking is based on the set {11;} from Eqn. (4.6). In practice, we rely on the estimates {212‘} from Eqn. (4.15). Two-state implementation begins by re-indexing the genes based on their two- sample t-statistics [Eqn. (4.2)]. Then, based on the ZA, the first c genes are identified as null, as specified in Eqn. (4.3). By default, P0 is set to 50(%). Although the choice 50% is somewhat arbitrary, this fraction has worked well empirically in the data sets tested. A P0 as low as 10(%) is found to enhance the power. Future research may yield more rigorous methods for choosing P0. In the two-state implementation, in order to nullify any genuine treatment differ- 79 ences, X is converted to «It: by subtracting each gene’s average response within each treatment group. The residual sample correlation matrix C of «l; is computed sub- sequently. The crucial step is to compute fifll based on Eqn. (4.15). The elements T of (1‘1“)T = [03x1 (fi(1)) ] determine the gene-ranking: A gene with bigger li’Zfl is assigned a higher rank. The first r genes are reported as top r statistical discoveries. The multi-state / continuous-state implementation beings by re-indexing the genes based on their simple linear regression t-statistic. The entries of the residual matrix it" are the differences between the observed expression values and the corresponding predicted expression values from regression equations X,- = a + bL + e,i = 1, . . . , m. 4.3.1 Numerical stability and Computational Complexity Because the number of samples n is often less than the number of genes m, the residual sample correlation matrix turns out to be singular and hence non—invertible. Therefore, we add a very small correction term (= 10-10) to its diagonal entries to make it nonsingular, and in effect, invertible. After this correction, tEllipsoid shows excellent numerical stability. Equation (4.15) involves matrix inversion, which, if performed in a naive way, could be a prohibitive operation, since microarray data sets may have several tens —1 (00) neous linear equations (Cmmx = tan) is much faster than explicitly computing of thousand genes. Indeed, solving the term C t(0) as a system of simulta- C(_0})). In particular, we can employ the Cholesky decomposition to exploit the fact that the matrix C(00) is symmetric and positive definite. MATLAB imple— mentation of tEllipsoid uses the in-built linslove with appropriate settings, which, in turn, uses highly optimized routines of LAPACK (Linear Algebra PACKage— http://www.netlib.org/1apack/). 80 For the Prostate data (used in Section 4.4) with 12625 genes and 102 samples, tEllipsoid, running on a computer with a 2.2 GHz dual-core AMD Opteron processor with 8 GB of RAM and MAT LAB version R2006b, requires just under 40 seconds to report the final gene list. For the same settings, the implementation with explicit matrix inversions takes ~ 10 minutes. 4.3.2 Algorithm for Two-State Studies tEllipsoid: An enhanced gene-ranking for differential gene expression detection Input: X = Labeled m x n gene expression matrix; r = Size of gene list Output: The gene list containing top r differentially expressed genes 1. Calculate two-sample (unpaired) t-statistics: t,- = (573,-;2 — fall/3i 2. Reindex genes such that |t1| g |t2| S S |tm| 3. Gather first c : [0.01mP0l ti’s in a vector t(0); By default P0250(%) 4. Convert X to X by subtracting each gene’s average response within each treat- ment group 5. Compute C 2 the (residual) sample correlation matrix of X 6 Find (fi*)T = T (13* )T where G* : t — C C_1 t - cx1 (1) , (1) (1) (10) (00) (0) 7. Determine gene-ranking: Assign a gene with bigger |iZ:‘| a higher rank 8. Report top r genes as statistical discoveries 81 4.3.3 Algorithm for Multi-State and Continuous-State Studies tEllipsoid: An enhanced gene-ranking for differential gene expression detection Input: X = Labeled m x n gene expression matrix; r = Size of gene list Output: The gene list containing top r differentially expressed genes 1. Calculate per gene simple linear regression t-statistics 2. Reindex genes such that |t1| S |t2| g g ltml 3. Gather first c : [0.01mP0] ti’s in a vector t(0); By default P0:5O(%) 4. Obtain the residual matrix X whose entries are the differences between the observed expression values and the corresponding predicted expression values 5. Compute C : the sample correlation matrix of X 6 Find (fi*)T = T (13* )T where ii" = t — G GT1 t ° cxl (1) , (1) (1) (10) (00) (0) 7. Determine gene-ranking: Assign a gene with bigger Iiifl a higher rank 8. Report top r genes as statistical discoveries 4.4 Test Cases To appreciate the increase in statistical power attributable to “exploiting” correla- tion, the performance of tEllipsoid is contrasted with three leading techniques. The first is the raw t-statistic itself and the other two are SAM [Significance Analysis 82 of Microarrays (Tusher et al., 2001)] and EDGE [Extraction and analysis of Dif- ferential Gene Expression (Leek et al., 2006)]. SAM adds a small exchangeability factor so to the pooled sample variance when computing the two-sample t-statistic: dz- 2 (53,-;2 — ii;1)/(si + so); Whereas EDGE is based on a general framework for sharing information across tests (see Storey et al. (2007)). EDGE is reported to show substantial improvement (in terms of statistical power) over five of the leading tech- niques including SAM (Storey et al., 2007). The other four are: (i) t/F—test of Kerr et al. (2000) and Dudoit et al. (2002); (ii) Shrunken t/F—test of Cui et al. (2005); (iii) The empirical Bayes local FDR of Efron et al. (2001); (iv) The a posteriori probabil- ity approach of Lonnstedt and Speed (2002). It should be noted that tEllipsoid can also serve as an additional layer to SAM and EDGE and enhance their power. To determine the performance quality of various techniques, we focus primarily on the empirical FDR in the reported gene list: Empirical FDR 2 NoFP /r, where N oF P 2 the number of false positives. Broadly speaking, smaller the FDR better the technique. 4.4.1 Case I: Real Data With Induced Differences Singh et al. (2002) studied m 2 12625 genes on n 2 102 oligonucleotide microarrays, comparing n1 2 50 healthy males with n2 2 52 prostate cancer patients. The purpose of their study was to identify genes that might anticipate the clinical behavior of Prostate cancer. We downloaded the .CEL files from http://www-genome .wi .mit. edu/MPR/prostate. The software RMAExpress (Irizarry et al., 2003) was used to obtain high quality gene expressions from these .CEL files. We let RMAExpress apply its in-built background adjustment, however, the quantile normalization was skipped. Each gene was represented in the final expression matrix X by the logarithm 83 (base 10) of its expression level. Taking the log is thought to increase normality and stabilize across group standard deviations (Tsai et al., 2003). Algorithm testing required an expression matrix X with the knowledge of truly non-differential genes. At the same time, we wanted the inter-gene correlation in X to resemble that in the real microarray data. These two seemingly conflicting requirements were satisfied concurrently by row standardizing a real X. The prostate cancer matrix X was transformed to X by subtracting each gene’s average response within each treatment group, and by normalizing within group sample mean squares. That is, for each group k 6 {1,2}, (1/nk) 23- ii,- 2 0 and (l/nk) 23- X12]- 2 1. Here, the sum runs over corresponding ”k samples only. With this transformation, all genes have equal energy and yet the same within group inter-gene correlation structure as the original X .Note. Normalizing within group sample mean squares to unity is not implemented in the tEllipsoid algorithm. To generate a test data set from i, its 102 columns were randomly divided into groups of 50 (2711) and 52 (2n2). Next mu (md) genes were randomly chosen for up (down) regulation by adding a positive (negative) offset xu (red) to the corresponding entries in group 2. Various choices of (mwmd, mu,a:d) were tested to represent a range of differential analysis scenarios encountered in practice. Two cases were studied. In the first, the proportion of truly differential genes, say p1, was taken to be relatively small: p1 ~ 0.01-0.05. The second case em- ployed a larger p1 ~ 0.1. The former simulates microarray studies seeking genes that distinguish subtypes of cancer, diabetes, etc., whereas the latter resembles studies comparing healthy versus diseased cell activity. Results were obtained using the subroutines samr.r from the package “samr” and statex . r from the package “edge.” Both routines compute their native gene summary 84 statistics which, in turn, can be used to determine top r genes. Case 1 [pl 2 0.025, mu2200, md2100, $1,201, and xd2-0.1]. Figure 4.1(a) shows plots of the F DRs for 40 different data sets with the size of the reported list, r2300. A large value of r coincides with an attempt to extract as many differential expressions as possible, a desired goal especially in microarray studies performed to identify genes that are to be explored further — experimentally or computationally — to gain better understanding of underlying gene networks. Since the differential signal $11,201 and xd2-0.1 is rather weak, recovering a good list is not easy as evident from the results — among all methods only tEllipsoid achieved sufficiently low FDRs to rescue a few X’s. Figure 4.1(b) presents results for 72100. A smaller r would be chosen to identify high-quality class distinguishing features for gene-expression-profiling-based clinical diagnosis and prognosis, where the goal is to build accurate classifiers and predictors. Whereas Singh et a1. (2002) build a classifier around only 16 of 12625 features, they do discuss the need to include as many reliable features as possible. Remarkably, for 37 out of 40 X matrices, tEllipsoid reports gene lists with no false discoveries at all, while the other techniques fail to obtain a single gene list with the FDR < 0.5. Case 2a [pl 2 0.1, mu2600, md2600, xu20.02, and xd2-0.02]. In this set of experiments, p1 is increased, but the differential signal is reduced. This situation also proves to be challenging for the existing techniques. However, tEllipsoid provides the FDR of ~ 0.5 for r21200, and, again for 72300, while it reports most gene lists with no false discoveries at all. Case 2b [p1 20.1, mu2600, md2600, $1,201, and xd2-0.1]. This subcase is de- signed to assess the effects of small sample sizes on performance. n1 and n2 are both reduced to 20. We randomly chose 20 columns per group from the original prostate 85 1 - D D [I Cl U E El Ct] El U Cl D D 0.5 [3 D a U D D D :1 13 E] [II CD D D DC! [:1 1:1 :1 DD :1 U :1 O . i . 0 10 20 30 40 Data Index (a) 1 - I 5', If “‘ o 0.5- " LI. 0 1:1 1:1 0 "ac-$1351 a 20 30 40 Data Index (b) Figure 4.1: FDRs for Case 1. The number of truly differential gene is 300. Panel (a) r2300; Panel (b) r2100. “Exploiting correlation” considerably enhances the statisti- cal power. Square (El) marker 2 tEllipsoid. Lines: solid 2 raw t—statistic; dotted 2 SAM; dashed 2 EDGE. 86 El I E 0.542111%]thqu QEDUDWJ O0 10 2'0 30 40 Data Index (a) 1 L: D 0.5- Cl OWE] W 0 10 20 30 40 Data Index 0)) Figure 4.2: F DRs for Case 2a. The number of truly differential gene is 1200. Panel (a) r21200; Panel (b) r2300. “Exploiting correlation” appreciably enhances the statistical power. Square (CI) marker 2 tEllipsoid. Lines: solid 2 raw t-statistic; dotted 2 SAM; dashed 2 EDGE. 87 cancer X, and then applied the data generation process (including row standardiza- tion) detailed in Subsection 4.4.1. Reduction in the number of samples is compensated by increase in the differential signal. The FDRs for tEllipsoid, Fig. 4.3, are excellent suggesting that tEllipsoid increases power of small sample data sets too. 4.4.2 Case ll: Simulated Data Before devising the prostate data test setup, tEllipsoid was tested on several simulated data sets. Below we discuss some simulation results that shed further light on the small sample behavior. Let us denote by X“) the ith column of a simulated expression matrix X. We assume that the random vector X“) is multivariate Gaussian with mean 0 and covari- ance matrix W. Each such column represents m23226 genes with a covariance matrix W that introduces roughly the same amount of correlation as found in the BRCA data of Hedenfalk et al. (2001). We choose mu 2 50, md 2 50, ran 2 1, 1rd 2 --1,n1 2 10, and n2 2 10. Figure 4.4 shows plots of the F DRs for r250 and r2100. Table 4.1 shows results for some data point from Fig. 4.4(b). Shown are the top 100 values of it: and each corresponding original t,- with concomitant rank. With smaller 71., preeminence of tEllipsoid with respect to existing techniques scales down a bit. Nev- ertheless, for r250 case, for 25 out of 40 simulated X realizations, tEllipsoid achieves a low FDR of ~ 0.1 or less. Interestingly, with a smaller n, SAM outperforms the other two techniques. This is not entirely surprising as a smaller n can make the noise in the per gene pooled vari- ance s,- (and possibly the equivalent quantity in the EDGE algorithm) more promi- nent. Nevertheless, SAM does mitigate this issue in some measure by using the exchangeability factor so to adjust the effective pooled variance (Tusher et al., 2001). 88 1—50 51-100 11,: rank 11’; t,- t,- rank it; rank it”; t,- t,- rank 1 4.22 5.87 1 51 2.18 3.45 23 2 -4.17 -555 2 52 2.17 3.05 42 3 —3.93 -4.26 5 53 2.16 2.80 82 4 -3.74 -4.12 7 54 -215 -257 122 5 -3.58 -4.49 4 55 2.15 1.96 357 6 -349 -334 28 56 -2.14 -147 751 7 -345 -425 6 57 2.13 2.25 229 8 3.35 3.87 10 58 2.13 1.77 486 9 -333 -320 35 59 2.13 1.44 785 10 3.33 3.77 13 60 2.12 2.14 273 11 3.25 3.42 25 61 -211 -1.48 744 12 -3.16 -2.18 260 62 2.10 1.80 453 13 3.14 4.54 3 63 2.09 2.60 114 14 3.10 2.87 65 64 -209 -205 312 15 -3.08 -354 17 65 2.09 2.70 96 16 -3.07 -2.80 80 66 2.09 2.23 237 17 3.06 3.49 20 67 2.08 2.34 188 18 3.02 2.29 213 68 -2.08 -224 232 19 -299 -334 27 69 -2.06 -253 130 20 -293 -3.13 38 70 -204 -211 283 21 -292 -292 57 71 -204 -295 54 22 2.86 3.26 31 72 -2.o3 -3.08 40 23 -2.83 -2.82 74 73 -202 -230 210 24 2.82 2.37 180 74 -2.01 -3.67 15 25 -2.81 -213 276 75 2.00 2.62 109 26 2.81 3.48 21 76 -1.98 -2.38 171 27 -279 -301 47 77 1.98 1.43 795 28 2.70 2.87 64 78 1.96 1.69 549 29 -2.66 -315 37 79 -1.95 -1.47 746 30 -2.58 -3.85 11 80 1.95 1.95 361 31 —2.56 -2.84 71 81 1.95 2.81 77 32 -255 -1.72 524 82 -194 -141 813 33 -254 -2.63 106 83 1.94 3.40 26 34 -254 -2.69 98 84 1.94 1.30 948 35 2.53 2.30 209 85 -1.94 -3.27 30 36 2.48 2.45 148 86 -1.93 -1.11 1190 37 -247 -229 212 87 -193 ~1.37 872 38 -2.46 -321 33 88 -1.93 -3.44 24 39 -243 -244 154 , 89 -1.92 -3.07 41 40 2.43 2.71 94 90 -1.92 4.50 726 41 2.40 2.86 66 91 1.90 3.62 16 42 -234 -2.60 115 92 .1.90 -2.82 75 43 -234 -2.98 50 93 1.89 1.25 1007 44 -233 -3.80 12 94 1.87 3.89 8 45 -232 -2.06 306 95 -l.86 -3.49 19 46 2.29 1.81 444 96 -l.86 -2.08 300 47 -227 -1.17 1110 97 1.85 1.20 1074 48 2.26 1.97 347 98 -1.83 -290 60 49 2.24 3.75 14 99 1.83 1.39 833 50 2.20 3.88 9 100 -132 -195 367 Table 4.1: TEllipsoid in action with Top 100 it: ’8. Corresponding ti’s and their rank are also shown. TEllipsoid 2 22 NoFPs; raw t-statistics 2 68 NoFPs. Truly null genes are printed in bold—sans typeface. 89 0 1’0 2‘0 30 40 Data Index (a) D Cl E] El 0 1 0 29 30 40 Data Index 0)) Figure 4.3: F DRs for Case 2b. Panel (a) r21200; Panel (b) r2300. The sample size is smaller than that in Cases 1 & 2a and yet “Exploiting correlation” has apparent benefits. Square (El) marker 2 tEllipsoid. Lines: solid 2 raw t-statistic; dotted 2 SAM; dashed 2 EDGE. 90 1 1 :1: ~ . , ‘ . E 0.5”}!04 . “”3 ,s-JH-N I“! " ll 'r‘b’, Eb ' 4 ”an? 1. 3100000 50 Duct! U CF Chm 0 D 9C.” D , O 10 29 30 40 Data Index (a) 1 2 3O 29 Data Index 0)) Figure 4.4: F DRs for simulated data. Panel (a) r2100; Panel (b) 7250. (Small) sample sizes: n1210, n2210. Yet, “Exploiting correlation” considerably enhances the statistical power. Square (El) marker 2 tEllipsoid. Lines: solid 2 raw t-statistic; dotted 2 SAM; dashed 2 EDGE. 91 4.5 Discussion By allowing researchers to examine the simultaneous expressions of enormous numbers of genes, microarrays promised to revolutionize the understanding of complex diseases and usher in an era of personalized medicine. However, the shift in perception of that promise is palpable in the literature. A 1999 Nature Genetics article (Lander, 1999) is entitled “Array of hope," but a 2005 Nature Reviews article (Frantz, 2005) is entitled “An array of problems." It is not unusual for impacts of new technologies to be overestimated when first deployed, then to have the expectations moderated as the technologies reveal new complexities in the problems they are designed to solve. In the study of microarray data, the need for exceeding care in the design and regularization of experiments and data collection are understood to be critical, but the biggest hindrance to progress has been the data interpretation. In particular, the biggest challenge seems to be the treatment of intrinsic inter-gene correlation. In most microarray data there are at least three vital resources: (i) identifiability (ii) immense parallel structure, and (iii) inter-gene correlation itself. In this light, tEllipsoid can be viewed as exploiting more than correlation as a means of sharing information across tests, as it also involves identifiability. A crucial step in formulating tEllipsoid was the comprehension of the effects of inter-gene correlation on Cov(Tz-, Til). In light of Observations 1 and 2, the choice of the Mahalanobis distance was intuitive, as it is already known to give computationally attractive solutions through the matrix inversion lemma. Limited time and resources—and perhaps also the necessity for scientific focus— often require biomedical researchers to work on only a small number of “hot (gene) prospects.” Even under such highly conservative conditions, however, misleading results can occur, as evident in the results of Figs. 4.1—4.4. For all their careful 92 develOpment and statistical power, even state-of—the—art tools that do not account for correlation can report spurious gene lists. The extra statistical power available by exploiting inter-gene correlation promises to further guard against anomalous results that can have serious consequences for the trajectory of a study of gene function, causation, and interaction. In summary, this chapter has reported the development and testing of a novel framework for the detection of differential gene expression. The framework combines the exploitation of inter-gene correlation to share information across tests, with iden- tifiability — the fact that in most microarray data sets, a large proportion of genes can be identified a priori as non-differential. When applied to the widely used two-sample t-statistic approach, this viewpoint yielded an elegant differential analysis technique, which requires as inputs only a gene expression matrix, related two-sample labels, and the size of desired gene-list r. Empirical evidence suggests that exploiting corre- lation substantially enhances statistical power. Usually, with increase in microarray samples, power tends to increase considerably, but, even for small sample sizes, the performance improvement is noticeable. 93 Chapter 5 Concluding Remarks 5.1 Summary and Concluding Remarks This work has investigated the effects of inter-gene dependency on statistical methods for differential gene analysis. Differentially expressed genes are pivotal in understand- ing highly intricate genotype—phenotype relations and devising appropriate therapeu- tic interventions. The effort focused on understanding, correcting, and exploiting the effects of second-order inter-gene dependency within the large-scale significance test- ing formulation for differential gene analysis. It was shown that combining inter-gene dependency with gene expression purity yields more accurate and powerful statistical inferences yielding a more accurate list of differentially expressed genes. The main statistical theme of this work has been to draw second-order conditional inferences based on cases that are theoretically more likely to be null. This research resulted in novel ways of combining inter-gene correlation with pu- rity. A method for mitigating unfavorable effects of correlation on the false discovery rate calculations was developed. A framework for exploiting beneficial effects of cor- 94 relation on gene-ranking enhancement was also discovered. The findings are very general in that they are applicable to any test statistics obtained using simple linear regression. In particular, a novel approach to the false discovery calculations was explored. The approach has produced an inferential technique capable of handling exceedingly small sample sizes and reporting an entire distribution of a random variable model of the number of false discoveries. The technique first summarizes the effect of millions of pair-wise correlation coefficients in a single parameter, then explicitly incorporates this parameter in the inference of the number of false discoveries. The possibility of exploiting correlation to improve gene-ranking was also explored. This effort culminated in a powerful framework employing statistical distance mea- sures which can account for the effect of correlation on the joint distribution of test statistics. The extra statistical power made available by exploiting inter-gene cor- relation promises to further guard against anomalous results that can have serious consequences for the trajectory of a study of gene function, causation, and interaction. The urgency for more accurate differential gene analysis methods motivated this research. However, the principal contributions are firmly rooted in the fundamental theory and methods of large-scale significance testing and enjoy broader applicabil- ity. Large-scale significance testing has become a key statistical tool for exploratory data analysis in single nucleotide polymorphism detection and other high-throughput genomic research endeavors. A proper treatment of dependency in most large-scale significance testing applications seems necessary to draw meaningful conclusions. 95 5.2 Future Work Research topics for future work concerning the FDR estimation method of Chapter 3 are listed below: 1. Further investigation of Cor(V, C) across a range of microarray data sets. 2. Evaluation of the effect of center-area boundary 60 on Cor(V, C) and on overall accuracy. 3. Extension of the single parameter correlation model to a multi-parameter cor- relation model to increase modeling accuracy. 4. Exploration of connections between the maxent exponential density and the exponential parametric form of “empirical null density” as in Efron (2004). Research topics for future work concerning the gene-ranking framework of Chapter 4 are listed below: 1. Investigation of the role of P0 in overall performance. 2. Empirical and / or theoretical understanding of the relation between the residual correlation and the gain in statistical power; the eigenvalue spread of residual correlation matrix seems crucial here. 3. Empirical and / or theoretical understanding of the relation between the number of samples and the gain in statistical power. 4. Developing false discovery rate estimates for the proposed gene-ranking. 5. Incorporation of prior information, such as “gene grouping” or a priori non-null identity. 96 Bibliography B. Alberts, D. Bray, J. Lewis, M. Raff, K. Roberts, and JD Watson. Molecular biology of the cell. 2002. A. Almudevar, L.B. Klebanov, X. Qiu, P. Salzman, and A.Y. Yakovlev. Utility of correlation measures in analysis of gene expression. NeuroRX, 3(3):384—395, 2006. P. Baldi and S. Brunak. Bioinformatics: the machine learning approach. MIT Press Cambridge, MA, USA, 2001. J. Barnard, R. McCulloch, and X.L. Meng. Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage. Statistica Sinica, 10(4):1281—1311, 2000. Y. Benjamini. Comment: Microarrays, empirical Bayes and the two-groups model. Statist. Sci, 23(1):23—28, 2008. Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B., 57(1):289—300, 1995. Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29(4):1165—1188, 2001. J .O. Berger. Could Fisher, Jeffreys and Neyman have agreed on testing? Statistical Science, 18(1):1—32, 2003. N. Biotechnology. The MicroArray Quality Control (MAQC) project shows inter-and intraplatform reproducibility of gene expression measurements. Nature Biotechnol- ogy, 24:1151—1161, 2006. BM Bolstad, RA Irizarry, M. Astrand, and T P Speed. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2):185—193, 2003. T.T. Cai. Comment: Microarrays, empirical Bayes and the two-group model. Statis- tical Science, 23(1):29—33, 2008. 97 K.R. Coombes, J. Wang, and K.A. Baggerly. The tail-rank statistic for finding biomarkers from microarray data, with application to prostate cancer. preprint at http: //bioinformatics. mdanderson. org/ Tai lRank , 2008. T.M. Cover and J .A. Thomas. Elements of information theory. Wiley New York, 1991. X. Cui, J .T.G. Hwang, J. Qiu, N.J. Blades, and GA. Churchill. Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics, 6(1):59—75, 2005. PA Dejviver and J. Kittler. Pattern recognition: a statistical approach. Prentice Hall International, 1982. J .R. Deller, Jr., J.H.L. Hansen, and J .G. Proakis. Discrete-time processing of speech signals. IEEE Press, 2000. S. Dudoit, Y.H. Yang, M.J. Callow, and T.P. Speed. Statistical methods for iden- tifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica, 12(1):111-139, 2002. B. Efron. Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. Journal of the American Statistical Association, 99(465):96—105, 2004. B. Efron. Microarrays, empirical Bayes, and the two-groups model. Preprint, Dept. of Statistics, Stanford University, 2006. B. Efron. Correlation and large-scale simultaneous significance testing. Journal of the American Statistical Association, 102(477):93—103, 20078. B. Efron. Size, power and false discovery rates. Annals of Statistics, 35:1351—1377, 2007b. B. Efron. Microarrays, empirical Bayes and the two-group model. Statistical Science, 23(1):1—22, 2008. B. Efron, R. Tibshirani, J .D. Storey, and V. Tusher. Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association, 96(456): 1151—1161, 2001. MB. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, 95(25):]4863, 1998. S. Frantz. An array of problems. Nature Reviews Drug Discovery, 4:362—363, 2005. 98 C. Genovese and L. Wasserman. A stochastic process approach to false discovery control. Annals of Statistics, 32(3):1035—1061, 2004. CH. Golub and CF. Van Loan. Matrix computations. Johns Hopkins University Press, 1996. G.W. Hatfield, S. Hung, and P. Baldi. Differential analysis of DNA microarray gene expression data. Molecular Microbiology, 47(4):871-877, 2003. I. Hedenfalk, D. Duggan, Y. Chen, et al. Gene—expression profiles in hereditary breast cancer. New England Journal of Medicine, 344(8):539—548, 2001. N .J . Higham. Computing the nearest correlation matrix—a problem from finance. IMA Journal of Numerical Analysis, 22(3):329-—343, 2002. L. Hunter. Molecular biology for computer scientists. Artificial Intelligence and Molecular Biology, Cambridge, 1993. JP Ioannidis. Why most published research findings are false. PLoS Medicine, 2(8): e124, 2005. R.A. Irizarry, B.M. Bolstad, F. Collin, L.M. Cope, B. Hobbs, and T.P. Speed. Sum- maries of Affymetrix GeneChip probe level data. Nucleic Acids Research, 31(4): e15, 2003. ET Jaynes. Probability theory: The logic of science. Cambridge University Press, 2003. RA. Jones and SB. Baylin. The fundamental role of epigenetic events in cancer. Nat Rev Genet, 3(6):415—428, 2002. M.K. Kerr, M. Martin, and GA. Churchill. Analysis of variance for gene expression microarray data. Journal of Computational Biology, 7(6):819—837, 2000. Kyung In Kim and Mark van de Wiel. Effects of dependence in high-dimensional multiple testing problems. BMC Bioinformatics, 9(1):114, 2008. L. Klebanov and A. Yakovlev. Diverse correlation structures in microarray gene ex- pression data and their utility in improving statistical inference. Annals of Applied Statistics, 1(2):538—559, 2007. L. Klebanov, C. Jordan, and A. Yakovlev. A new type of stochastic dependence revealed in gene expression data. Statistical Applications in Genetics and Molecular Biology, 5(1):7, 2006. ES. Lander. Array of hope. Nature Genetics, 21(1):3—4, 1999. 99 M. Langaas, E. Ferkingstad, and RH. Lindqvist. Estimating the pr0portion of true null hypotheses, with application to DNA microarray data. Journal of the Royal Statistical Society, Series B., 67:555—72, 2005. M.L.T. Lee, F .C. Kuo, GA Whitmore, and J. Sklar. Importance of replication in mi- croarray gene expression studies: Statistical methods and evidence from repetitive cDNA hybridizations. Proceedings of the National Academy of Sciences, 97(18): 9834—9839, 2000. J.T. Leek and J.D. Storey. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics, 3(9):e161, 2007 . J .T. Leek, E. Monsen, A.R. Dabney, and J .D. Storey. EDGE: Extraction and analysis of differential gene expression. Bioinformatics, 22(4):507—508, 2006. EL Lehmann and J .P. Romano. Generalizations of the familywise error rate. Annals of Statistics, 33(3):]138—1154, 2005. EL. Lehmann and J .P. Romano. Testing Statistical Hypotheses. Springer, 2006. F. Li and GD. Stormo. Selection of optimal DNA oligos for gene expression arrays. Bioinformatics, 17(11):]067—1076, 2001. J.C. Liechty, M.W. Liechty, and P. Muller. Bayesian correlation estimation. Biometrika, 91(1):1-14, 2004. DJ. Lockhart, H. Dong, M.C. Byrne, M.T. Follettie, M.V. Gallo, M.S. Chee, M. Mittmann, C. Wang, M. Kobayashi, H. Norton, et al. Expression monitor- ing by hybridization to high-density oligonucleotide arrays. Nature Biotechnology, 14:1675—1680, 1996. I. Lonnstedt and T. Speed. Replicated microarray data. Statistica Sinica, 12(1): 31—46, 2002. RC. Mahalanobis. On the generalized distance in statistics. Proceedings of the Na- tional Institute of Science in India, 2(1):49—55, 1936. G. Marsaglia and I. Olkin. Generating correlation matrices. SIAM Journal on Sci- entific and Statistical Computing, 5:470—475, 1984. ON. Morris. Comment: Microarrays, empirical Bayes and the two-group model. Statistical Science, 23(1):34—40, 2008. MA Newton, CM Kendziorski, CS Richmond, F .R. Blattner, and KW Tsui. On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. Journal of Computational Biology, 8(1):37—52, 2001. 100 I. Olkin. Note on ‘The Jacobians of certain matrix transformations useful in multi- variate analysis’. Biometrika, 40(1-2):43, 1953. AB. Owen. Variance of the number of false discoveries. Journal of the Royal Statistical Society, Series B., 67:411—426, 2005. TA. Patterson, E.K. Lobenhofer, S.B. Fulmer-Smentek, et al. Performance compar- ison of one-color and two-color platforms within the Microarray Quality Control (MAQC) project. Nature Biotechnology, 24:1140—1150, 2006. Y. Pawitan, K.R.K. Murthy, S. Michiels, A. Ploner, and O. Journals. Bias in the estimation of false discovery rate in microarray studies. Bioinformatics, 21(20): 3865—3872, 2005. BR Petricoin, J .L. Hackett, L.J. Lesko, et al. Medical applications of microarray technologies: a regulatory science perspective. Nature Genetics, 32(supp):474—479, 2002. Houduo Qi and Defeng Sun. A quadratically convergent newton method for comput- ing the nearest correlation matrix. SIAM Journal on Matrix Analysis and Appli- cations, 28(2):360—385, 2006. X. Qiu and A. Yakovlev. Some comments on instability of false discovery rate es- timation. Journal of Bioinformatics and Computational Biology, 4(5):1057—1068, 2006. X. Qiu, A.I. Brooks, L. Klebanov, and A. Yakovlev. The effects of normalization on the correlation structure of microarray data. BM C Bioinformatics, 6:120, 2005a. X. Qiu, L. Klebanov, and A. Yakovlev. Correlation between gene expression levels and limitations of the empirical Bayes methodology for finding differentially expressed genes. Statistical Applications in Genetics and Molecular Biology, 4(1):1157, 2005b. X. Qiu, Y. Xiao, A. Gordon, and A. Yakovlev. Assessing stability of gene selection in microarray data analysis. BMC Bioinformatics, 7:50, 2006. J. Quackenbush. Microarray data normalization and transformation. Nature Genetics, 32(supp):496—501, 2002. A. Reiner-Benaim. FDR control by the BH procedure for two-sided correlated tests with implications to gene expression data analysis. Biometrical journal, 49(1): 107—126, 2007. K. Rice and D. Spiegelhalter. Comment: Microarrays, empirical Bayes and the two- groups model. Statistcal Science, 23(1):41—44, 2008. 101 M. Schena, D. Shalon, R.W. Davis, and RC. Brown. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270(5235): 467—470, 1995. D. Singh, P.G. Febbo, K. Ross, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2):203—209, 2002. J.D. Storey. A direct approach to false discovery rates. Journal of the Royal Statistical Society, Series B., 64(3):479—498, 20023. J .D. Storey. False discovery rates theory and applications to DNA microarrays. PhD thesis, stanford university, 2002b. J .D. Storey, J .E. Taylor, and D. Siegmund. Strong control, conservative point estima- tion and simultaneous conservative consistency of false discovery rates: a unified approach. Journal of the Royal Statistical Society Series B, 66(1):187—205, 2004. J.D. Storey, J.Y. Dai, and J.T. Leek. The optimal discovery procedure for large- scale significance testing, with applications to comparative microarray experiments. Biostatistics, 8(2):414—432, 2007. A. Subramanian, P. Tamayo, V.K. Mootha, S. Mukherjee, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102(43):15545-—15550, 2005. R. Tibshirani and L. Wasserman. Correlation-sharing for detection of differential gene expression. Arriv preprint math.ST/0608061, 2006. GA. ‘I’sai, Y.J. Chen, J .J . Chen, and O. Journals. Testing for differentially expressed genes with microarray data. Nucleic Acids Research, 31(9):e52, 2003. V. Tusher, R. Tibshirani, and C Chu. Significance analysis of microarrays applied to ionizing radiation response. Proceedings of the National Academy of Sciences, 98: 5116—5121, 2001. M.J. van der Laan, S. Dudoit, and KS. Pollard. Augmentation procedures for control of the generalized family-wise error rate and tail probabilities for the proportion of false positives. Statistical Applications in Genetics and Molecular Biology, 3(1):15, 2004. AB. van’t Wout, G.K. Lehrman, S.A. Mikheeva, et al. Cellular gene expression upon human immunodeficiency virus type 1 infection of CD4+-T-cell lines. Journal of Virology, 77(2):1392—1402, 2003. 102 JD Watson, TA Baker, SP Bell, A. Gann, and R. Losick. Molecular biology of the gene (5th Edition). Benjamin Cummings, 2004. Y. Woo, J. Affourtit, S. Daigle, et al. A comparison of cDNA, oligonucleotide, and Affymetrix GeneChip gene expression microarray platforms. Journal of Biomolec- ular Techniques, 15(4):276—284, 2004. Y. Zheng and M. Pepe. A practical multifaceted approach to selecting differentially expressed genes. Cancer Informatics, 2:113—122, 2007. K.H. Zou and W.J. Hall. On estimating a transformation correlation coefficient. Journal of Applied Statistics, 29:745—760, 2002. 103 11111111111114((1111 3 1293 0295