"Em LIBRARY
2 00'”; Michigan §tate
University

This is to certify that the
dissertation entitled

DEPENDENCY AND PURITY IN LARGE-SCALE
STATISTICAL SIGNIFICANCE TESTING: A DNA
MICROARRAY PERSPECTIVE

presented by
Keyur Hemantkumar Desai

has been accepted towards fulﬁllment
of the requirements for the

Ph.D. degree in Electrical Engineering

 

 

   

   
  

.11,

Profess- ’s Signat"

  
 

09/24/2008

 

  

Date

MSU is an Afﬁrmative Action/Equal Opportunity Employer

 

—.-.-.—a-.-o----I-u-p-u-o---a-o-n-u-u---o--o-o-I-u-o-u-u-u-u-n-o—----u.

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5/08 K:/Prolecc&Pres/CIRC/DateDue indd

DEPENDENCY AND PURITY IN
LARGE-SCALE STATISTICAL SIGNIFICANCE TESTING:
A DNA MICROARRAY PERSPECTIVE

By

Keyur Hemantkumar Desai

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Electrical Engineering

2008

ABSTRACT

DEPENDENCY AND PURITY IN
LARGE-SCALE STATISTICAL SIGNIFICANCE TESTING:
A DNA MICROARRAY PERSPECTIVE

By

Keyur Hemantkumar Desai

Statistical methods for detecting differential gene expression (DGE) in DNA mi-
croarray experiments are discussed. A comprehensive deﬁnition of DGE refers to
“statistical dependence” between gene expression levels and biological conditions of
interest, such as differing environments, treatments, time points, phenotypes, or clin-
ical outcomes. The ability to detect genuine DGEs has become crucial in the effort
to understand and cure difﬁcult diseases like cancer, diabetes, and Alzheimer’s dis-
ease. The statistical nature of microarray data necessitates the use of “large-scale
signiﬁcance testing” to detect DGE. Large-scale testing is qualitatively different than
the usual one-at-a-time testing: implied information from the “other” cases can force
its way into the decision rule. Two prominent attributes of gene expression data,
“inter-gene dependency” and “purity,” complicate the situation even further. The for-
mer refers to the statistical patterns of dependencies among gene expression levels,
and the latter stems from the fact that a great majority of genes show no change in
expression even under changing conditions.

Inter-gene dependency is perceived commonly as a harmful force cluttering the
decision-making. Contrastingly, this research takes a more sympathetic view of inter-
gene dependency when seen together with purity. We argue that their combination
gives rise to not only accurate but also more powerful statistical reasoning. Our effort

to combine the two has culminated in two contributions, each of which has both

applied and theoretical implications. The ﬁrst is a method for a better assessment
of the number of false positives in the presence of heavy inter-gene dependency and
extreme sampling errors due to exceedingly small sample sizes. The second is a
method for “exploiting” inter-gene dependency to yield substantially more powerful
DGE detection procedures. The empirical evidence from real and simulated test cases

suggests the usefulness of our ideas.

ACKNOWLEDGEMENT

This research was supported in part by a Michigan State University IRGP grant,
a graduate fellowship from the MSU Quantitative Biology Program, and a research
fellowship from the MSU Department of Electrical and Computer Engineering. The
support of the MSU High Performance Computing Center is appreciated.

I am grateful to Professors Christina Chan, Hayder Radha, Justin McCormick,
and Jack Deller for being on my Ph.D. committee. Their feedback and advice have
been a valuable contribution to this work.

Thanks to Professor Jack Deller for being a nearly perfect advisor. I owe my
successful transition from engineering to genomics to his guidance, encouragement,
support, and constant assurance. Thanks to Professor Justin McCormick for his
generosity in time to explain intricacies of cancer research to a naive engineering
student. Thanks to Professor Hayder Radha for his continual interest in my progress.

I am grateful to my friends Shirish, Sharath, Amit, Akshay, Snehal, Darshak, Amit
Gore, Koushik, Kiran, and Abhishek for their friendship, support, and motivation. I
am indebted to Bina Desai for her interest in my early education. Many thanks to
Professor S.D. Mahanti for his support and encouragement during my early days in
the graduate school.

This dissertation is dedicated to my mother, father, brother, and grandparents.

Their love, support, and sacriﬁces have carried me to where I am today.

iv

TABLE OF CONTENTS

List of Tables .................................. vii
List of Figures .................................. ix
Introduction and Background 1
1.1 Foundations of Modern Biology ..................... 2
1.2 Statistical Reasoning in Differential Gene Expression Detection . . . . 3
1.3 Variability in Gene Expression Data ................... 6
1.4 Testing for Statistical Signiﬁcance .................... 8
1.5 Large-scale Inferences ........................... 10
1.5.1 The Sc0pe ............................. 10
1.5.2 The Goals ............................. 11
1.6 An Outline ................................ 11
A1: DNA microarrays ............................. 12
Large-Scale Signiﬁcance Testing 15
2.1 A Statistical Formulation for DGE Detection .............. 16
2.2 Role of the Number of False Discoveries in Measuring Quality . . . . 19
2.3 Literature Review and Contributions of This Research ........ 21
2.3.1 Control. .............................. 21
2.3.2 Power ............................... 25
2.4 Dependency in Gene Expression Data .................. 27
2.5 Purity, Identiﬁability and Zero Assumption ............... 28
Estimating the Number of False Discoveries 30
3.1 Overview .................................. 31
3.2 The Proposed Method .......................... 33
3.2.1 Estimating the Moments ..................... 36
3.2.2 Estimating Correlation Densities ................ 42
3.2.3 Fitting the Maximum Entropy Distribution .......... 47
3.3 Summary of the Method ......................... 53
3.4 Test Cases ................................. 54
3.4.1 Real Data ............................. 54
3.4.2 Simulated Data .......................... 60
3.5 Discussion ................................. 66
Exploiting Correlation to Improve Gene-ranking 68
4.1 Overview .................................. 69
4.2 The Proposed Method .......................... 7

‘7

4.2.1 Choosing a Distance Metric ................... 74

4.2.2 An Intuitive Interpretation .................... 77

4.2.3 Estimating the Covariance .................... 77

4.2.4 The Final Equation ........................ 79

4.3 Implementation Details .......................... 79
4.3.1 Numerical stability and Computational Complexity ...... 80

4.3.2 Algorithm for Two-State Studies ................. 81

4.3.3 Algorithm for Multi-State and Continuous-State Studies . . . 82

4.4 Test Cases ................................. 82
4.4.1 Case I: Real Data With Induced Differences .......... 83

4.4.2 Case II: Simulated Data ..................... 88

4.5 Discussion ................................. 92

5 Concluding Remarks 94
5.1 Summary and Concluding Remarks ................... 94
5.2 Future Work ................................ 96
Bibliography 103

vi

1.1

4.1

LIST OF TABLES

An example of two-state comparative experiments. The HIV study of
van’t Wout et al. (2003). .........................

TEllipsoid in action with Top 100 ﬁg’s. Corresponding ti’s and their
rank are also shown. TEllipsoid = 22 NoFPs; raw t-statistics : 68
NoFPs. Truly null genes are printed in bold-sans typeface. ......

vii

89

1.1

3.1

3.2

3.3

3.4

3.5

3.6

LIST OF FIGURES

An example of an approximately 40,000 probe spotted oligonucleotide
microarray with enlarged inset to show detail. .............

Discrete support Pt ([3 markers) versus continuous support 8t (solid
boundary). St is standardized to improve numerical stability ......

Effect of sampling ﬂuctuations on the empirical correlation density. (a)
BRCA example (b) HIV example. For each sub ﬁgure: Left panel is
the histogram of sample correlations after applying the Fisher transfor-
mation (3.10) and a normal distribution (heavy curve) ﬁt to it; Right
panel is the histogram of de-noised correlations and a modiﬁed beta
distribution ﬁt to it (heavy curve). This summarizes the cumulative
effect of (I?) gene-gene correlations in a single parameter ﬂ. .....

BRCA example: Estimated (V, C) moments and a mazent distribution
ﬁt to it. Third moment estimate of P(V, C) (left) exhibits ﬁner details
than the second moment estimate (right). ...............

The distributions of the number of false discoveries; Panel (a) the
BRCA study, Panel (b) the HIV study. To show the effect of skewness
corrections, the third moment distribution (solid curve) is compared to
its second moment counterpart (dashed curve). For BRCA the second
moment mean estimate is 79 compared to 104 for the third-moment;
while for HIV these are 19 and 8. The BRCA panel also shows 50%
(solid line) and 75% (dotted line) conﬁdence intervals. ........

Simulation experiments comparing conditional estimates: third mo-
ment estimates + marker and second moment 0 marker. This ﬁgure
corresponds to the left-sided tail-area with 61 = —2.0. Substantial
row-wise correlation is present. The abscissa is the realized count while
the ordinate is the estimated count. For the signiﬁcance of different
(k, k0, 90)’s refer to the main text. Third moment skewness corrections
enhance the estimation accuracy. ....................

Left-sided tail-area with 61 = —2.5, else the caption for Figure 3.5 is
applicable ..................................

viii

13

48

56

57

58

63

65

4.1

4.2

4.3

4.4

F DRs for Case 1. The number of truly differential gene is 300. Panel
(a) r2300; Panel (b) 1:100. “Exploiting correlation” considerably en-
hances the statistical power. Square (El) marker 2 tEllipsoid. Lines:
solid :: raw t-statistic; dotted = SAM; dashed 2 EDGE. .......

FDRs for Case 2a. The number of truly differential gene is 1200. Panel
(a) r=1200; Panel (b) r2300. “Exploiting correlation” appreciably
enhances the statistical power. Square (III) marker 2 tEllipsoid. Lines:
solid :- raw t-statistic; dotted :— SAM; dashed 2 EDGE. .......

FDRs for Case 2b. Panel (a) r21200; Panel (b) r2300. The sample size
is smaller than that in Cases 1 & 2a and yet “Exploiting correlation”

has apparent beneﬁts. Square (D) marker :— tEllipsoid. Lines: solid :
raw t-statistic; dotted : SAM; dashed : EDGE .............

FDRs for simulated data. Panel (a) 7:100; Panel (b) r250. (Small)
sample sizes: n1:10, 722-210. Yet, “Exploiting correlation” consider-
ably enhances the statistical power. Square (D) marker 2 tEllipsoid.
Lines: solid : raw t-statistic; dotted : SAM; dashed :2 EDGE.

ix

86

87

90

91

Chapter 1

Introduction and Background

The objective of this research was to improve the existing statistical techniques of
differential gene analysis. The literature identiﬁes “dependency among gene expres-
sions” as the chief obstacle in drawing meaningful conclusions from micorarray data
and “purity” as its vast untapped resource. This research focuses on understanding,
correcting, and exploiting the effects of dependency by combining it with purity in
a general setting. It has culminated in two major ﬁndings pertinent to the general
approach for detecting “statistical linear dependence” between gene expression levels
and measured biological states. The general approach consists of two main steps: (i)
ranking the genes and (ii) evaluating the number of false positives for a signiﬁcance
cut-off. Our ﬁrst ﬁnding establishes the importance of third moment skewness cor-
rections in estimating the number of false positives. The second ﬁnding provides a
general technique to revise a given gene-ranking for better statistical power. A good
starting point is to discuss the role of statistical reasoning in interpreting the gene

expression data.

Organization of the chapter. Section 1.1 reviews the foundations of modern bi-
ology. Section 1.2 begins with a “bird’s eye view” of the problem of differential gene
detection and the potential implications of the present work. An example of two-
state microarray data and a listing on typical sources of within state gene expression
variations are given in Section 1.3. A contrast between one-at-a-time and large-scale
signiﬁcance testing is drawn in Section 1.4. In Section 1.5, we discuss the possible
scope and the key goals of large-scale inferences and decision-making germane to the
expression proﬁling data. An outline of the entire dissertation appears in Section 1.6.

A short overview of DNA microarray technology concludes this chapter.

1.1 Foundations of Modern Biology

Before we begin, it is helpful to recapitulate the foundations of modern biology. Refer
to Watson et al. (2004) for a more complete discussion. Hunter (1993) provides a brief

survey of biology for researchers trained in engineering or computer science.

0 Life changes and develops through evolution and that all life forms known have

a common origin.

a The cell is the fundamental unit of life. Cells arise from other cells through
cell division, and in multicellular organisms, every cell in the organism’s body
is produced from a single cell in a fertilized egg and hence contains the same

genotype.

0 Biological form and function are passed on to the next generation by genes,
which are the primary units of inheritance and made of DNA. All information

flows from the genotype, the genetic makeup of the organism, to the phenotype,

the observable physical or biochemical characteristics of the organism.

o The total complement of genes in an organism or cell is known as its genome.
When a gene is active, the DNA code is transcribed into an RNA copy of the

gene’s information.

0 Gene expression refers to the transcription of DNA into messenger RNA (mRN A)
by RNA polymerase. The mRNA is then translated into protein by the ribo-
some. In gene expression analysis, expression level refers to the amount of

mRNA detected in a sample.

0 Phenotypic differences can ultimately be understood in terms of the regulation

of gene expression.

1.2 Statistical Reasoning in Differential Gene Ex-
pression Detection

Gene expression proﬁling experiments measuring the expression levels of thousands
of genes at once to create a global picture of cellular function have become a vital
component of modern biomedical research. They let us investigate complex human
diseases—such as cancer, diabetes, Alzheimer’s disease, etc—in unprecedented ge-
netic detail. However, in order to draw meaningful conclusions from the wealth of
gene expression data, we must rely on statistical reasoning for two main reasons.
First, the measurements are based on mRNA molecules drawn from a cell popula-
tion. Second, there are several uncontrollable sources of noise and uncertainty. Con-
sequently, the promise of proﬁling—better understanding of regulatory mechanisms,

more effective therapeutic targets, accurate (sub) categorization of diseases, superior

clinical diagnosis-prognosis, etc—relies heavily on pertinent statistical methodolo-
gies (Lander, 1999; Petricoin et al., 2002; Singh et al., 2002; Jones and Baylin, 2002).
Surprisingly, despite a major research interest in the statistical methods for gene ex-
pression data analysis, several key issues have remained unresolved due to their rather
difﬁcult and unfamiliar nature (Frantz, 2005).

Statistical methods for detecting differential gene expression (DGE) between two
or more biological samples have attracted considerable attention. The comprehensive
deﬁnition of DGE refers to “statistical dependence” between gene expression levels
and biological conditions of interest, such as differing environments, treatments, time
points, phenotypes, or clinical outcomes. In practice, however, “linear dependence,”
also known as “second moment dependence,” is pursued more frequently because its
detection is readily subject to a statistically rigorous approach, and yet its existence
often suggests a phenomenon of scientiﬁc interest and clinical utility.

For convenience, the biological conditions are referred as “states.” The term “two-
state” refers to binary valued states, such as healthy versus cancer; “multi-state” to
multi valued states, such as different categories of breast cancer; and “continuous-
state” to continuous valued states, such as the age of the subject or insulin dosage,
etc.

The need for detecting DGE between two states is very common. The methods
customized for two-state is the present focus. Note that even after the common goal
being to detect linear dependence, as per the situation (two-state, multi-state, or
continuous-state), the methods may differ in some mathematical and implementation
aspects. In fact, detecting second moment dependence in a two-state study can
also be interpreted in terms of detecting statistical mean difference between the two

states, and, in turn, dealt more efﬁciently through the standard (unpaired) t-statistic.

Indeed the t-test and simple linear regression are mathematically equivalent (refer to
Section 2.1) implying that the present ideas for two—state methods are also applicable
to multi-state or continuous-state methods, perhaps with little or no modiﬁcations.

Additionally, some of the outcomes of this research also seem translatable to DGE
notions other than “central tendency,” for example fold change (Biotechnology, 2006)
or tail-rank statistics (Coombes et al., 2008; Zheng and Pepe, 2007). These extensions
will be the subjects of future research.

The overarching theme of this dissertation is to develop methods for combining
inter-gene dependency (Section 2.4) and purity (Section 2.5) to improve statistical
decision-making. The former refers to (statistical) dependencies among gene expres-
sion levels and the latter to an abstract notion about the test statistics of differen-
tially expressed (DE) genes not falling in certain intervals. Previous understanding
perceived inter-gene dependency as more of a harmful force cluttering the decision-
making, but now newer understanding suggests that when combined with purity,
they can yield not only accurate but also more powerful statistical inference (Ben-
jamini, 2008; Morris, 2008; Cai, 2008; Efron, 2008; Klebanov and Yakovlev, 2007).
The present scope covers methods which include only the measured expression and
“assigned” state labels; however, the possibility of extending these ideas to integrate
legitimate “prior” knowledge like “gene grouping,” as in enrichment analysis (Subra~

manian et al., 2005), has also shown potential.

Note. To avoid spurious mix-ups, the term “dependence” is reserved for statisti-
cal dependence between gene expression levels and biological conditions, whereas the
term “dependency” is reserved for statistical dependence among gene expression lev-

els. Throughout this dissertation, the words null and non-differential are used inter-

changeably; the same holds true for the words non-null and differential.

1.3 Variability in Gene Expression Data

Table 1.1 shows typical data generated by a two-state study. A distinctive char-

1

acteristic of these data sets is a huge number of cases, a “large m,’ and very few
measurements per case, a “small n.” This particular data set is from van’t Wout
et a1. (2003) who measured the mRNA levels of 7680 genes on 8 separate microar-
rays, 1—4 assigned to healthy cells and 5—8 to HIV infected cells. The challenge for

the statistical methods is to discover the true signal, for example the true mean dif-

ference, in the presence of substantial within state (per) gene expression variations.

There are at least four factors contributing to these within state gene expression

variations. These are:

1. Cross-hybridization noise due to some mRNAs molecules cross-hybridizing to

the probes that are supposed to detect another mRNA.

2. Measurement noise associated with the microarray technology and related bio-

chemical processing.

3. Biological variability: The mRN A molecules are obtained from a cell pOpulation
and the individual cells can have differing gene expression levels due to a vari-
ety of inﬂuences including (i) differences in the cells’ micro-environments (e.g.,
nutrient and temperature gradients), (ii) the growth phase differences between
cells in the culture, (iii) the phase variations, and (iv) the periods of rapid

change in the gene expression and multiple additional stochastic effects that

 

 

 

 

healthy healthy healthy healthy HIV HIV HIV HIV

gene 1 9.609 7.323 5.328 13.63 8.757 21.72 0.4873 6.364
gene 2 2642 5034 537.2 766.8 1123 961.7 601.7 1016
gene 1000 42.03 13.94 60.91 70.26 84.03 25.91 103.9 99.96
gene 1001 135.1 115.2 111.8 134.1 151.7 145.1 135.9 166.5
gene 4000 1455 513.1 1159 438.1 1806 647.4 1921 759.3
gene 4001 555.2 909.3 216.8 902.8 775.9 1510 492.7 1981
gene 7679 15.31 41.25 16.82 14.23 45.49 55.23 10.16 32.86
gene 7680 47.73 109.7 31.99 151.7 63.14 99.14 33.04 44.3

Table 1.1: An example of two-state comparative experiments. The HIV study of van’t

Wout et a1. (2003).

cannot be controlled (Hatﬁeld et al., 2003; Baldi and Brunak, 2001; Watson

et al., 2004; Alberts et al., 2002).

4. Confounding variables, such as age, sex, race, living style, genotype, etc. in

studies with differing subjects. These can introduce large-scale variation (Leek

and Storey, 2007).

1.4 Testing for Statistical Significance

Due to the unavoidable variability, “statistical” decision-making is necessary to de-
tect DGE. The theory and methods of statistical hypothesis testing deal speciﬁcally
with such situations (Lehmann and Romano, 2006), yet testing each gene separately

”

can be very “agonizing. From the beginning itself, the validity of a reference null
distribution (obtained either from theory or re-sampling, see Section 2.1) may pose
a concern. Next, a smaller p-value (deﬁned in Section 2.1) does contradict the null
hypothesis of “no DGE,” however its role as a comprehensive measure of the inherent
ambiguity is often under question (Ioannidis, 2005). Both the celebrated Neyman-
Pearson lemma and the Bayesian decision theory are of limited use as they require
speciﬁed knowledge of distributions under the alternative hypothesis. Moreover, the
latter also requires prior distributions of hypotheses occurrence. Maybe the problem
in itself is ill-posed. Berger (2003) and accompanying “follow-up comments” highlight
different point of views in this matter. Fortunately, these conceptual difﬁculties go
away when considering several thousand genes in parallel. In this case, at least in
principle, there is enough data to infer the null, the possible alternatives, and their
relative proportions.

The statistical framework known as large-scale signiﬁcance testing (LSST) is a
natural extension of one-at-a-time signiﬁcance testing. The theory and methods of
LSST deal with thousands of simultaneous tests (Lehmann and Romano, 2006). In
other words, this framework can deal with all rows of Table 1.1 simultaneously. How-
ever, earlier approaches in LSST were limited in their treatment of dependency among
tests. In practice, most microarray data exhibit substantial inter-gene dependency
among expression variations (Owen, 2005), which eventually gets manifested in depen-

dency among tests. There are number of numerical studies which point out that the

 

LSST approaches not entertaining inter-gene dependency properly can lead to highly
inaccurate reporting of the underlying facts (Qiu et al., 2005b; Kim and van de Wiel,
2008)

After deciding that a proper treatment of dependency is a worthwhile research
endeavor, the other alternative which we explored is cluster analysis, e.g., Eisen et al.
(1998). When tailored to report two clusters, then clustering in essence can deal with
dependency through an appropriate similarity metric. The drawback is that clustering
neglects valuable information like assigned state values. Also, the overall framework
itself is not very conducive to deducing precise statistical guarantees. Therefore, this
work focusses on extending existing LSST methods to include dependency.

Speciﬁcally, we focus on extending the relevant LSST methods to include com-
monly observed patterns of “inter-gene correlation” in expression variations. Recall
that correlation is a “scale-free” measure of statistical linear dependence also known
as second moment or second-order dependence. The present emphasis is on correct-
ing and exploiting the effects of inter-gene correlation by combining it with purity—a
statistical assumption stemming from the fact that in many comparative studies the
activity of a great majority of genes remains unchanged with respect to the treatment
of interest. This can also be looked as drawing second-order conditional inferences
based on cases that are fundamentally less ambiguous than the others. The scope and
the goals of large-scale inferences germane to expression proﬁling data are discussed

next.

1.5 Large-scale Inferences

1.5.1 The Scope

0 The tests are related in the sense that the noise and uncertainty associated with

them are of similar statistical nature perhaps due to common origins.

0 We assume: For convenience the statistical reasoning may be done around per-
gene univariate summary statistics, T1, . . . ,Tm, and related null hypotheses,
H1, . . . , Hm; however, that the original measurements are available and hence

inferring the statistical structure among Ti’s is possible.

0 By convention, the rejection (or signiﬁcance) regions to be considered are “tail-
areas.” Recall that the term rejection-region refers to an interval in the space
of test statistics wherein we are interested in rejecting the null hypotheses.
Both “one-sided” and “two-sided” tail-areas are allowed. Working with tail-
areas naturally yields the methods which are sequential in nature and begin
by ordering the test statistics; for example, see the Benjamini and Hochberg

method described in Section 2.3.

The rationale behind tail-areas is that the evidence for a gene being null
decreases monotonically as we move farther from the intervals wherein most
of the probability mass of the null distribution is concentrated. Methods em-
ploying rejection-regions different than tail-areas are observed to yield results
with certain interpretational difﬁculties (Storey, 2002b; Rice and Spiegelhalter,
2008).

0 Moreover, the number of genes, denoted by m, may run in several thousands—

for current comparative microarray experiments m typically ranges from 5000 to

10

 

25000. Large-scale methods and algorithms should anticipate and accommodate
computational constraints and statistical convergence issues that are typical to

this range.

0 The methods are customized in the sense that they try to capture the scien-
tiﬁc context affecting the data—the present emphasis is on incorporating and

exploiting typical dependency structures by combining them with purity.

1.5.2 The Goals

Broadly speaking, the goal is to reliably identify as many DE genes as possible. To
elaborate on more speciﬁc statistical goals, the following terminology is helpful: When
a null hypothesis H,- is rejected, in other words when a gene is declared DE, then a
“discovery” is made; when a gene declared as DE is indeed non-DE, then a “false
discovery” is made. A broader statistical goal is to report as many discoveries as
possible while not exceeding a speciﬁed “false discovery proportion.” Sometimes a
“ball park” ﬁgure for the number of discoveries is speciﬁed and then, the goal is to
minimize the false discovery proportion as much as possible. When recast in the
language of statistical inference, the former amounts to an accurate estimation of the
number of false discoveries for a rejection region and the latter to revising the test

statistics to increase statistical power.

1.6 An Outline

This dissertation is outlined as follows. Chapter 1 introduces the “problem.” Chap-
ter 2 presents a mathematical formulation, reviews the relevant literature, discusses

inter-gene dependency and purity, and underlines our main contributions. The issue

11

HI

 

of estimating the number of false discoveries is discussed in Chapter 3, whereas Chap-
ter 4 presents a novel gene re-ranking method. Conclusion and future work appear
in Chapter 5.

At the beginning of each chapter a separate paragraph called “organization of the

chapter” is provided to guide the reader through that particular chapter.

Supplement: DNA Microarrays

A DNA microarray uses the selective nature of DNA-DNA or DNA-RNA hybridiza-
tion. In this process two complementary strands of DNA (sometimes DNA and a
complementary strand of RNA) bind. An array is a collection of microscopic DNA
segments attached to a solid surface, such as glass, plastic, or silicon chip. The afﬁxed
DNA segments are known as probes, thousands of which can be placed in known 10-
cations on a single array (see Fig. 1.1). In a typical experiment, the microarray is
washed with a sample containing ﬂuorescent dyed mRNA transcripts. The probe and
the target sequences will hybridize according to their complementary nature. The
abundance of target molecules can then be identiﬁed based on the resulting ﬂuo-
rescence patterns. Typical numbers reported by a microarray appears in Table 1.1.
Based on the way DNA microarrays are fabricated, two distinct microarray platforms

have evolved, each with its own pros and cons.

Spotted microarrays or two-color microarrays. These arrays employ clones or
oligonucleotides that are spotted onto glass slides and two distinctly labeled com-
plementary DNA (cDNA) samples are hybridized together on a single array. The
advantage is that the spotting process is relatively inexpensive and it is easier to

make custom made arrays (Schena et al., 1995). This ﬂexibility however comes at a

12

 

Figure 1.1: An example of an approximately 40,000 probe spotted oligonucleotide
microarray with enlarged inset to show detail.

price: The spotting process is inherently variable. This limitation is circumvented by
reporting relative mRNA expression levels instead of absolute expression levels(Woo
et al., 2004). The relative levels are obtained by co-hybridizing a two-color array with
two RNA samples, namely, experimental and reference. Here, the reference should
not be confused with the control (for example, while comparing normal cells and
their cancer mutants, the RNA sample from the normal cells is termed the control

and from the cancer cells the treatment).

Affymetrix microarrays or one-color microarrays. These arrays employ short (25-
mer) oligonucleotide probes deposited on a silicon substrate through high precision
photolithography (Lockhart et al., 1996). This process is similar to the one used in
manufacturing of electronic microchips and remarkably accurate in producing nearly
identical arrays. However, the ﬂexibility of customization is lost since for every custom

array an expensive mask must be created. One-color microarrays report absolute

13

mRN A expression levels.

Normalization. Measurements reported by two separate microarrays, even if they
belong to a single study, are seldom comparable in their original form. The major
reasons are unequal quantities of starting RNA, differences in labeling or detection
efﬁciencies between the ﬂuorescent dyes used, and systematic biases in the measured
expression levels (Quackenbush, 2002). A transformation, known as normalization,
is necessary. Robust normalization methods for both one-channel and two-channel

microarray data are available.

Cross-hybridization. Even after perfect normalization, the measured mRNA levels
can still suffer from experimental noise due to cross-hybridization: some mRNAs may
cross-hybridize probes in the array that are supposed to detect another mRNA. This
problem has been alleviated somewhat by a systematic selection of expressed sequence
tags (ETS) with high speciﬁcity-selectivity (Li and Stormo, 2001). Both microarray

platforms have evolved to provide very similar data quality (Patterson et al., 2006).

14

Chapter 2

Large-Scale Significance Testing

The task of differential gene expression (DGE) detection is often recast as large-scale
signiﬁcance testing (LSST). Inter-gene dependency among gene expression variations
weakens the LSST DGE detection formulation by introducing substantial dependency
among test statistics. Any other formulation is equally prone to the effects of depen-
dency. However, within the framework of LSST itself, dependency can be understood,
corrected, and exploited by combining it with purity. This chapter lays down the
necessary statistical groundwork to do so. A direct treatment of generic any-order
dependency is impractical, but treating second-order dependency is both useful and
mathematically tractable. A scale-free version of second-order dependency is correla-

tion and how to include inter-gene correlation in the LSST formulation is discussed.

Organization of the chapter. Section 2.1 presents a statistical formulation of the
“problem.” A mapping from inter-gene correlation to inter-test correlation is given
by Eqn. 2.3. Error measures, the false discovery proportion and the false discovery
rate, appear in Section 2.2. Section 2.3 reviews the relevant literature and states

the principal contributions of this work. A discussion on inter-gene dependency is

15

provided in Section 2.4. Purity and two related concepts, “identiﬁability” and “zero

assumption” are discussed in Section 2.5.

2.1 A Statistical Formulation for DGE Detection

There are m + 1 random variables involved in this statistical reasoning: L and
(X1, . . . ,Xm). Here, L refers to “state” deﬁned in Section 1.2, and Xi to expres-
sion level of gene i. Their (unknown) underlying joint probability distribution is
.7: (L, X 1, . . . ,Xm). A microarray experiment obtains n independent and identically
distributed samples from this distribution.

A “random” data set is written as (L;X1; . . .;Xm), where X,- = (X21, . . . ,Xz-n)
denotes a random sample for gene i and L 2 (L1, . . . , Ln) the corresponding states.
The subscript i in Xij indexes the genes and j the microarrays. To increase nor-
mality, raw Xij’s are converted to a logarithmic scale. The “realized” data set is
written as (l;x1; . . . ;xm) with l = (l1,...,ln) denoting the “assigned” states and
x,- = (3:21,” .,x,-n) the measured expression for gene i.

The overarching goal is to test whether random variables L and Xi: i = 1,. . . , m,
are dependent. The existence of statistical dependence between L and X,- suggests
that gene i may be crucial to the phenotypic distinctions being studied. A rigorous
treatment of general statistical dependence measures like mutual information (Cover
and Thomas, 1991) is often impractical. Instead most approaches test linear (or
second moment) statistical dependence. A univariate statistic measuring linear de-
pendence between L and Xi can be obtained through simple linear regression between
X,- and L.

The LSST for DGE detection involves m univariate per gene summary statis-

16

tics, T1,...,Tm, with T,- corresponding to gene i. These Ti’s are assessed using
H1, . . . , Hm, the null hypotheses of independence between L and Xi: i = 1,. . . ,m.
A T,- is gauged with respect to its null distribution p(T§|H,-): The fact, it is “typi-
cal” under Hi, suggests that L and X,- are uncorrelated. Note, however, that other
forms of statistical dependence between L and Xi may go undetected. Philosophi-
cally the bottleneck is caused by the fact that statistical independence has a concrete
mathematical interpretation, but statistical dependence is just an arbitrary functional
relation between random variables.

The two-state situation can be handled efficiently through the standard unpaired
t-statistic:

Ti = (99,2 — Xt;1)/Sz'. (2-1)

where X23,“ is the mean of gene i in state It and S,- is the pooled within-state stan-
dard deviation of gene i. Mathematically the t-statistic is equivalent to performing
simple linear regression, and hence, in essence it tests linear dependence. The usual
interpretation, i.e., the t-statistic detecting a mean difference and H,- referring to the
true mean difference being zero, is more obvious and well-known. If the additional
assumption of normality of X,- is made, then p(T,-|H,-) can be replaced by the Stu-
dent’s t-distribution. Doing so is not absolutely necessary: Permutation calculations
can estimate a putative null cumulative density function (cdf) common to all genes,
say G0, which may represent the basic facts more accurately (Efron, 2007a).
Dependency among “truly null” genes has profound implications on the “signiﬁ-
cance cut-off,” i.e., the threshold beyond which if a T,- occurs, then its corresponding
gene is declared non-null. Dependency among null and / or non-null genes has pro-
found implications on the possibility of mapping (T1, . . . ,Tm) to (T i" , . . . ,T;;,) for

better statistical power. Again, a general treatment of dependency among Ti’s is

17

impractical, but correlation among Ti’s is tractable, both mathematically and com-
putationally, and will be pursued.

When Ti’S are obtained through simple linear regression, then correlation among
Ti’s is easy to track. Lemma 1 of Owen (2005) states that for Ti’s measuring the “linear

’ is independent of

correlation” between L and Xi, if expression of gene 2' and gene i
state L, then

COI(Ti, Ti, I x’i’ Xi’) = ﬁiil. (2.2)

In Eqn. (2.3) pa, is the “Pearson product-moment correlation coefﬁcient” between
samples x,- and Xi]. It can be veriﬁed through computer simulations that the same
fact applies to t-statistics obtained through simple linear regression.

The discussion surrounding Eqn. (2.3) suggests following: When expression of

I

gene i and gene i is independent of state L, and statistics T,- and Ti; are due to

simple linear regression, then
COI'(TZ', T2") =- pz'il, (2.3)

where pii’ is the “theoretical” correlation between gene i and gene i’. For exceedingly
small n, the sampling error in p“, is a concern. This concern, when making inferences
for millions of inter-gene correlation coefﬁcients simultaneously, can be circumvented
through clever data processing (see Section 3.2.2). Owen (2005, Section 4) and Efron
(2007a, Section 2) both adopt a similar point of view.

For analytical convenience, Ti’s can be converted to z-values:

z, = (1,—1 {00cm}, i=1,...,m, (2.4)

18

where G0 is the putative null cdf mentioned above and <I>-1 is the inverse cdf of
N(0,1). For “truly null” cases we expect Zz- ~ N(0, 1). Note that Eqn. (2.4) is
a nonlinear order-preserving monotone transformation. The relationship between
Cor(T,-,Tz-/) and Cor(Z,-, Zi’) must be calibrated. In practice, the approximation
Cor(Z,-, Z24) as Cor(T,-,Tz-;) is seen to work well (Efron, 2007a). Methods for direct
estimation of Cor(Zz-, Z24) are discussed by Zou and Hall (2002).

Some DGE methods convert test statistics T1, . . . ,Tm to p-values, P1, . . . ,Pm,
through p(T1|H1), . . . ,p(Tm|Hm). For example, a one-sided p-value corresponding
to T,- = t,- is obtained as Pi = min [Pr(T,- < t,- | Hi),Pr(T,- > t,- | Hill; a two-sided

p-value is obtained as p,- = Pr (ITzI < ItiI I Hi)-

2.2 Role of the Number of False Discoveries in Mea-
suring Quality

Recall the terminology introduced in Section 1.5. When a null hypothesis H,- is
rejected, in other words when a gene is “declared” non-null, then a “discovery” is
made; when a gene declared as non-null is indeed null, then a “false discovery” is
made. Let for a random data set (L; X1; . . . ; Xm) a decision rule report R number of
discoveries. Let V (_<_ R) of these are false discoveries. The ratio V/ R, known as the
False Discovery Proportion (FDP) and interpreted as zero when R = 0, is of obvious
appeal, especially in exploratory data analysis where statistical ﬁndings form a basis
for further investigation.

Contrastingly, pre-F DP LSST methods focused on V—only-error-rates because the
emphasis was more on conﬁrmatory data analysis and stricter safe-guards against

making erroneous positive statements were necessary. Notice, when both V and

19

R are involved, then the decision rule is highly data dependent: If data with less

ambiguity is encountered then “relatively” more ﬁndings are reported and vice versa.

FDP and FDR. The usual convention assumes the random variable R as “observ-
able” and V as “unobservable” and hence requiring estimation. In their breakthrough

paper, Benjamini and Hochberg (1995) used an all-null-theoretical-expectation,

m
Er(V) = ZPr (T.- e P I Hi), (25)

i=1
as an estimate of the “realized” 2) for an arbitrary rejection-region I‘. The “dot” in
EF(V) emphasizes the fact that it is an “all null” expectation, which may be contrasted
with EP(V), the true (unknown) expectation for that 1". These two expectations start
deviating appreciably as the proportion of null genes, 7r0, reduces. If the putative null
cdf G0 is obtained through permutations, then Eqn.(2.5) is a part of the Yekutieli-

Benjamini F DP Estimator described in (Qiu and Yakovlev, 2006, Section 3.2).

Benjamini and Hochberg showed that their procedure, which estimates the realized
v/r through EF(V)/r with r referring to the realized value of R, can “bound” the
average F DP, i.e., E(V/ R), at an arbitrary prescribed level. The average FDP is now
widely known as the False Discovery Rate (FDR). The bounding of FDR is termed

as “control of FDR” and discussed in the next section.

20

2.3 Literature Review and Contributions of This

Research

2.3.1 Control.

“Control of an error measure” is a concept that developed in the Frequentist-framework
of LSST. Here, the error measures are compound involving a hypothetical data en-
semble which can be generated by the experiment. The goal is to develop methods
that can keep the error measure below a prescribed “level.” Among all the methods
that can do so, the ones with more average power (the proportion of true discoveries

which are reported) are sought after.

Controlling FDR: Independent tests. The sequential p-value algorithm of Ben-

jamini and Hochberg (1995), which controls the FDR at level a, has following steps:

1. Let H1...Hm be the null hypotheses and P1...Pm their corresponding p-

values
2. Order p-values in increasing order and denote them by P(l) . . . P(m)
3. Find R = argmax (P(k)m) 3 (ka), where k is an integer

ngSm

4. Reject all H“) for i = 1,. . .,R.

This algorithm assures that the reported discoveries have an “estimated F DP” always
below a. Benjamini and Hochberg proved the following: When truly null cases are
statistically independent, then in the long run behavior the above procedure guaran-
tees E(V/ R) S (1. Storey et a1. (2004) provides an alternative and a more concise

proof. That the equality holds, requires additional assumptions.

21

According to Benjamini (2008) the motivation behind their 1995 method was
to deal efﬁciently with around 100 statistically independent hypotheses in a clinical
setting. Their method is now applied to situations with thousands and sometimes
even millions of cases and yet whenever independency holds there is no reason to
disbelieve its efﬁcacy. By the time it was realized that the method might be critically
susceptible to inter-case dependency was realized, it had become a standard technique.

In DGE detection, dependency is intrinsic because the gene expression data rep-
resent the life process, a fundamentally coordinated activity. Yet the exact nature,
form, and amount of dependency are still being debated—Section 2.4 discusses this
in detail.

This issue of dependency has profoundly affected the LSST research. Despite
strong interest, the issue remains unresolved. Not only very little is understood about
how to exploit dependency, but even to determine the adverse effects of dependency
on methods developed with the assumption of independency has been found very
challenging. In relation to dependency, a major emphasis in the LSST research has
been to establish “validity” of any proposed procedure. Validity refers to the fact that

for a given procedure, the FDR can be bounded below an arbitrarily prescribed level.

Validity: Dependent tests. Benjamini and Yekutieli (2001) show that for tests
with a positive-regression dependency condition, estimating the F DP as EP(V) / r still
ensures validity, and if this FDP estimator is modiﬁed as EP(V)-( 27:1 ihl) / r, then
validity is ensured for arbitrary dependency. However, this implies a very conservative
F DP estimation; for example, with m = 25000 a scaling of 1/ 10.7 is performed. A
slightly different but more provocative interpretation is following: in order to ensure

validity in the presence of dependency, the realized data are judged at a much more

22

stringent F DP level than what is needed. Benjamini (2008) argues that the real
situations fall between positive regression dependency and arbitrary dependency and
hence a less conservative scaling should work.

Instead van der Laan et a1. (2004) propose to satisfy the probabilistic constraint
Pr (F DP > '7) < a by modifying the existing procedures that satisfy the probabilistic
constraint Pr (V > '7) < a; this conversion does not require independent tests, but
what happens to the average power is not clear. Lehmann and Romano (2005) propose
two more methods to satisfy the probabilistic constraint Pr (FDP > '7) < a under
dependency. The ﬁrst method can work under an abstract mathematical condition
on dependency structure claimed to be reasonable. The second method can work for
arbitrary dependency but the same problem as in Benjamini and Yekutieli (2001)
with arbitrary dependency persists: The realized data is judged at a more stringent
FDP level than necessary.

Since the existing FDR procedures that are capable of ensuring validity under
an arbitrary dependency are highly conservative, most investigators still rely on the

original Benjamini and Hochberg procedure.

Validity and variance. After proving validity of a procedure, for arbitrary de-
pendency or a special dependency structure of interest, further evaluations are still
needed. An important measure is the variance of the FDP estimation, E(F DP — (1)2;
here, the actual FDP is contrasted with the intended level a. Owen (2005) argues
that this could be a serious issue because inter-gene dependency found in most mi-
croarray data tends to inﬂate the variance of V considerably compared to what it
may be under independency; in the cases considered by Owen, the inﬂation is nearly

100 times.

23

Therefore, approximating the realized 2) by an all-null-theoretical-expectation for
heavy dependency may be critically unrealistic. A simulation in (Efron, 2007a, Section
5) emphasizes this issue. Qui et a1. (Qiu and Yakovlev, 2006; Qiu et al., 2005a, 2006,
2005b) also explore this effect and warn against neglecting the effect of dependency
on an overall, as well as case-to-case, basis. An attempt to develop a better estimate

of 12 may be worthwhile; however, the ways to do so are not obvious.

Conditional v estimates. Efron suggests “conditional v inferences” to account for
inter-gene dependency. The notion of purity as in Section 2.5 is crucial for condi-
tioning to work. Efron (2004) proposes a technique called “empirical null” to perform
conditioning; this technique works solely on the realized t,- histogram and the depen-
dency observed in the actual data is not entertained.

In a follow-up paper (Efron, 2007a), Efron develops a second moment theory for
“null T,- histogram” to gain insight into the behavior of V. He makes a crucial obser-
vation that under heavy inter—gene correlation the behavior of the null T,- histogram
is highly regulated in the sense that between its tail and its center there is extreme

negative correlation.

Contribution 1. Based on Efron’s observation, this work builds a moment-based
estimator of the number of false discoveries. Sections 3.1—3.3 develop the methodol-
ogy behind the proposed estimator and Section 3.4 uses real and simulated examples
for veriﬁcation. The key issue, the proposed estimator addresses, is the exceedingly
small n situations as encountered in the HIV example of Section 1.3 where extracting
any useful information about the dependency structure in itself is a major challenge.
A notable feature of the proposed approach is that it naturally leads to an estimate

of the entire probability distribution of the random variable V.

24

2.3.2 Power

For the present LSST formulation of DGE, the test statistics T1,...,Tm are “de-
signed” to capture linear dependence between random variables L and X1, . . . ,Xm.
The observation that there might be ways of mapping the original set of statistics
(T1, . . . ,Tm) to a newer set (Ti’, . . . , T3,) for better statistical power is very appeal-
. ing. There are two notable and successful attempts in this regard, both, at some

level, relying on the assumption of independent or weakly dependent Ti’s.

Estimating proportion of null cases. Such attempts can be seen as identically
scaling the original (T1, . . . ,Tm) in terms of their ability to detect linear dependence.
The philosophy behind this adjustment is: if the proportion of null genes (no) is ap-
preciably small, then the case for linear dependence is strengthen accordingly. Storey
(2002a) and Pawitan et al. (2005) present and review successful attempts. Most no

estimators require the assumption of independency or weak dependency among Ti’s.

Borrowing strength. The understanding that a better estimate of variance in the
standard t-statistic [Eqn. 2.1] should yield more power, led to the idea of a “stabilized
variance estimation,” also known as “smooth t-test.” The idea has been reintroduced
in several forms. Baldi and Brunak (2001) use a Bayesian approach to trade-off
between the sample variance for the gene of interest and a combined sample variance
from similar looking samples; it is called a “shrinkage” estimate. Newton et al. (2001)
use a Gamma-Poisson model which achieves a similar effect. The “fudge factor” of the
SAM algorithm, Tusher et al. (2001), is also of the same variety. All these approaches
make an implicit assumption that amount of inter-gene correlation is small enough so

a combined sample variance from multiple genes will not suffer from excessive random

25

ﬂuctuations.

Borrowing strength through signal-induced correlation. Tibshirani and Wasser-
man (2006) present a scheme to combine Ti’s using the inter-T,- correlation for ob-
taining Ti*’s with better statistical power. Their scheme averages a T,- with other Ti’s
in its “correlation neighborhood.” They assume that the correlation among Ti’s is
primarily due to the treatment effect. Therefore, “truly” non-null Ti’s would add up
together and map to stronger Cl}*’s by reducing the noise through averaging, and truly
null Ti’s would largely be unaffect by the averaging. Philosophically their idea is to
add-up signal strength among non-null cases through the device of sample correlation.

Storey et al. (2007) present an interesting, more general approach, of accomplish-

ing a similar effect.

Exploiting residual correlation. Microarray samples that are assigned the same
“state” do exhibit a substantial inter-gene correlation which suggests that the source
of inter—gene correlation is more intrinsic since hypothetically there are no treatment
differences for identical states. Therefore, we conclude that the source / sources of
correlation affect both null and non-null genes. However, somehow, the possibility
that this “intrinsic” correlation among Ti’s can be “exploited” to yield a superior
(Tf, . . . ,T,”;,) is unexplored, at least in the LSST formulation of DGE. This work

attempts to do so. Our contribution in this regard is described below.

Contribution 2. The basis of Contribution 1 is to perform conditioning on eviden-
tially more likely null genes and while doing so incorporate the observed inter-gene
correlation. PhiIOSOphically we employ the same idea but this time to yield a better

(T1, . . . ,Tm) —> ( f, . . .,T;';,) mapping. Sections 4.1—4.3 develop the methodology

26

behind the proposed re-ranking procedure and Section 4.4 uses real and simulated
examples for veriﬁcation. The approach relies on the residuals of simple linear re-
gression. This contribution is complementary to the ﬁrst as the proposed approach
yields gains for moderate n situations, for example the Prostate cancer data of Singh
et al. (2002) as in Section 4.4. If substantial null-null correlation is present and purity

holds, then the gain in statistical power is notable.

2.4 Dependency in Gene Expression Data

Dependency refers to the fact that in a random data set the expression levels Xij
and Xi’j are statistically dependent. Formally,p (Xij’Xi ’j) 74 p(X.,; j)p (Xi’j)'
It is helpful to think from the point view of mutual information between Xij and
X i; 3" Recall that the mutual information (Cover and Thomas, 1991) between random

variables X and Y is deﬁned as

“m =//X” “”9 0gb I__(>py)<y))“d“

That on-an-average how much dependency (or say mutual information) there is be-
tween two gene expressions measurements on a same micorarray, is the topic of a
major scientiﬁc debate (Klebanov et al., 2006).

In practice, method-development for general dependency measures like mutual in-

formation is impractical; however, second—order dependency, i.e., correlation between

Xij and Xi’j’

ray data sets implies the existence of substantial on-an-average correlation between

15 mathematically tractable. A careful examination of various microar-

two arbitrary gene samples. Owen (2005) examines four different data sets — (i)

expression levels in different human tissue (ii) a Kidney aging data (iii) human stress

27

data and (iv) gene expression in yeast — and concludes the existence of substantial
inter-gene correlation.

Qiu et al. (2005b), Qiu et a1. (2006), Almudevar et al. (2006), and Klebanov et al.
(2006) argue that commonly observed inter-gene correlation is intrinsic and must be
accounted for. However, we can just speculate about the possible sources of inter-gene
correlation, such as co-regulations of the genes, spatially correlated measurement er-
rors (Reiner-Benaim, 2007), confounding variables introducing long-rage dependency
(Leek and Storey, 2007), etc. Klebanov and Yakovlev (2007) call for to improve

statistical inference by incorporating correlation structures.

2.5 Purity, ldentifiability and Zero Assumption

The assumption of “purity” states that we will not discover anything interesting near
the center of the null distribution (Genovese and Wasserman, 2004). Efron (2008),

while working with z-values, rephrases this as the “zero assumption” (ZA):
Zero assumption most of the z-values near 0 come from null genes.

Efron (2006) discusses the use of the ZA in a variety of differential analysis ap-
proaches. It plays a central role in the literature on estimating the proportion of null
genes, as in Pawitan et al. (2005) and Langaas et al. (2005). The ZA is equally crucial
for the two-group model approach developed in the Bayesian microarray literature,
as in Lee et al. (2000), Newton et al. (2001), and Efron et al. (2001). The ZA is also
important for the “empirical null” technique of Efron (2004).

For the purposes of statistical reasoning, purity allows to impose “identity” on null

genes. This is known as “identiﬁability.” In the present context, this notion refers

28

to the act of incorporating the genes with test statistics near the center of the null
distribution as explicitly null in the inference. Here, the premise is that the statistical
risk of imposing identiﬁability may weigh less than not doing so. Empirical evidence
supports this assumption.

The assumption of purity is more believable if the proportion of null genes no is
huge. For a variety of gene expression studies this seems to be a common situation as
the number of genes behaving differently in closely compared phenotypic distinctions
is thought to be a small proportion. Intuitively, a signiﬁcant number of total genes are
involved in “basic pathways” that are crucial to the overall functioning and survival of
the cell, and hence, by and large, their expression may remain unchanged with respect
to the treatment of interest. A good part of statistical literature assumes no 2 0.9

for method-development purposes; see (Efron, 2004) and references therein.

29

Chapter 3

Estimating the Number of False

Discoveries

The previous chapter emphasized that the number of false discoveries is central to
large-scale signiﬁcance testing. This chapter discusses the issue of estimating the
number of false discoveries in the presence of substantial inter-gene correlation. In
particular, we focus on exceedingly small sample sizes wherein estimating the inter-
gene correlation structure becomes very challenging. Such situations are very common
in cost-constrained microarray investigations which typically involve only 3-4 repli-
cates per biological state. We lay the statistical groundwork for a method which,
in principle, can estimate an entire distribution of a random variable model of the
number of false discoveries. This distribution is interpretable from both Bayesian
and Frequentist points of view. A distinctive feature of the present method is that it
ﬁrst summarizes the effect of millions of pair-wise correlation coefﬁcients in a single
parameter ,8, then explicitly incorporates this parameter in the inference. Doing so

offers the possibility of a sophisticated yet practical inferential technique capable of

30

handling exceedingly small sample sizes.

Organization of the chapter. Key challenges in estimating the number of false
discoveries and the intuition behind the proposed method appear in Section 3.1.
Section 3.2 develops the method. Algorithm 1 in Section 3.3 summarizes the main
steps. Section 3.4 applies the method to real and simulated test data. A discussion

and potential extensions appear in Section 3.5.

3.1 Overview

Recall the basic “large-scale statistical signiﬁcance testing” formulation from Sec-
tion 2.1: Genes are represented by summary statistics and corresponding null hy-

potheses,

H1,H2,...,Hm
T1,T2,...,Tm
t1,t2,...,tm,

where T,- refers to a “random value” and t,- the “realized value.” The magnitudes of
the ti’s establish a gene-ranking, and the top r < m genes with the largest t scores
are reported as statistically signiﬁcant discoveries. The cut-off between non-null and
null genes is determined on the basis of the ratio 22 / r, where 2) refers to the number of
false discoveries in r discoveries. This ratio, known as the false discovery proportion,
is detailed in Section 2.2. In Section 2.3 it is pointed out that a key issue in large-scale
signiﬁcance testing is an accurate assessment / estimation of the unknown quantity

1).

31

Similarly to the pair (ti,Tz-), it is also helpful to visualize the quantity 1) as the
realized value of a random variable V associated with a hypothetical data ensemble.
Formally,

V = #{null T,- E P},

where l" is the rejection-region under consideration. Here, “#” is read as “the number
of.” The sample space and the probability distribution of V can be deduced from the
joint distribution of null Ti’s. Since the identity of truly null genes is unknown we
restrict ourselves to a conservative all-null-calculation where all the genes are assumed
to be null. Such “all null treatment” is very common in the microarray literature
in which the goal is to identify a relatively small set of interesting non-null genes.
For example, see Owen (2005), (Efron, 2007b), and Efron (2008); the discussion in
Section 2.5 also has some reference to this point. The “FDR terminology” appears in
Section 2.3.

In their breakthrough paper, Benjamini and Hochberg (1995) used an all-null-
theoretical—expectation, EP(V) = 27-”:1 Pr (T,- E F I Hi): Eqn. (2.5), as an estimate
of the ‘iealized” o. In fact, Benjamini and Hochberg (1995) showed that this intuitive
estimator ensures that the FDR stays below an arbitrarily prescribed level. Note that
when the proportion of null genes is close to one, then E1707) is a good approximation

of the true (unknown) expectation E(V). It was later realized that when null cases

are highly correlated, the variance of V,
Var(V) = E {V — E(V)}2,

is greatly inﬂated. Owen (2005) presents a mathematical analysis of this fact and

emphasizes that for certain situations the increase can be nearly 100 fold.

32

A direct implication of inﬂated Var(V) is that the realizations of the random
value V deviate substantially from the expected value E(V). Which casts doubts on
the effectiveness of approximating the realized 2) with Ep(V). The recent literature
calls for a better treatment of inter-gene correlation (see Section 2.4 for details).
Therefore, “to build a realistic i) for highly correlated tests” is the purpose of the
research described in this chapter. The symbol 2) should be read as “an estimate of
the realized 1).”

The present emphasis is on exceedingly small sample sizes, only 3-4 replicates per
biological state, similarly to the HIV example of Section 1.3. The inherent difﬁculty
with small sample situations is that extracting any useful information about inter-
gene correlation structure in itself is a major challenge (Owen, 2005; Efron, 2007a).

Intuitively, the approach draws conditional inferences based on the identiﬁable
information. Mathematically, if I is the information from identiﬁability, then we seek
i) = E (VII) Section 2.5 discusses the concept of identiﬁability in detail. Formalizing

this intuition is not straightforward.

3.2 The Proposed Method

The proposed method models the “histogram binning” of random values Ti’s. While
pursuing this modeling, inter-gene correlation enters into the statistical inference in
a subtle way. For convenience, we work with the z-values by mapping Ti’s to Zi’s
(see Section 2.1 and Section 3.2.1 for details). Efron (2007a) has developed a second—
moment theory for null Z,- “histogram binning” to gain insight into the behavior of
V. At some level, in order to construct a moment-based i), we effectively extend the

“moment framework” by including third moment skewness corrections.

33

Due to purity the center of the Z,- histogram will be populated mostly by null
cases. Therefore, we can place a small zero-symmetric bin designated as “center-area”
in the sample space of Zi’s and posit that the count in center-area is primarily due to
null cases. Consequently, the null count in center-area, denoted by C, is observable.
Next, the observed value, say c, of the random variable C can be used to condition
V to yield a theoretically better estimate of v.

The most complete probabilistic relationship between V and C is given by the
joint distribution P(V, C). We estimate this distribution. To do so we estimate the
ﬁrst three moments of the random variables (V, C) and later ﬁt the maximum entr0py
distribution (maxent) to these moments. Section 3.2.1 discusses the moment estima-
tion and Section 3.2.3 the maxent ﬁtting. Naturally, this approach yields an estimate
of P(Vlc). We claim that the proposed method yields an entire distribution of the
number of false discoveries. The entire distribution is potentially more useful than a
point estimate because of the noisy nature of large-scale inferences (Owen, 2005). If
one wishes to skip conditioning altogether, then this method can report an estimate
of P(V) and is therefore still useful. Compared to purely histogram-curve-ﬁtting
techniques like “empirical null” (Efron, 2004), this approach enjoys the attractive fea-
ture that correlation is separately estimated and later explicitly incorporated in the
inference through the moments of (V, C).

A crucial observation is that for most microarray data there is extreme negative
correlation between random variables V and C (Efron, 2007a). Efron (2007a) pro-
vides an explanation for this phenomenon and then uses the gained insights to build
a Poisson model based second-order i) which is also reliant on the notion of a center-
area. However, an hypothesis of this research was that characterizing the correlation

between V and C using moments could be more helpful in correcting the effect of

34

inter-T,- dependency. Indeed computing the second moments is straightforward, but
unfortunately purely second-order estimators apparently suffer from potentially haz-
ardous “over / under estimation events.” Both the present second moment i) (see the
results of Section 3.4) and Efron (2007a) 23 show this effect.

Three observations explain these estimation issues: (i) the random variable V is
bounded below by zero, (ii) the expectation of V is small, and (iii) correlation causes
the variance of V to inﬂate. These observations suggest that third moment skewness
corrections are vital. Indeed Owen (2005) encourages further investigation of this
point.

However, a third moment extension under severe sampling error is nontrivial. The
chief contribution of these developments is an inferential technique to estimate the
empirical density of 3 x 3 covariances enabling realistic estimates of third moments.
In effect, it is possible, to within a useful degree, to ﬁx the estimation issues inherent
in second-order approaches.

The present extension of the moment framework, in principle, admits any-order
moments, but computational challenges have prevented inclusion of higher than third
moments. The inclusion of the third moments provides signiﬁcant improvement for
a range of real and simulated examples (see Section 3.4). Hence, a key observation
relating to u estimation methodologies is that techniques which rely on identiﬁability
can be enhanced by including third moment skewness corrections if they do not
already do so.

Section 3.2.1 discusses the moment framework and derives the necessary formulae.
Extraction and modeling of correlation information is discussed in Section 3.2.2. The

maximum entropy technique of distribution ﬁtting is described in Section 3.2.3.

35

3.2.1 Estimating the Moments

The goal of the work reported in this section is to derive mathematical formulae for
estimating the moments of random variables V and C. This includes the individual
moments (e.g., E(V),E (C2), etc.) as well as the joint moments (e.g., E(VC),
E (V20), etc.). The work required to meet the goal will extend, however, into
Section 3.2.2. Broadly speaking, the approach is to perform a normal theory analysis
of the effect of “pair-wise gene expression correlations” on the distribution of null
counts.

We begin by transforming test statistics T1, . . . , Tm to z-values:
Z.- = 4‘1 {Gown} . 2' = 1... .,m, (3.1)

where Go is the putative null cdf of the test statistic and <1)"1 is the inverse cdf of
N(0, 1). The z-values provide the analytical convenience of multivariate normal form
in describing the joint null statistic behavior. For these calculations it is assumed
that all genes are null

z,- =/v(o,1), i = 1,...,m, (3.2)

so that the theoretical null distribution N(0, 1) is individually correct.

Formally, the quantities of interest are:

V = #{Zi I Zi S 61} (3.3a)

c = #{a = W s 60}. (3.31))

In Eqn. (3.3), the interval [—60,60] coincident with random count C is known as

“center-area” and the interval [—00, 61] coincident with random count V is known as

36

“left-sided tail-area.” Mostly to be consistent with (Efron, 2007a) we work with a
left-sided tail-area, however, the alternative choices, right-sided tail-area or double-
sided tail-area, are equally valid. Throughout this work 61 = —2.5 and 60 = 1 unless
stated otherwise.

Central to these developments is the premise of Efron (2007a) that the 21’s falling
in center-area represent null hypotheses, in turn, making count C nearly observable.
A similar assumption plays a central role in the literature on estimating the propor-
tion of null genes (Pawitan et al., 2005; Langaas et al., 2005). The “empirical null”
corrections of Efron (2004) too, rest on a similar logic. The observation of purity in
gene expression data motivates this premise. Refer to Section 2.5 for further details.

In order to use the above premise, the proposed strategy is to:
1. Estimate the moments of (V, C)

2. Infer a P(V, C) based on the above moments

3. Report P(VIC) based on determined P(V, C)

4. Use determined P(V|C) to ﬁnd a me, i.e., v conditioned on c

Small improvements may be possible by the incorporation of an estimate of the pro-
portion of truly null cases (Langaas et al., 2005; Storey et al., 2004).

Additionally we assume bivariate and trivariate normality of Zi’s. Efron (2007a)
and Owen (2005) in their developments assume bivariate normality, therefore trivari-
ate normality is an added assumption. Through bivariate and trivariate normal as-
sumptions yield a second-order approximation of the “true” p(Z1, . . . , Zm). Since
Zi’s are individual N(0,1), a second-order approximation is useful. The empirical

evidence of Section 3.4 provides support.

37

The z—value histogram. It is convenient and computationally efﬁcient to approach
the moments of (V, C) through the moments of the “Z,--histogram.” This is done by

partitioning Z, the sample space of z-values, into K disjoint bins,

K
2 = U zk,
k=1

where the kth bin has center z[k] and width A (constant with k). The histogram

counts are:

Yk=#{ZiEZk}i fork=1,...,K
m

=Zlk[i], fork=1,...,K,
i=1

where 1k [j] is the indicator random variable for the Z j falling in bin k. The moment
expression derived are for “central moments” for convenience in using the maxent
approach.

The expectation of Yk is estimated as:

Md 5 E(Yk) = E Zlklil)

= Z Pr(z,- e zk)

i=1

2:[k]+A
z —-2-

= mA<p(z[k]) + 0(A)

z mAcp(z[k]), where tp(z) = —1—e_z2/2. (3-4)

38

The second moment of the pair (Yk, Y1), where k may equal l, is obtained as:

uzlkill = E{(Yk - (Your; - (Yzlll

-.{ (w) (2)}

= ZPrlzz' 6 31:, Zj 6 2,) + 2PM- e Zkv Z.- 6 Zr) — <Yk><Yz>. (3.5)
#i i

If Tz'j models the correlation of a z-value pair (Zi, Z j), then, by adding bivariate

normality, Eqn. (3.5) can be approximated as:

2r,,-z[k]z[z] — z[k]2 — 4112

A2
k.1 z ———e — Y Y 1: Y ,
#2I I g27r\/1———rzz; xp( 2(1_Tzzj) ) < k)< l>+ k l< k)

(3.6)

 

where 1k=l equals 1 if k = l otherwise 0.
Now, let g(r) denote the empirical density of the pair-wise correlations of (I?)
(Z1, Zj) pairs. This quantity together with Eqn. (3.6) yield a useful approximation

for the second moments as stated in the following proposition.

Proposition 3.2.1. The (central) second moment of a histogram count pair (Yk, Y1),

where k may equal 1, is given by
#2Ika I] a mzAzcgelki, zlll) - <Yk><n>+1k=z<m

where

 

+1 2 2
g(r) 2mm - x - v
G (x, y) = / —— exp dr
9 -—1 27r\/1 — r2

ande=m(m—1)---(m—k+1).

39

The error in the above approximation isof 0(A2).

The third moment of the triplet (Yk, Yl, Yj), where k, l, and 3' may be equal, is:

Mama] = E {at — <Yk>><Yz -— mog- — am}

= E(YleYj) — <Yt><Yl><Yj> — (annual + chunks] + name. ll) ,
(3.7)

where:

E(YkYij) = E{ Zlklpl) (21M) (2 Ijl?‘l)}
=1 q=1 r=l

= Z Pr(zp e zk,zq 6 21,2. e 2,)
prqrr

+1k=j Z Pr(zp e 2k, Z1 6 zj)+1,,=,z Pr(zp e 2k, Zq e 2,)
1975a prq

+ 11:,- Z Pr(zp e 2,, Zq e 2,) + 1k=l=j ZPr(zp e zk).
10754 P
1k=l=j = 1 if k = l = m, else 1k=l=j = 0.
Let
1 Tij Tik
R3fi‘ljtk] = Tij 1 Tjk
Tile Tjk 1
denote the 3 x 3 covariance of the triplet (Zi: Zj, Zk) where i 74 j 74 k. Notice that
this matrix is an element of the space of 3 x 3 correlation matrices, say R3. Notice
that the matrices in R3 are always positive (semi) deﬁnite.

Now let h(R3) denote the empirical density of all such R3 [i, j, k] for the true

40

p(Zl, . . . , Zm). Together with Eqns. (3.4)—(3.7), this yields an approximation for the

third moment which is again embodied in a proposition.
Proposition 3.2.2. The (central) third moment of a histogram count triplet

(Yk,Yl, Yj), where k, l, and j may be equal, is given by

islets] z m§A3anlktzULzm>
+ m—Z-A2 [lkszg(ZIkla 2v» + 1k=ng(Zlkla 21m + lag-69w]. 4m]

+ 1k=z=j(Yk) - (Yk)(Yz)<Yj> - [lYklu2IlJl + (3’1)#2lk,jl+ (leuzlkllll ,
where

h R _
Hh($,y,2) = [R3 (2W)3/(2|I3{:|1/2 exp (-%Ix.y,ZIR3llx.y,ZIT) 4R3, (3-8)

 

and Gg(x,y) and [1.2 [i, j] are deﬁned in Proposition 3.2.1. The error in this approxi-
mation is of 0(A3).

The integral in Eqn. (3.8) is computed over three dimensions. A kth moment
It
calculation would involve ('5) -D integral and require integration in 72(2).

Next, to get the moments of (V, C) we combine the moments of the corresponding

41

Yk’s. For example,

E (v — E(V»2 = Z Hzlk, I] use
{k,z:z,,,z,cr,.}

E (C — (0))2 = Z #2Ika 1] (3.91»
{k,z:zk,z,crc}

E {W — E(V))<C — (0)» = 2 leka ll (39c)

{k,z:z,,cr,.,z,c,rc}
(3.9a)
(3.9e)

In Eqn. (3.9): Pr E [—oo,61], tail-area (also called the rejection-region); I‘C E
[—60, 60], center-area.
The key quantities in Proposition 3.2.1 and 3.2.2 are empirical correlation densities—

obtaining those in the presence of severe sampling errors is discussed next.

3.2.2 Estimating Correlation Densities

Severe sampling ﬂuctuations create technical challenges: The current methods can
recover only g(r), and h(R3) requires informed approximations based on g(r). For this
reason as well as for ease in the calculations of Proposition 3.2.1 and 3.2.2, we seek to
parameterize g(r). Fortunately, for most real examples, a single omnibus parameter ,8
is found to be sufﬁcient. This omnibus measure based approach has an added beneﬁt:
Whenever inter-Z; correlation is inaccessible, due to a complicated deﬁnition of the
test statistic, the investigator can still exercise judgement to incorporate correlation
among the test statistics.

Similar to (Efron, 2007a), we also normalize the columns of [x1; . . . ;xm] to mean

42

zero and variance one (but not quantile normalized). This is a usual practice to negate
“brightness” disparities among microarrays (Bolstad et al., 2003; Qiu et al., 2005a).
Column standardization forces the sum of covariances to be zero. This allows ﬁtting
a zero centered density on, g(r), which in turn has profound consequences for the form
of h(R3).

To estimate g(r), we require g(p)—the empirical density of (75’) correlation coef-
ﬁcients between gene expression levels. The mapping between Tij and Pij is needed
to calibrate g(r). However, for the usual two-sample t-statistic, assuming the inde-
pendent columns in [x; . . . ;xm], Tij z pij, and hence nearly the same density g(p)
and its extension h(P3) apply to the Zi’s. Here, the symbol “tilde” distinguishes
measurement correlation density from Z,- correlation density. The fact Tij z pij can
readily be veriﬁed through computer simulations. See (Efron, 2007a, Remark A) for
a similar discussion. Moreover, recall that this issue of the relationship between r
and p was discussed in detail in Section 2.1.

Obtaining g(p). Let [2,, be the sample correlation coefﬁcient between rows i and
j of the residual matrix [21; . . . ﬁrm], obtained by subtracting off each gene’s average
response within each treatment group. The cumulative sampling errors are removed
by transforming to

,.._110 ”iv.
2] — 2 gl‘ﬁij,

 

(3.10)

assuming a translation model Tij = rz-j +5, rij ~ E(T) on this scale; ﬁtting a normal
density N(0, 02) on Iij histogram; letting 8 ~ N(0, 1/(n—3)) on the basis of bivariate
normality (Owen, 2005); recovering E(T) as N (0, 02 — 1/(n — 3)); and transforming
back to the p scale. If necessary the same calculation can be done in the resampling

mode as described in Efron (2007a, Remark A).

43

Next, on the p scale, the modiﬁed Beta density is ﬁtted:

§(p)o<(1—p2)“=(§(p+1))“(1—§(p+1))“, mg. (3.11)

Efron (2007a) works with (0, a2); The present 8 and Efron’s a are related as:

_ 2__ £32 _1-oz2
Var(p) —a —4(23)2(23+1) éﬂ— .

 

 

(3.12)

The rest of this section pursues h(P3)—a 3 x 3 extension of g(p). For the usual
two-sample t-statistic, h(R3) z h(P3) because Tij z pij. Here, P should be read as
Rho, relating to p.

Obtaining h(P3). To model h(P3), we seek a joint density on ’P3—the space of
all 3 x 3 correlation matrices—such that all the inherent marginal densities (i.e., the
density of pig-(i 75 j), the (i, j)th entry of P3) are equivalent to g(p). Such a density
can be obtained from the inverse-Wishart density whose marginalization properties
are especially helpful while deducing the probability density of a subset of random
variables.

Let the underlying covariance E of X come from the standard inverse-Wishart

density w,7,1(1, u), u 2 m:
fm(2 | 1/) oc I23|_("+”’+1)/2 exp (—tr{2‘1}/2) , (3.13)

where u is the single parameter that characterizes the density. The goal is to relate
V to 6 and deduce the probability density of any 3 x 3 covariance sub-matrices of
2. Recall that while forming a covariance / correlation sub-matrix row and column

selection must be identical. To do so we follow the “separation strategy” of Barnard

44

et al. (2000):
E=S-P-S, (3.14)

where S is the diagonal matrix whose ith diagonal element, 3;, is the standard de-
viation of gene i, and P is the m x m correlation matrix of X. By back-to-back

marginalization yields the density of any K: x K; correlation sub-matrix of P:

—(V—m+n) / 2
) , (3.15)

max. I u) o< IPa<"—'”+*‘“1>(Ml/2'"1 (ﬁ I<P..).-.-I
i=1
where (A),,- is the ith principal sub-matrix of A. The reasoning connecting Eqn. (3.13)
to Eqn. (3.15) follows.
Under the transformation 2 -—> (S, P), the Jacobian is given by 27” (H,- si)m [see
Theorem 3, (Olkin, 1953)]. Thus, after marginalization over S

i

V —(V+m+1)/2 m 0° 7(V+1) Jii ,
fm(Pl )o<|P| £11 [0 s, exp( 232)ds.. (3.16)

where p“ is the ith diagonal element of P‘l. The product occurs because of inde-

pendence of si’s. Similarly to (Barnard et al., 2000), substituting w, = pii/2szz yields

m _V/2 m
fm(P I V) CC lPl—(V+m+1)/2 H p’ii H [00 wgu-2)/2 exp (—wz-)dwz- ,
i=1 i=1 0
(3.17)

which leads to an expression for the probability density of the correlation matrix P:

—u/2
m
fm(P | u) oc IPI(”")(”"1)/2’1 H IPz-z-I . (3.18)

i=1

45

where Pi; is the ith principal sub-matrix of P, and pii = |Pz-z-I/ |P|, with |A| denoting
the determinant of A.

For P with probability density Eqn. (3.18), the marginal density of an arbitrary
K x n correlation sub-matrix, denoted by PR, is obtained as follows. Let the K. x
K. covariance sub-matrix 2).; undergo the transformation 2).; —> (SmPn). Then,
due to the marginalization property of inverse-Wishart, 23,; ~ W;1(I, u — m + K).
Steps Eqns. (3.13)—(3.18) for 23,; yield:

—(u—m+n)/2

m
fm(PK;|1/)O( IPK, |(V— m-I-It— 1)( (it— 1)/2- -1 UK ()PK, )‘iiI I:I.
i=1
Substituting n = 2 in (3.15) yields
(u— —m 1)/2
fm(P2|V)Efm(p12|I/)O<(1—p12) . |p12|31.

which has the same parametric form as Eqn. (3.11). By setting 11 — m = 26 + 1 we
can force the inherent marginal densities of P entries IPij (i 75 j)] to equal g(p)—the
speciﬁc aim with which the derivation began. Finally, substituting K. = 3 in (3.15)
gives:

2(8+1)
(1 — Pig _ .033 — Pf3 ‘I‘ 2P12923013)

[(1 — Pig)“ " P33)(1 " Pf3l] ﬁ+2

 

h(P3) OC (3.19)

It should be noted that even though for large m inverse-Wishart, Eqn. (3.13),
is a tenuous assumption, its use is strictly to estimate h(P3) from the g(p), there
is no concern for the entire P. Also, since single parameter probability densities
on a positive deﬁnite matrix space are very few, this one is chosen for its useful

“marginalization” property. The justiﬁcation of this choice rests in the fact that

46

Eqn. (3.19) is an empirically realistic estimate as evident in the results of Section 3.4.
The “Bayesian correlation priors” point of view (Liechty et al., 2004) was especially
helpful in formulating these ideas. Exploring other ways to obtain h(P3) is a subject

for future research.

3.2.3 Fitting the Maximum Entropy Distribution

P(V, C) is inherently a discrete distribution with support D = {(i,j) : 0 _<_ i S
m, 0 S j S m, 0 S i + j S m}, and any moment based inference would invariably
involve computation over 8. Growing cardinality of D (cc m2) makes dealing with

large m difficult. However, computation can be reduced substantially by truncating

Dto

Dt={(i1j) I Vmin S. 2 S Vmax, Cmin S j S Cmax, O S 2+]. S m}: Where
Vmin = max([(V) — l - Std(V)_|,0), Vmax = min (RV) + l - Std(V)l,m);

Cmin = max (L(C) — l - Std(C)j,0), Cmax = min ([(C) +l - Std(C)l,m).

Here, Chebyshev’s inequality guides the choice of parameter I: For l 2 6, the loss of
accuracy due to truncation is negligible.

Computation can be reduced even further by recognizing the fact that the distri-
butions imposed on the basis of a small number of moment constraints often enjoy a
high-level of regularity and a sparser mesh should suffice. In fact, the computation—
accuracy trade-off is easy to deal with, if the task is changed to learning a continuous

probability density p(x, y) over continuous support

St = {(x,y) : x E range(V),y E range(C), #22 S x +y S 1 — W}.

47

 

 

 

 

 

 

 

(0,0) V ——> (m,0) (“m f C) (1 _%),_—_(n¥))

Figure 3.1: Discrete support Dt (Cl markers) versus continuous support St (solid
boundary). St is standardized to improve numerical stability.

range(Y) is [W, W] (see Figure 3.1).

The above standardization offers numerical stability, but the moment constraints
must be scaled appropriately. Let ’Pc denote the space of feasible p(x,y)’s, then
Vp(x, y) 6 ’PC:

[8 xiyjp(x,y) dx dy = pij, and 0 S i +j s K; (3.20)
t

where ”v = E{(V — (V))’(C’ — (0))1'} /mi+J’ . (3.21)

In Eqn. (3.21) (i, j):(0,0) corresponds to the constraint fSt p(x,y) dx dy = 1, while
(i,j):(1, 0) together with (i,j):(0, 1) imply that Vp 6 ’Pc has mean (0,0).

Selection of a unique p(x, y) relies on the principle of entropy maximization (max-
ent) which seeks a p(x, y) 6 PC with maximum information entropy (Jaynes, 2003).
The information entropy essentially measures the spread of the distribution, and
hence, maxent can be seen as a criterion, which within the knowledge constraints,
chooses a least “assuming” probability density—arguably, a correct approach in the

framework of statistical inference.

48

The application of maxent leads to the following optimization problem:

p*(x,y) = argmax {—fs p(x, y) lnp(x,y) dxdy}. (3.22)
t

p(x,y)€Pc

The solution takes the following exponential form:

EXP (ZlgHjSK Aijxiyj)

f5, exp (219+ng Atjxiyj) d2: dy

 

1910?, y) = (3-23)
Solving (3.22) requires concepts from the Calculus of variations and Lagrange multi-
pliers. The reasoning leading to the exponential form Eqn. (3.23) and a procedure to
determine optimal Aij’s will now be discussed.

In addition to the fact that information entropy functional is concave (Cover
and Thomas, 1991), the constraints in Eqn. (3.21) are also linear in p(x,y). Thus,
the problem in Eqn. (3.22) is a convex program which can readily be solved in a
Lagrangian dual framework, where one works with an unconstrained upper-bound
(lower-bound if minimization) that is easy to optimize. More importantly, in the
present case, the framework allows the conversion of the original inﬁnite dimension
problem of functional variation into a ﬁnite dimensional problem with as few variables

as the number of constraints.

Proposition 3.2.3. The dual \IJ(/\) of the concave optimization problem Eqn. (3.22)

is given by:

\II()\)=ln [Sexp Z Aijxiyj dxdy — Z Aijpzj, (3.24)
t

19+ng 29+ng

where )‘ij is the Lagrange multiplier corresponding to the (ij)th constraint and )1”,

49

and i,j, K are deﬁned in Eqn. (3.21).

Proof. By the deﬁnition of the Lagrangian dual function, the Lagrangian III(/\) :

311p [- [Stp($,y)1np(x,y)dx dy+ Z An (thiyjpwwwx dy - #2“)

p(x,y)el’ i+ng
(3.25)

Taking the functional variation of the square bracketed term in Eqn. (3.25) with
respect to the unknown density p(x, y) and using the fact that f St p(x, y) dx dy = 1,

the maximizer of Eqn. (3.25) is obtained,

eXp (219+ng 42739319)

 

1010:, y) = (3.26)

fSt ‘3’“) (219% _<_K Atjmiyj) 611' dy

Inserting Eqn. (3.26) into Eqn. (3.25), yields \IJ()\)

=/Sp(x.y)1n [.9 exp. 2 Win) dandy dxdy— Z w”
t t

19+ng 19+ng

1
I
= In [/ exp ( Z Aiszgﬂ) d2) dyJ — ZZSi+jSK Aijpzj, (3.27)
5‘ 132+ng
. _ 00 _ 10 _ 01 _.
where the facts fStp(x,y) dx dy u — 0, u — 0, and ,u — 0 from Eqn. (3.21)

have been used. [I

It is easy to verify that the Hessian of Eqn. (3.24) is positive deﬁnite and hence
\I'(/\) is convex. Suppose X" is the minimum of \II(/\), then the corresponding pri-
mal solution p” (x, y)——obtained via Eqn. (3.26)—indeed maximizes Eqn. (3.22).
To verify this, let p0(x,y) be the maximizer of Eqn. (3.22), then from Eqn. (3.25)

\II(A) Z ’H{f°(x,y)},V)\. Now from the general optimization theory, the functional

50

variation of Lagrangian with respect to p(x,y) evaluated at p0(x,y) must be zero,
which implies p0(x, y) could be written in the form Eqn. (3.26) for some A0. But then,
‘1’(/\*) S ‘I’(/\0) => ‘1’(/\*) S H{f0($, 31)}; consequently, WW) = H{f°($, 31)} =>
X“ = A0; hence va (x,y) is the maximizer of Eqn. (3.22).

What remains is the need for an efﬁcient method of minimizing \II(/\). “Newton’s
method” is used. According to Newton’s method, if a multi-variable function \II(/\)
is twice diﬂ'erentiable and the initial guess point A0 is in the “neighborhood” of X",

then the sequence
n+1 = An — amnion—lawn). n 2 o (3.28)

converges to X". In (3.28) A\II(/\n) denotes the gradient of \II()\) evaluated at An and
H 0110171)} the Hessian. The parameter 7 > 0 allows ﬁner control of step sizes to
avoid numerical instabilities. At the nth iteration, \II(/\) is replaced by its second-
order Taylor expansion around An and then minimized exactly, which produces the
minimum An+1. At the (n + 1)th iteration, An+1 becomes the point of expansion
and the method continues until it convergences.

~

The elements of the gradient A\II()\) are given by:

 

~ . . exp :1<'+'<K:\"$iyj ..
6;?) '— / x2?!“ ( J J“ - z]. . ) dads-E” (329)
Z] St [St exp (219+jSK Aiszyj) dxdy
= [S $iyji3($,y)d$dy—Hij =11” w”. (3.30)
t

where [Ii-j denotes the (i j )th central moment of the distribution given by :\ via (3.26).

51

Similarly, the elements of the Hessian are given by:
0% Z\ . - ~ ~ . .~
37%;; = / 1"”? “p(x,y) div dy - / Ikylp(x,y) d1? dy - / E’s/319mg) d9: dy
kl 2] St St St

From (3.30) and (3.31) we observe that the gradient calculations are done as a part
of the Hessian calculations which essentially involve terms requiring integration on St.
A plethora of advanced techniques to carry out numerical integration on a quadrangle
like St is available in the literature. An equal-spaced rectangular mesh turns out to
be sufﬁcient for the present purpose. The sequence Eqn. (3.28) is initialized with

A = 0 which implies a uniform distribution over St.

52

3.3 Summary of the Method

 

Algorithm 1: Inferring the distribution of the number of false discoveries.

Input: (l;x1; . . . ;xm) (Section 2.1) and 61 [Eqn. (3.3)].
Output: An estimate of P(Vlc), where (random) tail-count V = #{null Zz- E

—oo,6 and center-count c = # z- 6 —1,1 .
1 z

1. Compute 21, . . .,zm: z,- = <I>"1 {Co(t,-)}, Eqn. (3.2). ti’s are linear regression
test statistics (Section 2.1), Q is the test statistic cdf, and (D‘l is the standard

normal inverse cdf.
2. Determine ﬂ, Eqn. (3.12), summarizing the inter-Z,- correlations.

3. Partition [—5, 5], the Zz- sample space, into K = 100 equal width bins (A = 0.1)
and use Eqn. (3.4), Proposition 3.2.1, and Proposition 3.2.2 to compute the ﬁrst

three moments of the Z,- histogram

4. Use determined 2,; histogram moments to compute the ﬁrst three moments of

V and C, Eqn. (3.9)
5. Use determined moments with the maxent, Section 3.2.3, to estimate P(V, C)

6. Use the estimated P(V, C) to determine P(Vlc)

 

53

3.4 Test Cases

MAT LAB code for the algorithm developed above can be requested via email at
keyurdesai@gmail.com. The approach was tested on two real data sets, both showing
a signiﬁcant amount of inter-gene correlation and unusual null count behavior. Ca1-
culations below are for 61 = —2.5 and 60 = 1. Comparisons with the second-order i)

of Efron (2007a) are also made.

3.4.1 Real Data

The BReast CAncer (BRCA) study of Hedenfalk et al. (2001) has m = 3226
and n = 15 with 7 samples assigned to BRCAl mutations and 8 to BRCA2. The
study sought to identifying genuine mRNA activity differences between these two
categories. The study used two—color microarrays and hence the measurements are
in terms of “ratios.” The logarithms of these ratios are used to raise normality (Tsai
et al., 2003).

The HIV study of van’t Wout et al. (2003) has m = 7680 and n = 8 with 4
samples assigned to an HIV infected condition and the remaining 4 to the control.
The control (CD4-T cell lines) was infected by the HIV-IBRU virus. This study
reported raw mRN A levels which were also converted to logarithms for the present
purpose.

The approach reduces the entire data matrix to just two parameters: The observed
C and the ,8. As evident in Figure 3.2, the parametrization of Section 3.2.2 is realistic.
For BRCA example 8:17.77 and for HIV )8—351. Additional details are provided in
the caption.

The next step is to compute the moments of (V, C) per Propositions 1 and 2. These

54

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

x 10 x 10
- ii,- —histogram - Tij —histogram
——N(0, 0.3382) (1 — r2)17'77
4 - , 4 -
G 5
c c
a) I I a)
3 3
c- c7
9 “2
LI. I LL
2 r 2 r
. I . (
I ”I 'III
0 H . O ..IIII. . I.
'2 0 2 "I 0 1
r r

(a)

Figure 3.2: Caption appears on the next page below Figure 3.2.

calculations require m, 8, c, 6, and A. A = 0.1 is selected. A maxent distribution was
ﬁt to these moments. Figure 3.3 reports the moments and the corresponding maxent
distribution for the BRCA data. Here, (V, C) show strong negative correlation of
-0.89; a similar ﬁgure is reported by (Efron, 2007a, Table 1). Furthermore, V shows
signiﬁcant positive skewness, which causes C to show negative skewness. This is not
surprising as V is bounded below by 0 and yet has small mean but inﬂated variance.

In effect, third-moment provides additional detail about the joint behavior of (V, C).

55

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6 6
x10 x10
3.25 ’ - fij—histogram4 ’ _ (1 _ 7.2)351
3 " —N(0, 0.6542) rs-histogram
3 L
a 2’ a
c 5
s 3
3 3
II 11.
1 -
1 -
e o
O -2 o 2 —1 p 1

(b)

Figure 3.2: Effect of sampling ﬂuctuations on the empirical correlation density. (a)
BRCA example (b) HIV example. For each sub ﬁgure: Left panel is the histogram
of sample correlations after applying the Fisher transformation (3.10) and a normal
distribution (heavy curve) ﬁt to it; Right panel is the histogram of de-noised corre-
lations and a modiﬁed beta distribution ﬁt to it (heavy curve). This summarizes the
cumulative effect of (75”) gene-gene correlations in a single parameter ,6.

During the maxent numerical optimization a 100 x 500 equal-spaced mesh was
found sufﬁcient for the BRCA study; however, for the HIV study, it was necessary to
expand the mesh to 400 x 2000 because of both larger m and heavier (V, C) correlation.

The BRCA Optimization took 30 iterations to converge, whereas the HIV optimization

56

289.7
33113.3

10299.8
-1399069.6
-66210.9
3882259

 

   
  
  
   

2600 \
2400
2200

2000

(7 1800

100

    
   

2200
2000
' 1800
0 V C 0 V

Figure 3.3: BRCA example: Estimated (V, C) moments and a maxent distribution ﬁt
to it. Third moment estimate of P(V, C) (left) exhibits ﬁner details than the second
moment estimate (right).
took 70.

Figure 3.4 reports the estimated P(V|c). Second moment and third moment
estimates are shown separately. In the framework of statistical inference, such a

distribution is the ultimate goal. Point estimates and associated conﬁdence intervals

57

 

 

 

 

 

 

 

 

0.05 . . f
.‘x 97—109
0.04 - .' ‘. .
' I
"—1
I
73—85 : 1|
O : I.
a :Iznxrnuxn.‘ ............
53 0-02" : 69—89 ‘ 9 —113
' I
' |
' 0
0.01 - ,' 1‘
' |
' 0
' \
' s
. . _v' . 1 1‘. __ _ 1
00 20 40 60 80 100 120
The Number of False Discoveries (NoFD)
(a)
0.1
0.08 -
a 0.06 -
LL.
o
E, .
5: 0.04 3.‘
‘Q
~~
0.02 - N,‘
40 50 60 70

0 1’0 20 30
The Number of False Discoveries (NoFD)
(b)

Figure 3.4: The distributions of the number of false discoveries; Panel (a) the BRCA
study, Panel (b) the HIV study. To show the effect of skewness corrections, the third
moment distribution (solid curve) is compared to its second moment counterpart
(dashed curve). For BRCA the second moment mean estimate is 79 compared to 104
for the third-moment; while for HIV these are 19 and 8. The BRCA panel also shows

50% (solid line) and 75% (dotted line) conﬁdence intervals.

58

can be extracted in accordance with a ‘loss function.”

If the mean of the estimated P(Vlc) is used to determine the realized v, then for
the BRCA study, third moment calculations suggests 104 false discoveries versus 79
for second moment. Contrastingly, the “all-null—theoretical—expectation” yields only 20
false discoveries. These numbers must be put in perspective by noting that the actual
2,; count falling in the left-sided tail-area [00, —2.5] is 116. The standard Benjamini
and Hochberg 1995 procedure evaluates the F DP coincident with the left-sided tail-
area [00, -2.5] as 0.1724, the second moment as 0.68, the third moment as 0.9, and
the Benjamini and Yekutieli 2001 procedure as 1 (actually as 1.4759).

For HIV example, third moment calculates 8 false discoveries compared to 19 for
second-moment. Contrastingly, the “all-null-theoretical-expectation” suggests 48 false
discoveries. This time the actual 2,- count in the left-sided tail-area [00, —2.5] is 46.
In this case, the F DP coincident with the left-sided tail-area [00, —2.5] is evaluated
by Benjamini and Hochberg 1995 procedure as 1 (actually as 1.0435), the second
moment as 0.41, the third moment as 0.17, and the Benjamini and Yekutieli 2001
procedure as 1 (actually as 9.94).

That the conservative scaling in Benjamini and Yekutieli 2001 procedure to ensure
“validity” can cause a signiﬁcant loss in statistical power (Section 2.3), becomes ap-
parent in the HIV example. Whereas “all-null-theoretical-expectation” of Benjamini
and Hochberg 1995 procedure can let through a potentially powerless data set, be-
comes apparent in the BRCA example. However, extensive treatment of inter-gene
correlation combined with identiﬁability may lead to more realistic conclusions.

The second-order i) developed in (Efron, 2007a) found 77 false discoveries for
the BRCA study. Efron compares that to the results of nonparametric analysis and

concludes underestimation, but the issue is left unexplored. Our main ﬁnding is that

59

moments higher than second are important in describing the null Zz' histogram.

3.4.2 Simulated Data

It is helpful to test the method on simulated data where the true answer is known.
In simulation below all genes are attributed as “null” and hence no treatment effect is
added. The objective of this all-null simulation is to evaluate the estimation accuracy
of the developed method. The estimated tail-counts are contrasted with the true tail-
counts. In each trial, a 3226 x 15 matrix with entries simulating raw mRNA levels
is generated based on the Gamma-Gamma model described below. First 7 columns
are assigned “state 1” and the remaining 8 are assigned “state 2.” The standard
two-sample t-statistic is used enroute to the z-values, 21, . . . , 23226 The number of
zi’s falling in the tail-area is the “true count” which is compared with the “estimated
count” from the method.

Let the mRNA level Xij of gene i, measured by jth microarray, be
Xij ~ Gamma(k,9i), for j = 1, . . . ,n, (3.32)

where Gamma(k, 6) is the Gamma distribution with the shape parameter k > 0 and
the scale parameter 6’ > 0, similarly to the Gamma-Gamma model in (Newton et al.,
2001). In (3.32) the shape parameter k is common to all genes. Note that the index
variable I: of Section 3.2.1 has no connection with this k. The 6,- scale parameters
characterize the underlying mRN A levels which vary from gene to gene:

6,- i'lé" Gamma(k0, 90), for i = 1, . . . ,m. (3.33)

60

The intuition that genes with bigger underlying mRNA levels should have higher
variance is consistent with model (3.32) since the mean of the ith gene is k6,- and
variance is 1062-2. The parameters (10, 100,60) may be chosen on the basis of the overall
gene expression histogram of some real microarray data set.

Three sets of results, (1,0.6, 500), (2, 0.39, 384), and (3, 0.33, 300), are presented.
These numbers were chosen to preserve the total sample variance. These particular
values are based on the HIV data of van’t Wout et al. (2003) which were collected
using Affymetrix microarrays. In particular, case kzl implies that the Xz-j’s are
exponentially distributed, while case 1022 implies a unimodal distribution with heavy
tails and a noticeable departure from Gaussianity. Case 1023 is characterized by a
more normal (Gaussian) looking distribution, however, with slightly heavier tails.

Following technique was employed to add substantial row-wise correlation: Through
the normal cdf, map the entries of a m x n matrix of correlated Gaussian random
variables

20 = LTz, where 2,,- ‘kf’ N(0, 1), (3.34)
to their p-values Pij = <I>(ij), and further map these p-values into Xij’s through the
inverse Gamma cdf’s as in accordance with Eqn. (3.32). In Eqn. (3.34), LTL = R is
the Cholesky factorization of the correlation matrix R in which L is a lower triangular
matrix.

Developing proper correlation matrices for simulation is a challenging problem.
Even the methods described in the comprehensive work of Marsaglia and Olkin (1984)
fail to generate arbitrarily dense correlation matrices for a large m (m > 1000). In-
stead we take a novel approach based on the recent work of Higham (2002) whose
basic idea is to generate a symmetric matrix and covert it to a nearest correlation

matrix based on Frobenius norm minimization. There are many ways to solve the

61

 

300’ (k = 1,160 = 0.6,00 = 500) J
41
250- + 0*
'E
3 + 3’
8 200~ + 1
3
‘5150- 5
.§
iii
“100’ -I
50"

 

 

 

0 50 100 130 200 230 300
Realized count
(3)

 

. . , 1 . +1
300- (k = 2,190 = 0.391, 60 = 384) 1
+
250- +1.? I
’S’ o
8 200- + + 09
"U
‘1’ 5
a 150- 9 )
.§
1‘3 100»
50- .

 

 

 

0 50 100 _150 200 250 300
ReaIIzed count

(b)

Figure 3.5: 61 = —2.0. For the caption, refer to panel (c) on the next page.

62

 

 

300- (k = 3, k0 = 0.333, 90 = 300)
+
+
250- 1 ++
E + ++#‘I§-* O
3 O
8 200-
"U +. O
.22
co 150-
.E
‘0-0 6“ Q
11"} 100-
50
0

 

 

0 5‘0 100 130 200 250 300
ReaIIzed count
(C)

Figure 3.5: Simulation experiments comparing conditional estimates: third moment
estimates + marker and second moment 0 marker. This ﬁgure corresponds to the
left-sided tail-area with (51 = —2.0. Substantial row-wise correlation is present. The
abscissa is the realized count while the ordinate is the estimated count. For the
signiﬁcance of different (10, k0, 60)’s refer to the main text. Third moment skewness
corrections enhance the estimation accuracy.
numerical Optimization proposed in Higham (2002); however, recently proposed N ew-
ton method of Qi and Sun (2006) is noteworthy for its speed and accuracy. Higham
(2002) provides a lucid exposition of the problem of ﬁnding the nearest correlation
matrix.

Figures 3.5 and 3.6 compare the second moment estimates versus the third moment
estimates for 6 = —2.0 and 6 = —2.5, respectively. Both cases use 60 = 1 for the
center-area boundary. The choice of a center-area is discussed in Section 3.5. For each

sets of (19,100,190), 800 test data were simulated. On each test data the approach was

applied in its entirety and no additional knowledge was assumed. Throughout the

63

Estimated count

Estimated count

Figure 3.6: For the caption, refer to panel (c) on the next page.

120

100-

80-

60-

40

20-

120:

100-

80-

60-

40-

20'

 

 

 

 

(’9 = 11k0 = 06,00 = 50%)].1
+ ++-‘
«I? +: +
+ #* ++ 0
+0) cg o
++++Qi+ +++J$ O¢8 .
O
+ +
O
s
cc
3
2° 4° 60 80 100 120

Realized count
(3)

 

 

 

 

(’9 = 2, k0 = 0391,190 = 384) :-
++ + + +++++‘

+ﬁ+++++ +0

+ ++qd++d$ogl

+
$ 60 l
is
2’0 4’0 610 8’0 100 120

Realized count
(b)

64

 

15,, (k = 3, k0 = 0.333, 00 = 300) +

75120- + +
it o

B 90’ ++ +08 0
...- 00
(D O 0

.§

1;; 60-

I.“

30- ,r

 

 

0 i) 610 . 90 120 130
Realized count
(C)
Figure 3.6: Left-sided tail-area with 61 = —2.5, else the caption for Figure 3.5 is
applicable.
mean of the estimated P(V/c) was used as the ﬁnal estimate of the tail-area count.
Note that the all-null-theoretical-estimates are always 20 for 61 = —2.5 and 73 for
61 = —2.0, regardless of the test data analyzed.
For all three sets of (k, k0,190), third moment skewness corrections enhance the
estimation accuracy. Evidently, the third moment estimates are signiﬁcantly less
prone to “over / under estimation events” than the second moment estimates. A

positive results in three very different simulations emphasize the over-all usefulness

of the developed method.

65

3.5 Discussion

Improved DNA microarray technology, reﬁned standardization procedures, and a
careful execution of laboratory protocols collectively lead to testing situations with
individually accurate but strongly dependent null hypotheses. Inter-test dependency
arising from intrinsic gene-gene interactions cannot be circumvented by experimental
design. Therefore, the effects of dependency on the accuracy of decision-making must
be carefully analyzed. In particular, due to dependency, the “realized” v/r may vary
substantially on a case-to-case basis and the control of E(V/ R) may no longer ensure
the quality of reported discoveries.

Treating general dependency is impractical, but admitting second-order depen-
dency is possible and widely attempted. The developments in this chapter emphasize
that an “explicit” combination of estimated correlation with identiﬁability can miti-
gate the unfavorable effects of inter-gene correlation, and that the moment theory of
null statistic histogram allows to do so.

It is tempting to conclude that a large number of vectors drawn from the distri-
bution N(0, R) can yield the required moment estimates and the mathematical work
in Sections 3.2.1—3.2.2 is unneeded. However, such a conclusion neglects excessive
sampling errors that are present in sample correlation coefﬁcients. With substantial
sampling errors, a realistic estimate of the underlying R cannot be obtained but the
quantities g(r) and h(R3) are still obtainable.

Permutation calculations, as in Section 4 of (Efron, 2007a), present an alternative
way to estimate the moments. They too can run into computational difficulties,
especially when the test statistic is computation intensive. An even bigger difficulty
is innate when samples are few: For a two-state study like HIV, 4—4 samples each, only

70 unique permutations are available. Nevertheless the permutation based approach

66

is promising as a subject of further research.

When a direct extraction of inter-test correlation is not feasible, the “single om-
nibus parameter models” remain useful: They allow the user to systematically exer-
cise judgement by selecting different values for ﬂ to examine a range of “correlation
effects.”

The distribution of interest P(V, C) tends to have complicated support like ’D
(Figure 3.1), and the maxent algorithm is well-suited to such complicated support
regions. At a more fundamental level, moment is intended to minimize the amount of
unintentional prior information brought into the inference.

For the present approach, apart from the numerical parameters A (bin width) and
the mesh resolution in maxent, the only open choice is 60, the center-area boundary.
The choice 60 = 1 in this work is based on the ﬁrst eigenvector analysis of (Efron,
2007a) which suggests that (within certain approximations) the interval [—1, 1] has
completely opposite count behavior from the rest of the Z space.

Does more inter-ZZ- correlation translate into more extreme Cor(V, C)? The answer
is surprisingly no. In the BRCA example, Cor(V, C) is —0.89, whereas in the HIV
example it is reduced to -0.75. Further insight into this behavior should be a useful
contribution.

Here, the aim was to mitigate the unfavorable effects of inter-gene correlation
when estimating the number false discoveries. In fact, inter-gene correlation can be

“exploited” to yield a superior gene-ranking. This is the topic of the next chapter.

67

Chapter 4

Exploiting Correlation to Improve

Gene-ranking

In the LSST DGE detection formulation (Section 2.1), genes are represented by uni-
variate summary statistics. Moreover, there is correlation among test statistics due
to inter-gene correlation among gene expression variations. The observation that in
high-dimensional inference there can be ways of mapping the original set of statistics
(t1, . . . , tm) to a newer set (t*, . . . , tin) with more statistical power is very appealing.
The research described in this chapter develops a technique to obtain one such map-
ping. The key idea is to combine correlation with purity through a distance metric
that can account for the effect of correlation on the joint distribution of Ti’s. As a spe-
cial case, we develop a method that builds upon the widely used two-sample t-statistic
approach suitable for two-state studies. The method uses the Mahalanobis distance as
the distance metric. An extension accommodating multi-state and continuous-state

studies is also discussed.

68

Organization of the chapter. Section 4.1 provides an overview of the proposed
approach. Section 4.2 obtains closed form expressions for the minimum Mahalanobis
distance estimates. Section 4.3 builds on the theory of Section 4.2 to develop the
tEllipsoid gene-ranking method. In Section 4.4, we apply tEllipsoid to the prostate
cancer data of Singh et al. (2002) and evaluate the gain in statistical power. Sec-

tion 4.5 discusses the implications of these results.

4.1 Overview

Detecting differentially expressed genes in the presence of substantial inter-gene cor-
relation is a challenging problem. Research has focused largely on understanding the
harmful effects of correlation on the threshold settings demarcating null and non-null
genes. The research discussed in Chapter 3 developed a method which explicitly
admits the observed sample correlation in the analysis and offers a more accurate
assessment of signiﬁcance cut-offs. The research described in this chapter shows that
correlation can, in fact, be exploited to share information across tests, which, in turn,
can increase statistical power. The key contributions in Chapter 3 were in part mo-
tivated by the exceedingly small sample situations (72. ~ 5-10), whereas the work in
this chapter is intended to beneﬁt situations with n 2 20.

It is helpful to think in terms of mapping the original set of observed statistics
(t1, . . . , tm) to a newer set (t*, . . . ,tfn) for better statistical power (Section 2.3). The
literature is not devoid of attempts to develop such mappings that exploit correlation
among (the random variable interpretation of these) test statistics, but such efforts
have not produced compelling results. We posit that the limitations of such develop-

ments are due, at least in part, to neglecting identiﬁabz’lz’ty—the act of incorporating

69

the genes whose test statistics fall near the center of the null distribution as explicitly
null in the inference. The statistical risk of imposing identiﬁability weigh less than
not doing so because of the fact that in most comparative studies the activity of a
great majority of genes remains unchanged with respect to the treatment of interest
(Section 2.5).

This chapter presents a framework to obtain a better gene-ranking by combin-
ing correlation and identiﬁability through an optimization criterion. The framework
builds upon the widely-used two-sample t-statistic approach and uses the Maha-
lanobis distance as the optimality criterion. Although the initial motivation was to
improve statistical inference in two-state microarray studies, the framework readily
generalizes to multi-states and continuous-states as well as to other multiple compar-
ison applications. The connection between the standard t-statistic and simple linear
regression as discussed in Section 2.1 is crucial to this generalization.

Recall the basic LSST formulation from Section 2.1: Genes are represented by

summary statistics and the corresponding null hypotheses,

H1,H2,...,Hm
T1,T2,...,Tm
t1,t2,...,tm,

where T,- refers to a “random value” and t,- the “observed value.” The magnitudes of
ti’s establish a gene-ranking, and the top 7' << m genes with the largest t scores are
reported as statistically signiﬁcant discoveries. The investigator can either explicitly
supply r or rely on the false discovery rate (FDR) calculations to ﬁnd a maximal r

with the allowable FDR. The present discussion assumes that r is ﬁxed.

70

The issue of correctly estimating the F DP in the presence of correlation has re-
ceived much recent attention because highly correlated tests increase the variance of
the F DP leading to unreliable results (Owen, 2005). As discussed in Efron (2007a)
and Chapter 3, for “over powered” data sets, there may be signiﬁcantly fewer tail-
area null counts than expected, while for “under powered” data sets, the situation can
worsen with many more tail-area null counts than expected. Importantly though,
techniques for correctly estimating the F DP do not change the gene-ranking, but
only the size of the reported list.

The research discussed in this chapter was motivated by the notion that, for “under
powered” data sets, it might be possible to exploit correlation among test statistics
to establish a gene ranking that has better statistical power than the original t2-
based ranking. The method that resulted from an exploration of this question indeed
seems to improve the statistical power of all data sets. The proposed method uses,
(i) a vector of the observed statistics t = [t1,t2, . . . ,tm]T and (ii) an estimate of
the covariance matrix of the vector T = [T1, T2, . . . ,Tm]T, to output a substantially
revised version of t, denoted ult, whose corresponding entries can be used to establish
an improved gene-ranking.

The method is summarized as follows. Let T ~ (u, 2). Note that no distributional
information nor higher order statistics of T are assumed. Now based on the observed
value t, we can estimate ult, but while doing so we invoke the zero assumption
(ZA) (Efron, 2008) that the smallest P0(%) of ti’s are associated with null genes.
Based on the ZA, we can set corresponding entries of u to zero. For the remaining
entries of u we obtain minimum Mahalanobis distance (Mahalanobis, 1936) estimates.

Inter-gene correlation causes the vector T to distribute around the center of mass

11 in an hyperellipsoidal manner, and the Mahalanobis distance is a natural way to

71

measure vector distances in such a distribution. In fact, to emphasize the geometric
intuition of tracking the center of an ellipsoid, the method is named as “tEllipsoid.”
Through extensive experimentation with both real and simulated data, it has been
found that for a truly null t,- which happens to be in tail-area, the corresponding uz-It
consistently tends to zero (its theoretical value).

TWO prior research efforts, Storey et al. (2007) and Tibshirani and Wasserman
(2006), were particularly useful in formulating the present approach. Interestingly,
both approaches aim at exploiting the signal structure among non-null tests, whereas
the present approach aims at exploiting the structure among null tests; see Section 2.3

for details.

4.2 The Proposed Method

An m x n matrix of gene expressions, for m genes and n samples, is given. For the
present discussion we assume that the samples fall into two states 19 = 1 and k = 2
and there are ”k samples in group I: with n1 + n2 2 n. The generalization to multi-
state and continuous-state is discussed at the end of the section. We start with the

standard (unpaired) t-statistic:
t2- = ———' (4.1)

where ink is the mean of gene i in group k and s7; is the pooled within-group standard
deviation of gene i. If the ith gene is indeed null, then we expect the random variable
T,- ~ (0, u/ (12—2)). Here, the degrees of freedom 11 is obtained from either the unpaired
t-test theory or the permutation null calculations as discussed in Section 2.1. Note

that T,- is a random variable and t,- is its observed value. If gene i is non-null, then

72

we expect T,- ~ (11.2303). For non-null genes, the values of uz- and 0',- depend on the
amount of up / down regulation, the number of samples in each group, and 11.

Without loss of generality, we may assume that the genes are indexed so that
W S |t2| S S ltml- (4-2)

Then, a reasonable way to impose identiﬁability on null genes is through the ZA,
namely, that P0(%) of the genes—those with the smallest t statistics—are null. Sec-
tion 2.5 provides a comprehensive discussion regarding the rationale behind the ZA.
The use of the ZA is justiﬁed in the present situation as long as P0 is sufficiently small
so that the bottom P0(%) genes would almost certainly be declared null for reason-
able FDR’s. Potentially, the statistical risk of “imposing” identiﬁability weigh less
than not imposing it, especially when purity holds. Empirical evidence of Section 4.4
supports this intuition.

Formally, the ZA is stated as follows: Let c be the largest integer (gene index)
such that c/m 3 P0 / 100, denoted

c = [0.01mPol, (4.3)
then genes with indices 1,2,. . . ,c are assumed null. Let us partition the set of t
statistics into those corresponding to genes declared null under the ZA, {t1, t2, . . . , tc},

and those for the remaining m — c genes which continue to compete for the non-null
designation, {tc+1, tc+2, . . . , tm}. (The present c is different from the c in Chapter 3.)

For convenience, we introduce the following vector notation,
T T
_ T T _ T T
t - l to town l - l ta» to) l -

73

Then the random vector T is distributed in the following way:
T ~ (11, E), (4.4)

where u is the underlying mean vector and 2 the covariance matrix. The correspond-

ing partitions of (u, E) are denoted

u(0) and 2: 2(00) 2(01)

“(1) 23(10) 2(11)

u:

The central hypothesis here is that there is a vector, say ult, whose elements represent
a reordering of the elements of the t, such that gene-ranking represented by u|t has
better statistical power for detecting non-null genes than that based on is itself.

The present effort focuses mainly on the second moment distributional character-
istics of t. However, in fact, if the gene expressions are normally distributed, then,
perhaps, t is described more accurately by the multivariate Student distribution.

Exploitation of this additional structure will be considered in future work.

4.2.1 Choosing a Distance Metric

We are interested in obtaining an estimate of the vector mean 11 based on the obser-
vation t. This requires an appropriate metric in the space of t vectors, with which to
quantify the distance of the observed t from the center of mass 11, say dist(t, u). The

82 norm induces a useful metric between t and u provided that we ﬁrst decorrelate

74

the vector elements as 2—1/2 (t — 11), thus yielding

 

dist(t,u) = \/n>:-1/2(t—u)ll2

= ﬂ; _ u)T2-1(t — u). (M)

 

This weighted Euclidean distance is sometimes called the Mahalanobis distance in
the pattern classiﬁcation literature as in Deller et al. (1999) and Dejviver and Kittler
(1982).

We can relate t to u through the Mahalanobis distance but while doing so we
invoke the ZA, which, in turn, implies that the ﬁrst c entries of u are zero. This

yields the estimate

0
u* = , where uh) = argmin (4.6)
1121) 11(1)ElRm_c
T -1
t(0) " 0 23(00) ”(01) t(0) “ 0
to) - “(1) 2(10) 23(11) to) - “(1)

In effect, ufl) combines the identiﬁability information based on the ZA with the
information about the covariance structure of T which too can be obtained from the

measured X itself. Notably the optimization in Eqn. (4.6) enjoys closed form solution:

1121) = 13(1) — 2(10)2&)}))t(0). (4.7)

The derivation leading from Eqn. (4.6) to Eqn. (4.7) follows.

75

Suppose that

.4 -A B .
2 = C D and u(1) = ta) — 11(1). (4.8)

We also have C 2 BT. Substituting these in Eqn. (4.6) yields:

11:1) -_- ﬁSgEIElE—ctfg’ﬁtw) + zeacqo) + aﬁpam (4.9)
1

In Eqn. (4.9), by setting the gradient w.r.t 11(1) to 0, we obtain:

11),) = —CTD—1t(0). (4.10)

Now for 2‘1, we can appeal to the matrix inversion lemma (Golub and Van Loan,

1996):

”(0%) (1+ 2(01)Q_12(10)2(_0l))) “Zahzwnq'l
_Q*12(10)2(-0%)) Q—l

2‘1:

1

where Q = 201) — 2(10)2(—0%))2(01). Plugging this in Eqn. (4.10) yields:

4 _ _1
Combining Eqn. (4.11) with Eqn. (4.8) provides the desired expression:

*_ —1

“(1) - t0) — 2<10>’3(oo)t(0) '3'-

76

4.2.2 An Intuitive Interpretation

If t0) is written as “(1) + e(1), then Eqn. (4.7) can be seen as

2|: —1
Ul(1) : 11(1) + [9(1) ’ 23(10)2(00)t(0)l '

Then, the proposed method is interpreted in the following way: The method uses
the identiﬁable null cases to predict and, in turn, suppress the “null energy” in the

competing cases.

4.2.3 Estimating the Covariance

Notice that Eqn. (4.7) involves theoretical covariances whose estimates must be ob-
tained from the data set at hand. To estimate the required entries of )3, we make two
observations. The validity of these observations can be established through computer
simulations. An intuitive argument also appears. Note that Eqn. (4.7) does not re-
quire the covariance between two non-null Ti’s. The realization of the fact that the
t-statistic is a “standardized” statistic helps understand these covariance—correlation

relationships.

Observation 1. If genes i and i’ both are null, then
u
COV (Ti? Tzl) % COI‘ (Xi? Xi’) :5 (4.12)

This observation maybe intuitive to the reader from Eqn. (4.1) itself, or it is easily
veriﬁed through a computer simulation. Efron (2007a), Owen (2005), and the method
deveIOped in Chapter 3 all use this observation for their respective conditional FDR

calculations.

77

Observation 2. Similarly, if the gene i is null and i’ non-null (or conversely), then

1/
’
u—2

 

Cov (T,, T,,) 2 Cor (X,, X,,) (4.13)
where Cor(ff'i, X24) denotes the “residual” correlation between gene i and i’. Residual
correlation is the correlation in the fraction of gene expression that is unaffected by
the treatment of interest. For null genes the fraction is one because by deﬁnition
the gene expression X,- for a null gene is independent of the state variable L. For a
non-null gene, the fraction is determined by how strongly that gene is affected by the
treatment.

Equations (4.12) and (4.13) suggest to use “residual” sample correlations to esti-

mate Cov(t,-, tiz):

(“22' 55%) (532' 530-)

where :iiz-j denotes the (observed) residual corresponding to the ith gene and the

jth microarray. The scale factors cancel in the terms 200) and 23—1 so that

(00)’
estimating u/(u—2) is not required. These arguments hold for two-state, multi-state,

607! (E,Til) CX , (4.14)

 

 

and even continuous-state studies provided the test statistics are obtained through
simple linear regression. Here, the entries of the residual matrix are the differences
between the observed expression values and the corresponding predicted expression
values from regression equations X,- = a + bL + e,i = 1,. . . ,m. Note that the two-
sample t-statistic can be interpreted as simple linear regression with a binary-valued
covariate. The situations where test statistics are not from simple linear regression

will be considered in future work.

78

4.2.4 The Final Equation

In light of Eqn. (4.14), Eqn. (4.7) takes the practical form

62:1) = t0) — C(10)?)(Bbtm), (4.15)

where C is the sample correlation matrix of the residuals. In most cases computing the
. . ~ _1 . . ~ _ 1
full matrix inverse (C(00)) IS not necessary and solvmg the term C (00)t(0) through

an efﬁcient linear solver reduces the computation considerably.

4.3 Implementation Details

This section outlines a self-contained differential analysis algorithm based on the ideas
discussed in Section 4.2. Its name tEllipsoid was coined to emphasize the geometric
intuition of tracking the center of an hyper-ellipsoid.

TEllipsoid takes a gene expression matrix X and assigned biological conditions
and provides a speciﬁed number, say r, of the most differentially expressed genes. In
principle, the ranking is based on the set {11;} from Eqn. (4.6). In practice, we rely
on the estimates {212‘} from Eqn. (4.15).

Two-state implementation begins by re-indexing the genes based on their two-
sample t-statistics [Eqn. (4.2)]. Then, based on the ZA, the ﬁrst c genes are identiﬁed
as null, as speciﬁed in Eqn. (4.3). By default, P0 is set to 50(%). Although the choice
50% is somewhat arbitrary, this fraction has worked well empirically in the data sets
tested. A P0 as low as 10(%) is found to enhance the power. Future research may
yield more rigorous methods for choosing P0.

In the two-state implementation, in order to nullify any genuine treatment differ-

79

ences, X is converted to «It: by subtracting each gene’s average response within each
treatment group. The residual sample correlation matrix C of «l; is computed sub-

sequently. The crucial step is to compute ﬁfll based on Eqn. (4.15). The elements

T
of (1‘1“)T = [03x1 (ﬁ(1)) ] determine the gene-ranking: A gene with bigger li’Zﬂ is

assigned a higher rank. The ﬁrst r genes are reported as top r statistical discoveries.

The multi-state / continuous-state implementation beings by re-indexing the genes
based on their simple linear regression t-statistic. The entries of the residual matrix
it" are the differences between the observed expression values and the corresponding

predicted expression values from regression equations X,- = a + bL + e,i = 1, . . . , m.

4.3.1 Numerical stability and Computational Complexity

Because the number of samples n is often less than the number of genes m, the
residual sample correlation matrix turns out to be singular and hence non—invertible.
Therefore, we add a very small correction term (= 10-10) to its diagonal entries to
make it nonsingular, and in effect, invertible. After this correction, tEllipsoid shows
excellent numerical stability.

Equation (4.15) involves matrix inversion, which, if performed in a naive way,

could be a prohibitive operation, since microarray data sets may have several tens

—1
(00)

neous linear equations (Cmmx = tan) is much faster than explicitly computing

of thousand genes. Indeed, solving the term C t(0) as a system of simulta-

C(_0})). In particular, we can employ the Cholesky decomposition to exploit the
fact that the matrix C(00) is symmetric and positive deﬁnite. MATLAB imple—
mentation of tEllipsoid uses the in-built linslove with appropriate settings, which,
in turn, uses highly optimized routines of LAPACK (Linear Algebra PACKage—

http://www.netlib.org/1apack/).

80

For the Prostate data (used in Section 4.4) with 12625 genes and 102 samples,
tEllipsoid, running on a computer with a 2.2 GHz dual-core AMD Opteron processor
with 8 GB of RAM and MAT LAB version R2006b, requires just under 40 seconds
to report the ﬁnal gene list. For the same settings, the implementation with explicit

matrix inversions takes ~ 10 minutes.

4.3.2 Algorithm for Two-State Studies

 

tEllipsoid: An enhanced gene-ranking for differential gene expression detection
Input: X = Labeled m x n gene expression matrix; r = Size of gene list

Output: The gene list containing top r differentially expressed genes
1. Calculate two-sample (unpaired) t-statistics: t,- = (573,-;2 — fall/3i
2. Reindex genes such that |t1| g |t2| S S |tm|
3. Gather ﬁrst c : [0.01mP0l ti’s in a vector t(0); By default P0250(%)

4. Convert X to X by subtracting each gene’s average response within each treat-

ment group
5. Compute C 2 the (residual) sample correlation matrix of X
6 Find (ﬁ*)T = T (13* )T where G* : t — C C_1 t
- cx1 (1) , (1) (1) (10) (00) (0)
7. Determine gene-ranking: Assign a gene with bigger |iZ:‘| a higher rank

8. Report top r genes as statistical discoveries

 

81

4.3.3 Algorithm for Multi-State and Continuous-State Studies

 

tEllipsoid: An enhanced gene-ranking for differential gene expression detection
Input: X = Labeled m x n gene expression matrix; r = Size of gene list

Output: The gene list containing top r differentially expressed genes
1. Calculate per gene simple linear regression t-statistics
2. Reindex genes such that |t1| S |t2| g g ltml
3. Gather ﬁrst c : [0.01mP0] ti’s in a vector t(0); By default P0:5O(%)

4. Obtain the residual matrix X whose entries are the differences between the

observed expression values and the corresponding predicted expression values
5. Compute C : the sample correlation matrix of X
6 Find (ﬁ*)T = T (13* )T where ii" = t — G GT1 t
° cxl (1) , (1) (1) (10) (00) (0)
7. Determine gene-ranking: Assign a gene with bigger Iiiﬂ a higher rank

8. Report top r genes as statistical discoveries

 

4.4 Test Cases

To appreciate the increase in statistical power attributable to “exploiting” correla-
tion, the performance of tEllipsoid is contrasted with three leading techniques. The

ﬁrst is the raw t-statistic itself and the other two are SAM [Signiﬁcance Analysis

82

of Microarrays (Tusher et al., 2001)] and EDGE [Extraction and analysis of Dif-
ferential Gene Expression (Leek et al., 2006)]. SAM adds a small exchangeability
factor so to the pooled sample variance when computing the two-sample t-statistic:
dz- 2 (53,-;2 — ii;1)/(si + so); Whereas EDGE is based on a general framework for
sharing information across tests (see Storey et al. (2007)). EDGE is reported to show
substantial improvement (in terms of statistical power) over ﬁve of the leading tech-
niques including SAM (Storey et al., 2007). The other four are: (i) t/F—test of Kerr
et al. (2000) and Dudoit et al. (2002); (ii) Shrunken t/F—test of Cui et al. (2005); (iii)
The empirical Bayes local FDR of Efron et al. (2001); (iv) The a posteriori probabil-
ity approach of Lonnstedt and Speed (2002). It should be noted that tEllipsoid can
also serve as an additional layer to SAM and EDGE and enhance their power.

To determine the performance quality of various techniques, we focus primarily
on the empirical FDR in the reported gene list: Empirical FDR 2 NoFP /r, where
N oF P 2 the number of false positives. Broadly speaking, smaller the FDR better the

technique.

4.4.1 Case I: Real Data With Induced Differences

Singh et al. (2002) studied m 2 12625 genes on n 2 102 oligonucleotide microarrays,
comparing n1 2 50 healthy males with n2 2 52 prostate cancer patients. The purpose
of their study was to identify genes that might anticipate the clinical behavior of
Prostate cancer. We downloaded the .CEL ﬁles from http://www-genome .wi .mit.
edu/MPR/prostate. The software RMAExpress (Irizarry et al., 2003) was used to
obtain high quality gene expressions from these .CEL ﬁles. We let RMAExpress
apply its in-built background adjustment, however, the quantile normalization was

skipped. Each gene was represented in the ﬁnal expression matrix X by the logarithm

83

(base 10) of its expression level. Taking the log is thought to increase normality and
stabilize across group standard deviations (Tsai et al., 2003).

Algorithm testing required an expression matrix X with the knowledge of truly
non-differential genes. At the same time, we wanted the inter-gene correlation in
X to resemble that in the real microarray data. These two seemingly conﬂicting
requirements were satisﬁed concurrently by row standardizing a real X. The prostate
cancer matrix X was transformed to X by subtracting each gene’s average response
within each treatment group, and by normalizing within group sample mean squares.
That is, for each group k 6 {1,2}, (1/nk) 23- ii,- 2 0 and (l/nk) 23- X12]- 2 1. Here,
the sum runs over corresponding ”k samples only. With this transformation, all genes
have equal energy and yet the same within group inter-gene correlation structure as
the original X .Note. Normalizing within group sample mean squares to unity is not
implemented in the tEllipsoid algorithm.

To generate a test data set from i, its 102 columns were randomly divided into
groups of 50 (2711) and 52 (2n2). Next mu (md) genes were randomly chosen for up
(down) regulation by adding a positive (negative) offset xu (red) to the corresponding
entries in group 2. Various choices of (mwmd, mu,a:d) were tested to represent a
range of differential analysis scenarios encountered in practice.

Two cases were studied. In the ﬁrst, the proportion of truly differential genes,
say p1, was taken to be relatively small: p1 ~ 0.01-0.05. The second case em-
ployed a larger p1 ~ 0.1. The former simulates microarray studies seeking genes that
distinguish subtypes of cancer, diabetes, etc., whereas the latter resembles studies
comparing healthy versus diseased cell activity.

Results were obtained using the subroutines samr.r from the package “samr” and

statex . r from the package “edge.” Both routines compute their native gene summary

84

statistics which, in turn, can be used to determine top r genes.

Case 1 [pl 2 0.025, mu2200, md2100, $1,201, and xd2-0.1]. Figure 4.1(a) shows
plots of the F DRs for 40 different data sets with the size of the reported list, r2300. A
large value of r coincides with an attempt to extract as many differential expressions
as possible, a desired goal especially in microarray studies performed to identify genes
that are to be explored further — experimentally or computationally — to gain better
understanding of underlying gene networks. Since the differential signal $11,201 and
xd2-0.1 is rather weak, recovering a good list is not easy as evident from the results
— among all methods only tEllipsoid achieved sufﬁciently low FDRs to rescue a few
X’s.

Figure 4.1(b) presents results for 72100. A smaller r would be chosen to identify
high-quality class distinguishing features for gene-expression-proﬁling-based clinical
diagnosis and prognosis, where the goal is to build accurate classiﬁers and predictors.
Whereas Singh et a1. (2002) build a classiﬁer around only 16 of 12625 features, they
do discuss the need to include as many reliable features as possible. Remarkably, for
37 out of 40 X matrices, tEllipsoid reports gene lists with no false discoveries at all,
while the other techniques fail to obtain a single gene list with the FDR < 0.5.
Case 2a [pl 2 0.1, mu2600, md2600, xu20.02, and xd2-0.02]. In this set of
experiments, p1 is increased, but the differential signal is reduced. This situation also
proves to be challenging for the existing techniques. However, tEllipsoid provides the
FDR of ~ 0.5 for r21200, and, again for 72300, while it reports most gene lists with

no false discoveries at all.
Case 2b [p1 20.1, mu2600, md2600, $1,201, and xd2-0.1]. This subcase is de-
signed to assess the effects of small sample sizes on performance. n1 and n2 are both

reduced to 20. We randomly chose 20 columns per group from the original prostate

85

 

 

 

 

 

 

 

 

 

 

1 -
D D [I Cl U
E El Ct] El U Cl D
D 0.5 [3 D
a U D D D :1 13
E] [II CD D D DC!
[:1 1:1 :1 DD :1 U :1
O . i .
0 10 20 30 40
Data Index
(a)
1 -
I 5', If “‘
o 0.5- "
LI.
0 1:1
1:1
0 "ac-$1351 a
20 30 40
Data Index

(b)

Figure 4.1: FDRs for Case 1. The number of truly differential gene is 300. Panel (a)
r2300; Panel (b) r2100. “Exploiting correlation” considerably enhances the statisti-
cal power. Square (El) marker 2 tEllipsoid. Lines: solid 2 raw t—statistic; dotted 2

SAM; dashed 2 EDGE.

86

 

El

I
E 0.542111%]thqu QEDUDWJ

 

 

 

 

 

 

 

O0 10 2'0 30 40
Data Index
(a)
1
L:
D 0.5-
Cl
OWE] W
0 10 20 30 40
Data Index

0))

Figure 4.2: F DRs for Case 2a. The number of truly differential gene is 1200. Panel
(a) r21200; Panel (b) r2300. “Exploiting correlation” appreciably enhances the
statistical power. Square (CI) marker 2 tEllipsoid. Lines: solid 2 raw t-statistic;
dotted 2 SAM; dashed 2 EDGE.

87

cancer X, and then applied the data generation process (including row standardiza-
tion) detailed in Subsection 4.4.1. Reduction in the number of samples is compensated
by increase in the differential signal. The FDRs for tEllipsoid, Fig. 4.3, are excellent

suggesting that tEllipsoid increases power of small sample data sets too.

4.4.2 Case ll: Simulated Data

Before devising the prostate data test setup, tEllipsoid was tested on several simulated
data sets. Below we discuss some simulation results that shed further light on the
small sample behavior.

Let us denote by X“) the ith column of a simulated expression matrix X. We
assume that the random vector X“) is multivariate Gaussian with mean 0 and covari-
ance matrix W. Each such column represents m23226 genes with a covariance matrix
W that introduces roughly the same amount of correlation as found in the BRCA data
of Hedenfalk et al. (2001). We choose mu 2 50, md 2 50, ran 2 1, 1rd 2 --1,n1 2 10,
and n2 2 10. Figure 4.4 shows plots of the F DRs for r250 and r2100. Table 4.1
shows results for some data point from Fig. 4.4(b). Shown are the top 100 values
of it: and each corresponding original t,- with concomitant rank. With smaller 71.,
preeminence of tEllipsoid with respect to existing techniques scales down a bit. Nev-
ertheless, for r250 case, for 25 out of 40 simulated X realizations, tEllipsoid achieves
a low FDR of ~ 0.1 or less.

Interestingly, with a smaller n, SAM outperforms the other two techniques. This
is not entirely surprising as a smaller n can make the noise in the per gene pooled vari-
ance s,- (and possibly the equivalent quantity in the EDGE algorithm) more promi-
nent. Nevertheless, SAM does mitigate this issue in some measure by using the

exchangeability factor so to adjust the effective pooled variance (Tusher et al., 2001).

88

 

1—50 51-100

 

 

 

11,: rank 11’; t,- t,- rank it; rank it”; t,- t,- rank
1 4.22 5.87 1 51 2.18 3.45 23
2 -4.17 -555 2 52 2.17 3.05 42
3 —3.93 -4.26 5 53 2.16 2.80 82
4 -3.74 -4.12 7 54 -215 -257 122
5 -3.58 -4.49 4 55 2.15 1.96 357
6 -349 -334 28 56 -2.14 -147 751
7 -345 -425 6 57 2.13 2.25 229
8 3.35 3.87 10 58 2.13 1.77 486
9 -333 -320 35 59 2.13 1.44 785
10 3.33 3.77 13 60 2.12 2.14 273
11 3.25 3.42 25 61 -211 -1.48 744
12 -3.16 -2.18 260 62 2.10 1.80 453
13 3.14 4.54 3 63 2.09 2.60 114
14 3.10 2.87 65 64 -209 -205 312
15 -3.08 -354 17 65 2.09 2.70 96
16 -3.07 -2.80 80 66 2.09 2.23 237
17 3.06 3.49 20 67 2.08 2.34 188
18 3.02 2.29 213 68 -2.08 -224 232
19 -299 -334 27 69 -2.06 -253 130

20 -293 -3.13 38 70 -204 -211 283
21 -292 -292 57 71 -204 -295 54
22 2.86 3.26 31 72 -2.o3 -3.08 40
23 -2.83 -2.82 74 73 -202 -230 210
24 2.82 2.37 180 74 -2.01 -3.67 15
25 -2.81 -213 276 75 2.00 2.62 109
26 2.81 3.48 21 76 -1.98 -2.38 171
27 -279 -301 47 77 1.98 1.43 795
28 2.70 2.87 64 78 1.96 1.69 549
29 -2.66 -315 37 79 -1.95 -1.47 746
30 -2.58 -3.85 11 80 1.95 1.95 361
31 —2.56 -2.84 71 81 1.95 2.81 77
32 -255 -1.72 524 82 -194 -141 813
33 -254 -2.63 106 83 1.94 3.40 26
34 -254 -2.69 98 84 1.94 1.30 948
35 2.53 2.30 209 85 -1.94 -3.27 30
36 2.48 2.45 148 86 -1.93 -1.11 1190
37 -247 -229 212 87 -193 ~1.37 872
38 -2.46 -321 33 88 -1.93 -3.44 24
39 -243 -244 154 , 89 -1.92 -3.07 41
40 2.43 2.71 94 90 -1.92 4.50 726
41 2.40 2.86 66 91 1.90 3.62 16
42 -234 -2.60 115 92 .1.90 -2.82 75
43 -234 -2.98 50 93 1.89 1.25 1007
44 -233 -3.80 12 94 1.87 3.89 8
45 -232 -2.06 306 95 -l.86 -3.49 19
46 2.29 1.81 444 96 -l.86 -2.08 300
47 -227 -1.17 1110 97 1.85 1.20 1074
48 2.26 1.97 347 98 -1.83 -290 60
49 2.24 3.75 14 99 1.83 1.39 833
50 2.20 3.88 9 100 -132 -195 367

 

Table 4.1: TEllipsoid in action with Top 100 it: ’8. Corresponding ti’s and their rank
are also shown. TEllipsoid 2 22 NoFPs; raw t-statistics 2 68 NoFPs. Truly null
genes are printed in bold—sans typeface.

89

 

 

 

 

0 1’0 2‘0 30 40
Data Index
(a)

 

 

D
Cl E]

El

 

 

 

0 1 0 29 30 40
Data Index

0))
Figure 4.3: F DRs for Case 2b. Panel (a) r21200; Panel (b) r2300. The sample size
is smaller than that in Cases 1 & 2a and yet “Exploiting correlation” has apparent

beneﬁts. Square (El) marker 2 tEllipsoid. Lines: solid 2 raw t-statistic; dotted 2
SAM; dashed 2 EDGE.

90

 

 

 

 

 

1
1
:1: ~ . , ‘ .
E 0.5”}!04 . “”3 ,s-JH-N I“!
" ll 'r‘b’, Eb ' 4
”an? 1. 3100000 50
Duct! U CF Chm
0 D 9C.” D ,
O 10 29 30 40
Data Index
(a)
1 2

 

 

 

 

3O

29
Data Index

0))

Figure 4.4: F DRs for simulated data. Panel (a) r2100; Panel (b) 7250. (Small)
sample sizes: n1210, n2210. Yet, “Exploiting correlation” considerably enhances the
statistical power. Square (El) marker 2 tEllipsoid. Lines: solid 2 raw t-statistic;

dotted 2 SAM; dashed 2 EDGE.

91

4.5 Discussion

By allowing researchers to examine the simultaneous expressions of enormous numbers
of genes, microarrays promised to revolutionize the understanding of complex diseases
and usher in an era of personalized medicine. However, the shift in perception of
that promise is palpable in the literature. A 1999 Nature Genetics article (Lander,
1999) is entitled “Array of hope," but a 2005 Nature Reviews article (Frantz, 2005)
is entitled “An array of problems." It is not unusual for impacts of new technologies
to be overestimated when ﬁrst deployed, then to have the expectations moderated
as the technologies reveal new complexities in the problems they are designed to
solve. In the study of microarray data, the need for exceeding care in the design and
regularization of experiments and data collection are understood to be critical, but
the biggest hindrance to progress has been the data interpretation. In particular, the
biggest challenge seems to be the treatment of intrinsic inter-gene correlation.

In most microarray data there are at least three vital resources: (i) identiﬁability
(ii) immense parallel structure, and (iii) inter-gene correlation itself. In this light,
tEllipsoid can be viewed as exploiting more than correlation as a means of sharing
information across tests, as it also involves identiﬁability.

A crucial step in formulating tEllipsoid was the comprehension of the effects of
inter-gene correlation on Cov(Tz-, Til). In light of Observations 1 and 2, the choice of
the Mahalanobis distance was intuitive, as it is already known to give computationally
attractive solutions through the matrix inversion lemma.

Limited time and resources—and perhaps also the necessity for scientiﬁc focus—
often require biomedical researchers to work on only a small number of “hot (gene)
prospects.” Even under such highly conservative conditions, however, misleading

results can occur, as evident in the results of Figs. 4.1—4.4. For all their careful

92

develOpment and statistical power, even state-of—the—art tools that do not account for
correlation can report spurious gene lists. The extra statistical power available by
exploiting inter-gene correlation promises to further guard against anomalous results
that can have serious consequences for the trajectory of a study of gene function,
causation, and interaction.

In summary, this chapter has reported the development and testing of a novel
framework for the detection of differential gene expression. The framework combines
the exploitation of inter-gene correlation to share information across tests, with iden-
tiﬁability — the fact that in most microarray data sets, a large proportion of genes can
be identiﬁed a priori as non-differential. When applied to the widely used two-sample
t-statistic approach, this viewpoint yielded an elegant differential analysis technique,
which requires as inputs only a gene expression matrix, related two-sample labels,
and the size of desired gene-list r. Empirical evidence suggests that exploiting corre-
lation substantially enhances statistical power. Usually, with increase in microarray
samples, power tends to increase considerably, but, even for small sample sizes, the

performance improvement is noticeable.

93

Chapter 5

Concluding Remarks

5.1 Summary and Concluding Remarks

This work has investigated the effects of inter-gene dependency on statistical methods
for differential gene analysis. Differentially expressed genes are pivotal in understand-
ing highly intricate genotype—phenotype relations and devising appropriate therapeu-
tic interventions. The effort focused on understanding, correcting, and exploiting the
effects of second-order inter-gene dependency within the large-scale signiﬁcance test-
ing formulation for differential gene analysis. It was shown that combining inter-gene
dependency with gene expression purity yields more accurate and powerful statistical
inferences yielding a more accurate list of differentially expressed genes. The main
statistical theme of this work has been to draw second-order conditional inferences
based on cases that are theoretically more likely to be null.

This research resulted in novel ways of combining inter-gene correlation with pu-
rity. A method for mitigating unfavorable effects of correlation on the false discovery

rate calculations was developed. A framework for exploiting beneﬁcial effects of cor-

94

relation on gene-ranking enhancement was also discovered. The ﬁndings are very
general in that they are applicable to any test statistics obtained using simple linear
regression.

In particular, a novel approach to the false discovery calculations was explored.
The approach has produced an inferential technique capable of handling exceedingly
small sample sizes and reporting an entire distribution of a random variable model of
the number of false discoveries. The technique ﬁrst summarizes the effect of millions
of pair-wise correlation coefﬁcients in a single parameter, then explicitly incorporates
this parameter in the inference of the number of false discoveries.

The possibility of exploiting correlation to improve gene-ranking was also explored.
This effort culminated in a powerful framework employing statistical distance mea-
sures which can account for the effect of correlation on the joint distribution of test
statistics. The extra statistical power made available by exploiting inter-gene cor-
relation promises to further guard against anomalous results that can have serious
consequences for the trajectory of a study of gene function, causation, and interaction.

The urgency for more accurate differential gene analysis methods motivated this
research. However, the principal contributions are ﬁrmly rooted in the fundamental
theory and methods of large-scale signiﬁcance testing and enjoy broader applicabil-
ity. Large-scale signiﬁcance testing has become a key statistical tool for exploratory
data analysis in single nucleotide polymorphism detection and other high-throughput
genomic research endeavors. A proper treatment of dependency in most large-scale

signiﬁcance testing applications seems necessary to draw meaningful conclusions.

95

5.2 Future Work

Research topics for future work concerning the FDR estimation method of Chapter 3

are listed below:

1. Further investigation of Cor(V, C) across a range of microarray data sets.

2. Evaluation of the effect of center-area boundary 60 on Cor(V, C) and on overall

accuracy.

3. Extension of the single parameter correlation model to a multi-parameter cor-

relation model to increase modeling accuracy.

4. Exploration of connections between the maxent exponential density and the

exponential parametric form of “empirical null density” as in Efron (2004).

Research topics for future work concerning the gene-ranking framework of Chapter 4

are listed below:

1. Investigation of the role of P0 in overall performance.

2. Empirical and / or theoretical understanding of the relation between the residual
correlation and the gain in statistical power; the eigenvalue spread of residual

correlation matrix seems crucial here.

3. Empirical and / or theoretical understanding of the relation between the number

of samples and the gain in statistical power.
4. Developing false discovery rate estimates for the proposed gene-ranking.

5. Incorporation of prior information, such as “gene grouping” or a priori non-null

identity.

96

Bibliography

B. Alberts, D. Bray, J. Lewis, M. Raff, K. Roberts, and JD Watson. Molecular biology
of the cell. 2002.

A. Almudevar, L.B. Klebanov, X. Qiu, P. Salzman, and A.Y. Yakovlev. Utility of
correlation measures in analysis of gene expression. NeuroRX, 3(3):384—395, 2006.

P. Baldi and S. Brunak. Bioinformatics: the machine learning approach. MIT Press
Cambridge, MA, USA, 2001.

J. Barnard, R. McCulloch, and X.L. Meng. Modeling covariance matrices in terms
of standard deviations and correlations, with application to shrinkage. Statistica
Sinica, 10(4):1281—1311, 2000.

Y. Benjamini. Comment: Microarrays, empirical Bayes and the two-groups model.
Statist. Sci, 23(1):23—28, 2008.

Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and
powerful approach to multiple testing. Journal of the Royal Statistical Society,
Series B., 57(1):289—300, 1995.

Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple
testing under dependency. Annals of Statistics, 29(4):1165—1188, 2001.

J .O. Berger. Could Fisher, Jeffreys and Neyman have agreed on testing? Statistical
Science, 18(1):1—32, 2003.

N. Biotechnology. The MicroArray Quality Control (MAQC) project shows inter-and
intraplatform reproducibility of gene expression measurements. Nature Biotechnol-
ogy, 24:1151—1161, 2006.

BM Bolstad, RA Irizarry, M. Astrand, and T P Speed. A comparison of normalization
methods for high density oligonucleotide array data based on variance and bias.
Bioinformatics, 19(2):185—193, 2003.

T.T. Cai. Comment: Microarrays, empirical Bayes and the two-group model. Statis-
tical Science, 23(1):29—33, 2008.

97

K.R. Coombes, J. Wang, and K.A. Baggerly. The tail-rank statistic for ﬁnding
biomarkers from microarray data, with application to prostate cancer. preprint
at http: //bioinformatics. mdanderson. org/ Tai lRank , 2008.

T.M. Cover and J .A. Thomas. Elements of information theory. Wiley New York,
1991.

X. Cui, J .T.G. Hwang, J. Qiu, N.J. Blades, and GA. Churchill. Improved statistical
tests for differential gene expression by shrinking variance components estimates.
Biostatistics, 6(1):59—75, 2005.

PA Dejviver and J. Kittler. Pattern recognition: a statistical approach. Prentice Hall
International, 1982.

J .R. Deller, Jr., J.H.L. Hansen, and J .G. Proakis. Discrete-time processing of speech
signals. IEEE Press, 2000.

S. Dudoit, Y.H. Yang, M.J. Callow, and T.P. Speed. Statistical methods for iden-
tifying differentially expressed genes in replicated cDNA microarray experiments.
Statistica Sinica, 12(1):111-139, 2002.

B. Efron. Large-scale simultaneous hypothesis testing: the choice of a null hypothesis.
Journal of the American Statistical Association, 99(465):96—105, 2004.

B. Efron. Microarrays, empirical Bayes, and the two-groups model. Preprint, Dept.
of Statistics, Stanford University, 2006.

B. Efron. Correlation and large-scale simultaneous signiﬁcance testing. Journal of
the American Statistical Association, 102(477):93—103, 20078.

B. Efron. Size, power and false discovery rates. Annals of Statistics, 35:1351—1377,
2007b.

B. Efron. Microarrays, empirical Bayes and the two-group model. Statistical Science,
23(1):1—22, 2008.

B. Efron, R. Tibshirani, J .D. Storey, and V. Tusher. Empirical Bayes analysis of a
microarray experiment. Journal of the American Statistical Association, 96(456):
1151—1161, 2001.

MB. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein. Cluster analysis and
display of genome-wide expression patterns. Proceedings of the National Academy
of Sciences, 95(25):]4863, 1998.

S. Frantz. An array of problems. Nature Reviews Drug Discovery, 4:362—363, 2005.

98

C. Genovese and L. Wasserman. A stochastic process approach to false discovery
control. Annals of Statistics, 32(3):1035—1061, 2004.

CH. Golub and CF. Van Loan. Matrix computations. Johns Hopkins University
Press, 1996.

G.W. Hatﬁeld, S. Hung, and P. Baldi. Differential analysis of DNA microarray gene
expression data. Molecular Microbiology, 47(4):871-877, 2003.

I. Hedenfalk, D. Duggan, Y. Chen, et al. Gene—expression proﬁles in hereditary breast
cancer. New England Journal of Medicine, 344(8):539—548, 2001.

N .J . Higham. Computing the nearest correlation matrix—a problem from ﬁnance.
IMA Journal of Numerical Analysis, 22(3):329-—343, 2002.

L. Hunter. Molecular biology for computer scientists. Artiﬁcial Intelligence and
Molecular Biology, Cambridge, 1993.

JP Ioannidis. Why most published research ﬁndings are false. PLoS Medicine, 2(8):
e124, 2005.

R.A. Irizarry, B.M. Bolstad, F. Collin, L.M. Cope, B. Hobbs, and T.P. Speed. Sum-
maries of Affymetrix GeneChip probe level data. Nucleic Acids Research, 31(4):
e15, 2003.

ET Jaynes. Probability theory: The logic of science. Cambridge University Press,
2003.

RA. Jones and SB. Baylin. The fundamental role of epigenetic events in cancer. Nat
Rev Genet, 3(6):415—428, 2002.

M.K. Kerr, M. Martin, and GA. Churchill. Analysis of variance for gene expression
microarray data. Journal of Computational Biology, 7(6):819—837, 2000.

Kyung In Kim and Mark van de Wiel. Effects of dependence in high-dimensional
multiple testing problems. BMC Bioinformatics, 9(1):114, 2008.

L. Klebanov and A. Yakovlev. Diverse correlation structures in microarray gene ex-
pression data and their utility in improving statistical inference. Annals of Applied
Statistics, 1(2):538—559, 2007.

L. Klebanov, C. Jordan, and A. Yakovlev. A new type of stochastic dependence
revealed in gene expression data. Statistical Applications in Genetics and Molecular
Biology, 5(1):7, 2006.

ES. Lander. Array of hope. Nature Genetics, 21(1):3—4, 1999.

99

M. Langaas, E. Ferkingstad, and RH. Lindqvist. Estimating the pr0portion of true
null hypotheses, with application to DNA microarray data. Journal of the Royal
Statistical Society, Series B., 67:555—72, 2005.

M.L.T. Lee, F .C. Kuo, GA Whitmore, and J. Sklar. Importance of replication in mi-
croarray gene expression studies: Statistical methods and evidence from repetitive
cDNA hybridizations. Proceedings of the National Academy of Sciences, 97(18):
9834—9839, 2000.

J.T. Leek and J.D. Storey. Capturing heterogeneity in gene expression studies by
surrogate variable analysis. PLoS Genetics, 3(9):e161, 2007 .

J .T. Leek, E. Monsen, A.R. Dabney, and J .D. Storey. EDGE: Extraction and analysis
of differential gene expression. Bioinformatics, 22(4):507—508, 2006.

EL Lehmann and J .P. Romano. Generalizations of the familywise error rate. Annals
of Statistics, 33(3):]138—1154, 2005.

EL. Lehmann and J .P. Romano. Testing Statistical Hypotheses. Springer, 2006.

F. Li and GD. Stormo. Selection of optimal DNA oligos for gene expression arrays.
Bioinformatics, 17(11):]067—1076, 2001.

J.C. Liechty, M.W. Liechty, and P. Muller. Bayesian correlation estimation.
Biometrika, 91(1):1-14, 2004.

DJ. Lockhart, H. Dong, M.C. Byrne, M.T. Follettie, M.V. Gallo, M.S. Chee,
M. Mittmann, C. Wang, M. Kobayashi, H. Norton, et al. Expression monitor-

ing by hybridization to high-density oligonucleotide arrays. Nature Biotechnology,
14:1675—1680, 1996.

I. Lonnstedt and T. Speed. Replicated microarray data. Statistica Sinica, 12(1):
31—46, 2002.

RC. Mahalanobis. On the generalized distance in statistics. Proceedings of the Na-
tional Institute of Science in India, 2(1):49—55, 1936.

G. Marsaglia and I. Olkin. Generating correlation matrices. SIAM Journal on Sci-
entiﬁc and Statistical Computing, 5:470—475, 1984.

ON. Morris. Comment: Microarrays, empirical Bayes and the two-group model.
Statistical Science, 23(1):34—40, 2008.

MA Newton, CM Kendziorski, CS Richmond, F .R. Blattner, and KW Tsui. On
differential variability of expression ratios: improving statistical inference about
gene expression changes from microarray data. Journal of Computational Biology,
8(1):37—52, 2001.

100

I. Olkin. Note on ‘The Jacobians of certain matrix transformations useful in multi-
variate analysis’. Biometrika, 40(1-2):43, 1953.

AB. Owen. Variance of the number of false discoveries. Journal of the Royal Statistical
Society, Series B., 67:411—426, 2005.

TA. Patterson, E.K. Lobenhofer, S.B. Fulmer-Smentek, et al. Performance compar-

ison of one-color and two-color platforms within the Microarray Quality Control
(MAQC) project. Nature Biotechnology, 24:1140—1150, 2006.

Y. Pawitan, K.R.K. Murthy, S. Michiels, A. Ploner, and O. Journals. Bias in the
estimation of false discovery rate in microarray studies. Bioinformatics, 21(20):
3865—3872, 2005.

BR Petricoin, J .L. Hackett, L.J. Lesko, et al. Medical applications of microarray
technologies: a regulatory science perspective. Nature Genetics, 32(supp):474—479,
2002.

Houduo Qi and Defeng Sun. A quadratically convergent newton method for comput-
ing the nearest correlation matrix. SIAM Journal on Matrix Analysis and Appli-
cations, 28(2):360—385, 2006.

X. Qiu and A. Yakovlev. Some comments on instability of false discovery rate es-
timation. Journal of Bioinformatics and Computational Biology, 4(5):1057—1068,
2006.

X. Qiu, A.I. Brooks, L. Klebanov, and A. Yakovlev. The effects of normalization on
the correlation structure of microarray data. BM C Bioinformatics, 6:120, 2005a.

X. Qiu, L. Klebanov, and A. Yakovlev. Correlation between gene expression levels and
limitations of the empirical Bayes methodology for ﬁnding differentially expressed
genes. Statistical Applications in Genetics and Molecular Biology, 4(1):1157, 2005b.

X. Qiu, Y. Xiao, A. Gordon, and A. Yakovlev. Assessing stability of gene selection
in microarray data analysis. BMC Bioinformatics, 7:50, 2006.

J. Quackenbush. Microarray data normalization and transformation. Nature Genetics,
32(supp):496—501, 2002.

A. Reiner-Benaim. FDR control by the BH procedure for two-sided correlated tests
with implications to gene expression data analysis. Biometrical journal, 49(1):
107—126, 2007.

K. Rice and D. Spiegelhalter. Comment: Microarrays, empirical Bayes and the two-
groups model. Statistcal Science, 23(1):41—44, 2008.

101

M. Schena, D. Shalon, R.W. Davis, and RC. Brown. Quantitative monitoring of gene
expression patterns with a complementary DNA microarray. Science, 270(5235):
467—470, 1995.

D. Singh, P.G. Febbo, K. Ross, et al. Gene expression correlates of clinical prostate
cancer behavior. Cancer Cell, 1(2):203—209, 2002.

J.D. Storey. A direct approach to false discovery rates. Journal of the Royal Statistical
Society, Series B., 64(3):479—498, 20023.

J .D. Storey. False discovery rates theory and applications to DNA microarrays. PhD
thesis, stanford university, 2002b.

J .D. Storey, J .E. Taylor, and D. Siegmund. Strong control, conservative point estima-
tion and simultaneous conservative consistency of false discovery rates: a uniﬁed
approach. Journal of the Royal Statistical Society Series B, 66(1):187—205, 2004.

J.D. Storey, J.Y. Dai, and J.T. Leek. The optimal discovery procedure for large-
scale signiﬁcance testing, with applications to comparative microarray experiments.
Biostatistics, 8(2):414—432, 2007.

A. Subramanian, P. Tamayo, V.K. Mootha, S. Mukherjee, et al. Gene set enrichment
analysis: A knowledge-based approach for interpreting genome-wide expression
proﬁles. Proceedings of the National Academy of Sciences, 102(43):15545-—15550,
2005.

R. Tibshirani and L. Wasserman. Correlation-sharing for detection of differential gene
expression. Arriv preprint math.ST/0608061, 2006.

GA. ‘I’sai, Y.J. Chen, J .J . Chen, and O. Journals. Testing for differentially expressed
genes with microarray data. Nucleic Acids Research, 31(9):e52, 2003.

V. Tusher, R. Tibshirani, and C Chu. Signiﬁcance analysis of microarrays applied to
ionizing radiation response. Proceedings of the National Academy of Sciences, 98:
5116—5121, 2001.

M.J. van der Laan, S. Dudoit, and KS. Pollard. Augmentation procedures for control
of the generalized family-wise error rate and tail probabilities for the proportion of
false positives. Statistical Applications in Genetics and Molecular Biology, 3(1):15,
2004.

AB. van’t Wout, G.K. Lehrman, S.A. Mikheeva, et al. Cellular gene expression upon
human immunodeﬁciency virus type 1 infection of CD4+-T-cell lines. Journal of
Virology, 77(2):1392—1402, 2003.

102

JD Watson, TA Baker, SP Bell, A. Gann, and R. Losick. Molecular biology of the
gene (5th Edition). Benjamin Cummings, 2004.

Y. Woo, J. Affourtit, S. Daigle, et al. A comparison of cDNA, oligonucleotide, and
Affymetrix GeneChip gene expression microarray platforms. Journal of Biomolec-
ular Techniques, 15(4):276—284, 2004.

Y. Zheng and M. Pepe. A practical multifaceted approach to selecting differentially
expressed genes. Cancer Informatics, 2:113—122, 2007.

K.H. Zou and W.J. Hall. On estimating a transformation correlation coefﬁcient.
Journal of Applied Statistics, 29:745—760, 2002.

103

      

 

 

 

 

11111111111114((1111

3 1293 0295