IMPLEMENTING VALIDATION PROCEDURES TO STUDY THE PROPERTIES
OF WIDELY USED STATISTICAL ANALYSIS METHODS OF RNA
SEQUENCING EXPERIMENTS
By
Pablo Daniel Reeb
A DISSERTATION
Submitted
to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Fisheries and Wildlife
-
Doctor of Philosophy
2015
ABSTRACT
IMPLEMENTING VALIDATION PROCEDURES TO STUDY THE PROPERTIES
OF WIDELY USED STATISTICAL ANALYSIS METHODS
OF RNA
SEQUENCING EXPERIMENTS
By
Pablo Daniel Reeb
RNA sequencing (RNA
-
seq) technology is being rapidly adopted as the platform of choice
for tran
scriptomic studies. Although its
major focus has been gene expression profiling, other
interests, such as single nucleotide profiling, are emerging as the technology evolves. In addition,
applications are being rapidly expanding in model and nonmodel organisms.
T
he
overall objective
of t
his dissertation was to propose and implement validation procedures based on experimental
data to estimate the properties of widely used statistical analysis methods of RNA
-
seq
experiments.
The first study evaluated differential expression methods based o
n count data distribution
and Gaussian transformed models. Parametric simulations and plasmode datasets derived from
RNA
-
seq experiments were generated to compare the statistical models in terms of type I error
rate, power and null p
-
value distribution. Ov
erall, Gaussian models presented p
-
values closer to
nominal significance levels and a p
-
value distribution closer to the expected uniform distribution.
Researchers using models with these properties
will
have less false positives when inferring
differentia
lly expresses transcripts. Additionally, the use of Gaussian transformations enables the
applications of all the well
-
known theory of linear models for instance to account for complex
experimental designs.
The second study assessed the properties of dissim
ilarity measures for agglomerative
hierarchical cluster analysis. The validation comprise
d
dissimilarity measures based on Euclidean
distance, correlation
-
based dissimilarities and count data
-
based dissimilarities. I used plasmode
datasets generated from t
wo RNA
-
seq experiments with different
sample structures and
simulated
scenarios based on informative and non
informative transcripts. In addition
, I proposed
two measures, agreement and consistency, for comparing dendrograms. Dissimilarity measures
based o
n non
-
transformed data resulted in dendrograms that did not resemble the expected
sample structure, whereas dissimilarities calculated with appropriate transformations for count
data were consistent in reproducing the expected dendrograms under different s
cenarios
.
The third study compared variant calling programs
that used reference
genotypes
obtained from a SNP
chip. The evaluation included multiple samples and multiple tissue datasets
and considered the effect of per base read depth. Sensitivity
and false discovery rates were
computed separately for heterozygous and homozygous sites in order to provide information for
potentially different applications such as allele
-
specific expression or RNA
-
editing. Additionally, I
explored the use of SNP calle
d from RNA
-
seq to compute relationship matrices in population
studies. Heterozygous sites with more than 10 reads per base and per sample were called with
high sensitivity and low false discovery rates. Homozygous sites were called with higher sensitivity
than heterozygous irrespective of depth but presented higher false discovery rates. A relationship
matrix based on accurate genotypes obtained with RNA
-
seq presented a high correlation with a
relation matrix based on genotypes from a SNP
chip.
In conclusion
, u
sing synthetic and reference datasets, I compared statistical models to
perform differential expression analysis, sampled
-
base hierarchical cluster analysis, and variant
calling and genotyping
. This validation framework can be extended to
evaluate other
methods of
RNA
-
seq analysis as well as to evaluate
the periodic
publication of new and updated analysis
methods.
Choosing the most appropriate software can help
researcher
s
to obtained better results
and to achieve the goals of the
ir
investigation
s.
Copyright by
PABLO DANIEL REEB
2015
v
To Romina, Sophia and Olivia
.
vi
ACKNOWLEDGMENTS
I dedicate this dissertation to my lovely family: Romina, Sophia and Olivia. Saying thank you will
never be sufficient. Romina and I simultaneously embraced the challenges of building a family
and moving abroad to continue my education. Both ventures have been tough but both were worth
the efforts. We can say that one challenge has arrived to the end af
ter
writing
five chapters. The
remaining challenge
has two little sprouts
, Sophia and Olivia,
and I hope
we will never finish
writing that
story. I love you!
An especial recognition goes to my advisor Dr. Juan P. Steibel for being extremely patient and
su
pporting throughout the whole program. Certainly, he has been more than a mentor while giving
guidelines and advice, and offering friendship in the right moments.
I would like to say thank you to the members of my guidance committee Drs
.
James Bence, C.
Ti
tus Brown, Weiming Li, and Robert Tempelman. They all helped me to build a comprehensive
academic background from their specific disciplines.
I shared a productive and friendly academic environment at Michigan State University, especially
at the Animal Bre
eding and Genetics Group. I am taking with me an enriched experience of
collaborative work with graduate students, professors, visiting scholars and post
-
docs.
Finally, I want to thank my large family and friends. All your short or long messag
es, phone calls,
and hugs given virtually from distant places or in person in Michigan were essential for me and
my family.
vii
TABLE OF CONTENTS
LIST OF TABLES
................................
................................
................................
.......................
ix
LIST OF FIGURES
................................
................................
................................
.....................
x
Chapter 1 Introduction
................................
................................
................................
................
1
Chapt
er 2 Evaluating statistical analysis models for RNA sequencing experiments
..................
11
2.1
Introduction
................................
................................
................................
.................
12
2.2
Material and Methods
................................
................................
................................
.
15
2.2.1 Simulations
................................
................................
................................
...............
15
2.2.2 Plas
modes
................................
................................
................................
................
18
2.2.2.1 Null dataset (Cheung)
................................
................................
.........................
18
2.2.2.2 Differentially expressed dataset (Bottomly)
................................
.........................
20
2.2.3 Comparison of alternative
analysis tools for evaluating differential expression
..........
22
2.2.3.1 Filtering and normalization
................................
................................
..................
22
2.2.3.2 Differential expression analysis
................................
................................
..........
23
2.2.3.3 Multiple comparisons
................................
................................
..........................
23
2.2.4 Evaluating and comparing results from alternative analysis packages
......................
24
2.3
Results
................................
................................
................................
........................
26
2.3.1 Simulation
s
................................
................................
................................
...............
26
2.3.2 Plasmodes
................................
................................
................................
................
28
2.3.2.1 Null dataset (Cheung)
................................
................................
.........................
28
2.3.2.2 DE dataset (Bottomly)
................................
................................
.........................
29
2.4
Discussion and conclusion
................................
................................
..........................
31
APPENDIX
................................
................................
................................
............................
35
Chapter 3 Assessing dis
similarity measures for sample
-
based hierarchical clustering of RNA
sequencing data using plasmode datasets
................................
................................
................
41
3.1
Introduction
................................
................................
................................
.................
42
3.2
Material and Methods
................................
................................
................................
.
44
3.2.1 Datasets
................................
................................
................................
....................
44
3.2.2 Plasmodes
................................
................................
................................
................
44
3.2.2.1 Plasmodes from Bottomly dataset
................................
................................
......
45
3.2.2.2 Plasmodes from MSUPRP dataset
................................
................................
.....
47
3.2.3 Clustering
................................
................................
................................
..................
49
3.2.4 Cluster validation using results from plasmodes
................................
........................
50
3.3
Results
................................
................................
................................
........................
51
3.3.1 Bottomly
................................
................................
................................
....................
51
3.3.2 MSUPRP
................................
................................
................................
..................
57
3.4
Discussion
................................
................................
................................
..................
60
3.5
Conclusion
................................
................................
................................
..................
66
viii
APPENDIX
................................
................................
................................
............................
68
Chapter 4 Assessing genotype call accuracy from RNA sequencing data
................................
.
75
4.1
Introduction
................................
................................
................................
.................
76
4.2
Material and Methods
................................
................................
................................
.
78
4.2.1 Resource population
................................
................................
................................
.
78
4.2.2 Variant calling
................................
................................
................................
...........
79
4.2.3 Assessing genotype call accuracy
................................
................................
.............
80
4.2.4 Equivalence between relationsh
ip matrices and correlation with SNP chip data
........
82
4.3
Results
................................
................................
................................
........................
83
4.3.1 Analysis across variant calling programs for 24 animals
................................
............
83
4.3.2 Comparison of genotype calls from DNA
-
seq and RNA
-
seq from multiple tissues
.....
85
4.3.3 Equi
valence between relationship matrices
................................
...............................
87
4.4
Discussion
................................
................................
................................
..................
89
Chapter 5 General discussion
................................
................................
................................
...
94
5.1
Introduction
................................
................................
................................
.................
95
5.2
Objectiv
es revisited
................................
................................
................................
.....
95
5.3
Future research directions
................................
................................
........................
101
REFERENCES
................................
................................
................................
.......................
103
ix
LIST OF TABLES
Table 1 Classification rule to compute false and true positive rates
................................
...........
25
Table 2 Consistency for each dissimilarity measure
................................
................................
..
56
Table 3 Reference similarity matrix (S) for MSUPRP plasmodes
................................
..............
72
Table 4 Reference dissimilarity matrix (D) for MSUPRP plasmodes
................................
..........
72
Table 5 Cophenetic matrix of the reference dendrogram for MSUPRP plasmod
es.
..................
73
Table 6 Correlation between dissimilarity measures and reference dissimilarity
........................
74
Table 7 Contingency table with classification rule used to compute sensitivity and error rate of
genotype calling
................................
................................
................................
........................
81
x
LIST OF FIGURES
Figure 1 Algorithm used to simulate counts from existing estimates of model parameters
........
17
Figure 2 Multidimensional scaling analysis of experimental datasets
................................
........
19
Figure 3 Algorithm used to generate plasmode datasets with no differentially expressed
transcripts under a model with one classification variable
................................
.........................
20
Figure 4 Algorithm used to generate plasmode datasets with differentially expressed transcripts
under a model with two classification variables (block + treatment)
................................
...........
21
Figure 5 Simulation results from a scenario with three biological replicates
..............................
27
Figure 6 Plasmode results from Cheung dataset:
................................
................................
......
28
Figure 7 Plasmode results from Bottomly dataset:
................................
................................
....
29
-
value results
................................
................................
30
Figure 9 Steps used to process reads to obtain matrix
of counts
................................
..............
36
Figure 10 P
-
value distribution of differential expression analysis performed with edgeR for
Bottomly data using a
model with block and treatment fixed effects
................................
..........
37
Figure 11 P
-
value distribution of non differentially expressed transcripts for a
simulated scenario
with three biological replicates
................................
................................
................................
..
38
Figure 12 P
-
value distribution of non differentially expressed transcript
s for plasmodes
generated from Cheung dataset
................................
................................
................................
39
Figure 13 P
-
value distribution of non differentially expressed transcripts for
plasmodes
generated from Bottomly dataset
................................
................................
..............................
40
Figure 14 Algorithm used to generate plasmodes from Bottomly dataset
................................
..
47
Figure 15 Multidimensional scaling analysis of MSUPRP dataset
................................
.............
48
Figure 16 Typical dendrograms obtained for plasmode datasets from Bottomly experimental
data with two dissimilarity measures under three scenarios
................................
......................
53
Figure 17 Agreement between dissimilarity measures using Bottomly plasmode datasets
........
54
xi
Figure 18 Typical dendrograms obtained for plasmode datasets from MSUPRP experimental
data with two dissimilarity measures
................................
................................
.........................
58
Figure 19 Agreement between dissimilarity measures using MSUPRP plasmode datasets
.....
59
Figure 20 Sequencing layout of 24 barcoded samples
................................
..............................
69
Figure 21 Plasmode generation for MSUPRP dataset
................................
...............................
71
Figure 22 Reference dendrogram for MSUPRP plasmodes
................................
......................
73
Figure 23 Assessing SNP calling accuracy of RNA
-
seq samples of
longisimus dorsi
................
84
Figure 24 Assessing SNP calling accuracy of DNA and RNA
-
seq samples of three tissues from
the same animal
................................
................................
................................
........................
87
Figure 25 Scatterplot of correlation between relationship matrices for the 24 F2 females
.........
89
1
Chapter
1
Introduction
Introduction
2
Transcriptome analysis through next generation sequencing technologies
(Metzker, 2010)
is known as RNA sequencing (RNA
-
seq)
(Wang
et al., 2009)
. This technology has revolutionized
transcriptomics since its first introduction in 2004
(Bennett, 2004)
and has been rapidly applied
to a variety of studies and species
(Ozsolak and Milos, 2011; Marguerat et al., 2010;
Wickramasinghe et al., 2014)
. Gene expression analysis has been the main objective in RNA
-
seq
experiments
(Oshlack et al., 2010)
but other quantitative and qualitative aspects of transcript
biology have been increasing in importance as the technology evolves. These other objectives
include studies for detecting allele
-
specific expression
(Quinn et al., 2014; Piri
nen et al., 2015;
Steibel et al., 2015)
, RNA editing
(Lee et al., 2013; Ramaswami et al., 2013)
,
alternative splicing
(Griffith et al., 2010; Alamancos et al., 2014)
, novel transcripts
(Roberts et al., 2011; Trapnell et
al., 2010)
, and nucleotide variations
(Quinn et al., 2013; Cánovas et al., 2010)
.
A typical RNA
-
seq experiment includes
the following steps
(Marguerat et a
l., 2010;
Oshlack et al., 2010)
. First, a sample is extracted from the tissue of interest,
then
RNA is purified
and submitted to a series of processes known as library preparation. This process comprises
poly
-
A RNA isolation, RNA fragmentation, reverse
transcription to cDNA, adapter ligation, size
selection, and PCR enrichment. The library preparation converts the input RNA into small
fragments of cDNA that are ready to be sequenced. Second, the cDNA libraries are placed in the
sequencing machine and the
fragments are sequenced in parallel in a predefined number of
cycles. At each cycle, fluorescent labeled nucleotides are added and the signals emitted are
recorded and converted to base
-
calls. As a result, the sequencing machine provides a number of
short
reads with length equal to the number of cycles and with read abundance proportional to
the fragment abundance. Third, the sequenced reads are filtered (for quality) and mapped to a
reference genome or transcriptome (either pre
-
existing or assembled from
the sequenced reads
themselves).
3
After reads have been aligned, qualitative and quantitative information can be extracted
from them in order to be applied for different types of analysis. Qualitative information refer
s
to
the study of the sequences of base
s
per se
and can be used, for example, for discovering and
calling single nucleotide polymorphisms (SNPs). Quantitative information, on the other hand, is
based on the characteristic that the number of reads aligning to any given region of the genome
is pr
oportional to the abundance of fragments for that region within the sample
(Mortazavi et al.,
2008)
. These read counts can be summarized and aggregated over biologically meaningful units,
such as exons, transcripts
,
or genes in order to provide a direct measure of expression
(Oshlack
et al., 2010)
. The most common use of this expression profiling is the analysis of differential
expression (DE)
(Oshlack et al., 2010)
. Differential expres
sion analysis pursue
s
the identification
of genes whose transcripts show substantial changes in abundance across experimental
conditions, such as differences between strains, tissues
,
or treated versus untreated individuals.
Finally, both qualitative and q
uantitative information can be combined in studies like allele
-
specific
expression (ASE). In ASE experiments, the goal is to identify genes for which the two alleles in
an individual are expressed at different l
evels. ASE studies, first call
for SNP and
th
en identify
heterozygous sites. For each of the heterozygous sites the number of reads supporting each of
the alleles is counted and a statistic
al
test is performed to evaluate whether the alleles are
expressed in balance.
RNA
-
seq has expanded the field o
f transcriptomics
as no other previous technology (
e.g
.
microarray),
not
only for model but also for non
model organisms in general
(Ekblom and Galindo,
2011)
. Some of the reasons that contribute to this unpr
ecedented expansion in non
model
organisms
,
and for fish and wildlife species
(Qian et al., 2014)
specifically
,
are the cost
-
effectiveness
and the independence of prior genomic knowledge of the speci
es of interest
(Ekblom and Galindo, 2011; Vijay et al., 2013)
. For instance, Salem et al.
(
2012)
used RNA
-
seq
analysis to identify SNPs markers associated with growth
-
rate in rainbow trout. The authors
4
compared two full
-
sib families, one selected for improved growth. Putative SNPs identified by
RNA
-
seq as associated with growth trait were further est
imated by genotyping individuals from
40 families using a designed array platform. Smith et al.
(
2013)
studied the ability of crimson
spotted rainbowfish to adapt to temperature stress. The authors tested for differential exp
ression
across two groups of fish exposed to different conditions using a
de novo
assembled
transcriptome. The genes identified as differentially expressed were related to critical metabolic
pathways for temperature tolerance and provided candidates to ext
end investigations of
population adaptations to climate change. Babbit et al.
(
2012)
used RNA
-
seq to compare gene
expression patterns in the ovarian tissue of juvenile and adult female baboons and were
able to
link the differentially expressed genes to selection occurring in human and chimpanzee. This
information is valuable to study the evolution of gene regulation
in humans, specifically providing
insight into the loci that contributed to shifts in de
velopmental timing and physiology during human
evolution. All the applications of RNA
-
seq aimed at making population
-
wide inferences (e.g:
differential
expression, allele
-
specific expression
),
including the cited ones, use
a sample from
the population in q
uestion, and thus, they rely on statistical analysis
to make inferences
.
Consequently
, the validity of the conclusions depend directly on the validity of the
statistical
methods used in the studies.
Even though a number of statistical analysis methods and
tools have been developed for
various applications of RNA
-
seq experiments
(Chen et al., 2011)
, limitations and challenges still
remain in determining their statistical validity due to potential variation introduced in each of the
steps of the design and execution of an RNA
-
seq study
(Ozsolak and Milos, 2011; Marguerat et
al., 2010)
. A characteristic aspect of RNA
-
seq is that the measure of expression, total number of
reads mapping to a certain region in the reference, is of discrete nature
(Anders and Huber, 2010)
in contrast to continuous intensity measures obtained in microarray analysi
s. These count data
require statistical models that can account for the discrete nature of the data when testing for
5
differential expression or when summarizing massively parallel multivariate expression records
through clustering analysis. A simple Poisso
n model seems appropriate for this type of data when
the experiment includes only technical replicates
(Marioni et al., 2008)
. H
owever, due to extra
sources of variation at the biological level, read counts are always overdispersed and other
models
have been proposed
, for example,
based
on the negative binomial distribution
(Robinson
et al., 2010; Anders and Huber, 2010; Hardcastle and Kelly, 2010; Di et al., 2011)
.
Other
characteristics of RNA
-
seq experiments such as the small sample size and large number of
re
sponses have limited the direct application of
generalized linear mixed models developed for
count data in other disciplines.
Current models based on discrete distributions (i.e. Poisson,
negative binomial) are limited to simple experimental design
s
such a
s pairwise or multiple fixed
factors. Although those designs are extremely useful in most of lab experiments, they cannot
account for random effects and hierarchical structures which are commonly present in ecological
experiments and studies. Thus, it woul
d be valuable to evaluate the
use of transformation in linear
mixed models using realistic datasets. The evaluation should include properties of interest when
inferring results such as the rate of false positives
and statistical power. Specifically
, some a
uthors
(Langmead et al., 2010; Law et al., 2014)
have proposed the use of data transformation to fit
simpler Gaussian linear model analysis, which allow a straightforward implementation of
multilevel models. Thus, there are multiple recomm
ended ways of analyzing RNA
-
seq for
differential expression and clustering whose statistical properties need to be evaluated.
Hierarchical cluster analysis has been the preferred statistical method to represent results
of samples and genes after differenti
al expression analysis
(Liu and Si, 2014)
. Hierarchical
clustering is also a good unsupervised technique to discover subpopulation structure with respect
to a set of multivariate responses
(Legendre and Legendre, 2012)
. Thus, hierarchical clustering
of samples using gene expression data from RNA
-
seq experiments holds great promise for the
study of natural variation in relation
to gene expression in ecological experiments. Hierarchical
6
clustering analysis relies on a measure of similarity/dissimilarity among the features (i.e. genes
,
transcripts or samples) to build a dendrogram. But dissimilarity measures may be greatly affected
by the distribution of the measured variables. Dissimilarity measures for RNA
-
seq derived gene
expression count data are mainly based on logarithmic transformation of raw counts. Additionally,
a specific Poisson dissimilarity measure has been proposed for
RNA
-
seq data
(Witten, 2011)
.
However, only parametric simulations and exemplar data have been used to compare the
performance of these measures for sample clustering. Due to the high dispersion of count data
derived from RNA
-
seq
, a comparison based on experimental
datasets and using a good measure
of consistency of the resulting dendrogram representation could contribute to decide on the best
dissimilarity measure
s
for hierarchical clustering implementations. Another relevant que
stion in
this area is: are transformation
s
used for differential expression analysis also appropriate for
clustering analysis? Finding a dissimi
larity measure that can reliably
represent the sample
structure before fitting any model could be of interest to
conduct
a priori
exploratory data analysis
to be followed up by inferential analyses.
Finally, another inferential problem that deserves study is the use of algorithms for calling
SNP genotyping from RNA
-
seq data. SNP genotyping from RNA
-
seq is an applica
tion of RNA
-
seq experiments
that
has proven to be useful in some studies and we can expect that its use will
increase in the future
(Wickramasinghe et al., 2014; Seeb et al., 2011; Schunter et al., 2013;
Narum et al., 2013)
. Curre
ntly, some questions remain
about the equivalence of using this
technology versus more traditional genotyping chips and DNA
-
seq. For instance, the effe
ct of per
base read depth when calling and genot
yping variants should be studied
in terms of its impact in
sensitivity and specificity for calling SNP from RNA
-
seq data. Moreover, the statistical properties
of called genotypes should be studied separately
for heterozygous and homozygous genotypes
because they are likely to be used for different purposes. Ano
ther important property
of genotypes
derived from RNA
-
seq data is how well they can represent the genetic structure of a population,
7
when compared to mo
re traditional measures of relatedness based on SNP chip genotypes and
on genealogies.
In order to compare and decide among the available statistical methods in each of the
mentioned applications of RNA
-
seq (e.g. differential expression analysis, clusteri
ng, SNP
genotyping) a validation framework should be established for each case. Consequently, in this
dissertation I followed the epistemological guidelines proposed by Mehta et al.
(
2004)
for high
-
dimensional biology. Assessing statistical validity requires an explanation of what a method is
supposed to do or what properties
an estimate or test
is supposed to exhibit. Although there is
subjectivity in selecting the desired properties, once
the criteria are chosen, methods can be
evaluated objectively
(Mehta et al., 2004)
. For instance, a desired property to compare among
rate. Once we
choose this property, we can compare the models by applying a validation, such
as simulating data or other validating procedures.
Validating procedures for RNA
-
seq analysis methods have followed the same strategies
that have been previously implemented i
n technologies such as microarray. Basically, the most
used validating procedures can be grouped into the following categories: a) theoretical
demonstration, b) exemplar dataset, c) simulation, d) gold
-
standard reference, and e) plasmode
datasets. Theoreti
cal derivations in the context of high
-
dimensional biology often lacks of a
mathematical demonstration to support their validity
(Mehta et al., 2006)
. The use of an exemplar
dataset offer the advantage of accounting for the whole complexity of the transcriptome such as
having a variance
-
covariance matrix that reflect real interactions am
ong features. However, this
approach must be considered only as an illustration in a particular dataset and not as evidence
to support a method. Computer simulations have been the most commonly used procedure
(Anders and Huber, 2010; McCarthy et al., 2012)
d
ue to ease
in creating data
sets under
hypothetical scenarios by assuming a parametric data gener
ation model. However, although
gene
8
expression data is easy to simulate, mimicking a realistic gene expr
ession dataset is quite
challenging. Gold
-
standard references obtained with a different technology have been used to
validate results
(Fang and Cui, 2011)
. Typical r
eferences for gene expression results from RNA
-
seq are qPCR data
(Bullard et al., 2
010; Rapaport et al., 2013)
. However, analysis models for
qPCR data should themselves be validated
(Steibel et al., 2009)
. Anoth
er example of gold
standard is the use of mask
-
and
-
impute for estimating imputation accuracy
(Badke et al., 2013;
Gualdrón Duarte et al., 2013)
and the use of genotyping chips to validate gen
otype calls from
genome sequencing
(Kumar et al., 2014)
. Finally, the use of plasmodes is another app
ropriate
procedure that can be applied to validate a statistical method. This approach aims at generating
datasets that preserve the characteristics of experimental data with the benefit of knowing the
true status as it happens with simulated data.
Plasmo
des are synthetic datasets generated from experimental data
(Mehta et al.
, 2004)
.
The term plasmode was introduced in 1967 by Cattel
l
et al.
(
1967)
as a tool for validating
techniques in multiv
ariate data. Unlike parametric simulations, data distributions and correlations
generated in plasmodes can be more realistic because they are taken directly
from experimental
data thus no
assumptions are required
a priori
to create the datasets. Plasmodes have to be
generated according to the validation objectives and the available experimental data
(Mehta et
al., 2006)
and one challenge is precisely how to create a plasmode for differential expression
analysis versus a plas
mode to validate results for
sample
-
or even gene
-
based
clustering.
In this diss
ertation
,
I show how to validate statistical analysis methods for RNA
-
seq data
by creating plasmode datasets. I also use parametric simulations and a gold
-
standard reference
to complement the validation analysis. Plasmodes were created in three different w
ays. First, I
created plasmodes by applying resampling
-
based methods consistent with the null hypothesis
(no differential expression). Second, I created differentially expressed plasmodes with a known
proportion of differentially expressed genes. These two
methods of creating plasmodes are
9
particularly useful to evaluate statistical properties of differential expression analysis models and,
for evaluating dissimilarity metrics for clustering. Third, I generated synthetic individuals by
combining known propo
rtions of transcript
-
specific read counts from two paternal individuals,
which is more relevant for comparing dissimilarity metrics for hierarchical clustering. Finally, I used
a gold
-
standar
d reference comparison
to assess properties of variant calling me
thods.
Given the rapid
advances in analysis tools designed for RNA
-
seq experiments, a better
understanding of the properties of results obtained by commonly used statistical analysis methods
would provide valuable
information for researchers who
either de
velop analysis models and tools
or utilize analysis tools in their investigations. A systematic comparison t
h
rough a valid framework,
as provided by plasmodes, could supply reference datasets to be used as benchmark
s
for
comparing analytic approaches. Furt
hermore, researchers interested in specific experimental
data could use plasmodes generated by similar experimental conditions to conduct pilot
in silico
studies and to choose the most suitable analysis procedure. For instance, a researcher may
evaluate wh
ether to use a generalized linear model for analyzing the read counts of a differential
expression experiment or to apply an appropriate transformation in order to use a more flexible
general linear mixed model. As the range of objectives of RNA
-
seq studie
s e
xpands (i.e. focusing
not only o
n differential expression analysis) it will be of interest to contribute with analogous
validation framework in such
specific applications. In accor
dance with this demand, we provide
metrics to evaluate dissimilarity matr
ices to represent sample relationships using hierarchical
cluster analysis, and metrics to compare variant calling software. The evaluation of dissimilarity
measures are of direct benefit to summarize and represent biological/technical variability in
exper
imental design, or relationships among individuals in population studies. The comparison of
variant calling software provide
s
information on genotyping accuracy that could improve other
related analysis such as allele
-
specific expression or RNA editing.
10
Thus, the overarching goal of this dissertation was to propose and implement validation
procedures based on experimental data to estimate the properties of widely used statistical
analysis methods of RNA
-
seq experiments.
The specific objectives were:
1.
To
evaluate statistical models for differential expression analysis in RNA
-
seq experiments
2.
To assess the properties of dissimilarity measures for agglomerative hierarchical
clustering of RNA
-
seq data
3.
To compare sensitivity and false discovery rates for SNP
genotyping in variant calling
programs
11
Chapter
2
Evaluating statistical analysis
models for RNA sequencing experiments
Evaluating statistical analysis models for
RNA sequencing experiments
Reeb, P. D., and Steibel, J. P. (2013). Evaluating statistical analysis models
for RNA sequencing experiments.
Front. Genet.
4, 1
-
9.
doi:10.3389/fgene.2013.00178.
12
2.1
Introduction
RNA sequencing (RNA
-
seq) technology is being rapidly adopted as the platform
of choice
for high
-
throughput gene expression analysis
(Ozsolak and Milos, 2011)
. Many methods have
been proposed to model
relative transcript abundances obtained in RNA
-
seq experiments but it
is still difficult to evaluate whether they provide accurate estimations and inferences.
Sound statistical analysis of RNA
-
seq data should consider not only the factors of any basic
expe
transcriptomic, etc). An RNA
-
seq experimental design must consider treatment and block
structures, and combine them according to the principles of a well
-
planned design:
randomization,
blocking and replication
(Aue
r and Doerge, 2010)
. Typically, fixed or random effects such as
library multiplexing, sequencing lane, flow cell, individual sample, tissue, or time can be crossed
or nested with treatments or other experimental conditions. Such a design is used to mode
l
thousands of correlated variables (i.e transcripts), usually, in a context of small number of
biological replicates. Although the development of reliable models that account for all these
factors is challenging, it is even more difficult to assess the va
lidity of a particular analysis model
(Pachter, 2011)
.
Validity of statistical models for differential expression analyses has been evaluated by (i)
app
lying th
e model to a novel data
set, (ii) deriving analytical proofs, (iii) using simulations, (iv)
comparing to a gold
-
standard measure, or (v) constructing plasmodes. In (i) the true status of
nature is unknown, therefore this method must only be accepted as an i
llustration and not as
evidence to support a model. However, any of the last four options, or a combination of them,
could be used to demonstrate adequacy of a model. Obtaining a mathematical demonstration (ii),
may be impossible for some models
(Gadbury et al., 2008)
. Most of the models rely on
assumptions that are difficult to verify and the consequences of departures from assumptions may
13
not be clear. Computer simulation (iii) has been the most commonly used procedure
(Anders and
Huber, 2010; McCarthy et al., 2012)
. This preference
is due to ease in creating data
sets under
diverse scenarios by controlling the set of parameters used in the simulation. Nevertheless, such
g
enerated data depend on the parameterization selected and the assumptions of the simulation
model. Moreover, these dataset may constitute a partial representation of reality as the complexity
of RNA
-
seq data is hard to mimic. Typical gold
-
standard (iv) for
gene expression are qPCR data
(Bullard et al., 2010; Rapaport et al., 2013)
. Ho
wever, analysis models for qPCR data should
themselves be validated
(Steibel et al., 2009)
. The use of plasmodes (v) is another app
ropriate
procedure that can be applied to validate a statistical method. This approach aims at generating
datasets that preserve the characteristics of experimental data with the benefit of knowing the
true status as it happens with simulated data.
A plasm
ode is a data
set obtained from experimental data but for which some truth is known
(Mehta et al., 2004)
. Plasmodes have been applied in microarrays
(Gadbury et al., 2008)
,
admixture estimation methodologies
(Vaughan et al., 2009)
and qPCR
(Steibel et al., 2009)
.
This
procedure has not been extensively applied in RNA
-
seq since it requires large sets of raw data
with an accurate description of
the experimental conditions under which they were obtained. This
information is essential to accurately develop plasmodes under null and alternative hypotheses.
Only recently, an initiative has provided a repository with ready
-
to
-
use databases from RNA
-
seq
studies
(Frazee et al., 2011)
.
Processed data obtained from RNA
-
seq experiments are essentially counts that in the
simplest model represent total number of reads mapping to a region in a reference genome or
transcriptome. A comprehensive comparison of stochastic models that have been pr
oposed is
presented in Pachter
(2011)
. Although different discrete distributions such as binomial,
multinomial, beta
-
binomial, Poi
sson, and negative binomial, have been proposed to model RNA
-
seq data, Poisson and negative binomial are the most implemented ones in RNA
-
seq analysis
14
software. A simple Poisson model seems appropriate when the experiment includes only
technical replicates
from a single source of RNA
(Marioni et al., 2008)
. In practice, however, due
to e
xtra sources of variation, the observed dispersion is larger than the expected for a simple
Poisson distribution and to correctly account for over
-
dispersion, generalized Poisson (GPseq)
(Srivastava and Chen, 2010)
, mixed Poisson (TSPM)
(Auer and Doerge, 2011)
, Poisson log
-
linear
(PoissonSeq)
(Li et al., 2012)
and negative binomial (edgeR, DESeq, baySeq, NBPSeq)
(Robinson et al., 2010; Ander
s and Huber, 2010; Hardcastle and Kelly, 2010; Di et al., 2011)
are
used instead. Regardless of the model, calculating dispersion parameters requires special
statistical and numerical approaches due to the small sample sizes and large number of
respons
es used in RNA
-
seq studies. In particular, borrowing information across transcripts when
estimating model parameters, as used in microarrays
(Smyth, 2004; Cui et al., 2005)
, has been
also proposed for RNA
-
seq
(Robinson and Smyth, 2008; Anders and Huber, 2010; Zhou et al.,
2011)
. Another challenging issue for these statistical anal
ysis models, is the ability to handle
different experimental sources of variation. Most of the models allow fitting simple effect models
and pair
-
wise comparison between treatments but only a few allow multiple factors
(McCarthy et
al., 2012)
. Currently, to the best of our knowledge, there is only one available model that can fit
random
effects
(Van De Wiel et al., 2013)
. Methods that can accommodate complex hierarchic
al
designs and provide more powerful tests to detect differentially expressed transcripts are under
active research.
On the other hand, microarray analysis models and software usually assume a
Gaussian distribution for response variables, but they accommod
ate fixed and random effects in
a straightforward manner
(Rosa et al., 2005; Cui et al., 2005)
. Consequently, an alternative to
model counts in RNA
-
seq experiments is to transform counts and use Gaussian models
(Langmead et al., 2010; Smyth et al., 201
2)
.
In any case, given the multitude of available statistical models and the complexity of
experimental design of many gene expression studies, researchers often find themselves having
15
to decide between competing models and analysis program. In other c
ases, although a
researcher may have an a priori designated software and model for RNA
-
seq data analysis, the
question is if the fitted model produces sound inferences.
In this paper, we present and apply a methodology for evaluating statistical methods fo
r RNA
-
seq experiments by combining results from computer simulations and plasmodes. We follow the
epistemological guidelines stated in Mehta et al.
(2006)
for high
-
dimensional biology and provide
a general framework that can be adapted to different experimental conditions.
2.2
Material and Methods
2.2.1
Simulations
Simulated
datasets were created conditional on estimated parameter values and results that
had been previously obtained
(Ernst et al., 2011)
. The data consisted of read co
unts from an
RNA
-
seq experiment based on a developmental expression study
(Sollero et al., 2011)
.
Experimental and alignment protocols are described in the supplemental material (
Figure
9
).
Estimations for parameters
and
were obtained by fitting generalized linear Poisson models
with log
-
library size as an offset variable using function lmer
(Bates et al., 2013)
from R
(R
Development Core Team, 2014)
.
Equation [1] represents the generalized linear model used to generate the simulated
datasets:
[1]
16
where
y
ij
is the read count for a particular transcript in treatment
i
and sample
j
,
O
ij
is a known off
-
set value (in this case the total library size),
i
is the group mean,
e
ij
is a sample
-
specific residual.
The transcript sub
-
index (g) was omitted for convenience.
Given estimates of parameters from equation [1] for transcripts, we simulated read counts by
following the algorithm described in
Figure
1
. The output from this procedure consisted of a matrix
of counts of size
T
by
2nr
with a known proportion (
p
0
) of differentially expressed transcripts and
known group effec
ts (
i
). Treatment is represented in this matrix by
nr
columns, but with only
n
independent (biological) replicates. While this simulation is not based on the negative binomial
distribution, it continues to be an over
-
dispersed Poisson process commonly use
d to simulate
RNA
-
seq counts
(Blekhman et al., 2010; Auer and Doerge, 2011; Hu et al., 2011)
. The resulting
over
-
dispersed Poisson counts will have means, variances, and treatment effects sampled from
those estimated from experimental
data. The procedure can be repeated K times to produce
several simulated datasets.
17
Figure
1
Algorithm used to simulate counts from existing estimates of model parameters
We set
K=1000
and
T=5000
, producing 1000 simulated datasets with 5000 transcripts each.
Noteworthy, when sampling transcripts in S, it is assumed that all transcripts are differentially
expressed (no significance testing is performed). But subsequently, the mean treatment
differe
nces (in the log
-
scale) are zeroed out if the transcripts are assigned to
S
0
. For transcripts
assigned to
S
1
, mean differences are kept unchanged; consequently
S
1
includes a whole
distribution of treatment effects from very small to large
according to the
distribution
estimated
from the experimental data.
We simulated nine scenarios by combining three levels of biological replication (n=3, 5, 10)
and three levels of technical replication (
r
=1, 3, 5). The proportion of differentially expressed
transcripts w
as set to 0.1.
18
2.2.2
Plasmodes
In contrast to simulation datasets based on equation [1], we generated plasmode datasets not
based on any model. Plasmodes were generated using data available in the online resource
ReCount
(Frazee et al., 2011)
. From the whole collection of analysis
-
ready datasets, we chose
to work with two RNA
-
seq experiments to illustrate t
he generation of 1) a null data
set, where there
are no obvious systematic effects that explain v
ariance in gene expression
and, 2) a data
set with
treatment and block effects.
2.2.2.1
Null dataset (Cheung)
The data originated in a study of immortalized B
-
cells from 41 (17 females and 24 males)
(Cheung et al.,
2010)
. The samples were sequenced using the Illumina Genome Analyzer. To generate a
plasmode dataset, we selecte
d the 21 samples from male individuals that were represented with
only one technical replicate. The resulting gene expression data exhibits extensive variation that
cannot be attributed to any systemati
c factor (
Figure
2
a
). Any random partition of the dataset into
two (or more) categories should shield a null dataset where no differential expression is expected
beyond the normal sample
-
to
-
sample variation. Consequently
this dataset lends itself to create
plasmodes to evaluate statistical properties of analysis models under the null hypothesis.
19
Figure
2
Multidimensional scaling analysis of experimental datasets
(A)
Cheung samples: F=Females and M=males;
(B)
Bottomly samples: labels correspond to
strain (treatment) B6=C57BL/6J, D2=DBA/2J, and colors to flowcell number (block): red=4,
black=6 and green=7. In Cheung dataset there is not clear distinction between femal
es and males
while in Bottomly samples are first grouped in two large groups corresponding to strain B6 and
D2 and then in subgroups consistent with flowcell number
To generate null data
sets, we pr
oceeded as explained in
Figure
3
. Using
n=21
samples from
males, we generated
p=10
plasmodes each with
t=2
groups and
r=10
biological replicates in each
group.
Notice that no
parametric model is used at any time. We constru
cted plasmodes by reshuffling
data and assigning an arbitrary treatment label. In this way overall distribution and gene
-
to
-
gene
correlations remain unchanged with respect to real data.
20
Figure
3
Algorithm
used to generate plasmode data
sets with no differentially expressed
transcripts under a model with one classification variable
2.2.2.2
Differentially expressed
dataset (Bottomly)
In Bottomly et al.
(2011)
, the authors arranged 21 samples from two inbred mouse strains
(B6 and D2; n for B6=10, n for D2=11) on 21 lanes of three Illumina GAIIx flowcells and they
analyzed the RNA
-
seq reads with a simple one
-
way classification (strain) model. A
fter
performing descriptive analysis of gene expression data, we found that not only strain but also
the experiment number (flowcell) explained a large amount of the variation (
Figure
2
b). For
example, the first principal dimension clearly divides samples from each strain, but the second
principal dimension shows substantial variation between flowcells, especially flowcell 4 (red)
versus the other two.
Consequently, we
blocked by experiment and used edgeR to fit a model with strain and
experiment as fixed effect, resulting in a large number of putatively differentially expressed
genes (
Figure
10
). Due to a strong experiment effect, we decided to conduct randomization for
plasmode construction within experiment number as detailed in
Figure
4
.
We g
enerated 10 plasmodes executing step 4 to 7 with p=10 and
0.20. Notice that in step
3, we used edgeR to obtain a list of DE genes (set G) to build a plasmode with some genes under
alternative hypothesis. Any other statistical software, however, can be us
ed with the only
21
requirement of defining a sufficient small q
-
value threshold. After genes are selected no model is
used at any time. Similar to the previous section plasmodes are constructed by reshuffling data,
but in this case an effect estimated from r
eal data is added to selected genes. Again, we expect
that this procedure yields plasmodes with identical distribution to real data for non differentially
expressed genes and with comparable effect sizes for differentially expressed genes.
Figure
4
Algorithm used to generate plasmode datasets with differentially expressed transcripts
under a model with two classification variables (block + treatment)
22
2.2.3
Comparison of alternative analysis tools for evaluating
differential expression
To illustrate the use of simulated datasets and plasmodes we compared three R
(R
Development Core Team, 2014)
packages from Bioconductor
(Gentleman et al., 2004)
. Two of
them, edgeR and D
ESeq, were designed specifically for statistical analyses of RNA
-
seq
experiments while the third one, MAANOVA
(Cui et al., 2005)
, was originally conceived for
analyzing microarray experiments. As mentioned before, MAANOVA has the ability of fitting
hierarchical models that can better accommodate complex experimental design assumptions.
However, such flexib
ility comes at the price of assuming a Gaussian distribution. Data
transformation and use of permutation to set significance thresholds can help alleviate these
limitations, but its performance may still be contingent upon sample size and total read counts
per
transcript. Consequently, we included MAANOVA in this study and compare it to two well
established packages for RNA
-
seq analysis.
2.2.3.1
Filtering and normalization
A double filtering cr
iterion was applied to all data
sets previous to normalization a
nd
statistical analysis. Transcripts with 2 or more reads per million in at least as many libraries as
number or biological replicates were kept in the analysis. In the simulation study, technical
replicates were summed up before filtering. Consequently, t
he technical replicate level only
represents increased sequencing depth.
Normalization aimed at accounting for differences in library size and composition not
attributable to treatments. To conduct the analysis with edgeR, data were normalized using the
sc
aling method proposed by Robinson and Oshlack
(2010)
and the logarithm of the resulting
effective library size were used by default as offsets in the model.
23
Analyses with DESeq were performed on counts previously normalized by function
estimateSizeFactors. According t
o Anders and Huber
(2010)
, this normalization method is
similar to the one proposed by Robinson and Oshlack
(Robinson and Oshlack, 2010)
in edgeR,
and it is the recommended procedure by the authors of DESeq.
Normalized values to use in MAANOVA were obtained with function voom() of the limma
package
(Smyth, 2005)
. The process, analogous to the one proposed in
(Smyth et al., 2012)
,
included adjustment for compositional structure using function calcNormFactors() of edgeR and
transformation to log2
-
counts per million.
2.2.3.2
Diff
erential expression analysis
edgeR
:
Differential expression was tested by likelihood ratio tests using the generalized linear
model functionality and estimating tagwise dispersions.
DESeq
:
To look for differentially expressed genes, function nbinomGLMTest
was applied using
the dispersion estimates generated by function estimateDispersions.
MAANOVA
:
In the linear model fit by MAANOVA lane was treated as a fixed array effect of a
single
-
color microarray. Differential expression analysis was performed using bo
th, moderated
F
-
test (Fs) and transcript by transcript F
-
test (F1). Significance was assessed using 100 sample
permutations
(Yang and Churchill, 2007)
.
2.2.3.3
Multiple
comparisons
It is recognized that correction of p
-
values when making multiple comparisons is essential in
high throughput differential expression analyses
(Storey and Tibshirani, 2003)
. The most common
procedure used is the computation of the false discovery rate or FDR
(Benjamini and Ho
chberg,
1995)
. Properties of methods to estimate FDR rely heavily on the distribution of p
-
values
(Li et
24
al., 2012)
. In this case we did not aim at selecting individual differentially expressed genes or
gene sets but we focused at studying the properties of tests in terms of type I and type II error
rates. Consequently, we concentrate on compar
ison of nominal and empirical type I and type II
error rates without applying multiple correction and we discuss how departures of assumed values
can further affect decisions when applying p
-
value corrections.
2.2.4
Evaluating and comparing results from
alternative analysis packages
To compare performances of derived tests in terms of power and type I error rates, we
generated receiver operator characteristic (ROC) curves by computing true positive rate (TPR)
and false positive rate (FPR) at given signifi
cance thresholds. The TPR was calculated as the
proportion of true positives (TP) over the total number of simulated differentially expressed
transcripts (S
1
). FPR, on the other hand, was calculated as the proportion of false positives (FP)
over the total
number of transcripts simulated with no differential expression (S
0
). See table 1 for
details.
25
Table
1
Classification rule to compute false and true positive rates
Analysis
method status
Transcript simulation status
Total
Not
differentially
expressed
Differentially
expressed
not declared
significant
T
N
FN
R
0
declared
significant
FP
TP
R
1
Total
#S
0
#S
1
G
FP=number of false positives (transcripts in S
0
set declared differentially expressed), TP=
number of true positives (transcripts in S
1
declared differentially expressed), FPR= false positive
rate= FP/#S
0
, TPR=true positive rate=TP/#S
1
Finally, distributions of p
-
values were compared by quantile
-
to
-
qu
antile plots and
histograms.
Analyses were performed at the Michigan State University High Performance Computing
Center facilities using R (version 2.15.1), edgeR (version 3.0.8.4.6), limma (version 3.14.4),
DESeq (version 1.10.1) and MAANOVA (version
1.28.0).
26
2.3
Results
2.3.1
Simulations
Figure
5
shows results obtained for a simulation with 3 biological replicates and 1 technical
replicate. Similar results were found in other simulated scenarios (data not shown).
The quantile
-
to
-
qua
ntile plot in
Figure
5
allows evaluation of the fit of observed p
-
values to the
uniform (0,1) distribution expected under null hypothesis
(
Leek and Storey 2011
)
. P
-
values
corresponding to MAANOVA showed a more characteristic pattern whereas edgeR and DESeq
presented significant departures from such distribution. Furthermore,
the logarithmic scale allows
to easily inspect the behavior of very small p
-
values. DESeq
presented larger p
-
values than
expected up to a cutoff of 0.001, while the opposite pattern occur for p
-
values smaller than 0.001.
Both MAANOVA approaches presented a close to expected pattern with a small deviation for p
-
values smaller than 0.0001. To
compute the logarithm, all p
-
values equal to zero were replaced
by the minimum observed p
-
value and thus generated the plateau at the end of the distributions
of MAANOVA results. In addition, quantile
-
to
-
quantile plots allowed us to select Fs and F1 tests
computed with permutation against the tabulated approach (
Figure
8
a
-
b). An alternative
representation of p
-
value distribution using histograms is presented in the sup
plemental material
(
Figure
11
).
27
Figure
5
Simulatio
n results from a scenario with three
biological replicates
(A)
Q
-
Q uniform plot of non differentially expressed transcripts,
(B)
type I error rate
vs. nominal
significance values, and
(C)
ROC curves. Models:
i) edgeR (blue), ii) DESeq (red), iii) MAA
-
Fs:
MAANOVA Fs moderated test using permutation (green), and iv) M
AA
-
F1: MAANOVA F1
transcript by transcript test using permutation (blue)
In concordance with the observed p
-
value distributions, the realized type I error rates
levels for DESeq and edgeR were much different than expected in comparison with MAANOVA
approaches (
Figure
5
b). All the packages presented higher realized significance levels when
evaluated at nominal values below 0.01, with edgeR being the most liberal, and MAANOVA the
least deviated from nominal values.
ROC curves had similar patterns for each of the nine sim
ulated scenarios. Power improved at
a given FPR as the number of technical and/or biological replicates increased. In the scenario
with 3 biological replicates, the enhancement in power when adding technical replicates seems
to be particularly greater than
in a scenario with 5 or 10 biological replicates (data not shown). In
the case with 3 biological replicates and 1 technical replicate (
Figure
5
c), edgeR and DEseq ha
d
similar power while the MAANOVA analyses reported less power.
28
2.3.2
Plasmodes
2.3.2.1
Null data
set (Cheung)
Q
-
Q plot in
Figure
6
a
shows the adequacy of p
-
values t
o the uniform distributio
n for each of
the plasmode data
sets analyzed with the different models. All the models presented large
dispersions with some cases being close to the expected values and some being far apart. In
particular, edgeR results tend to be above the identity line which means that observed p
-
val
ues
are smaller than expected. On the contrary, DESeq and both MAANOVA tests tend to have a
more conservative behavior as they presented larger observed p
-
values than expected. See also
Figure
6
b where edgeR presented inflated type I error rates for nominal significance threshold
smaller than 0.01.
Figure
6
Plasmode results from Cheung dataset:
(A)
Q
-
Q uniform plot of non differentially expressed transcripts,
(B)
type I error rate
vs. nominal
significance values. Models:
i) edgeR (blue), ii) DESeq (red), iii) MAA
-
Fs: MAANOVA Fs
moderated test using permutation (green), and iv) MAA
-
F1: MAANOVA F1 t
ranscript by transcript
test using permutation (blue)
29
2.3.2.2
DE dataset (Bottomly)
The p
-
value distributions (
Figure
7
a) presented similar dispersion patterns to the one
observed in the plasmodes generated from Cheung dataset utilizing edgeR and DESeq.
However, p
-
value distributions for MAANOVA tests were
more homogeneous across datasets
with the p
-
values from F1 test tabulated approach being closer to the expected values under
uniform distribution. ROC curves for DESeq and edgeR were analogous after adjusting for type
I error rates. Besides, both programs
reported higher power than analysis performed with
MAANOVA (
Figure
7
c).
Interestingly, and opposite to previous datasets, the best F test to apply
when using MAANOVA
was F1 with tabulated F values
-
compare the proximity to the red line in
Figure
8
e in contrast to the pattern in
Figure
8
f.
Figure
7
Plasmode results from Bottomly dataset:
Q
-
Q uniform plot of non differentially expressed transcripts,
(B)
type I error rate
vs. nominal
significance values, and
(C)
ROC curves. Models:
i) edgeR (blue), ii) DESeq (red), iii) MAA
-
Fs:
MAANOVA Fs moderated test with permutation (green), and iv) MAA
-
F1: MAANOVA F1
transcript by transcript test tabulated (blue)
30
Figure
8
-
value results
-
value results of non
differentially expressed transcripts using Fs
moderated test and F1 transcript by transcript test, with a tabulated (right) or permutation (left)
approach. In the simul
ated data
set
(A
-
B)
the permutation approach presented a more
characteristic uniform distribution, the plateau at the end is caused by the replacement of zeroes
by the minimum observed p
-
value when computing logarithm. Plasmodes generated from Cheung
datase
t, presented similar patterns either using a tabulated or a permutation approach
(C
-
D)
.
Plasmodes generated from Bottomly presented better patterns for Fs with tabulated and F1 with
permutation approach
(E
-
F)
31
2.4
Discussion
and conclusion
Validating and comparing methods to analyze RNA
-
seq data is essential for providing
powerful statistical packages that can detect differentially expressed genes in downstream
analyses
(Robles et al., 2012)
. In this paper we illustra
te how to utilize plasmode data
sets in
combination with simulations to evaluate analysis methods more co
mprehensively.
Parametric simulations can benefit a particular model depending on the distribution and
specifications used to generate the dataset. For example, it can be argued that in our simulation
study, edgeR and DESeq resulted too liberal compared to
MAANOVA due to the additive
generalized Poisson model that was used to simulate the dataset. However, results from two
independent plasmode datasets, generated without using specific parametric models, confirmed
the same behavior (
Figure
6
b and
Figure
7
b
). Moreover, a common problem of parametric
simulations is that genes are sim
ulated independently. Such misspecification is overcome in
plasmode datasets where the residual correlation structure among genes after adjusting for
systematic effects is preserved with respect to the original dataset.
Exploring the joint null distri
bution of p
-
values for a particular test helps to determine the
adequacy of a model and to decide the best method to correct for multiple comparisons, and
doing so requires generation of multiple accurate high
-
dimensional datasets
(Leek and Storey,
2011)
. For example, we compared null p
-
value distribution obtained for the two types of
MAANOVA F tests (Fs or F1) combined with two methods to compute the p
-
values (tabulated
F
or permutation). The choice of the best c
ombination varies for each data
set: in the simulation
study, either Fs or F1 using permutation provide a p
-
value distribution close to a uniform
distribution while none of the F tests using tabulated values provi
de a reasonable distribution
(
Figure
8
a
-
b). Plas
modes generated from Cheung dataset
presented similar patterns for all the
combinations (
Figure
8
c
-
d), then Fs and F1 using permutation were chosen as suggested by Cui
32
et al.
(Cui et al., 2005)
. Conversely, in the analysis of plasmodes generated from Bottomly data
set
,
F1 test using tabulated F values was the best approach (
Figure
8
e
-
f). According to Cui et al.
(2005)
,
the F1
test for a fixed effect model has a standard F distribution and critical values could
be obtained from F tables. These results are important because typical correction by FDR as
proposed by Benjamini and Hochber
(1995)
may not be appropriate if the underlying uniform
distribution is not supported. Other strategies have been adapted from Storey
(2002)
to estimate
FDR for RNA
-
seq data and which correction should be applied is a topic of re
seach
(Li et al.,
2012)
. All in all, these results emphasize the need to validate methods und
er realistic conditions
and to select a base dataset for a plasmode where total sample size and sequencing depth
(magnitude of counts) are considered.
In addition to the base dataset used to build a plasmode, the specific algorithm for plasmode
generation
should vary according to the objective of the study.
Gadbury et. al
(2008)
presented
an algorithm that generates the partition of samples into two groups and repeatedly samples
different e
ffect sets to be added to that unique partition.
In this work, we propose to make several
partitions from the original set of samples and add a set of effect in each case (
Figure
4
). This
approach constitutes a way to incorporate valuable information on biological variation. For
example, one can easily study the dispersion of patterns in the Q
-
Q plots or ROC curves.
Alternatively, both approaches, Gadbury et al.(2008)
and the one presented in this paper, can be
combined to study the influence of different sets of genes as well as sample variability.
Moreover, the construction of a plasmode must consider all the experimental conditions under
which the base data were
collected. Treatment and block effects may be easily identified from the
experimental design but further restrictions in randomization (flowcell, lane, time) or technical
issues (operator, use of technical replicates) may arise only from inspecting protoco
l details and
applying explorative statistical analyses. For instance, descriptive analysis of the Cheung dataset
and visualization of samples using multidimensional scaling analysis (
Figure
2
a) suggested that
33
no specific effects were present in the data structure; therefore we used it as an example to build
a null plasmode. However, the same procedu
re applied to the Bottomly data
set indicated that not
only the main st
rain, but also a characteristic effect due to flowcell number was an important
source of variation (
Figure
2
b). Consequently, strain and block (flowcell) were considered in two
parts of the plasmode generation algorithm: firstly, when defining the model to select the effects
(step 2 in
Figure
4
), and secondly, when partitioning samples within each flowcell (step 5 in
Figure
4
). These considerations allowed us to generate approp
riate null
and alternative data
sets. A
similar process should be followed with any new dataset plausible of being used as a base for
plasmode generation.
We used the plasmodes and simulated data to illustrate the selection of optimal differential
expression analysis
strategies. To this end, we focused on comparing true and false positive
rates of tests to assess type I error rates and power. While we did not intend to perform a
comprehensive evaluation of analysis protocols for RNA
-
seq data analysis, we did want to i
nclude
two broad types of methods: 1) those directly tailored to count data by using negative binomial
distributions (DESeq, EdgeR) or 2) a Gaussian model after transformation (MAANOVA). We
found that edgeR and DESeq incur
red
in inflated type I error rates
for small significance levels
(Figures
Figure
5
b
,
Figure
6
b
, and
Figure
7
b
-
values tend to be closer to the
nominal significance levels. Admittedly, after adjusting for type I error rates, power was similar for
edgeR and
DESeq and higher than that from MAANOVA (
Figure
7
c). However, in a real data
scenario, adjusting is not possible because the true status is unknown.
These results emph
asize the fact that RNA
-
seq data are complex and to decide what method
to use may be experiment
-
specific due to the unknown distributions of expression levels.
Plasmodes may contribute to decisions on which method to choose by us
ing a similar pre
-
existing
data
set and comparing results. It is critical to select a dataset that has a complete description of
the experimental design and detailed protocols of how the data were obtained. Using this
34
information, it is possible to design proper null and alternative
datasets. For example, it was easy
to find a set of differentially expressed genes in the mouse dataset that studied two inbred lines.
Contrarily, in the human dataset, it was not possible to explain the variation in expression only as
a consequence of ge
nder effects. The human subjects came from an outbred population and
factors such as age, weight, or other characteristics could have explained differences in gene
expression. Granted, any of the mentioned effects could have been included in the model if t
he
information was available. The promising results obtained from this approach emphasize the need
of promoting and improving systematic data sharing across the research community to facilitate
plasmode building.
Finally, the flexibility of plasmode const
ruction allows comparing model tuning selection for
downstream analysis but also upstream analysis, as normalization procedures or alignment
pipelines, could be contrasted. Future uses of plasmodes could be: comparison of alignment
programs for a given sta
tistical analysis model or even exploring interaction of statistical model
and read processing protocols to find optimal combined pipelines for data processing from reads
-
to
-
p
-
values.
35
APPENDIX
36
RNA
-
seq data processing
In a previous
experiment, we used the Pig
oligoarray for transcriptional profiling of developing
pig skeletal muscle, and results for a study comparing transcript profiles of
longis
s
imus dorsi
muscle from fetuses at 40 and 70 d of gestation in two different breed
(Soller
o et al., 2011)
. For
this study, we used the same RNA samples from one of the breed types profiled with the
Pigoligoarray (n = 3 for each developmental age) with deep sequencing technology (Illumina
GAIIx) to obtain 50nt paired end reads from 6 librarie
s (2 conditions, 3 bio
-
replicates each). The
processing steps are described in
the following
Figure.
Figure
9
Steps used to process reads to obtain matrix of counts
1. Filter passing read pairs were aligned to the S. scrofa reference genome (Sscrofa9, April 2009,
Ensemble release 61) using the spliced RNA aware alig
ner, TopHat. Reads from each library were
aligned separately. The reference gene annotation (same version as above) was provided to TopHat
to provided information about predicted splice junctions, but TopHat also predicts novel splice
junctions.
2. Novel s
plice junctions predicted from each of the 6 libraries were combined with the splice
annotations from the reference to create a single, non
-
redundant set of predicted splice sites.
TopHat is better able to map spliced reads if a list of potential junction
s is provided as input.
3. Each library was aligned to the reference a second time, providing the non
-
redundant set of potential
splice sites as input.
4. The alignments produced from each library were used as input to Cufflinks. Cufflinks was used to
exam
ine RNA
-
Seq alignments and to generate a set of predicted transcripts based on assembly of
overlapping reads.
5. The transcript models generated by Cufflinks for each library were combined into a single, non
-
redundant set of transcript/gene models with Cuf
fcompare. The reference annotation was also
provided to Cuffcompare to associate the predicted gene models with their most likely reference
model.
6. The models generated by Cuffcompare are filtered to remove those models with little support from
the under
lying read alignments. Specifically, if aligned reads are observed in only one of the six
libraries, that model is removed from the final set.
7. The alignments for each library produced during the second round of TopHat (step 3) and the
curated set of gen
e models (step 6) were used as input to htseq
-
count. This program compared a set
of alignments to an annotation file and reported the number of fragments uniquely aligned to each
gene in the annotation. The models generated by Cuffcompare may have mul
tipl
e transcripts
modeled for a
particular gene but htseq
-
count only reports the total fragments for a gene.
37
Figure
10
P
-
value distribution of differential expression analysis performed with edgeR for
Bottomly data using a model with block and treatment fixed effects
38
Figure
11
P
-
value distribution of non differentially expressed transcripts for a simulated
s
cenario with three
biological replicates
P
-
value distribution of non differentially expressed transcripts for a simulated scenario with 3
biological replicates using: 1)
edgeR
, 2) DESeq, 3) MAA
-
Fs: MAANOVA
Fs moderated test
(permutation), and 4) MAA
-
F1: MAANOVA F1 transcript by transcript test (permutation)
39
Figure
12
P
-
value distribution of non differentially expressed trans
cripts for plasmodes
g
enerated
from Cheung dataset
P
-
value distribution of non differentially expressed transcripts for plasmodes generated from
Cheung dataset using: 1) edgeR, 2) DESeq, 3) MAA
-
Fs: MAANOVA Fs moderated test
(permutation), and 4) MAA
-
F1: MAANOVA F1 transcript by
transcript test (permutation)
40
Figure
13
P
-
value distribution of non differentially expressed transcripts for plasmodes
generated from Bottomly dataset
P
-
value distribution of non differentially expressed transcripts for
plasmodes generated from
Bottomly dataset using: 1) edgeR, 2) DESeq, 3) MAA
-
Fs: MAANOVA Fs moderated test
(permutation), and 4) MAA
-
F1: MAANOVA F1 transcript by transcript test (tabulated)
41
Chapter
3
Assessing dissimilarity
measures for sample
-
based hier
archical
clustering of RNA sequencing data using
plasmode datasets
Assessing dissimilarity measures for
sample
-
based
hierarchical
clustering of
RNA sequencing data using plasmode
datasets
Reeb, P. D., Bramardi,
S. J., and Steibel, J. P. (2015
) Assessing dissimilarity
measures for sample
-
based
hierarchical
clustering of RNA sequencing d
a
ta
using plasmode datasets.
PLoS
One
10(7)
:
e0132310.
doi:10.1371/journal.pone.0132310
42
3.1
Introduction
Hierarchical cluster analysis has been a popular method for finding patterns in data and for
representing results of gene expression analysis
(Liu and Si, 2014)
. Clustering algorithms have
been widely studied for analyzing microarray data
(Jiang et al., 2004; Dalton et al., 2009)
,
however, such technology is being rapidly replaced by RNA sequencing technology (RNA
-
seq)
(Wang et al., 2009)
. In contrast to microarray experiments, RNA
-
seq generates count data of
discrete nature that may call for di
fferent analysis methods. One of the most obvious differences
between clustering gene expression data from RNA
-
seq or microarray is the choice of a
dissimilarity measure, or the need to transform and normalize RNA
-
seq data in order to use
dissimilarity mea
sures commonly used for microarray data
(Liu and Si, 2014)
.
Before implementing any statistical analysis of RNA
-
seq data, normalization and
transformation have to be performed.
(Bullard et al., 2010; Law et al., 2014; Liu and Si, 2014)
.
Normalization aims at reducing non
-
systematic va
riation within and between samples, such as
sequencing depth and library preparation. Data transformation could be very important because
it aims at reducing the effects of skewness, scale and presence of outliers that can be found in
read count data that
usually follow a Poisson
(Marioni et al., 2008)
or negative binomial distribution
(Robinson et al., 2010; Anders and Huber, 2010)
. Through appropriate transformation,
dissimilarity measures that are sensitive to asymme
tric distributions and scale magnitude, such
as Euclidean and 1
Pearson correlation
(Liu and Si, 2014; Jiang et al., 2004; Johnson and
Wichern, 2002)
could be used for clustering RNA
-
seq data.
Although a Gaussian distribution assumption is not required to compute Euclidean
and
correlation based distances, transformations that convert count data into a continuous and almost
Gaussianly distributed variable
(Law et al., 2014)
could be used for hierarchical clustering. For
instance, besides the classica
l logarithmic transformation, several functions have been proposed
43
to model the mean
-
variance relationship of RNA
-
seq data
(Anders and Huber, 2010; Love et al.,
2014
a; Law et al., 2014)
, while accounting for over
-
dispersion. But the properties of those
transformations need to be tested.
Finally, instead of using transformations to approximate the data to a pre
-
specified distribution
where available dissimilarity m
easures perform well, model based methods can be directly used
to compute dissimilarity measures
(Witten, 2011)
.
Evaluating the adequacy of alternative dissimilarity measures for hierarchical clustering
requires the fundamental
step of choosing reference datasets
(Handl et al., 2005)
. An ideal
reference dataset should mimic the technical and biological variability found in experimental data,
and it should also have some
a priori
known st
ructure in order to assess the goodness of results
from alternative analyses. Parametric simulations, exemplar datasets, and permutation sampling
have been used to generate such datasets in clustering analysis of biological data
(Sloutsky et
al., 2013)
. Similarly, plasmode datasets
(Mehta et al., 2004)
have been proposed for evaluating
differential expression analysis in RNA
-
seq experiments
(Reeb and Steibel, 2013)
. A plasmode
is a data
set obtained from experimental data from which some truth is known, thus, it is an ideal
way to generate data with an a priori defined structure that realistically mimics RNA
-
seq data.
Plasmodes were originally proposed for assessing multivariate analysis
methods
(Cattell and
Jaspers, 1967)
and have been used in behavioral science
(Waller et al., 2010)
and also in
genomics
(Gadbury et al., 2008; Steibel et al., 2009)
.
In this paper, we propose the use of pla
smode datasets to assess the properties of
dissimilarity measures for agglomerative hierarchical clustering or RNA
-
seq data. We present two
possible ways of creating plasmode datasets that depend on the available data structure, and we
use the resulting re
ference datasets to compare several commonly used dissimilarity measures.
44
3.2
Material and Methods
3.2.1
Datasets
Two experimental datasets were used in this study to create reference datasets. The first
dataset,
Bottomly
corresponds to an experiment descr
ibed elsewhere
(Bottomly et al., 2011)
.
Briefly, 21 samples of
striatum
tissue from two inbred mouse strains (C57BL/6J (B6), n=10; and
DBA/2J (D2), n=11) were sequenced in three Illumina GAIIx flowcells. Data were downloaded
from ReCount website
(Frazee et al., 2011)
.
A
fter filtering out genes with zero counts in all
sampl
es, the count matrix contained
13932 rows (transcripts) and 21 columns (samples). The
longissimus
muscle selected from the
MSU Pig Resource Popula
tion
(Steibel et al., 2011)
and sequenced
by our collaborators
(Steibel
et al., 2014)
. Total RNA from 24 F2 female pigs of Duroc b
y Pietrain ancestry was barcoded and
sequenced on Illumnina HiSeq 2000. Read mapping, gene modelling and read counting were
performed using Tophat
(Trapnell et al., 2009)
, Cufflinks
(Trapnell et al., 2012)
and HTSeq
(Anders et al., 2015)
, respectively. After processing the sequence reads, we
obtained a count
matrix with 26740 rows (transcripts) and 24 columns (samples). (Fo
r details, see supplementary
material
). The count matrix of the five samples (animals) used in this paper is available as
s
upporting information at PLoS ONE website
.
3.2
.2
Plasmodes
Plasmodes are synthetic datasets generated from experimental data for which some true
characteristic is known
(Mehta et al., 2004)
. For instance, we may know
a priori
which genes are
not differentially expressed or we may know group
membership of each sample. Then, we build
a plasmode by re
-
shuffling the existing data without assuming any probability distributions or
correlation structures. Thus, we can use the known characteristic of the synthetic dataset to
45
assess properties of anal
ysis methods. For instance we can apply resampling
-
based methods to
create plasmodes consistent with the null hypothesis (no differential expression) and use them to
evaluate the type I error rate hypothesis of testing procedures
(Mehta et al., 2006)
, or we can use
the known group memberships to assess the accuracy of clustering methods,
as we do in this
paper. Thus, plasmodes need to be constructed according to the validation objectives (i.e.
considering the statistical method that is being evaluated) and considering the available
experimental data.
In this paper, we present two example
s on how to create plasmodes to assess the effect of
choice of dissimilarity measures on the results of hierarchical clustering of RNA
-
seq data. In the
first experimental dataset, the natural structure of the data is known
a priori
and it was generated
thr
ough the experimental design (sequencing flowcells and mice strains), while in the second
experimental dataset there is not an
a priori
known structure, so we create a set of artificial
samples where the structure is generated by construction.
3.2.2.1
Plasmodes from Bottomly dataset
We built plasmodes for this dataset by using samples from B6 strain, partitioning them in two
groups and adding known effects for selected genes taken from the difference in gene expression
with strain D2.
Figure
14
presents the algorithmic steps used to generate the plasmodes. Two
main effects, strain and flowcell, were used to classify the 21 samples (Step 2.1 in
Figure
14
)
given the importance of both sources of variation has been described before
(Reeb and Steibel,
2013; Law et al., 2014)
. Then, a differential expression analysis including all the samples (both
strains) was conducted with edgeR
(Robinson et al., 2010)
and transcripts with q
-
value <0.05
were identified as differentially expressed
(set
G
1
in step 4 of
Figure
14
). Subsequently, samples
from strain B6 were randomly assigned to two groups (A or B) within each flowcell, and a subset
(
S
1
) of effects randomly selected from
G
1
was added to the correspondin
g genes in samples
46
labelled as B (Steps 5
-
6) . Therefore, samples from group A and B differ due to the strain effect
added by the subset (
S
1
) of differentially expressed genes, while samples within each group differ
due to the flowcell effect. We generated
50 plasmodes with 10% of differentially expressed
B and one or two samples, if available, to group A within each flowcell (Step 5.2 in
Figure
14
). As
a result, in each plasmode generation we obtained a total of 10 samples under two artificial
treatments (A or B) and three flowcell effects (1, 2 or 3), result
ing in a set of samples indexed by
such factors as:{(A
1
, A
1
, B
1
, B
1
),(A
2
,B
2
,B
2)
,(A
3
,B
3
,B
3
)}. If we use only differentially expressed
genes, we expect the samples with same letter to cluster together because of the treatment effect,
but as we add a large number of non differentially expressed genes, we can expect that samples
with the same subi
ndex (flowcell) will tend to cluster together because it has been shown before
that there is a strong flowcell effect in this experiment
(Law et al., 2014; Reeb and Steibel
, 2013)
.
To evaluate the performance of dissimilarity measures under various differentially expressed
/
non differentially expressed
ratios (DE/nonDE), we analyzed three scenarios for each plasmode:
1) only DE transcripts (
DE
[100%]
), 2) DE transcripts +
all nonDE transcripts (
DE
[10%]
+nonDE
[90%]
), and 3)
DE transcripts + a random sample of 50% from nonDE transcripts (
DE
[20%]
+nonDE
[80%]
).
47
Figure
14
Algorithm used to generate plasmodes from Bottomly dataset
3.2.2.2
Plasmodes
from MSUPRP dataset
Since this dataset did not have a natural sample structure derived from experimental
conditions, a structure had to be induced in order to know a priori the expected clustering
configuration. From a descriptive multidimensional scaling
analysis of the 24 pig samples
(animals), we selected 5 dissimilar samples (A, B, C, D, E) according to their configuration in the
main plane (
Figure
15
). Synthetic s
amples were generated by combining a known proportion of
randomly sampled read counts of individual genes from each of two of the five selected samples.
For instance, a new synthetic sample named AAC was generated combining 2/3 and 1/3 of read
counts of in
dividual genes from A and C respectively. A full plasmode consisted of 12 samples
that included the five selected samples {AAA, BBB, CCC, DDD, EEE}, five synthetic samples
{AAC, BBC, CCB, DDE, EED} obtained by combining 2/3:1/3 proportions from two of the
selected
samples, and two synthetic samples {CxB, ExD} obtained by combining 1/2:1/2 proportions of two
of the selected samples
(see
Figure
21
.in
supplemental material
with a representation of the
48
relationships among the 12 samples of each plasmode). Following this procedure, a total of 50
replicated plasmodes were generated. As a r
esult we created a synthetic dataset where the
samples were expected to resemble each other to a known degree given the proportions of shared
reads.
Figure
15
Multidimensional scaling analysis of MSUPRP dataset
Twenty four
samples were represented in the main plane (dimension 1 and dimension 2 explained
22.4% and 13.8% respectively) and five distant samples (A, B, C, D, E, marked with ovals) were
selected as input samples to generate plasmode datasets.
49
3.2.3
Clustering
Defining a dissimilarity measure and a linkage method are the two key decisions for
performing hierarchical cluster analysis. We focused on assessing the adequacy of dissimilarity
measures that have been commonly used for clustering gene expression data. W
e also include
a recently proposed dissimilarity measure for RNA
-
seq count data
(Witten, 2011)
. As linkage
method, we decided to use complete linkage because it is invariant under monotone
transformations
(Izenman, 2008)
, and hence dissimilarity measures that have the same relative
ranking result in the same cluste
r structure
(Liu and
Si, 2014)
. This robustness reduces the effect
of linkage method when comparing dendrograms and allowed us to concentrate in the evaluation
of dissimilarity measures. Hierarchical cluster analysis was applied to each plasmode using the
agglomerative proc
edure implemented in function
hclust
from R
(R Development Core Team,
2014)
to concatenate samples and to generate dendrograms.
Eight dissimilarity measures were compared, including 4 variants based on Euclidean
distance, 3 correlation based approaches, and one Poisson based measure. Euclidean distances
were computed between samples following one of 4 approaches: i) using raw count data (
raw
),
ii) after normalizing samples using the median ratio size factor propo
sed by Anders and Huber
(Anders and Huber, 2010)
(
rnr
), iii) after applying a variance stabilizing transformation computed
with DESeq2
(Love et al., 2014b)
(
vsd
), and iv) after applying a regularized logarithm
transformation implemented in DESeq2
(Love et al., 2014b)
(
rld
). Correlation based dis
similarities
comprised: i) 1
-
Pearson correlation between samples using raw counts (
pea
), ii) 1
-
Pearson
correlation between samples using counts transformed by logarithm of raw counts +1 (
plg
), and
iii) 1
-
Spearman correlation between samples using raw co
unts (
spe
). The Poisson dissimilarity
(
poi
), which is based on a log likelihood ratio statistic for a Poisson model
(Witten, 2011)
, was
computed on data that were transformed by a power function to account for overdispersion, an
d
normalized by total sum of counts for each sample.
50
3.2.4
Cluster validation using results from plasmodes
Cluster validation can be assessed using several indices
(Halkidi et al., 2001; Xiong and Li,
2013)
and the choice of a particular
measure is application dependent
(Jiang et al., 2004)
.
Cophenetic distances provide a way to quantify similarities among dendrograms in hierarchical
clustering. The cophenetic distance is the distance from the bottom of the tree at which two
elements (samples in this paper) are groupe
d in the same cluster for the first time in the hierarchy.
To represent a dendrogram in terms of a set of cophenetic distances, the distances between all
pairs of elements is computed and arranged into a matrix called cophenetic matrix that represents
the
whole hierarchy, as illustrated in
Figure
22
and
Table
5
in supplementary material
. Cophenetic
matrices can be used to compare dendrograms
(Sokal and Rohlf, 1962)
. For instance, to
compare how similar are two dendrograms, the Pearson correlation betwee
n the lower triangular
portions of two cophenetic matrices can be used.
We computed the correlation between cophenetic matrices
(Handl et al., 2005)
to compare
dendrograms obtained with different dissimilarity mea
sures (between dissimilarity measure
comparison) as well as to compare all dendrograms obtained with a particular dissimilarity
measure (within dissimilarity measure comparison). Mean and standard deviation of correlations
between dissimilarity measures we
re used as a measure of agreement while mean and standard
deviation of correlations within a dissimilarity measure were used as a measure of consistency.
We also visually compared the obtained dendrograms to a reference dendrogram built
according to the s
ample structure known
a priori
from the plasmode generation process in the
MSUPRP dataset. For the MSUPRP dataset, we defined the expected similarity between two
samples (
s
ij
) as the maximum proportion of shared reads, and we defined 1
-
s
ij
as a reference
dissimilarity (see
Tables
Table
3
and
Table
4
in
supplementary material
). With the correlation
between each of the dissimilarity matric
es and the reference dissimilarity, we assessed how well
each dissimilarity measure recovered the expected sample structure. An equivalent reference
51
dissimilarity matrix and reference dendrogram cannot be easily built for the Bottomly dataset
because we di
d not exploit relationships between samples to build the plasmode, except for their
group membership. In this case, we compared typical dendrograms obtained from plasmodes to
the known strain and experiment membership in the original data.
3.3
Results
3.3.1
Bottomly
Figure
16
shows the typical dendrograms obtained for plasmode datasets using two
dissimilarity measures,
poi
and
rnr
, which are representative examples of
two sets of results under
the three different scenarios (
DE
[100%]
,
DE
[10%]
+nonDE
[90%],
and DE
[20%]
+nonDE
[80%]
,). On the one hand,
scenario 1 (
DE
[100%]
) uses only differentially expressed transcripts, therefore the expected hierarchy
should arrange samples
in two separate groups according to main treatment labels. Such is the
structure obtained utilizing the Poisson (
poi
) dissimilarity measure (
Figure
16
a). Using the Po
isson
dissimilarity measure, samples were clustered in two groups corresponding to treatments A or B,
and within each of the groups, samples were arranged according to block numbers (4, 6, or 7).
Differently, the dendrogram based on Euclidean distance calc
ulated from raw normalized data
(
rnr
) (
Figure
16
b) mixed treatment labels and did not recover any expected structure. On the
other hand, scenario 2 (
DE
[10%]
+nonDE
[90%]
), uses information from differentially (10%) and non
differentially expressed (90%) transcripts. As a result, we expected that the dissimilarity
measures would tend to represent other aspects of samples in addition to the treatment effect. In
conco
rdance with such expected structure, dendrogram obtained using the Poisson dissimilarity
(
Figure
16
c) firstly separated samples according to block labels, block 4 bein
g the most different
group. Subgroups for treatments A and B were arranged within each block. Conversely,
dendrogram based on Euclidean distance calculated from raw normalized data (
rnr
) (
Figure
16
3d)
52
did not present any expected structure. Finally, scenario 3 (
DE
[20%]
+nonDE
[80%]
) represents an
intermediate case that is useful to further explore the performance of dissimilarity measures
because it is enriched in DE ge
nes with respect to scenario 1, but it still conserves 80% of
background (nonDE) genes. The dendrogram based on the Poisson dissimilarity (
Figure
16
e)
presented an in
termediate structure where we observed that samples from block 4 were clustered
together while the remaining samples were clustered in a separate group mainly classified by
treatment effect. Yet again, dendrogram based on
rnr
(
Figure
16
f) did not characterize any
expected configuration. To sum up, for this dataset, dendrograms generated from a Poisson
dissimilarity resemble the expected hierarchical structures in all three scenarios, however,
dendrograms based on Euclidean distance comp
uted on raw normalized data did not.
Comparison of hierarchies between clusters constructed using
poi
and
rnr
dissimilarities across
50 plasmodes presented correlation of cophenetic matrices with low means and high standard
deviations (0.52±0.24, , 0.65±
0.17, 0.59±0.23 , for scenarios 1, 2 and 3, respectively) (
Figure
17
).
These results emphasize a poor correspondence between hierarchies constructed upon
poi
and
rnr
dissimilarities.
53
Figure
16
Typical dendrograms obtained for plasmode datasets from Bottomly experimental data
with two dissimilarity measures under three scenarios
Dendrograms obtained using complete linkage hierarchical
clustering based on Poisson
dissimilarity (
poi
) are presented in the left column (a, c and e), and dendrograms based on
Euclidean distance calculated from raw normalized data (
rnr
) are presented in right column (b, d,
f). The rows correspond to three sce
narios with different percentage of differentially expressed
(DE) transcripts: 1)
DE
[100%]
(a and b)
, 2)
DE
[10%]
+nonDE
[90%]
(c and d)
, and 3)
DE
[20%]
+nonDE
[80%]
(e and f).
Sample labels correspond to main treatment (A or B) and flowcell number (4, 6 or 7
).
Dendrograms based on
poi
separates samples according to the expected sources of variation; in
(a), only DE transcripts, samples are arranged in two separate groups following treatment labels;
in (c), with a predominant number of non DE transcripts, the
structure of groups is dominated by
flowcell characteristics in addition to main treatment: and in (e) an in
-
between scenario, the
dendrogram presents an intermediate group structure. Dendrograms based on
rnr
do not
resemble any expected configuration.
54
Figure
17
Agreement between dissimilarity measures using Bottomly plasmode datasets
Each matrix contains means (upper triangle) and standard deviations (lower triangle) of
correlation between cophenetic matrices of dendrograms (N=
50 plasmode datasets) for eight
dissimilarity measures: Euclidean distances using raw count data (
raw
), Euclidean distances
using normalized samples (
rnr
), Euclidean distances using variance stabilizing transformation
(
vsd
), Euclidean distances using regul
arized logarithm (
rld
).) 1
-
Pearson correlation using raw
counts (
pea
), 1
-
Pearson correlation using counts transformed by logarithm of raw counts +1 (plg),
and 1
-
Spearman correlation using raw counts (
spe
), and Poisson dissimilarity (
poi
). Panel labels
55
(a), (b) and (c) correspond to one of three scenarios of proportion of differential expressed genes:
DE
[100%]
,
DE
[10%]
+nonDE
[90%]
, and
DE
[20%]
+nonDE
[80%]
, respectively. In all scenarios, we identified three sets
of dissimilarity measures: 1)
raw
, 2)
rnr
and
pea
, and 3)
poi
,
rld
,
vsd
,
plg
and
spe
. Results from
raw
, set 1, were poorly related to results from any other dissimilarity measure. Dendrograms from
dissimilarity measures in set 2 presented correlation of cophenetic matrices with medium to high
means and high variability with each other, and low correlation with dendrograms from other
dissimilarity measures. Dendrograms from dissimilarity measures in
set 3 exhibited high
correlations of cophenetic matrices and low to medium variability when compared to each other.
Correlations between hierarchies obtained with the eight dissimilarity measure approaches
for each of three scenarios (
DE
[100%]
,
DE
[10%]
+n
onDE
[90%]
and
DE
[20%]
+nonDE
[80%]
) are presented in
Figure
17
. Each matrix contains means and standard deviations of correlations between cophenetic
matrices, in the upper and lower triangle respectively. Regardless of the scenario, dissimilarity
measur
es can be apportioned to three groups with common patterns. First, Euclidean distance
computed on raw data (
raw
) is poorly related to any other dissimilarity measure. Second,
Euclidean distance computed on normalized data (
rnr
) and 1
-
Pearson correlation
dissimilarity
(
pea
) presented medium to high correlations of cophenetic matrices (mean from 0.67 to 0.89)
with high variability (standard deviation from 0.13 to 0.29) with each other and low correlation
values with other dissimilarity measures. Third, the
re is a subset comprising 1
-
Pearson correlation
dissimilarity computed on log
-
transformed counts (
plg
), 1
-
Spearman correlation dissimilarity
(
spe
), Euclidean distance computed on transformed counts after applying either a variance
stabilizing function (
vsd
) or a regularized logarithm (
rld
), and the Poisson dissimilarity (
poi
). This
last group of dissimilarity measures presents high correlations of cophenetic matrices (mean from
0.82 to 0.99) with low to medium variability (standard deviation from 0.01 to 0.
2). Only hierarchies
obtained with dissimilarity measures from the third group consistently presented the expected
natural structure created by design in the plasmode generation process, being
rld
,
vsd
,
spe
and
plg
the more consistent across all scenarios
(
Table
2
, columns 2, 3 and 4).
56
Table
2
Consistency for each dissimilarity measure
Dissimilarity
Bottomly
MSUPRP
DE
[100%]
DE
[10%]
+nonDE
[90%]
DE
[20%]
+nonDE
[80%]
raw
0.58 (0.22)
0.91 (0.13)
0.75 (0.20)
0.56 (0.23)
rnr
0.35 (0.27)
0.43 (0.28)
0.33 (0.28)
0.40 (0.22)
rld
0.98 (0.01)
0.90 (0.11)
0.88 (0.13)
0.99 (0.01)
vsd
0.96 (0.04)
0.91 (0.10)
0.86 (0.15)
0.99 (0.01)
pea
0.37 (0.31)
0.53
(0.30)
0.45 (0.30)
0.43 (0.28)
plg
0.98 (0.01)
0.89 (0.13)
0.75 (0.19)
0.98 (0.01)
spe
0.99 (0.01)
0.86 (0.14)
0.88 (0.14)
0.99 (0.01)
poi
0.86 (0.21)
0.92 (0.09)
0.88 (0.14)
0.99 (0.01)
Mean and standard deviation of correlation between cophenetic
matrices of dendrograms (N=50
plasmode datasets) for each of eight dissimilarity measures: Euclidean distances using raw count
data (
raw
), Euclidean distances using normalized samples (
rnr
), Euclidean distances using
regularized logarithm (
rld
) Euclidean
distances using variance stabilizing transformation (
vsd
), 1
-
Pearson correlation using raw counts (
pea
), 1
-
Pearson correlation using counts transformed by
logarithm (plg), and 1
-
Spearman correlation using raw counts (
spe
), and Poisson dissimilarity
(
poi
). Columns correspond to the three scenarios generated for Bottomly (with different proportion
of DE genes) and the MSUPRP dataset.
We considered a clustering from a dissimilarity measure
to be consistent if hierarchies obtained for different plasmode dat
asets within each dissimilarity
measure were highly correlated and presented a low standard deviation. Clustering based on
raw
,
rnr
and
pea
were generally inconsistent presenting a number of very different hierarchical
structures.
We considered a dissimil
arity measure to be consistent if hierarchies obtained for different
plasmode datasets within each dissimilarity measure were highly correlated and presented a low
standard deviation. Consequently, we computed correlations of cophenetic matrices for
dendro
gram within each dissimilarity measure and calculated the mean and standard deviation
for each ensemble of 50 plasmodes (
Table
2
, columns 2, 3 and 4). Dissimilarity
measures
raw
,
rnr
and
pea
were generally inconsistent, resulting in a number of different hierarchical structures.
For instance,
rnr
presented mean correlation values of 0.35±0.27, 0.43±0.28, and 0.33±0.28 for
57
the three respective scenarios. Conversely, al
l the other dissimilarity measures were much more
consistent. For example,
rld
presented mean correlation values of 0.98±0.01, 0.90±0.11, and
0.88±0.13 for the three respective scenarios. Such high values mean that hierarchies obtained
with
rld
for the 50
plasmodes were all very similar to each other.
3.3.2
MSUPRP
Plasmodes from MSUPRP were constructed by combining known proportions of sequence
reads from pairs of samples, including nonDE as well as potentially DE transcripts across
individual. We expect
that dendrograms cluster the samples according to the known proportions
of shared reads as presented in the reference dendrogram in
Figure
18
c (see
Table
4
in
supplementary material
with the corresponding ref
erence dissimilarity matrix).
Figure
18
a
-
b
present the typical dendrograms obtained for plasmode datasets using
rnr
and
poi
, which are
representative examples of the 8 dissimilarity metrics. Dendrogram
based on the Poisson
dissimilarity (
Figure
18
a) clustered the original samples A, B
,
and C and their synthetic
combinations in one group, and original samples E and D and their synthetic combinations in a
distinct group. The hierarchical structure of each of these two groups represented the degree of
shared reads between samples by joinin
that shared ½ of reads. Additionally, the separation between samples {A, B, C} and {D,E} agreed
with positions along the most important dimension (dim1) in
Figure
15
. In contrast, dendrogram
based on
rnr
(
Figure
18
b)
did not cluster samples according to the anticipated configur
ation.
Comparison of hierarchies between clusters constructed from
rnr
and
poi
dissimilarities for all
plasmodes presented low mean and high standard deviations (0.44±0.22) of correlation between
cophenetic matrices (
Figure
19
).
58
Figure
18
Typical dendrograms obtained for plasmode datasets from MSUPRP experimental
data with two dissimilarity measures
Dendrograms using complete linkage based
on (a) Poisson dissimilarity (
poi
),
(b) Euclidean
distance calculated from raw normalized data (
rnr
), and (c) reference dissimilarity based on
maximum proportion of shared reads. Original samples are labeled with 3 same letters (AAA,
BBB, CCC, DDD or DDD),
synthetic samples are labelled with 2 or 3 letters symbolizing the
clustered original samples A, B and C and their synthetic combinations in one group, and ori
ginal
samples E and D and their synthetic combinations in another group. The hierarchical structure of
each of these two groups represented the degree of shared reads between samples by joining
red ½ of reads. Dendrogram (b)
did not cluster samples according to the expected configuration.
59
Figure
19
Agreement between dissimilarity measures using MSUPRP plasmode datasets
The matrix contains means (upper triangle) and
standard deviations (lower triangle) of correlation
between cophenetic matrices of dendrograms (N= 50 plasmode datasets) for eight dissimilarity
measures: Euclidean distances using raw count data (
raw
), Euclidean distances using normalized
samples (
rnr
), E
uclidean distances using variance stabilizing transformation (
vsd
), Euclidean
distances using regularized logarithm (
rld)
1
-
Pearson correlation using raw counts (
pea
), 1
-
Pearson correlation using counts transformed by logarithm of raw counts +1 (
plg
), an
d 1
-
Spearman correlation using raw counts (
spe
). We identified the same three sets of dissimilarity
measures described before: 1)
raw
, 2)
rnr
and
pea
, and 3)
poi
,
rld
,
vsd
,
plg
and
spe
.
Correlations between hierarchies for the eight dissimilarity measures are summarized in
Figure
19
. It contains mean and standard deviations, in the upper and lower tr
iangle respectively,
of correlations between cophenetic matrices computed on 50 plasmode datasets. As observed in
the three scenarios for the Bottomly experiment, dissimilarity measures can be apportioned to
three groups: 1)
raw
, 2)
rnr
and
pea
, and 3)
poi
,
rld
,
vsd
,
plg
and
spe
. Dendrograms from
raw
did
not agree with dendrograms from other groups. Hierarchies from dissimilarity measures in group
2 presented a medium correlation of cophenetic matrices with high variability (0.69±0.22).
Dendrograms from t
he dissimilarity measures in group 3 presented high correlation values with
each other (>0.98) and low variation (<0.01).
60
The correlation of hierarchies within each of the dissimilarity measures (
Table
2
, column 5)
was low for
raw
,
rnr
and
pea
, whereas clusters were much more consistent (
r
>0.98) for
poi
,
rld
,
vsd
,
plg
and
spe
dissimilarities. Additionally, dissimilarity measures
raw
,
rnr
, and
pea
were poorly
correlated with the reference dissimilarity (0.57±0.07, 0.53±0.07, and 0.51±0.04, respectively)
while
poi
,
rld
,
vsd
,
plg
, and
spe
were highly correlated with the reference dissimilarity
(r>0.8±0.001, see
Table
6
in supplementary material
). Consequently, dissimilarities raw,
rnr
, and
pea
did not resemble the expected sample structure and resulted in dendrograms that were very
inconsistent over repeated samp
ling of the same dataset. In contrast, dissimilarities
poi
,
rld
,
vsd
,
plg
, and
spe
maintained the sample structure and produced highly reproducible results in
hierarchical dendrograms.
3.4
Discussion
Hierarchical cluster analysis is one of the most used techniques for exploring expression
patterns in sequencing data
(Liu and Si, 2014)
. In this paper, we showed how to assess the
adequacy of dissimilarity measures for clustering samples from RNA
-
seq experiments by
generat
ing plasmode datasets from experimental data.
Plasmode datasets are useful alternatives to parametric simulations for assessing statistical
methodologies as data are generated on more realistic conditions and do not depend on a specific
parametric model
(Gadbury et al., 2008)
. The algorithm used to build a plasmode dataset
depends on the characteristics of the experimental data and the objective of the study. We
presented two examples on how to build plasmode datasets from tw
o experiments with different
conditions.
61
The Bottomly dataset had an experimental design with two main sources of variation and used
highly inbred individuals. Such context allowed us to generate plasmode datasets with known
proportions of differentiall
y expressed transcripts (
Figure
14
) and focused on assessing the
adequacy of dissimilarity measures in recovering the main sources of variation in the
hierarchical
structure (
Figure
16
). Analogous plasmode generation algorithms have been used with different
objectives, for example to validate differ
ential expression methods for RNA
-
seq
(Reeb and
Steibel, 2013)
, microarray analysis
(Gadbury et al., 2008)
, and qPCR
(Steibel et al., 2009)
, but
this is the first time that they are used to assess the properties of sample
-
based clustering. These
procedur
es are by no means exhaustive of the possible ways of creating plasmodes for clustering.
For instance, the algorithm presented in
Figure
14
preserves t
he correlation among genes
(Reeb
and Steibel, 2013)
when generating plasmodes, but other sampling strategies could purposefully
select groups of genes with speci
fic correlation patterns. For instance, instead of sampling from
DE and nonDE groups, transcripts could be sampled from blocks of co
-
expressed genes, resulting
in more realistic datasets especially if the study is focused on gene
-
based clustering and co
-
ex
pression analysis
(Si et al., 2014; Rau et al., 2015)
.
Diffe
rent from the Bottomly dataset, the MSUPRP did not present any experimental treatment,
and individual characteristics were more important. Under these circumstances, we built
plasmodes creating synthetic individuals, by combining known proportions of read
counts from
original individuals and we evaluated the adequacy of dissimilarity measures in resembling the
different mixture proportions in the hierarchical sample structure (
Figure
18
). A similar plasmode
generation algorithm was proposed
(Vaughan et al., 2009)
to evaluate admixture estimation
originates from different founding populations, but using SN
P genotypes instead of sequence
read counts.
62
Although the utility of plasmode datasets has been recently highlighted in RNA
-
seq studies
(Reeb and Steibel, 2013; Zhou et al., 2014)
, only parametric simulations or exemplar experimental
datasets have been used to compare dissimilarity measures and clustering methods
(Witten,
2011; Ma and Wang, 2012)
. While extremely useful, parametric simulations are often criticized
as being too simplistic to appropriately
capture the complexity in gene expression data
(Gadbury
et al., 2009)
, thus limiting the scope and validity of the resulting conclusions. On the other hand,
using a single exemplar dataset with unknown properties is not an appropriate approach for
comparing statistical methods
(Mehta et al., 2004)
. As a partial sol
ution to these limitations, in this
paper we show that plasmodes can supplement the evaluation of clustering algorithms by
including agreement and consistency measures based on datasets that mimic read count
distributions more realistically. One likely cr
iticism of plasmode is that the results may heavily
depend on the original dataset
(Reeb and Steibel, 2013)
. However, this does not invalidate their
use. Moreove
r, as shown in this paper, using two very different datasets, some properties of
alternative metrics remain consistent, which encourages the use of the plasmode generation
methods presented here using alternative datasets.
We built plasmodes to evaluate al
ternative dissimilarity measures. The selected dissimilarity
measures allowed i) the comparison of traditional dissimilarity measures and dissimilarity
measures based on discrete count distributions specifically proposed for RNA
-
seq, and ii)
studying the
effect of normalization and transformation prior to computing dissimilarities.
The Euclidean distance or the Pearson correlation based dissimilarity computed after
transforming data is a routine method adopted from microarray gene expression analysis
(Jiang
et al., 2004; Liu and Si, 2014)
. In fact, the Pearson correlation based dissi
milarity is equivalent to
squared Euclidean distance of standardized data
(Hastie et al., 2009)
. On the other hand, the
most common transformations used for RNA
-
seq are the logarithm of counts, or logarithm of
counts plus a constant
(Severin et
al., 2010)
, but other variance stabilizing and regularized
63
logarithm transformation functions have been proposed to model the mean
-
variance relationship
of RNA
-
seq counts
(Anders and Huber, 2010; Law et al., 2014; Love et al., 2014b)
. The
Spearman correlation based dissimilarity uses the rank of the read count instead of the counts
themselves to compute correlation; co
nsequently it could be applied without transforming the
data. Although the use of Spearman correlation based dissimilarity has been discouraged for
gene
-
based clustering of RNA
-
seq data
(Liu and Si, 2014)
, because it uses a small number of
grouped samples to compute ranks,
we have used it for sample
-
based clustering where the
number of genes is potentially large enough to obtain more precise ranks. Finally, the Poisson
dissimilarity
(Witten, 2011)
was specifically proposed for clustering of sequenc
ing data based on
a Poisson log
-
linear model of normalized counts, and thus, it is a natural candidate to be included
in this comparison.
The eight evaluated dissimilarity measures presented a common pattern of agreement and
consistency in recovering the
expected sample structure for both plasmode datasets. Dissimilarity
measures with high level of agreement between them
correlations between cophenetic matrices
with high mean
and low standard deviation (Figures
Figure
17
and
Figure
19
)
produce
dendrograms with very similar hierarchical structures. However, if dissimilarity measures have
correlations of cophenetic matrices with either low mean or high standard deviation (
F
igures
Figure
17
and
Figure
19
), they genera
te dendrograms with different hierarchical structures. In
addition, correlation between dendrograms obtained with a particular dissimilarity measure
summarizes the consistency of such dissimilarity measure. If a dissimilarity measure has a within
cophenet
ic correlation with high mean and low standard deviation, it consistently generates
similar dendrograms.
To assess the adequacy of a dissimilarity measure, both agreement and consistency are
important. We showed this with
poi
,
rld
,
vsd
,
plg
and
spe
dissim
ilarities, which presented similar
level of agreement
in both datasets (Figures
Figure
17
and
Figure
19
). However, dendrograms
64
based on these dissimilarity measures were consi
stent in the MSUPRP dataset, but showed
different consistency under the three scenarios in the Bottomly datasets (
Table
2
). A counter
example is dissim
ilarity measures
rnr
and
pea
that agreed with each other but were very
inconsistent. This means that
rnr
and
pea
tended to reproduce similar clusters on each plasmode,
and wide range of dendrograms structures. This has not been reported before, because
co
nsistency of clustering under repeated sampling has not been studied in previous works. But a
reason for the agreement is that both
rnr
and
pea
are focusing on the same features, because
they are essentially normalized Euclidean distances. The reason for t
he low consistency could be
that these measures are expected to behave well with heterogeneous approximately Gaussian
data, but not too well with extremely non
-
Gaussian data. On the other side, the measure
raw
is
expected to be better suited for homogenous
Gaussian data (all variances are of similar
magnitude).
As mentioned before, we used plasmode to study the effect of normalizing data. Normalization
is an essential data processing step in RNA
-
seq analysis that aims at removing systematic biases
in order
to make consistent comparisons within and between samples
(Dillies et al., 2012)
.
Although several methods
(Bullard et al., 2010; Anders and Huber, 2010; Robinson and Oshlack,
2010)
have been proposed to normalize data, especially for differential expression analysis, the
impact of a particular normalization method seem
ed to be less important in classification and
clustering analyses
(Witten, 2011)
. We confirmed this, showing that normalizing counts to equal
library sizes was not enough to capture the natural structure of samples when it was th
e only
transformation applied. For instance, dissimilarity measures
raw
and
rnr
had low agreement with
dissimilarity measures that resemble better the true structure of data, e.g.
rld
,
vsd
,
spe
or
plg
or
poi
(
Figures
Figure
17
and
Figure
19
).
Accounting for the discrete nature of read counts in RNA
-
seq data is the most important issue
to consider when computing dissimilarity measures. For instance, Euclidean distance and
65
Pearso
n correlation based measures are known to be influenced by scale, skewness and outliers,
thus, they may not work well for count data
(Liu and Si, 2014)
. In support of this, we found that
dissimilarity measures
raw
,
rnr
,
pea
, based directly on counts, regardless of normaliza
tion or
standardization, did not resemble the expected dendrogram and were generally inconsistent.
Dendrograms obtained with Pearson based correlation resembled the expected structures only
when data were previously log
-
transformed. However, we found that
the Spearman correlation
based dissimilarity measure (
spe
) was suitable to represent the natural structure of samples even
without normalizing data, possibly because it preserves the relative rank relationships, and it is
less influenced by skewness and ou
tliers
(Kendall and Gobbons, 1990)
when it is based on a
large number of genes. The variance stabilizing (
vsd
) and regularized logarithm (
rld
) approaches
consistently retrieved the expected dendrogram structure. Both t
ransformations model the mean
-
variance relationship across all genes to stabilize the variance of counts across samples
(Anders
and Huber, 2010; Love et al., 2014b)
. The regularized logarithm transformation also accounts for
variation in sequencing depth across samples
(Love et al., 2014b)
. Both functions have been
suggested as appropriate transformations for c
lustering and classification of RNA
-
seq data with
less ambiguous results in hierarchical clustering than using simply log
-
transformed counts
(Love
et al., 2014b)
. Finally, directly using the Poisson dissimilarity (
poi
) generated dendrograms with
the
expected structure. This is not surprising considering that read counts are usually assumed
to fit over dispersed Poisson distributions
(Pachter, 2011)
. Similarly, Witten
(Witten, 2011)
obtained dendrograms with lower clustering error rates when using the Poisson dissimilarity
rather than
vsd
or Euclidean distances on normalized data, but using overdispersed Poisson
simulation
s. Our results are encouraging because we did not use a parametric model to produce
similar outcomes.
Sample
-
based hierarchical cluster analysis can be used as a tool to present results after
differential expression analysis or it can be us
ed as an explorative technique for finding patterns
66
in data. In the first approach only informative genes, i.e. differentially expressed genes (called
signal in data mining literature) are used while in the second approach informative as well as non
inform
ative genes (also known as noise) are utilized
(Jiang et al., 2004)
. As the signal
-
to
-
noise
ratio (proportion of DE to nonDE genes) is usually less than 1:10
(Jiang et al., 2004)
, particular
methods are applied to diminish the influence of non informative genes that can degrade the
reliability of clustering resul
ts
(Jiang et al., 2004)
. In R
NA
-
seq analysis, cluster analysis is
commonly applied only to differentially expressed genes or a subset of them
(Liu and Si, 2014)
.
We have assessed dissimilarity measures under scenarios that include not only a set of
differentially expressed transcripts but we also combi
ned differentially and non differentially
expressed transcripts (signal
-
to
-
noise ratio 1:9 and 1:4), as well as a mixture of individuals. We
found that
rld
,
vsd
,
plg
,
spe
and
poi
were highly consistent under all scenarios with a tendency to
diminish consis
tency as the number of non informative genes increases. Although we focused
our comparison on the effect of dissimilarity measures on hierarchical clustering results, the same
plasmodes could be used to investigate the effect of other decisions made when p
erforming
sample
-
based clustering as the selection of the hierarchical clustering algorithm per
-
se or even
the effect of pre
-
filtering transcript according to their level of expression
(Rau et al., 2013;
Bourgon et
al., 2010; van Iterson et al., 2010)
.
We did not explore those aspects of sample
clustering, but their investigation will be facilitated by the plasmode building strategies described
in this paper.
3.5
Conclusion
Generating plasmode datasets from experimental data is a reliable tool for evaluating
dissimilarity measures in agglomerative hierarchical cluster analysis of RNA
-
seq data. Depending
on the characteristics of the available datasets, several scenarios can b
e established to compare
67
dissimilarity measures upon a broad spectrum of more realistic conditions than using other
simulation approaches. Similar methodologies can be applied to study gene
-
based clustering as
well as other clustering analysis methods.
Exp
lorative sample
-
based hierarchical clustering of RNA
-
seq data needs as an input a
dissimilarity matrix that accounts for the mean
-
variance relationship of the discrete nature of read
counts. Euclidean distance calculated either on data that have been previ
ously logarithm
-
transformed or regularized with more complex
ad hoc
functions, as well as model
-
based
dissimilarity for RNA
-
seq data, were consistent in reproducing the expected sample structure in
hierarchical dendrograms.
68
APPENDIX
69
Sample preparation, sequencing and mapping
The MSUPRP dataset corresponds to 24 samples of
longissimus
dorsi
muscle extracted from
24 F2 females selected from the MSU Pig Resource Population
(Steibel et al., 2011)
. Total RNA
was obtained using TRIzol reagent (Invitrogen/Life Technologies, Ca
rlsbad, CA, USA), and RNA
reverse transcribed into cDNA, fragmented and labeled to generate 24 barcoded libraries that
were sequenced on an Illumina HiSeq 2000 (100
bp, paired
-
end reads) at the Michigan State
University Genomics Core Facility. Four technical replicates were collected from each library and
arranged in four different lanes of a flowcell allowing up to 12 barcode
s per lane as illustrated in
Figure
20
.
Each library has a tag with the sample number and a colour symbolizing
its barcode. A same set
of 12 tags is repeated 4 times (technical replicates).
The raw read data consisted of 96 pairs of fastq files (4 per sample) containing approximately
15 million short
-
reads (100bp) each. Those fastq files were pre
-
processed
Figure
20
Sequencing layout of 24 barcoded samples
70
using
FASTX
toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) and FASTQC
(http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to assess read quality. Then, Tophat
(Trapnell et al., 2009)
was used for mapping the reads to the reference genome (Sus scrofa
10.2.69 retrieved from the Ensembl database) using an index generated by Bowtie2
(Langmead
and Salzberg, 2012)
. The aligned records were stored in BAM/SAM format
(Li et al., 2009a)
.
Alignment statistics and base coverage was cal
culated for each file using SAMt
ools
(Li et al.,
2009a)
. After that, Cufflinks software
(Trapnell et al., 2012)
was used to obtain gene models and
to merge gene models from all samples and reference annotation. Finally, transcript specific read
counts were estimated using HTSeq
(Anders et al., 2014)
.
Consistently, about 85% of reads were successfully mapped to referen
ce genome. We
detected a total of 26740 transcripts with at least one read aligned. Average coverage per base
across 96 pairs of fastq files was 45.79X.
To obtain the final count matrix, we filtered out transcript with zero expression in all samples
and
merged the 4 technical replicates from each sample. As a result we obtain a count matrix
with 26740 transcripts and 24 samples.
71
Plasmode generation for MSUPRP samples
A singular plasmode dataset comprised 12 samples: 5 original samples, ovals labelled as
{AAA,BBB,CCC,DDD,EEE}, and 7 synthetic samples, circles labelled as
{AAC,BBC,CXB,CCB,DDE,EXD,EED}, obtained by combining know
n proportions of transcripts
Figure
21
Plasmode generation for MSUPRP dataset
72
Reference similarity and dissimilarity matrices for MSUPRP plasmodes
The similarity between two samples (
s
ij
) was calculated as the maximum proportion of original
shared reads.
The dissimilarity between two samples (
d
ij
) was calculated as 1
-
s
ij
.
AAA AAC BBB BBC CXB CCB CCC DDD DDE EXD EED EEE
AAA
1.00 0.66 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
AAC
0.66 1.00 0.00 0.33 0.33 0.33 0.33 0.00 0.00 0.00 0.00 0.00
BBB
0.00 0.00 1.00 0.66 0.50 0.33 0.00 0.00 0.00 0.00 0.00 0.00
BBC
0.00 0.33 0.66 1.00 0.83 0.66 0.33 0.00 0.00 0.00 0.00 0.00
CXB
0.00 0.33 0.50 0.83 1.00 0.83 0.50 0.00 0.00 0.00 0.00 0.00
CCB
0.00 0.33 0.33 0.66 0.83 1.00 0.66 0.00 0.00 0.00 0.00 0.00
CCC
0.00 0.33 0.00 0.33 0.50 0.66 1.00 0.00 0.00 0.00 0.00 0.00
DDD
0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.66 0.50 0.30
0.00
DDE
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.66 1.00 0.83 0.66 0.33
EXD
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.83 1.00 0.83 0.50
EED
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.30 0.66 0.83 1.00 0.66
EEE
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.50
0.66 1.00
Table
3
Reference similarity matrix (S) for MSUPRP plasmodes
AAA AAC BBB BBC CXB CCB CCC DDD DDE EXD EED
AAC 0.34
BBB 1.00 1.00
BBC 1.00 0.67 0.34
CXB 1.00 0.67 0.50 0.17
CCB 1.00 0.67 0.67 0.34 0.17
CCC 1.00 0.67 1.00 0.67 0.50 0.34
DDD 1.00 1.00 1.00 1.00 1.00 1.00 1.00
DDE 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.34
EXD 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.50 0.17
EED 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.70 0.34 0.17
EEE 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.67 0.50 0.34
Table
4
Reference dissimilarity matrix (D) for MSUPRP plasmodes
73
Reference dendrogram and cophenetic matrix for MSUPRP plasmodes
Figure
22
Reference dendrogram for MSUPRP plasmodes
AAA AAC BBB BBC CXB CCB CCC DDD DDE EXD EED
AAC 0.34
BBB 0.87 0.87
BBC 0.87 0.87 0.63
CXB 0.87 0.87 0.63 0.26
CCB 0.87 0.87 0.63 0.26 0.17
CCC 0.87 0.87 0.63 0
.50 0.50 0.50
DDD 1.00 1.00 1.00 1.00 1.00 1.00 1.00
DDE 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.64
EXD 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.64 0.26
EED 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0
.64 0.26 0.17
EEE 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.64 0.50 0.50 0.50
Table
5
Cophenetic matrix of the reference dendrogram for MSUPRP plasmodes.
74
Correlation between dissimilarity measures and reference dissimilarity for the
MSUPRP plasmodes
Table
6
Correlation between dissimilarity measures and reference dissimilarity
Dissimilarity
Correlation
with
reference
dissimilarity
raw
0.57 (0.075
)
rnr
0.53 (0.069
)
rld
0.83
(0.0
0
1)
vsd
0.82
(0.0
0
1)
pea
0.51 (0.045
)
plg
0.82
(0.0
0
1)
spe
0.81
(0.0
0
1)
poi
0.81
(0.01)
Mean and standard deviation of correlation between each of the eight dissimilarity matrices and
the reference dissimilarity for MSUPRP plasmodes (N=50 plasmode datasets). Euclidean
distances using raw count data (
raw
), Euclidean
distances using normalized samples (
rnr
),
Euclidean distances using regularized logarithm (
rld
), Euclidean distances using variance
stabilizing transformation (
vsd
), 1
-
Pearson correlation using raw counts (
pea
), 1
-
Pearson
correlation using counts transfo
rmed by logarithm (plg), and 1
-
Spearman correlation using raw
counts (
spe
), and Poisson dissimilarity (
poi
). Distances measures
raw
,
rnr
, and
pea
poorly
preserved the expected sample structure (r<0.57) while
poi
,
rld
,
vsd
,
plg
, and
spe
highly preserved
th
e expected sample structure (r>0.8).
MSUPRP dataset
. This file contains raw count data (6 columns x 25798 rows) for the five
animals used to generate the MSUPRP plasmode datasets.
Available online
at PLoS ONE
website
.
75
Chapter
4
Assessing genotype cal
l
accuracy from RNA sequencing data
Assessing genotype call accuracy from RNA
sequencing data
76
4.1
Introduction
RNA sequencing technology (RNA
-
seq)
(Wang et al., 2009)
has become the platform of
choice for gene expression
profiling
(Oshlack et al., 2010)
. In addition to quantifying total gene
expression measures, RNA
-
seq experiments can be used for calling SNP genotypes to
investigate allele
-
specific expression (ASE)
(Pirinen et al., 2015; Steibel et al., 2015)
, and for
studying
RNA
-
editing
(Ramaswami et al., 2013)
.
Ano
ther proposed use of genotypes called from
RNA
-
seq data is for performing population genetics studies in nonmodel organisms
(De Wit et al.,
2012; Schunter et al., 2013; Lemay et al., 2013; Yu et al., 2014)
and for performing genetic
associations with phenotypes
(Cánovas et al., 2013)
. These experiments, which study and
quantify variation in transcripts nucleotide sequence rely on accurately calling genomic variants
and genotyping individuals. Moreover, calling variants from RNA
-
seq data poses
challenges that
are not present in variant calling from DNA sequence
(Piskol et al., 2013; Quinn et al., 2013)
. For
instance, alternative splicing
(Piskol et al., 2013; Quinn et al., 2013)
and ASE
(Steibel et al., 2015)
may substantially complicate genotype calling from RNA
-
seq data. Another difference between
RNA
-
seq and DNA
-
seq data
is that the sequence coverage of
certain genes across samples may
differ substantially, due to gene
-
specific
and subject to subject variation in gene expression.
Despite these challenges
, identification of genetic variants and calling genotypes from RNA
-
seq data is commonly performed using tools developed for whole genome sequencing (WGS) of
DNA samples
(Altmann et al., 2012; Nielsen et al., 2011)
. Consequently, the perfo
rmance of such
algorithms have to be studied separately for RNA
-
seq experiments.
Assessing the performance of variant calling programs from RNA
-
seq data is usually done by
comparing the called genotypes to a gold standard such as genotypes from a SNP chip
(Djari et
al., 2013; Cánovas et al., 2013)
or from WGS
(Piskol et al., 2013; Quinn et al., 2013)
and
estimating genotyping accuracy. But in the context of calling SNP genotypes from RNA
-
seq data,
77
estimating genotyping
accuracie
s separately for
heterozygous and homozygous is important due
to the different use of those genotypes. For instance: false heterozygous genotypes (a true
homozygous site that is called heterozygous in RNA
-
seq data) will likely lead to biases in
estimating
ASE
(Steibel et al., 2015; Pirinen et al., 2015)
and they w
ill lead to false positives in
RNA editing studies. On the other side, false homozygous genotypes (true heterozygous sites
called homozygous) will likely result in lack of power to detect ASE and RNA editing. In general,
notable exceptions
(Quinn et al., 2013)
, this issue has been ignored, and usually omnibus
genotype calling
accuracies or error
rates are defined and reported.
Calling reliable SNP genotypes may have other applications such as conducti
ng population
genetics studies for estimating effective population size, studying population substructure, and
performing genetic associations with ecologically or economically important phenotypes. In those
cases, estimating SNP
-
specific genotyping call
accuracies and error rates may not be relevant.
For those cases it is more important to correctl
y estimate the relatedness between
individual
s
.
For instance, reconstructing genomic relationship matrices
(VanRaden, 2008)
with SNP
s
called
from RNA
-
seq data and comparing
them
to a reference relationship matrix would definitely be
more infor
mative for animal breeding applications
(Habier et al., 2007)
, as well as for population
history and demography studies of nonmodel organisms
(Ekblom and Galindo
, 2011)
.
In this work, we compared sensitivity and false discovery rates for SNP calling using two well
-
established variant calling programs
BCFtools
(Li, 2011)
, and VarScan2
(Koboldt et al., 2012)
.
We also used Beagle
(Browning and Browning, 2007)
to impute genotype calls from BCFtools,
as previously recommended
(Nielsen et al., 2011; Li et al., 2013)
.
We apply these algorithms to
RNA
-
seq data from a pig resource population that has been extensively genotyped and
phenotyped. We report homozygous and heterozygous genotype call rates and we estimate
error
rates by comparing called genotypes against genotypes obtained from SNP chip array and DNA
sequencing.
78
4.2
Material and Methods
4.2.1
Resource population
The Michigan State University pig resource population (MSUPRP) is an F
2
cross originated
from 4 F
0
Duroc sires and 16 F
0
Pietrain dams. The full pedigree includes 20 F
0
, 56 F
1
and 954
F
2
animals. This population has been described in detail elsewhere
(Edwards et al., 2008)
.
For
this study we selected 24 F2
fe
males that presented extreme
phenotypes in loin eye area
(Steibel
et al., 2011)
.
Total RNA was extracted from
longissimus dorsi
muscle of the 24 F2 females using TRIzol
reagent (Invitrogen/Life Technologies, Carlsbad, CA, USA), and RNA quantity and quality were
determined using an Agilent 2100 Bi
fragmented and labeled to generate 24 barcoded libraries that were sequenced on an Illumina
HiSeq 2000 (100 bp, paired
-
end reads) at the Michigan State University Genomics Core Facility.
Four tech
nical replicates were collected from each library and arranged in four different lanes of
a flowcell allowing up to 12 barcodes per lane. The raw read data consisted of 96 pairs of fastq
files (4 per sample) containing approximately 15million short
-
reads (
100bp) each. Those fastq
files were pre
-
processed using FASTX toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) and
FASTQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to assess read quality.
Then, Tophat
(Trapnell et al., 2009)
was used for mapping the reads to the reference genome
(Sus scrofa 10.2.69 retrieved from the Ensembl database) using an index generated by Bowtie2
(Langmead and Salzberg, 2012)
. The aligned records were stored in BAM/SAM format
SNP
chip genotypes were obtained with the SNP60K Illumina chip
(Gualdrón Duarte et al.,
2013)
and used as gold standard to compare genotypes called from mRNA sequencing data.
79
In addition to the SNP
chip genotypes and muscle tissue RNA
-
seq
data, genomic DNA
sequence,
liver
and fat tissue RNA
-
seq data were
available for one of the 24 animals. Genomi
c
DNA was purified from white blood cells using the
Invitrogen Purelink Genomic DNA Mini Kit.
Library preparation was performed with the Illumina TruSeq Nano DNA Library Preparation Kit
HT and sequenced on 2 lanes of an Illumina HiSeq 2500 Rapid Run flow c
ell and sequenced in
a 2x150bp (PE150) configuration Rapid SBS reagents. Reads were trimmed with Condetri
(Smeds and Künstner, 2011)
and aligned using Bowtie 2
(Langmead and Salzberg, 2012)
. For
this animal, RNA
-
seq was performed
slightly
differently from the other 23 pigs. Total RNA was
extracted from subcutaneous fat, liver and
longissimus dorsi
muscle tissue. The samples were
prepared for seq
uencing using the Illumina TruSeq Stranded mRNA Library Preparation Kit. After
validation and quantitation the libraries were pooled and loaded on both lanes of an Illumina HiSeq
2500 Rapid Run flow cell (v1). Base calling was performed by Illumina Real Ti
me Analysis (RTA)
v1.18.61 and output of RTA was demultiplexed and converted to FastQ files with Illumina
Bcl2fastq v1.8.4.
Reads were trimmed with Condetri
(Smeds and Künstner, 2011)
and aligned
to the reference
genome (Sus scrofa 10.2.69) with Tophat
(Trapnell et al., 2009)
. The BAM files were merged and
unique alignments were e
xtracted and separated by strand.
4.2.2
Variant calling
First, base alignment files were obtained using the mpileup command of SAMtools
version
1.0
(Li et al., 2009a)
using options
-
Q20
-
C50
-
B of mpileup to eliminate reads with
quality scores
bel
ow 20,
to
downgrade mapping quality for reads with exces
sive mismatches, and to di
sable
the
calculation of base realignment quality in order to diminish false SNPs generate
d
by
misalignments. For the single animal with mRNA and DNA sequence data
, reads with quality
80
scores bel
ow 25 were filtered and the coefficient for downgrading mappin
g quality with excessive
mismatches was applied, but base realignment quality was allowed (option
-
Q25
-
C50
-
E
). These
options have been used before for the study of study RNA editing
(Chen et al., 2014)
.
The multiallelic calling version of
BCFtoo
ls
was used (bcftools
call
-
vm) to call genotypes,
w
hile to call SNP with VarScan2, the mpileup2snp command was used with default settings (a
minimum of 8 reads to make a call, with at least 2 supporting reads to call a variant, a minimum
average ba
se quality of 15, a
nd a p
-
value threshold of 0.01)
. Finally, Beagle
(Browning and
Browning, 2007)
was used to refine genotype calls from BCFtools using genotype imputation.
Bea
gle used as input vcf files obtained by BCFtools. We used default parameters for Beagle
version 4.0 (java
-
Xmx4000m
-
jar beagle.r1398.jar gl=bcfvariants).
4.2.3
Assessing genotype call accuracy
SNP genotypes obtained with each variant calling program were compared to genotypes
obtained from a DNA
SNP60K Illumina chip
to estimate genotype call rates. For each SNP
genotyped with both technologies, we built a contingency table (
Table
7
) and we computed
several genotyping accuracy and error measures. For instance: true homozygous called
homozygous as well as true heterozygous called heterozygous contribute to the accuracy of
the
variant calling method and they are the basis to estimate sensitivity. On the other hand, true
homozygous called heterozygous and true heterozygous called homozygous contribute to
genotyping errors.
81
Table
7
Contingency table w
ith classification rule used to compute sensitivity and error rate of
genotype calling
RNA
-
seq
Variant
calling
Genotype
SNP
c
hip Genotype
Total
Heterozygous
Homozygous
Heterozygous
True He
1
False He
2
Called He
3
Homozygous
False Ho
4
True Ho
5
Called Ho
6
Non Called
NC He
7
NC Ho
8
NC
9
Total
He
10
Ho
11
1
Number of he
terozygous genotyped in the SNP
chip called heterozygous from RNA
-
seq;
2
Number of
homozygous genotyped in the SNP
chip called heterozygous from RNA
-
seq;
3
Total Number of called heterozygous from RNA
-
seq;
4
Number of he
terozygous genotyped in the SNP
chip called homozygous from RNA
-
seq;
5
Number of homozygous
genotyped in the SNP
chip called homozygous from RNA
-
seq;
6
Total Number of called homozygous from
RNA
-
seq;
7
Number of het
erozygous genotyped in the SNP
chip non
-
called from RNA
-
seq;
8
Number of
homozygous genotyped in the SNP
chip non
-
called from RNA
-
seq;
9
Total number of non
-
called genotypes from RNA
-
seq;
10
Total number of he
terozygous genotyped in
the SNP
chip
11
Total number of
homozygous genotyped in the SNP
chip
To assess the performance of the variant calling methods, we computed true genotype
rates (sensitivity) and false discovery rates (FDR) for each genotype as:
82
4.2.4
Equivalence between relationship matrices and correlation with SNP chip data
The marker
-
based rel
ationship matrix among individuals
(VanRaden, 2008)
was calculated
using genotypes
from the 60K SNPchip (
) and compared to the marker
-
based relationship
using genotypes from RNA
-
seq
(
)
. In order to compute
we selected SNPs called from
RNA
-
seq or from the chip and built a genotype matrix
containing
standardized allelic dosages
, where
are the allelic counts of the reference allele and its expectation
and variance are computed assuming Hardy Weinberg equilibrium across SNP
(VanRaden,
2008)
. Finally, following Van Raden
(
2008)
,
To generate
,
we used SNP called by BCFtools with at
least 240 reads per base
across the 24 animals. This level of total depth ensure
d
that both heterozygous and homozygous
genotypes were accurately called (see results section). Additionally, only monomorphic (minor
allele frequency > 0.0) sites in autosomic
chromosomes with genotyping data in all 24 samples
were considered. After applying all the filters a total of 125
,
277 SNPs were used to compute
the
. Finally, the maximum correlation between SNPs in the SNPchip and SNPs within 1
Mb distance
in RNA
-
seq data were calculated to assess the redundancy of the genotype set
obtained with RNA
-
seq with respect to the SNP set in the 60K chip.
83
4.3
Results
4.3.1
Analysis across variant calling programs for 24 animals
Sensitivity curves presented similar p
atterns for heterozygous and homozygous genotypes,
but with specific quantitative differences.
Figure
23
a
-
b
shows the sensitivity for calling
heterozygous and homozyg
ous genotypes, respectively, for three variant calling programs as a
function of inverse cumul
ative read depth. Sensitivity for
calling heterozygous genotypes
was
higher wit
h BCFtools compared to VarScan2
for sequencing depth below 1000 reads per base
(across all 24 samples). Using Beagle to impute missi
ng genotypes called by BCFtools
provided
a slight increment in sensitivity o
f heterozygous genotype calling
at the expense of FDR as we
describe later. Reg
ardless of the variant calling program, heterozygous genotypes were called
with sensitivity > 94% when base read depth was equal or higher than 1000. When calling SNP
genotypes at homozygous sites, the sensitiv
ity was usually higher than
corresponding to
h
eterozygous sites, especially for sites covered by less than 1000 reads. Moreover, the reported
advantage in sensitivity of BCFtools compared to VarScan2 for calling
genotypes at heterozygous
sites
was still present but it was lower for homozygous sites. C
onsistently, imputing non
-
called
genotypes using Beagle improved the sensitivity of homozygous calling with respect to BCF
tools
even more markedly than for heterozygous sites. Similarly to heterozygous, calling homozygous
in s
ites covered by over 1000 read
s
re
sulted in sensitivity > 98%,
regardless of the variant calling
program.
84
Figure
23
Assessing SNP calling accuracy of RNA
-
seq samples of
longisimus dorsi
Sensitivity analysis for heterozygous
(A)
and
homozygous
(B)
. False discovery rate for
heterozygous
(C),
and homozygous
(D)
.
Raw read depth in the x axis is inversely cumulative in
the log
10
scale. Varcalling software correspond to: bcf=BCFtools, bea=Beagle, and
vsc=VarScan2
False discovery rates curves presented
more distinct patterns between heterozygous and
homozygous genotypes.
Figure
23
c
-
d
shows the FDR for heterozygous and homozygous calls,
respectively, for each of the
three variant calling programs as a function of inverse cumulative
read depth. Heterozygous sites called with VarScan2 were consistently correct regardless of the
85
coverage depth (FDR very close to 0). However this comes as a price of very low sensitivity.
For
example, in sites covered by 10 to 100 reads across all 24 samples, the sensitivity of VarScan2
was 56% on average (
Figure
23
a
). Contrastingly, at that same
depth range, FDR with BCFtools
was between 12% and 4%,
but sensitivity was between 50% and 87%. At high sequencing d
epth,
differences in FDR vanish
and all programs presented
FDR < 2% when base read coverage was
equal or higher than 1000. As mentioned befo
re, imputation with low coverage increased
sensitivity at the cost of FDR especially for sites with less than 100 reads per base. For instance,
the use of Beagle increased the FDR from 4% to 21% when coverage dropped below 10 reads
per base. The proportion
of incorrectly called homozygous was generally higher than the
proportion of incorrectly called heterozygous (
Figure
23
d
). Only sites with more than 1000 reads
per b
ase presented a FDR of homozygous below 3%. BCFtools had lower error rates than
VarScan2 for base read depth between 10 and 1000. The use of Beagle increased the FDR for
homozygous particularly for sites with very low coverage (less than 10 reads per base)
, and the
increase in FDR was less than when imputing heterozygous sites.
4.3.2
Comparison of genotype calls from DNA
-
seq and RNA
-
seq from multiple tissues
Comparing SNP calling from RNA
-
seq to DNA
-
seq data, genotypes called from DNA
-
seq data
showed unifo
rmly high sensitivity (sensitivity > 0.99) and
low FDR (2%). Conversely
, error rates
from genotypes called from RNA
-
seq were influenced by sequencing depth. Sensitivity for
heterozygous sites called from RNA
-
seq data varied with base read depth while sensi
tivity for
homozygous sites was always high (>96%) regardless of base read depth (
Figure
24
a
-
b
) and
comparable to the sensitivity of homozygous genotypes called from D
NA
-
seq. Sensitivity of
heterozygous genotype calls using RNA
-
seq for any of the tissues was below 90 % for sites with
less than 10 reads per base and almost constant (94%
-
98%) for sites with more than 10 reads
per base. A coverage of 10 reads per base in o
ne animal is on average equivalent to 240 reads
86
across 24 animals, at which point BCFtools provided a sen
si
tivity of 94% across all samples
(
Figure
23
a
).
False discovery rates for heterozygous sites were below 2% regardless of base read depth
either for DNA or RNA samples. Different patterns presented when analyzing homozygous sites.
For ho
mozygous sites, variant calling using RNA
-
seq incorrectly genotyped a large proportion of
sites (between 59% and 10%)
that were covered by 10 or fewer
reads per base. Homozygous
sites with more than 10 reads per base presented lower false discovery rates f
or RNA
-
seq
samples from
longissimus dorsi
than from liver or fat tissue. FDR for homozygous was always
below 1% when using the DNA sample.
87
Figure
24
Assessing SNP calling accuracy of DNA and RNA
-
seq samples of three tissues from
the same animal
Sensitivity analysis for heterozygous
(A)
and
homozygous
(B)
. False discovery rate for
heterozygous
(C),
and homozygous
(D)
.
Raw read depth in the x axis is in
versely cumulative in
the log
10
scale. Varcalling software correspond to: bcf=BCFtools, bea=Beagle, and
vsc=VarScan2
4.3.3
Equivalence between relationship matrices
Correlation between relationship matrices
and
(based on
SNPchip markers
and based on RNA
-
seq markers, respectively) was 0.99 (
Figure
25
). This value indicated that
88
both matrices were virtually equivalent and either of them could be used to account for the
relationship among the animals
. In
depth analysis of the correlation re
sults showed that
correlation between the diagonals of both relationship matrices was 0.66 and correlation between
off
-
diagonals was 0.95 (
Figure
25
). These results a
re of interest to distinguish the use of the
relationship matrices to compute genomic inbreeding coefficient from the diagonals or to compute
genomic relationships between individuals from the off
-
diagonals
(VanRaden, 2008)
.
An average of 110 SNPs called from RNA
-
seq were within the 1 Mb window defined for ea
ch
SNP in the SNP
60K
chip a
nd only n
ine percent of SNP
s
in the
SNP
60K
chi
p did not have any
SNP called f
r
o
m RNA
-
seq. The median corr
elation between SNPs in the SNP
60K
c
hip and SNPs
called from RNA
-
seq was 0.71 and increased to 0.78 for SNPs in the SNP
60K
chip with at least
one SNP
called from RNA
-
seq within the 1 Mb window. This result implied that on average the
information contained in all the SNPs in the SNP
60K
chip could be replaced by information on
SNP
s
called from RNA
-
seq.
89
Figure
25
Scatterplot of
correlation between relationship matrices for the 24 F2 females
The scatterplot shows a high equivalence (r=0.99) between the relationship matrix based on
SNPchip genotypes (
G
SNPchip
) and the relationship matrix based on SNPs genotypes called from
RNA
-
seq
sites (
G
RNA
-
seq
) with depth >=240 reads per base. Correlation between relationship
matrices for the same animal (diagonal values of the relationship matrices) correspond to values
in the right top quadrant (r=0.99) while correlation between relationship m
atrices for the different
animals (off
-
diagonal values of the relationship matrices) are represented in the left bottom
quadrant (r=0.95)
4.4
Discussion
This work assessed genotype calling from RNA
-
seq and DNA
-
seq data. We compared
genotypes called by two var
iant calling programs to genotypes obtained from a DNA SNP
60K
chip by computing sensitivity and false discover
y
rates separately. Accuracy of SNP calling varied
between programs for different genotypes. Based on these results we make suggestions for
implem
enting genotype calling from RNA
-
seq data, depending on its intended use, for instance
90
when using heterozygous genotypes for allele
-
specific expression studies or when using
homozygous for exploring RNA editing.
Identifying heterozygous SNP is the first s
tep to conduct allele
-
specific expression studies
(Pickrell et al., 2010)
. Heterozygous sites provide inform
ation to classify reads according to their
haplotype transcript of origin and to quantify the allelic expression ratio
(Lee et al., 2013)
. Two
major consequences from genotyping error in ASE are:1) the omission of calling a heterozygous
site
,
or 2) the incorrect assignment of a heterozygous genotype to a true homozygous genotype.
In the case of missing a heterozygou
s call the direct consequence will be a loss in power, while
when mislabeling the site the effect could be biases in estimation of the ASE ratio
(Steibel et al.,
2015)
. Quinn et
al.
(2013)
found that sensitivity for heterozygous varied between 40% to 80% at
coverage depth below 10X per sample, and it reached 90% for coverage depth above 10X per
sample. Similarly, we found that
for the single sample analysis more than 90% of heterozygous
were called in all tissues when depth was above 10 reads per base (
Figure
24
A). For the 24
multiple samples, the same proportion (>90%) was obtained using BCFtools for an equivalent
total depth of 240 reads or more per base across all samples (24 samples x 10X per sample,
Figure
23
a). However, VarScan2 called only 70% of heterozygous sites at the same depth of 240
reads per base. Interestingly, false discovery rates for heterozygous genotypes at 10 reads per
b
ase per sample, or equivalently
a
multi
-
sample total depth of 240 reads per base, were below
2%
.
Consequently, calling heterozygous from RNA
-
seq data with more than 10 reads on average
per base and per sample can provide a good number of sites to conduct ASE studies with
sufficient power a
nd unbiased estimations of the allelic expression rate. Additionally, BCFtools
and Beagle provided a larger set of true heterozygous genotypes (e.g. an average of 40% more
SNP compared to VarScan2), due to increased sensitivity and to more stringent filter
ing criteria
implemented in VarScan2. Yet, if a variant is called heterozygous by VarScan2, it will be likely be
a true heterozygous as the FDR is very low regardless of the depth (Figures
Figure
23
c and
91
Figure
24
c). However, more caution should be considered when using BCFtools and Beagle in
sites with low depth (Figures
Figure
23
c and
Figure
24
c).
Homozygous sites provide an initial list of candidate sites for studying RNA editing
(Li et al.,
2009b; Chen et al., 2014)
. The typical approach to study RNA editing compares the mismatches
between homozygous DNA and RNA sequences
(Peng et al., 2012; Ramaswami et al., 20
12)
from a single individual. Besides identifying the type of editing process, e.g. A
-
to
-
I editing, it is
possible to quantify editing levels in an analogous way as in ASE studies
(Bahn et al., 2012; Lee
et al., 2013)
. Therefore, similar implications as discussed above when analyzing sensitivity and
FDR for
calling heterozygo
us also applies
when analyzing homozygous. Calling genotypes from
DN
A
-
seq is reliable and has been
studied in detail
(Nielsen et al., 2011; Li, 2014; Li et al., 2013;
Li, 2011; Kumar et al., 2014)
.
We confirm
ed previously published results
for instance, when using
DNA
-
seq
,
we found that sensiti
vity for homozygous was > 0.99 (
Figure
24
b) with a corresponding
FDR < 2% (
Figure
24
D) irrespective of depth. Similar values and patterns were found for
heterozygous genotypes (
Figure
24
a
-
b). However, calling genotypes from RNA
-
seq protocols is
subject to diverse sources of false genotyping error, for instance: increased mapping errors due
to the complexity of the transcripto
me compared to the genome
(Bass et al., 2012)
. For that
reason, stringent mapping and filtering options are applied particularly when analyzing nucleotide
variation profiles from RNA
-
seq
(Quinn et al., 2013; Piskol et al., 2013; Lee et al., 2013)
.
In this
work, we found that calling homozygous from RNA
-
seq presented high sensitivity (>95%) in all
tissues and irrespective of depth (
Figure
24
b), but it could result in a high number of false positives
(
Figure
24
D)
. Similar results for
calling
homozygous were found in Quinn et al.
(Quinn et al.,
2013)
. Consequently, depth of RNA
-
seq reads should be considered when analyzing
homozygous sites using RNA
-
seq. This is usually not considered in RNA
-
editing studies where
protocols for including editing sites emphasize requirements in DNA depth but not in RNA
-
seq
depth
(Chen et al., 2014; Lee et al., 2013)
. Adding a depth filter to RNA
-
seq reads could help to
92
lower the number of false posi
tives in addition to other filtering strategies that are currently used
(Chen et al., 2014; Lee et al., 2013)
.
Imputation of genotyp
es using Beagle after calling variants with BCFtools neither improve
sensitivity nor worsen FDR in either heterozygous or homozygous sites covered by more than
100 reads across 24 samples (
Figure
23
). At lower coverage, the effect was detrimental as
imputation provided little increase in sensitivity at the cost of expanding FDR in both genotypes
(
Figure
23
c
-
d). These F2 pigs were sampled from an admixed population originated from two
different founding populations, thus it is expected long range persistence of linkage disequilibrium
(Vaughan et al., 2009)
and heterogeneity of haplotype blocks
(Gualdrón Duarte et al., 2013)
in
comparison to a panmictic population. Consequently, imputation based only on linkage
dis
equilibrium may not result in high imputation accuracy
(Gualdrón Duarte et al., 2013)
. Ho
wever,
if RNA
-
seq data alone or a mixture of DNA
-
seq and RNA
-
seq data were available across multiple
generations, pedigree based imputation could result in a more accurate imputation
(Gualdrón
Duarte et al., 2013)
. We did not explore this aspect in this paper due to small sample size and
lack of sequence genotypes in ancestors of the F2 pigs.
Accurate genotypes obtained by RNA
-
seq
can
be used to estimate genetic relationships
between individuals in a similar way to using genotypes from a SNP
60K
chip (
Figure
25
). Even
though the SNP sets from the chip and the RNA
-
seq analysis did not completely overlap, we
found that they are highly correlated and they represent the relationships among individual in a
very similar way (
Figure
25
). We know from previous analyses that the
SNP
60K
chip represents
the relationships among individuals in this population with high accuracy
(Gualdrón Duarte et al.,
2013)
.
Additionally
,
RNA
-
seq derived SNP
s
may be used to estimate relationships in populations
that are not well re
presented by SNPs in the SNP
60K
chip to avoid problems related to
ascertainment bias. In nonmodel organisms of little agricul
tural interest
SNP
chip
s
are generally
not available. In this situations,
RNA
-
seq data could be used to estimate relationships amon
g
93
individuals
(Seeb et al., 2011; Helyar et al., 2011; Weinman et al., 2014)
or to complement
genotyping by sequencing studies
(Narum et al., 2013)
,
and to improve genome
-
wide association
s
tudies by supplementing randomly selected SNP sets with more likely functional SNP sets
(Cánovas et al., 2013)
.
94
Chapter
5
General d
iscussion
General discussion
95
5.1
Introduction
RNA sequencing (RNA
-
seq)
(Wang et al., 2009)
has emerged as a revolutionary technology
to study transcriptomes. Both quantitative and qualitative aspects are being intensively stud
ied in
a broad spectrum of biological applications, such as gene differential expression analysis and
single nucleotide variation profiling in both model and nonmodel organisms
(Wickramasinghe et
al., 2014; Qian et al., 2014)
. Simultaneously, a number of methods are being pr
oposed for
analyzing RNA
-
seq data in each of those applications (Nookaew et al., 2012). Moreover, methods
are being periodically updated and new algorithms are being continuously published to improve
the analyses (Seyednasrollah et al., 2013; Kvam et al.,
2012). As a consequence, researchers
often find themselves having to decide between competing models and assessing the reliability
of results obtained with a designated analysis program. Choosing the most appropria
te software
can help to get
better results
and to achieve the goals of the investigation
(Mehta et al., 2006)
.
The overarchin
g goal of this dissertation was to propose and implement validation procedures
based on experimental data to estimate the properties of widely used statistical analysis methods
of RNA
-
seq experiments. Using synthetic and reference datasets, I compared stat
istical models
to perform differential expression analysis, sampled
-
base hierarchical cluster analysis, and
variant calling and genotyping
.
5.2
Objectives revisited
1.
To evaluate statistical models for differential expression analysis in RNA
-
seq experiments
Evaluation of statistical models for differential expression analysis has been based on
parametric simulated datasets using count data distributions, such as Poisson and negative
binomial. In this dissertation, I provided a more comprehensively approach us
ing pla
s
modes from
96
experimental datasets as well as parametric simulations. Two methods based on negative
binomial (edgeR and DESeq) presented higher type I error rates than a transformed Gaussian
model (MAANOVA) despite the use of simulations or plasmodes
. The complement between the
tw
o types of reference datasets was
particularly useful when comparing
the
methods. Since
parametric simulation can benefit a model that is based on the same distribution, the use of
plasmode
s
provide
d
an ind
ependent reference
to supplement
the analysis.
Additionally, I presented the comparison of models by exploring the joint null distribution of p
-
values that is expected to resemble a uniform distribution. The MAANOVA program used to fit the
transformed Gaussian model can gene
rate p
-
values using different strategies, such as using
moderated test or transcript
-
by
-
transcirp test. The best approach to compute the p
-
values differed
between datasets. The comparison of the p
-
value distributions helped to decide on the best set
of spe
cifications to use in each datase
t. Usually, this measure of comparison is
not reported when
evaluating the adequacy of a model. However, researchers typically correct p
-
values by FDR as
proposed by Benjamini and Hochber
(Benjamini and Hochberg, 1995)
and such correction may
not be appropriate if the uniform distribution is not supported, thus, leading to wrong decisions
when comparing treatments.
Overall, Gaussian models had p
-
values closer to nominal signifi
cance levels and presented
p
-
value distributions closer to the uniform distribution
. Researchers using models with these
characteristics will have less false positive when distinguishing differentiall
y expressed
transcripts. In addition
, Gaussian transform
ed model can include random effects and hierarchical
structures that arise in complex experiments, such as when collecting data in fisheries or wildlife
studies. For instance, landscape genomics has been proposed as a framework for studying
adaptive and ne
utral genetic variation at the population level in a spatially explicit context
(Joost
et al., 2007)
, however, flexible models that can account for environmental and spatial
autocorrelations are nee
ded
(Schoville et al., 2012)
. Similar to
macroscale biology
applications
,
97
more complex analysis methods are also required at the molecular scale for analyzing gene
families or regulatory networks that need to account for time course and gene
-
set correlations
(Yang and Wei, 2015; Emmert
-
Str
eib, 2013)
.
Additionally, the proposed plasmode algorithm is an alternative to previous
implemented
plasmode
s
. The
algorithm allows incorporation
of
biological variation by making several partitions
from the original set of samples instead of only one a
s implemented in Gadbury et al.
(
2008)
. This
way of creating
plasmodes can be of interest in field studies where the individual variability can
be studied
a priori
to improve experimental designs aspects such as sample size. In conclusion,
I proposed and implemented a general validation method that was applied to specifically compare
count data based models versus a Gaussian transformed data model. Furthermore, the
procedure
allow
ed
to set the best
analysis option for
the Gaussian transformed model
according to
specific
datasets,
and presented other uses of the information generated by the validation process that
can help researchers to plan data collection.
2.
To
assess the properties of dissimilarity measures for agglomerative hierarchical
clustering of RNA
-
seq data
Hierarchical cluster analysis is one of the most used techniques for representing pattern
recognition and representation
of results from
sequencing d
ata
(Liu and Si, 2014
)
. Arguably,
selection of a distance or similarity metrics is one of the most important decision in the
implementation of hierarchical clustering
(Izenman, 2008)
. As discussed for differential
expression methods, the validity of alternatives metrics for hierarchical clustering has been
previously assessed using parametric si
mulations. In t
his dissertation
, I presented two ways for
creating plasmode datasets for assessing the suitability of distance metrics for clustering RNA
-
seq expression data. For that, I used data from two experiments conducted under different
conditions. One experiment
was the result of performing RNA
-
seq from animals in a pre
-
designed
98
experiment based on two inbred strains of mice, while the other experiment consisted in
sequencing the transcriptome of randomly sampled (untreated) individuals from an outbred
population.
The common goal of performing clustering in both datasets was to uncover population
structure due
to
genetic differences in gene expression (differences between breeds of mice or
differences within a crossbred population)
.
In both situations, the proposed
plasmode building
strategy allowed for individual
-
to
-
individual variability and it extend
ed
the validation of cluster
analysis for datasets with different experimental structures, e.g the first dataset is a typical
example of wet lab experiment for studyi
ng the influence of controlled factors, e.g. Smith et al.
(2013)
, whereas the second is a common example of field studies for establishing relationships
among individuals, e.g. Lamichhaney et al.
(2012)
, or for studying evolution of gene differential
expression, e.g. Gu et al.
(2013
)
.
The validation included the comparison of a wide range of dissimilarity measures, including
the one based on the traditional Euclidean dist
ance, correlation
-
based dissimilarities using raw,
transformed and normalized count data, as well as measures specifically developed for discrete
count data. Dissimilarity measures based on non
-
transformed count data (Euclidean and Pearson
correlation), re
sulted in dendrograms that did not resemble the expected sample structure.
Normalization did not help with the use of those measures either. The dissimilarity based on
Spearman correlation was the only correlation
-
based dissimilarity that recovered the nat
ural
sample structure. Dissimilarities calculated using appropriate transformations for count data were
consistent in reproducing the expected dendrograms.
Additionally, the validation procedure proposed two metrics
, agreement and consistency,
for
measurin
g the adequacy of dissimilarity measures
.
Agreement measures the correlation between
dissimilarities
,
then two dissimilarities with high agreement produce dendrograms with similar
hierarchical structures. Consistency measures the correlation between dendro
grams obtained
with a particular dissimilarity
,
then a dissimilarity with high consistency always generates similar
99
dendrograms. These type of objective measures had not been reported before when comparing
dendrograms in RNA
-
seq studies.
Finally, plasmode
s allowed the analysis
to be extended
to different hypothetical scenarios. In
one scenario only differentially expressed genes were used in clustering analysis while in another
scenario all genes were included (differentially and non differentially express
ed). All the
dissimilarities that accounted for appropriate transformations were highly consistent in all
scenarios. This dual use of representing information is useful to explore relationships but has to
be interpreted in the correct context. For instance
, Lamichhaney et al.
(
2012)
presented
philoge
netic trees
using all markers or
using only markers that showed significantly genetic
differentiation when studying local adaptation to salinity levels in Atlantic herring populations
. The
study
failed to uncover the population structure, except when the gene set was
restricted to pre
-
selected genes known for harboring SNP segregating at extreme frequencies among diverse
populations.
All in all, the proposed validation method compared the most common dissimilarity measures
used in hierarchical clustering and dissimila
rity measures potentially appropriate for count
overdispersed data. The validation included a variety of scenarios suitable for RNA
-
seq
applications in biology studies and specifically in fisheries and wildlife applications, and proposed
agreement and cons
istency metrics to objectively compare dendrograms structures.
3.
To compare sensitivity and false discovery rates for SNP genotyping in variant calling
programs
Calling and genotyping variants from DNA
-
seq has been actively studied
(Nie
lsen et al., 2011;
Li et al., 2014, 2013)
and the same programs are being used for calling variant from RNA
-
seq
(Piskol et al., 2013; Lee et al., 2013)
. In
this thesis I compared three well known programs, namely
100
BCFtools, Beagle and VarScan2, in terms of sensitivity and false discovery rate when genotyping
heterozygous and homozygous sites using as reference genotypes obtained from a 60K SNPchip.
Additional
ly, the comparison considered multi samples and multi tissue scenarios, as well as
studied the influence of read depth. Calling heterozygous with more than 10 reads per base and
per sample provided sensitivity of 70% for VarScan2 and more that 90% for BCFt
ools and Beagle
with low FDR in all programs. Homozygous were called with higher sensitivity in all tissues and
samples irrespective of depth but presented higher FDR.
Distinguishing the sensitivity and false genotyping rate of homozygous from heterozygou
s
sites separately allows using the information for designing analysis for differ
ent applications such
as allele
-
specific expression and RNA edi
ting. This type of study is
emerging in fish and wildlife
populations. For instance, Wang et al.
(
2013)
used RNA
-
seq to study differential
expression
analysis and allele
-
specific expression related to disease resistance against enteric septicemia of
catfish. Furthermore, other uses of SNP genotyping from RNA
-
seq are expected to increase in
the future. In nonmodel organisms, SNP genotyping will be relevant for ecological an
d
evolutionary studies and population genetics
(Seeb et al., 2011; Helyar et al., 2011)
. For instance,
Weinman et al.
(
2014)
used RNA
-
seq for discovering SNP and establishing parentage and kinship
in cooperatively breeding super starlings that live in highly kin
-
structured groups. In this sens
e, I
demonstrated in this dissertation
that
accurate genotypes obtained by RNA
-
seq can be used to
estimate genetic relationships between individuals in a similar way to using genotypes from a
SNPchip. SNPchip are usually not available or are expensive to design for nonmodel organisms
of little agri
cultural interest, thus RNA
-
seq can be used to complement other genotyping
approaches like genotyping
-
by
-
sequencing
(Narum et al., 2013)
or genotyping
-
in
-
thousands by
sequencing
(Campbell et al., 2015; Pavey, 2015)
.
In summary, the comparison of variant calling programs provided information on the accuracy
of calling and genotyping homozygous and heterozygous sites including multiple samples and
101
multiple tissues scenarios, and considering the effect of read depth. The
evaluation explored the
results in the context of applications for allele
-
specific and RNA
-
editing studies, as well as for
studying relationships in a population.
5.3
Future research directions
The g
enomics era is providing unprecedented opportunities for r
esearchers to study
mechanisms underlying biological processes. The amount and type of data that are being
having data is just the first step towards generating
information that can help to elucidate the
biological processes. This context requires researchers that are able to understand the biology
and to develop and apply valid statistical methods for extracting information
(Gomez
-
Ca
brero et
al., 2014)
.
Systems biology is a field of study in which disciplines converged to study biological systems
in a holistic approach. Neuroscience, physiology, and ecology all converged on the idea that it is
as important to study and model the whole system and its inter
actions as to analyze the individuals
parts alone
(Conesa an
d Mortazavi, 2014)
. For instance the molecular systems biology needs to
integrate data from complex interactions among biomolecules such as DNA, different types of
RNA, proteins, and metabolites but similar integrative challenges exist at other multiple
scales,
such as cellular, organismal and ecological organization
(Conesa and Mortazavi, 2014)
.
Moreover, any integrative approach in omics data has to deal with high dimensional data and in
most cases
a priori
uncertainty in the definition of the hypotheses to be tested. Thus, a field of
active research is the
improvement of analysis methods that allow integration of multiple data
types to produce statistically valid and biologically relevant interpretations.
102
The procedures studied and developed in this dissertation are extremely useful to continue
research in
integrative analysis method in life sciences. The flexibility of plasmode construction,
as presented in chapters 2 and 3, can be extended to validate further analysis methods. In chapter
2, the algorithm to generate plasmodes repeatedly sampled a random se
t of genes. However, it
may be customized to sample particular related sets of genes in order to validate explorative and
inference methodologies of gene and molecular networks such as pathway analysis. In chapter
3, the assessment of dissimilarity measure
s for sample
-
based clustering analysis can be
extended to gene
-
based or bi
-
clustering, and other integrative clustering analysis for multiple
omics datasets
(Milone et al., 2014; Consortia, 2014)
. Importantly, the validation could be easily
tested in dimension reduction techniques that are important to reduce
the noisy high
-
dimensional
characteristic of RNA
-
seq and other omics. Results obtained with principal components analysis,
multidimensional scaling or correspondence analysis can be compared and validated including
new approaches of dimensionality reductio
n recently proposed for integrative analysis
(Reverter
et al., 2014; Reshetova et al., 2014; Simmons et al., 2015; Tomescu et al., 2014)
.
Chapter 3,
provided and in
-
depth exploration of calling and genotyping SNPs. SNP datasets are one of the
fundamental data sources used when integrati
ng datasets in genomics. For instance it is typical
to integrate SNP with gene expression
(Conesa and Mortazavi, 2014)
as in eQTL studies
.
103
REFERENCES
104
REFERENCES
Alamancos, G., Agirre, E., and Eyras, E. (2014).
-
Spliceosomal Pre
-
mRNA Splicing
Methods in
Molecular Biology., ed. K. J. Hertel (Totowa, NJ: Humana Press), 357
397.
doi:10.1007/978
-
1
-
62703
-
980
-
2.
Altmann, A., Weber, P., Bader, D., P
reuss, M., Binder, E. B., and Müller
-
Myhsok, B. (2012). A
beginners guide to SNP calling from high
-
throughput DNA
-
sequencing data.
Hum. Genet.
131, 1541
54. doi:10.1007/s00439
-
012
-
1213
-
z.
Anders, S., and Huber, W. (2010). Differential expression analysis f
or sequence count data.
Genome Biol.
11, R106. doi:10.1186/gb
-
2010
-
11
-
10
-
r106.
Anders, S., Pyl, P. T., and Huber, W. (2014). HTSeq
-
A Python framework to work with high
-
throughput sequencing data.
bioRxiv
, 0
4. doi:10.1101/002824.
Anders, S., Pyl, P. T.,
and Huber, W. (2015). HTSeq A Python framework to work with high
-
throughput sequencing data.
Bioinformatics
31, 166
169. doi:10.1101/002824.
Auer, P. L., and Doerge, R. W. (2011). A Two
-
Stage Poisson Model for Testing RNA
-
Seq Data.
Stat. Appl. Genet. Mol.
Biol.
10, 1
28. doi:10.2202/1544
-
6115.1627.
Auer, P. L., and Doerge, R. W. (2010). Statistical design and analysis of RNA sequencing data.
Genetics
185, 405
416. doi:10.1534/genetics.110.114983.
Babbitt, C. C., Tung, J., Wray, G. A., and Alberts, S. C. (20
12). Changes in Gene Expression
Associated with Reproductive Maturation in Wild Female Baboons.
Genome Biol. Evol.
4,
102
109. doi:10.1093/gbe/evr134.
Badke, Y. M., Bates, R. O., Ernst, C. W., Schwab, C., Fix, J., Van Tassell, C. P., and Steibel, J.
P. (20
13). Methods of tagSNP selection and other variables affecting imputation accuracy
in swine.
BMC Genet.
14, 8. doi:10.1186/1471
-
2156
-
14
-
8.
Bahn, J. H., Lee, J., Li, G., Greer, C., and Peng, G. (2012). Accurate identification of A
-
to
-
I RNA
editing in human
by transcriptome sequencing Accurate identification of A
-
to
-
I RNA editing
in human by transcriptome sequencing. 142
150. doi:10.1101/gr.124107.111.
Bass, B., Hundley, H., Li, J. B., Peng, Z., Pickrell, J., Xiao, X. G., and Yang, L. (2012). The
difficult ca
lls in RNA editing.
Nat. Biotechnol.
30, 1207
1209. doi:nbt.2452
[pii]
\
r10.1038/nbt.2452.
Bates, D., Maechler, M., and Bolker, B. (2013). lme4: Linear and mixed
-
effects models using S4
classes.
Benjamini, Y., and Hochberg, Y. (1995). Controlling the False
Discovery Rate
-
a Practical and
Powerful Approach to Multiple Testing.
J. R. Stat. Soc. Ser. B
-
Methodological
57, 289
300.
105
Bennett, S. (2004). Solexa Ltd.
Pharmacogenomics
5, 433
8. doi:10.1517/14622416.5.4.433.
Blekhman, R., Marioni, J. C., Zumbo, P., St
ephens, M., and Gilad, Y. (2010). Sex
-
specific and
lineage
-
specific alternative splicing in primates.
Genome Res.
20, 180
189.
doi:10.1101/gr.099226.109.
Bottomly, D., Walter, N. A. R., Hunter, J. E., Darakjian, P., Kawane, S., Buck, K. J., Searles, R.
P.,
Mooney, M., McWeeney, S. K., and Hitzemann, R. (2011). Evaluating Gene Expression
in C57BL/6J and DBA/2J Mouse Striatum Using RNA
-
Seq and Microarrays.
PLoS One
6,
e17820. doi:10.1371/journal.pone.0017820.
Bourgon, R., Gentleman, R., and Huber, W. (2010). Independent filtering increases detection
power for high
-
throughput experiments.
Proc. Natl. Acad. Sci.
107, 9546
9551.
doi:10.1073/pnas.0914005107.
Browning, S. R., and Browning, B. L. (2007). Rapid and ac
curate haplotype phasing and
missing
-
data inference for whole
-
genome association studies by use of localized haplotype
clustering.
Am. J. Hum. Genet.
81, 1084
1097. doi:10.1086/521987.
Bullard, J. H., Purdom, E., Hansen, K. D., and Dudoit, S. (2010). Evalu
ation of statistical
methods for normalization and differential expression in mRNA
-
Seq experiments.
BMC
Bioinformatics
11, 94.
Campbell, N. R., Harmon, S. a., and Narum, S. R. (2015). Genotyping
-
in
-
Thousands by
sequencing (GT
-
seq): A cost effective SNP gen
otyping method based on custom amplicon
sequencing.
Mol. Ecol. Resour.
15, 855
867. doi:10.1111/1755
-
0998.12357.
Cánovas, a, Rincón, G., Islas
-
Trejo, a, Jimenez
-
Flores, R., Laubscher, a, and Medrano, J. F.
(2013).
RNA sequencing to study gene expression and single nucleotide polymorphism
variation associated with citrate content in cow milk.
J. Dairy Sci.
96, 2637
48.
doi:10.3168/jds.2012
-
6213.
Cánovas, A., Rincon, G., Islas
-
Trejo, A., Wickramasinghe, S., and Medran
o, J. F. (2010). SNP
discovery in the bovine milk transcriptome using RNA
-
Seq technology.
Mamm. Genome
21, 592
8. doi:10.1007/s00335
-
010
-
9297
-
z.
Cattell, R. B., and Jaspers, J. (1967). General Plasmode No. 30
-
10
-
5
-
2 for factor analytic
exercises and resear
ch.
Multivariate Behav. Res.
Chen, G., Wang, C., and Shi, T. (2011). Overview of available methods for diverse RNA
-
Seq
data analyses.
Sci. China. Life Sci.
54, 1121
8. doi:10.1007/s11427
-
011
-
4255
-
x.
Chen, J. Y., Peng, Z., Zhang, R., Yang, X. Z., Tan, B. C.
M., Fang, H., Liu, C. J., Shi, M., Ye, Z.
Q., Zhang, Y. E., et al. (2014). RNA Editome in Rhesus Macaque Shaped by Purifying
Selection.
PLoS Genet.
10, 8
12. doi:10.1371/journal.pgen.1004274.
Cheung, V. G., Nayak, R. R., Wang, I. X., Elwyn, S., Cousins, S
. M., Morley, M., and Spielman,
R. S. (2010). Polymorphic Cis
-
and Trans
-
Regulation of
Human Gene Expression.
PLoS Biol
8, e1000480. doi:10.1371/journal.pbio.1000480.
106
Conesa, A., and Mortazavi, A. (2014). The common ground
of genomics and systems biology.
BMC Syst. Biol.
8 Suppl 2, S1. doi:10.1186/1752
-
0509
-
8
-
S2
-
S1.
Consortia, Stat. (2014). STATegRa: Classes and methods for multi
-
omics data integration. R
package v 1.2.1.
Cui, X., Hwang, J. T. G., Qiu, J., Blades, N. J., an
d Churchill, G. A. (2005). Improved statistical
tests for differential gene expression by shrinking variance components estimates.
Biostatistics
6, 59
75. doi:10.1093/biostatistics/kxh018.
Dalton, L., Ballarin, V., and Brun, M. (2009). Clustering algorithm
s: on learning, validation,
performance, and applications to genomics.
Curr. Genomics
10, 430
45.
doi:10.2174/138920209789177601.
Di, Y., Schafer, D. W., Cumbie, J. S., and Chang, J. H. (2011). The NBP Negative Binomial
Model for Assessing Differential Gen
e Expression from RNA
-
Seq.
Stat. Appl. Genet. Mol.
Biol.
10. doi:10.2202/1544
-
6115.1637.
Dillies, M. A., Rau, A., Aubert, J., Hennequet
-
Antier, C., Jeanmougin, M., Servant, N., Keime,
C., Marot, G., Castel, D., Estelle, J., et al. (2012). A comprehensive e
valuation of
normalization methods for Illumina high
-
throughput RNA sequencing data analysis.
Brief.
Bioinform.
doi:10.1093/bib/bbs046.
Djari, A., Esquerré, D., Weiss, B., Martins, F., Meersseman, C., Boussaha, M., Klopp, C., and
Rocha, D. (2013). Gene
-
bas
ed single nucleotide polymorphism discovery in bovine muscle
using next
-
generation transcriptomic sequencing.
BMC Genomics
14, 307.
doi:10.1186/1471
-
2164
-
14
-
307.
Edwards, D. B., Ernst, C. W., Tempelman, R. J., Rosa, G. J. M., Raney, N. E., Hoge, M. D., and
Bates, R. O. (2008). Quantitative trait loci mapping in an F2 Duroc x Pietrain resource
population: I. Growth traits.
J. Anim. Sci.
86, 241
253. doi:10.2527/jas.2006
-
625.
Ekblom, R., and Galindo, J. (2011). Applications of next generation sequencing in mo
lecular
ecology of non
-
model organisms.
Heredity (Edinb).
107, 1
15. doi:10.1038/hdy.2010.152.
Emmert
-
Streib, F. (2013). Influence of the experimental design of gene expression studies on
the inference of gene regulatory networks: environmental factors.
Pe
erJ
1, e10.
doi:10.7717/peerj.10.
Ernst, C. W., Steibel, J. P., Sollero, B. P., Strasburg, G. M., Guimarães, J. D., and Raney, N. E.
(2011). Transcriptional profiling during pig fetal skeletal muscle development using direct
high
-
throughput sequencing and
crossplatform comparison with gene expression
microarrays. in
Annual Meeting American Dairy Science Association and American Society
of Animal Science.
(New Orleans, Louisiana: J. Anim. Sci).
Fang, Z., and Cui, X. (2011). Design and validation issues in RN
A
-
seq experiments.
Brief.
Bioinform.
12, 280
287. doi:10.1093/bib/bbr004.
107
Frazee, A., Langmead, B., and Leek, J. (2011). ReCount: A multi
-
experiment resource of
analysis
-
ready RNA
-
seq gene count datasets.
BMC Bioinformatics
12, 449.
Design and Inference in High
-
Plant Systems Biology SE
-
9
Methods in Molecular Biology
TM
., ed. D. A. Belostotsky (Humana Press), 181
206 LA
English. doi:10.1007/978
-
1
-
60327
-
563
-
7_9.
Gadbury, G. L., Xiang, Q., Yang, L., Barnes, S., Page, G. P., and Allison, D. B. (2008).
Evaluating statistical methods using plasmode data sets in the age of massive public
databases: an illustration usi
ng false discovery rates.
PLoS Genet.
4, e1000098.
doi:10.1371/journal.pgen.1000098.
Gentleman, R., Carey, V., Bates, D., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L.,
Ge, Y., Gentry, J., et al. (2004). Bioconductor: open software developm
ent for
computational biology and bioinformatics.
Genome Biol.
5, R80.
Gomez
-
Cabrero, D., Abugessaisa, I., Maier, D., Teschendorff, A., Merkenschlager, M., Gisel,
A., Ballestar, E., Bongcam
-
Rudloff, E., Conesa, A., and Tegnér, J. (2014). Data integration
i
n the era of omics: current and future challenges.
BMC Syst. Biol.
8, I1. doi:10.1186/1752
-
0509
-
8
-
S2
-
I1.
Griffith, M., Griffith, O. L., Mwenifumbo, J., Goya, R., Morrissy, a S., Morin, R. D., Corbett, R.,
Tang, M. J., Hou, Y.
-
C., Pugh, T. J., et al. (2010)
. Alternative expression analysis by RNA
sequencing.
Nat. Methods
7, 843
7. doi:10.1038/nmeth.1503.
Gu, X. (2015). Statistical Detection of Differentially Expressed Genes based on RNA
-
seq: from
Biological to Phylogenetic Replicates.
Brief. Bioinform.
,
bbv035. doi:10.1093/bib/bbv035.
Gu, X., Zou, Y., Huang, W., Shen, L., Arendsee, Z., and Su, Z. (2013). Phylogenomic distance
method for analyzing transcriptome evolution based on RNA
-
seq data.
Genome Biol. Evol.
5, 1746
1753. doi:10.1093/gbe/evt121.
Gualdr
ón Duarte, J. L., Bates, R. O., Ernst, C. W., Raney, N. E., Cantet, R. J. C., and Steibel, J.
P. (2013). Genotype imputation accuracy in a F2 pig population using high density and low
density SNP panels.
BMC Genet.
14, 38. doi:10.1186/1471
-
2156
-
14
-
38.
Habi
er, D., Fernando, R. L., and Dekkers, J. C. M. (2007). The impact of genetic relationship
information on genome
-
assisted breeding values.
Genetics
177, 2389
2397.
doi:10.1534/genetics.107.081190.
Halkidi, M., Batistakis, Y., and Vazirgiannis, M. (2001). On
Clustering Validation Techniques.
J.
Intell. Inf. Syst.
17, 107
145.
Handl, J., Knowles, J., and Kell, D. B. (2005). Computational cluster validation in post
-
genomic
data analysis.
Bioinformatics
21, 3201
12. doi:10.1093/bioinformatics/bti517.
108
Hardcastle,
T. J., and Kelly, K. A. (2010). baySeq: empirical Bayesian methods for identifying
differential expression in sequence count data.
BMC Bioinformatics
11, 422.
doi:10.1186/1471
-
2105
-
11
-
422.
Hastie, T., Tibshirani, R., and Friedman, J. H. (2009).
The Elemen
ts of Statistical Learning. data
Mining, Inference, and Prediction
. Second Edi. New York, New York, USA: Springer.
Helyar, S. J., Hemmer
-
Hansen, J., Bekkevold, D., Taylor, M. I., Ogden, R., Limborg, M. T.,
Cariani, a., Maes, G. E., Diopere, E., Carvalho, G
. R., et al. (2011). Application of SNPs for
population genetics of nonmodel organisms: New opportunities and challenges.
Mol. Ecol.
Resour.
11, 123
136. doi:10.1111/j.1755
-
0998.2010.02943.x.
Hu, M., Zhu, Y., Taylor, J. M. G., Liu, J. S., and Qin, Z. S. (2
011). Using Poisson mixed
-
effects
model to quantify transcript
-
level gene expression in RNA
-
Seq.
Bioinformatics
28, 63
68.
doi:10.1093/bioinformatics/btr616.
Van Iterson, M., Boer, J. M., and Menezes, R. X. (2010). Filtering, FDR and power.
BMC
Bioinformat
ics
11, 450. doi:10.1186/1471
-
2105
-
11
-
450.
Izenman, A. J. (2008).
Modern Multivariate Statistical Techniques. Regression, Classification,
and Manifold Learning
. New York, New York, USA: Springer.
Jiang, D., Tang, C., and Zhang, A. (2004). Cluster analysis for gene expression data: a survey.
IEEE Trans. Knowl. Data Eng.
16, 1370
1386. doi:10.1109/TKDE.2004.68.
Johnson, R. A., and Wichern, D. W. (2002).
Applied multivariate statistical analysis
. 5th
ed.
Upper Saddle River, N.J.: Prentice Hall.
Joost, S., Bonin, a., Bruford, M. W., Després, L., Conord, C., Erhardt, G., and Taberlet, P.
(2007). A spatial analysis method (SAM) to detect candidate loci for selection: Towards a
landscape genomics approach
to adaptation.
Mol. Ecol.
16, 3955
3969.
doi:10.1111/j.1365
-
294X.2007.03442.x.
Kendall, M. G., and Gobbons, J. D. (1990).
Rank Correlation Methods
. 5th ed. USA: Oxford
University Press.
Koboldt, D. C., Zhang, Q., Larson, D. E., Shen, D., Mclellan, M. D., L
in, L., Miller, C. a, Mardis,
copy number alteration discovery in cancer by exome se
quencing. 568
576.
doi:10.1101/gr.129684.111.
Kumar, P., Al
-
shafai, M., Ahmed, W., Muftah, A., Chalhoub, N., Elsaid, M. F., Aleem, A. A., and
Suhre, K. (2014). Evaluation of SNP calling using single and multiple
-
sample calling
algorithms by validation agai
nst array base genotyping and Mendelian inheritance. 7, 1
13.
doi:10.1186/1756
-
0500
-
7
-
747.
Lamichhaney, S., Martinez Barrio, A., Rafati, N., Sundström, G., Rubin, C.
-
J., Gilbert, E. R.,
Berglund, J., Wetterbom, A., Laikre, L., Webster, M. T., et al. (2012)
. Population
-
scale
109
sequencing reveals genetic differentiation due to local adaptation in Atlantic herring.
Proc.
Natl. Acad. Sci. U. S. A.
109, 19345
50. doi:10.1073/pnas.1216128109.
Langmead, B., Hansen, K. D., and Leek, J. T. (2010). Cloud
-
scale RNA
-
sequ
encing differential
expression analysis with Myrna.
Genome Biol.
11, R83. doi:10.1186/gb
-
2010
-
11
-
8
-
r83.
Langmead, B., and Salzberg, S. L. (2012). Fast gapped
-
read alignment with Bowtie 2.
Nat.
Methods
9, 357
360. doi:10.1038/nmeth.1923.
Law, C. W., Chen, Y
., Shi, W., and Smyth, G. K. (2014). Voom: precision weights unlock linear
model analysis tools for RNA
-
seq read counts.
Genome Biol.
15, R29. doi:10.1186/gb
-
2014
-
15
-
2
-
r29.
Lee, J.
-
H., Ang, J. K., and Xiao, X. (2013). Analysis and design of RNA sequencing
experiments
for identifying RNA editing and other single
-
nucleotide variants.
RNA
, rna.037903.112
.
doi:10.1261/rna.037903.112.
Leek, J. T., and Storey, J. D. (2011). The Joint Null Criterion for Multiple Hypothesis Tests.
Stat.
Appl. Genet. Mol. Biol.
10.
doi:10.2202/1544
-
6115.1673.
Legendre, P., and Legendre, L. (2012).
Numerical ecology
. Third. Amsterdam: Elsevier.
Lemay, M. A., Donnelly, D. J., and Russello, M. A. (2013). Transcriptome
-
wide comparison of
sequence variation in divergent ecotypes of kokan
ee salmon.
BMC Genomics
.
doi:10.1186/1471
-
2164
-
14
-
308.
Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping
and population genetical parameter estimation from sequencing data.
Bioinformatics
27,
2987
2993. doi:10.
1093/bioinformatics/btr509.
Li, H. (2014). Toward better understanding of artifacts in variant calling from high
-
coverage
samples.
Bioinformatics
30, 2843
2851. doi:10.1093/bioinformatics/btu356.
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J.,
Homer, N., Marth, G., Abecasis, G.,
and Durbin, R. (2009a). The Sequence Alignment / Map ( SAM ) Format and SAMtools
1000 Genome Project Data Processing Subgroup. 1
2.
Li, J. B., Levanon, E. Y., Yoon, J.
-
K., Aach, J., Xie, B., LePoust, E., Zhang, K., Gao,
Y., and
Church, G. M. (2009b). Genome
-
Wide Identification of Human RNA editing sites by parallel
DNA capturing and sequencing.
Science (80
-
. ).
324, 1210:1213.
Li, J., Witten, D. M., Johnstone, I. M., and Tibshirani, R. (2012). Normalization, testing, and
false
discovery rate estimation for RNA
-
sequencing data.
Biostatistics
13, 523
38.
doi:10.1093/biostatistics/kxr031.
-
Y., Wang, M.,
Wang, C., et al. (2014). Detecting and corre
cting systematic variation in large
-
scale RNA
sequencing data.
Nat. Biotechnol.
32. doi:10.1038/nbt.3000.
110
Li, Y., Chen, W., Liu, E. Y., and Zhou, Y. H. (2013). Single Nucleotide Polymorphism (SNP)
Detection and Genotype Calling from Massively Parallel Sequ
encing (MPS) Data.
Stat.
Biosci.
5, 3
25. doi:10.1007/s12561
-
012
-
9067
-
4.
-
Statistical Analysis of
Next Generation Sequencing Data SE
-
10
Frontiers in Probability and the Statistical
Sciences., eds. S. Datta and D. Nettleton (Springer International Publishing), 191
217.
doi:10.1007/978
-
3
-
319
-
07212
-
8_10.
Love, M. I., Huber, W., and Anders, S. (2014a). Moderated estimation of fold change and
dispersion for RNA
-
Seq data with DESeq2.
Genome Biol.
15, 550. doi:10.1186/s13059
-
014
-
0550
-
8.
Love, M. I., Huber, W., and Anders, S. (2014b). Moderated estimation of fold change and
dispersion for RNA
-
Seq data with DESeq2.
bioRxiv
. doi:10.1101/002832.
Ma,
C., and Wang, X. (2012). Application of the Gini correlation coefficient to infer regulatory
relationships in transcriptome analysis.
Plant Physiol.
160, 192
203.
doi:10.1104/pp.112.201962.
Marguerat, S., Bähler, J., and Bahler, J. (2010). RNA
-
seq: from te
chnology to biology.
Cell. Mol.
Life Sci.
67, 569
579. doi:10.1007/s00018
-
009
-
0180
-
6.
Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., and Gilad, Y. (2008). RNA
-
seq: an
assessment of technical reproducibility and comparison with gene expression arr
ays.
Genome Res.
18, 1509
1517.
McCarthy, D. J., Chen, Y., and Smyth, G. K. (2012). Differential expression analysis of
multifactor RNA
-
Seq experiments with respect to biological variation.
Nucleic Acids Res.
,
gks042
. doi:10.1093/nar/gks042.
Mehta, T. S.,
Zakharkin, S. O., Gadbury, G. L., and Allison, D. B. (2006). Epistemological issues
in omics and high
-
dimensional biology: give the people what they want.
Physiol. Genomics
28, 24
32. doi:10.1152/physiolgenomics.00095.2006.
Mehta, T., Tanik, M., and Allis
on, D. B. (2004). Towards sound epistemological foundations of
statistical methods for high
-
dimensional biology.
Nat. Genet.
36, 943
947.
doi:10.1038/ng1422.
Metzker, M. L. (2010). Sequencing technologies
-
the next generation.
Nat. Rev. Genet.
11, 31
46.
doi:10.1038/nrg2626.
Milone, D. H., Stegmayer, G., Lopez, M., Kamenetzky, L., and Carrari, F. (2014). Improving
clustering with metabolic pathway data.
BMC Bioinformatics
15, 101. doi:10.1186/1471
-
2105
-
15
-
101.
Mortazavi, A., Williams, B. A., McCue, K., Sch
aeffer, L., and Wold, B. (2008). Mapping and
quantifying mammalian transcriptomes by RNA
-
Seq.
Nat. Methods
5, 621
628.
doi:http://www.nature.com/nmeth/journal/v5/n7/suppinfo/nmeth.1226_S1.html.
111
Narum, S. R., Buerkle, C. A., Davey, J. W., Miller, M. R., and
Hohenlohe, P. a. (2013).
Genotyping
-
by
-
sequencing in ecological and conservation genomics.
Mol. Ecol.
, n/a
n/a.
doi:10.1111/mec.12350.
Nielsen, R., Paul, J. S., Albrechtsen, A., and Song, Y. S. (2011). Genotype and SNP calling
from next
-
generation sequenc
ing data.
Nat. Rev. Genet.
12, 443
51. doi:10.1038/nrg2986.
Oshlack, A., Robinson, M., and Young, M. (2010). From RNA
-
seq reads to differential
expression results.
Genome Biol.
11, 220.
Ozsolak, F., and Milos, P. M. (2011). RNA sequencing: advances, challe
nges and opportunities.
Nat. Rev. Genet.
12, 87
98. doi:10.1038/nrg2934.
Pachter, L. (2011). Models for transcript quantification from RNA
-
Seq.
ArXiv
1104.3889, 1
28.
Pavey, S. a (2015). High
-
throughput SNPs for all: genotyping
-
in
-
thousands.
Mol. Ecol. Res
our.
15, 685
687. doi:10.1111/1755
-
0998.12405.
Peng, Z., Cheng, Y., Tan, B. C.
-
M., Kang, L., Tian, Z., Zhu, Y., Zhang, W., Liang, Y., Hu, X.,
Tan, X., et al. (2012). Comprehensive analysis of RNA
-
Seq data reveals extensive RNA
editing in a human transcript
ome.
Nat. Biotechnol.
30, 253
260. doi:10.1038/nbt.2122.
Pickrell, J. K., Marioni, J. C., Pai, A. A., Degner, J. F., Engelhardt, B. E., Nkadori, E., Veyrieras,
J.
-
B., Stephens, M., Gilad, Y., and Pritchard, J. K. (2010). Understanding mechanisms
underlying
human gene expression variation with RNA sequencing.
Nature
464, 768
772.
doi:http://www.nature.com/nature/journal/v464/n7289/suppinfo/nature08872_S1.html.
Pirinen, M., Lappalainen, T., Zaitlen, N. A., Dermitzakis, E. T., Donnelly, P., McCarthy, M. I., and
Rivas, M. A. (2015). Assessing allele
-
specific expression across multiple tissues from
RNA
-
seq read data.
Bioinformatics
, btv074
. doi:10.1093/bioinformati
cs/btv074.
Piskol, R., Ramaswami, G., and Li, J. B. (2013). Reliable Identification of Genomic Variants
from RNA
-
Seq Data.
Am. J. Hum. Genet.
93, 641
51. doi:10.1016/j.ajhg.2013.08.008.
Qian, X., Ba, Y., Zhuang, Q., and Zhong, G. (2014). RNA
-
Seq technology
and its application in
fish transcriptomics.
OMICS
18, 98
110. doi:10.1089/omi.2013.0110.
Quinn, A., Juneja, P., and Jiggins, F. M. (2014). Estimates of allele
-
specific expression in
Drosophila with a single genome sequence and RNA
-
seq data.
Bioinformatics
30, 2603
10. doi:10.1093/bioinformatics/btu342.
Quinn, E. M., Cormican, P., Kenny, E. M., Hill, M., Anney, R., Gill, M., Corvin, A. P., and Morris,
D. W. (2013). Development of Strategies for SNP Detection in RNA
-
Seq Data: Application
to Lym
phoblastoid Cell Lines and Evaluation Using 1000 Genomes Data.
PLoS One
8.
doi:10.1371/journal.pone.0058815.
R Development Core Team (2014). R: A language and environment for statistical computing.
112
Ramaswami, G., Lin, W., Piskol, R., Tan, M. H., Davis, C.,
and Li, J. B. (2012). Accurate
identification of human Alu and non
-
Alu RNA editing sites.
Nat. Methods
9, 579
581.
doi:10.1038/nmeth.1982.
(2013). Identifying RN
A editing sites using RNA sequencing data alone.
Nat. Methods
10,
128
32. doi:10.1038/nmeth.2330.
Rapaport, F., Khanin, R., Liang, Y., Krek, A., Zumbo, P., Mason, C. E., Socci, N. D., and Betel,
D. (2013). Comprehensive evaluation of differential expressio
n analysis methods for RNA
-
seq data. 1
21.
Rau, A., Gallopin, M., Celeux, G., and Jaffrézic, F. (2013). Data
-
based filtering for replicated
high
-
throughput transcriptome sequencing experiments.
Bioinformatics
29, 2146
2152.
doi:10.1093/bioinformatics/btt35
0.
Rau, A., Maugis
-
Rabusseau, C., and Celeux, G. (2015). Co
-
expression analysis of high
-
throughput transcriptome sequencing data with Poisson mixture models.
Bioinformatics
31,
1420
1427. doi:10.1093/bioinformatics/btu845.
Reeb, P. D., and Steibel, J. P. (
2013). Evaluating statistical analysis models for RNA
sequencing experiments.
Front. Genet.
4, 1
9. doi:10.3389/fgene.2013.00178.
Reshetova, P., Smilde, A. K., van Kampen, A. H., and Westerhuis, J. a (2014). Use of prior
knowledge for the analysis of high
-
throughput transcriptomics and metabolomics data.
BMC Syst. Biol.
8 Suppl 2, S2. doi:10.1186/1752
-
0509
-
8
-
S2
-
S2.
Reverter, F., Vegas, E., and Oller, J. M. (2014). Kernel
-
PCA data integration with enhanced
interpretability.
BMC Syst. Biol.
8 Suppl 2, S6. doi
:10.1186/1752
-
0509
-
8
-
S2
-
S6.
Roberts, A., Pimentel, H., Trapnell, C., and Pachter, L. (2011). Identification of novel transcripts
in annotated genomes using RNA
-
Seq.
Bioinformatics
27, 2325
9.
doi:10.1093/bioinformatics/btr355.
Robinson, M. D., McCarthy, D.
J., and Smyth, G. K. (2010). edgeR: a Bioconductor package for
differential expression analysis of digital gene expression data.
Bioinformatics
26, 139
140. doi:10.1093/bioinformatics/btp616.
Robinson, M. D., and Oshlack, A. (2010). A scaling normalizatio
n method for differential
expression analysis of RNA
-
seq data.
Genome Biol.
11, R25. doi:10.1186/gb
-
2010
-
11
-
3
-
r25.
Robinson, M. D., and Smyth, G. K. (2008). Small
-
sample estimation of negative binomial
dispersion, with applications to SAGE data.
Biostatist
ics
9, 321
332.
doi:10.1093/biostatistics/kxm030.
Robles, J. a, Qureshi, S. E., Stephen, S. J., Wilson, S. R., Burden, C. J., and Taylor, J. M.
(2012). Efficient experimental design and analysis strategies for the detection of differential
113
expression using
RNA
-
Sequencing.
BMC Genomics
13, 484. doi:10.1186/1471
-
2164
-
13
-
484.
Rosa, G. J. M., Steibel, J. P., and Tempelman, R. J. (2005). Reassessing design and analysis
of two
-
colour microarray experiments using mixed effects models.
Comp. Funct. Genomics
6, 123
31. doi:10.1002/cfg.464.
Salem, M., Vallejo, R. L., Leeds, T. D., Palti, Y., Liu, S., Sabbagh, A., Rexroad, C. E., and Yao,
J. (2012). RNA
-
seq identifies SNP markers for growth traits in rainbow trout.
PLoS One
7.
doi:10.1371/journal.pone.0036264.
Schoville, S. D., Bonin, A., François, O., Lobreaux, S., Melodelima, C., and Manel, S. (2012).
Adaptive Genetic Variation on the Landscape: Methods and Cases.
Annu. Rev. Ecol. Evol.
Syst.
43, 23
43. doi:10.1146/annurev
-
ecolsys
-
110411
-
160248.
Schunter, C.,
Garza, J. C., Macpherson, E., and Pascual, M. (2013). SNP development from
RNA
-
seq data in a nonmodel fish: how many individuals are needed for accurate allele
frequency prediction?
Mol. Ecol. Resour.
, n/a
n/a. doi:10.1111/1755
-
0998.12155.
Seeb, J. E., Car
valho, G., Hauser, L., Naish, K., Roberts, S., and Seeb, L. W. (2011). Single
-
nucleotide polymorphism (SNP) discovery and applications of SNP genotyping in
nonmodel organisms.
Mol. Ecol. Resour.
11, 1
8. doi:10.1111/j.1755
-
0998.2010.02979.x.
Severin, A. J.
, Woody, J. L., Bolon, Y.
-
T., Joseph, B., Diers, B. W., Farmer, A. D., Muehlbauer,
G. J., Nelson, R. T., Grant, D., Specht, J. E., et al. (2010). RNA
-
Seq Atlas of Glycine max:
a guide to the soybean transcriptome.
BMC Plant Biol.
10, 160. doi:10.1186/1471
-
2229
-
10
-
160.
Si, Y., Liu, P., Li, P., and Brutnell, T. P. (2014). Model
-
Based Clustering for RNA
-
Seq Data.
Bioinformatics
30, 197
205. doi:10.1093/bioinformatics/btt632.
Simmons, S., Peng, J., Bienkowska, J., and Berger, B. (2015). Discovering What Dimensi
onality
Reduction Really Tells Us About RNA
-
Seq Data.
J. Comput. Biol.
22, 150622063210002.
doi:10.1089/cmb.2015.0085.
Sloutsky, R., Jimenez, N., Swamidass, S. J., and Naegle, K. M. (2013). Accounting for noise
when clustering biological data.
Brief. Bioin
form.
14, 423
36. doi:10.1093/bib/bbs057.
Smeds, L., and Künstner, A. (2011). ConDeTri
--
a content dependent read trimmer for Illumina
data.
PLoS One
6, e26314. doi:10.1371/journal.pone.0026314.
Smith, S., Bernatchez, L., and Beheregaray, L. B. (2013). RNA
-
seq analysis reveals extensive
transcriptional plasticity to temperature stress in a freshwater fish species.
BMC Genomics
14, 375. doi:10.1186/1471
-
2164
-
14
-
375.
Bioinformatics
Statistics
for Biology and Health., eds. R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, and W. Huber
(New York: Springer), 397
420. doi:10.1007/0
-
387
-
29362
-
0_23.
114
Smyth, G. K. (2004). Linear models and empirical bayes methods for assessing differential
expression in microarray experiments.
Stat. Appl. Genet. Mol. Biol.
3, Article3.
doi:10.2202/1544
-
6115.1027.
Smyth, G. K., Ritchie, M., and Thorne, N. (2012). limma:
Linear Models for Microarray Data
-
Seq Data Analysis ). Melbourne, Australia:
Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research.
Sokal, R. R., and Rohlf, F. J. (1962). The comparison of dendr
ograms by objective methods.
Taxon
11, 33
40.
Sollero, B. P., Guimarães, S. E. F., Rilington, V. D., Tempelman, R. J., Raney, N. E., Steibel, J.
P., Guimarães, J. D., Lopes, P. S., Lopes, M. S., and Ernst, C. W. (2011). Transcriptional
profiling during foe
tal skeletal muscle development of Piau and Yorkshire
Landrace cross
-
bred pigs.
Anim. Genet.
42, 600
612. doi:10.1111/j.1365
-
2052.2011.02186.x.
Srivastava, S., and Chen, L. (2010). A two
-
parameter generalized Poisson model to improve the
analysis of RNA
-
se
q data.
Nucleic Acids Res.
38, e170. doi:10.1093/nar/gkq670.
Steibel, J. P., Bates, R. O., Rosa, G. J. M., Tempelman, R. J., Rilington, V. D., Ragavendran,
A., Raney, N. E., Ramos, A. M., Cardoso, F. F., Edwards, D. B., et al. (2011). Genome
-
wide linkage a
nalysis of global gene expression in loin muscle tissue identifies candidate
genes in pigs.
PLoS One
6, e16766. doi:10.1371/journal.pone.0016766.
Steibel, J. P., Poletto, R., Coussens, P. M., and Rosa, G. J. M. (2009). A powerful and flexible
linear mixed
model framework for the analysis of relative quantification RT
-
PCR data.
Genomics
94, 146
52. doi:10.1016/j.ygeno.2009.04.008.
Steibel, J. P., Reeb, P. D., Ernst, C. W., and Bates, R. O. (2014). Mapping cis and trans
-
acting
eQTL in swine populations. in
10
th WCGALP
(Vancouver, Canada).
Steibel, J. P., Wang, H., and Zhong, P.
-
S. (2015). A hidden Markov approach for ascertaining
cSNP genotypes from RNA sequence data in the presence of allelic imbalance by
exploiting linkage disequilibrium.
BMC Bioinformatics
16, 1
12. doi:10.1186/s12859
-
015
-
0479
-
2.
Storey, J. D. (2002). A direct approach to false discovery rates.
J. R. Stat. Soc. Ser. B
(Statistical Methodol.
64, 479
498. doi:10.1111/1467
-
9868.00346.
Storey, J. D., and Tibshirani, R. (2003). Statistical methods for identifying differentially
expressed genes in DNA microarrays.
Methods Mol. Biol.
224, 149
57. doi:10.1385/1
-
59259
-
364
-
X:149.
Tomescu, O. a, Mattanovich, D., and Thallinger, G. G. (2014). In
tegrative omics analysis. A
study based on Plasmodium falciparum mRNA and protein data.
BMC Syst. Biol.
8 Suppl
2, S4. doi:10.1186/1752
-
0509
-
8
-
S2
-
S4.
Trapnell, C., Pachter, L., and Salzberg, S. L. (2009). TopHat: discovering splice junctions with
RNA
-
Seq.
Bioinformatics
25, 1105
1111.
115
Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D. R., Pimentel, H., Salzberg, S.
L., Rinn, J. L., and Pachter, L. (2012). Differential gene and transcript expression analysis
of RNA
-
seq experiments with TopH
at and Cufflinks.
Nat. Protoc.
7, 562
578.
doi:10.1038/nprot.2012.016.
Trapnell, C., Williams, B. A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M. J., Salzberg, S.
L., Wold, B. J., and Pachter, L. (2010). Transcript assembly and quantification by RNA
-
Seq
reveals unannotated transcripts and isoform switching during cell differentiation.
Nat.
Biotechnol.
28, 511
5. doi:10.1038/nbt.1621.
VanRaden, P. M. (2008). Efficient methods to compute genomic predictions.
J. Dairy Sci.
91,
4414
4423. doi:10.3168/jds
.2007
-
0980.
Vaughan, L. K., Divers, J., Padilla, M., Redden, D. T., Tiwari, H. K., Pomp, D., and Allison, D. B.
(2009). The use of plasmodes as a supplement to simulations: A simple example
evaluating individual admixture estimation methodologies.
Comput.
Stat. Data Anal.
53,
1755
1766. doi:10.1016/j.csda.2008.02.032.
Vijay, N., Poelstra, J. W., Künstner, A., and Wolf, J. B. W. (2013). Challenges and strategies in
transcriptome assembly and differential gene expression quantification. A comprehensive
in sil
ico assessment of RNA
-
seq experiments.
Mol. Ecol.
22, 620
34.
doi:10.1111/mec.12014.
Waller, N. G., Underhill, J. M., and Heather, A. (2010). Multivariate Behavioral A Method for
Generating Simulated Plasmodes and Artificial Test Clusters with User
-
Defined
Shape ,
Size , and. 37
41.
Wang, R., Sun, L., Bao, L., Zhang, J., Jiang, Y., Yao, J., Song, L., Feng, J., Liu, S., and Liu, Z.
(2013). Bulk segregant RNA
-
seq reveals expression and positional candidate genes and
allele
-
specific expression for disease resi
stance against enteric septicemia of catfish.
BMC
Genomics
14, 929. doi:10.1186/1471
-
2164
-
14
-
929.
Wang, Z., Gerstein, M., and Snyder, M. (2009). RNA
-
Seq: a revolutionary tool for
transcriptomics.
Nat Rev Genet
10, 57
63. doi:10.1038/nrg2484.
Weinman, L. R.
, Solomon, J. W., and Rubenstein, D. R. (2014). A comparison of single
nucleotide polymorphism and microsatellite markers for analysis of parentage and kinship
in a cooperatively breeding bird.
Mol. Ecol. Resour.
, n/a
n/a. doi:10.1111/1755
-
0998.12330.
Wick
ramasinghe, S., Cánovas, A., Rincón, G., and Medrano, J. F. (2014). RNA
-
Sequencing: A
tool to explore new frontiers in animal genetics.
Livest. Sci.
166, 206
216.
doi:10.1016/j.livsci.2014.06.015.
Van De Wiel, M. a, Leday, G. G. R., Pardo, L., Rue, H., Van
Der Vaart, A. W., and Van
Wieringen, W. N. (2013). Bayesian analysis of RNA sequencing data by estimating multiple
shrinkage priors.
Biostatistics
, 113
128. doi:10.1093/biostatistics/kxs031.
De Wit, P., Pespeni, M. H., Ladner, J. T., Barshis, D. J., Senec
a, F., Jaris, H., Therkildsen, N.
116
genomics via RNA
-
Seq: an introduction to high
-
throughput sequencing data analysis.
Mol.
Ecol. Resour.
12, 1058
67. doi:10.1111/1755
-
0998.12
003.
Witten, D. M. (2011). Classification and clustering of sequencing data using a Poisson model.
Ann. Appl. Stat.
5, 2493
2518. doi:10.1214/11
-
AOAS493.
Data Clustering
Chapman &
Hall/CRC
Data Mining and Knowledge Discovery Series. (Chapman and Hall/CRC).
doi:doi:10.1201/b15410
-
24.
Yang, C., and Wei, H. (2015). Designing Microarray and RNA
-
Seq Experiments for Greater
Systems Biology Discovery in Modern Plant Genomics.
Mol. Plant
8, 196
206.
doi:10.1016/j.molp.2014.11.012.
Yang, H., and Churchill, G. (2007). Estimating p
-
values in small microarray experiments.
Bioinformatics
23, 38
43. doi:10.1093/bioinformatics/btl548.
Yu, Y., Wei, J., Zhang, X., Liu, J., Liu, C., Li, F., and Xiang, J. (2014). SNP discovery in the
transcriptome of white Pacific shrimp Litopenaeus vannamei by next generation
sequencing.
PLoS One
9, e87218. doi:10.1371/journal.pone.0087218.
Zhou, X., Lindsay, H., and Robinson, M. D. (2014). Robustly detecting differential expression in
RNA sequencing data using observation weights.
Nucleic Acids Res.
, 1
10.
doi:10.1093/nar/gku310.
Zhou, Y.
-
H., Xia, K., and Wright, F. a (2011). A powerful and
flexible approach to the analysis of
RNA sequence count data.
Bioinformatics
27, 2672
8. doi:10.1093/bioinformatics/btr449.